Re: [PATCH 1/1] net: pegasus: simplify logical constraint

2016-05-18 Thread Petko Manolov
On 16-05-18 20:40:51, Heinrich Schuchardt wrote:
> If !count is true, count < 4 is also true.

Yep, you're right.  However, gcc optimizes away the first condition.  What you 
really got me to think about is whether 4 is the right number.  I guess i shall 
refer to the HW documentation.


Petko


> Signed-off-by: Heinrich Schuchardt 
> ---
>  drivers/net/usb/pegasus.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/usb/pegasus.c b/drivers/net/usb/pegasus.c
> index 36cd7f0..9bbe0161 100644
> --- a/drivers/net/usb/pegasus.c
> +++ b/drivers/net/usb/pegasus.c
> @@ -473,7 +473,7 @@ static void read_bulk_callback(struct urb *urb)
>   goto goon;
>   }
>  
> - if (!count || count < 4)
> + if (count < 4)
>   goto goon;
>  
>   rx_status = buf[count - 2];
> -- 
> 2.1.4
> 
> 


RE: [PATCH linux-firmware] qed: Add FW 8.10.5.0

2016-05-18 Thread Yuval Mintz
> Hi,
> 
> Please consider applying this to `linux-firmware'.
> 
> Thanks,
> Yuval

I don't like to nag [and surely not so early after sending this], but do
you have an ETA for when you're going to flush the linux-firmware pipe?
[Asking as last time due to the holidays it took ~3 weeks]


RE: [PATCH v11 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)

2016-05-18 Thread Dexuan Cui
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Thursday, May 19, 2016 12:13
> To: Dexuan Cui 
> Cc: KY Srinivasan ; o...@aepfle.de;
> gre...@linuxfoundation.org; jasow...@redhat.com; linux-
> ker...@vger.kernel.org; j...@perches.com; netdev@vger.kernel.org;
> a...@canonical.com; de...@linuxdriverproject.org; Haiyang Zhang
> 
> Subject: Re: [PATCH v11 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)
> 
> 
> I'm travelling and very busy with the merge window.  So sorry I won't be able
> to think about this for some time.

David, 
Sure, I understand.

Please let me recap my last mail:

1)  I'll replace my statically-allocated per-connection "send/recv bufs" with
dynamically ones, so no buf is used when there is no traffic.

2) Another kind of bufs i.e., the  multi-page "VMBus send/recv ringbuffer", is
a must IMO due to the host side's design of the feature: every connection needs
its own ringbuffer, which takes several pages (2~3 pages at least. And, 5 pages
should suffice for good performance). The ringbuffer can be accessed by the
host at any time, so IMO the pages can't be swappable.

I understand net-next is closed now. I'm going to post the next version
after 4.7-rc1 is out in several weeks.

If you could give me some suggestions, I would be definitely happy to take.

Thanks!
-- Dexuan


[PATCH net V2] tuntap: correctly wake up process during uninit

2016-05-18 Thread Jason Wang
We used to check dev->reg_state against NETREG_REGISTERED after each
time we are woke up. But after commit 9e641bdcfa4e ("net-tun:
restructure tun_do_read for better sleep/wakeup efficiency"), it uses
skb_recv_datagram() which does not check dev->reg_state. This will
result if we delete a tun/tap device after a process is blocked in the
reading. The device will wait for the reference count which was held
by that process for ever.

Fixes this by using RCV_SHUTDOWN which will be checked during
sk_recv_datagram() before trying to wake up the process during uninit.

Fixes: 9e641bdcfa4e ("net-tun: restructure tun_do_read for better
sleep/wakeup efficiency")
Cc: Eric Dumazet 
Cc: Xi Wang 
Cc: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
---
- The patch is needed for -stable.
- Changes from v1: remove unnecessary NETREG_REGISTERED check in tun_do_read()
---
 drivers/net/tun.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 425e983..e16487c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -580,11 +580,13 @@ static void tun_detach_all(struct net_device *dev)
for (i = 0; i < n; i++) {
tfile = rtnl_dereference(tun->tfiles[i]);
BUG_ON(!tfile);
+   tfile->socket.sk->sk_shutdown = RCV_SHUTDOWN;
tfile->socket.sk->sk_data_ready(tfile->socket.sk);
RCU_INIT_POINTER(tfile->tun, NULL);
--tun->numqueues;
}
list_for_each_entry(tfile, >disabled, next) {
+   tfile->socket.sk->sk_shutdown = RCV_SHUTDOWN;
tfile->socket.sk->sk_data_ready(tfile->socket.sk);
RCU_INIT_POINTER(tfile->tun, NULL);
}
@@ -641,6 +643,7 @@ static int tun_attach(struct tun_struct *tun, struct file 
*file, bool skip_filte
goto out;
}
tfile->queue_index = tun->numqueues;
+   tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN;
rcu_assign_pointer(tfile->tun, tun);
rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
tun->numqueues++;
@@ -1491,9 +1494,6 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct 
tun_file *tfile,
if (!iov_iter_count(to))
return 0;
 
-   if (tun->dev->reg_state != NETREG_REGISTERED)
-   return -EIO;
-
/* Read frames from queue */
skb = __skb_recv_datagram(tfile->socket.sk, noblock ? MSG_DONTWAIT : 0,
  , , );
-- 
2.7.4



Re: [PATCH net] tuntap: correctly wake up process during uninit

2016-05-18 Thread Jason Wang



On 2016年05月18日 21:01, Eric Dumazet wrote:

On Wed, 2016-05-18 at 18:58 +0800, Jason Wang wrote:

We used to check dev->reg_state against NETREG_REGISTERED after each
time we are woke up. But after commit 9e641bdcfa4e ("net-tun:
restructure tun_do_read for better sleep/wakeup efficiency"), it uses
skb_recv_datagram() which does not check dev->reg_state. This will
result if we delete a tun/tap device after a process is blocked in the
reading. The device will wait for the reference count which was held
by that process for ever.

Fixes this by using RCV_SHUTDOWN which will be checked during
sk_recv_datagram() before trying to wake up the process during uninit.

Fixes: 9e641bdcfa4e ("net-tun: restructure tun_do_read for better
sleep/wakeup efficiency")




Ok.


Cc: Eric Dumazet 
Cc: Xi Wang 
Cc: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
---
The patch is needed for -stable.
---
  drivers/net/tun.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 425e983..752d849 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -580,11 +580,13 @@ static void tun_detach_all(struct net_device *dev)
for (i = 0; i < n; i++) {
tfile = rtnl_dereference(tun->tfiles[i]);
BUG_ON(!tfile);
+   tfile->socket.sk->sk_shutdown = RCV_SHUTDOWN;
tfile->socket.sk->sk_data_ready(tfile->socket.sk);
RCU_INIT_POINTER(tfile->tun, NULL);
--tun->numqueues;
}
list_for_each_entry(tfile, >disabled, next) {
+   tfile->socket.sk->sk_shutdown = RCV_SHUTDOWN;
tfile->socket.sk->sk_data_ready(tfile->socket.sk);
RCU_INIT_POINTER(tfile->tun, NULL);
}
@@ -641,6 +643,7 @@ static int tun_attach(struct tun_struct *tun, struct file 
*file, bool skip_filte
goto out;
}
tfile->queue_index = tun->numqueues;
+   tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN;
rcu_assign_pointer(tfile->tun, tun);
rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
tun->numqueues++;
Is the "if (tun->dev->reg_state != NETREG_REGISTERED) return -EIO;"
check still needed then ?

Thanks.




No need since we've check tun before, will remove this in V2.

Thanks



Re: [PATCH 1/1 RFC] net/phy: Add Lantiq PHY driver

2016-05-18 Thread John Crispin
Hi,

On 18/05/2016 18:24, Florian Fainelli wrote:
> CC'ing Andrew, John,
> 

also CC'ing Matthias and Hauke. we have had a driver in OpenWrt/LEDE for
several years that seems a little more complete than this one.

https://git.lede-project.org/?p=source.git;a=blob;f=target/linux/lantiq/patches-4.4/0023-NET-PHY-adds-driver-for-lantiq-PHY11G.patch;h=93bb4275ec1d261f398afb8fdc879c1dd973f997;hb=HEAD

John


> On 05/18/2016 09:03 AM, Alexander Stein wrote:
>> This currently only supports PEF7071 and allows to specify max-speed and
>> is able to read the LED configuration from device-tree.
>>
>> Signed-off-by: Alexander Stein 
>> ---
>> The main purpose for now is to set a LED configuration from device tree and
>> to limit the maximum speed. The latter one in my case hardware limited.
>> As MAC and it's link partner support 1000MBit/s they would try to use that
>> but will eventually fail due to magnetics only supporting 100MBit/s. So
>> limit the maximum link speed supported directly from the start.
> 
> The 'max-speed' parsing that you do in the driver should not be needed,
> PHYLIB takes care of that already see
> drivers/net/phy/phy_device.c::of_set_phy_supported
> 
> For LEDs, we had a patch series floating around adding LED triggers [1],
> and it seems to me like the LEDs class subsystem would be a good fit for
> controlling PHY LEDs, possibly with the help of PHYLIB when it comes to
> doing the low-level work of registering LEDs and their names with the
> LEDS subsystem.
> 
> [1]: http://lists.openwall.net/netdev/2016/03/23/61
> 
>>
>> As this is a RFC I skipped the device tree binding doc.
> 
> Too bad, that's probably what needs to be discussed here, because the
> driver looks pretty reasonable otherwise.
> 
>>
>>  drivers/net/phy/Kconfig  |   5 ++
>>  drivers/net/phy/Makefile |   1 +
>>  drivers/net/phy/lantiq.c | 167 
>> +++
>>  3 files changed, 173 insertions(+)
>>  create mode 100644 drivers/net/phy/lantiq.c
>>
>> diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
>> index 3e28f7a..c004885 100644
>> --- a/drivers/net/phy/Kconfig
>> +++ b/drivers/net/phy/Kconfig
>> @@ -119,6 +119,11 @@ config STE10XP
>>  ---help---
>>This is the driver for the STe100p and STe101p PHYs.
>>  
>> +config LANTIQ_PHY
>> +tristate "Driver for Lantiq PHYs"
>> +---help---
>> +  Supports the PEF7071 PHYs.
>> +
>>  config LSI_ET1011C_PHY
>>  tristate "Driver for LSI ET1011C PHY"
>>  ---help---
>> diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
>> index 8ad4ac6..e886549 100644
>> --- a/drivers/net/phy/Makefile
>> +++ b/drivers/net/phy/Makefile
>> @@ -38,3 +38,4 @@ obj-$(CONFIG_MDIO_SUN4I)   += mdio-sun4i.o
>>  obj-$(CONFIG_MDIO_MOXART)   += mdio-moxart.o
>>  obj-$(CONFIG_AMD_XGBE_PHY)  += amd-xgbe-phy.o
>>  obj-$(CONFIG_MDIO_BCM_UNIMAC)   += mdio-bcm-unimac.o
>> +obj-$(CONFIG_LANTIQ_PHY)+= lantiq.o
>> diff --git a/drivers/net/phy/lantiq.c b/drivers/net/phy/lantiq.c
>> new file mode 100644
>> index 000..876a7d1
>> --- /dev/null
>> +++ b/drivers/net/phy/lantiq.c
>> @@ -0,0 +1,167 @@
>> +/*
>> + * Driver for Lantiq PHYs
>> + *
>> + * Author: Alexander Stein 
>> + *
>> + * Copyright (c) 2015-2016 SYS TEC electronic GmbH
>> + *
>> + * This program is free software; you can redistribute  it and/or modify it
>> + * under  the terms of  the GNU General  Public License as published by the
>> + * Free Software Foundation;  either version 2 of the  License, or (at your
>> + * option) any later version.
>> + *
>> + */
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#define PHY_ID_PEF7071  0xd565a401
>> +
>> +#define MII_LANTIQ_MMD_CTRL_REG 0x0d
>> +#define MII_LANTIQ_MMD_REGDATA_REG  0x0e
>> +#define OP_DATA 1
>> +
>> +struct lantiqphy_led_ctrl {
>> +const char *property;
>> +u32 regnum;
>> +};
>> +
>> +static int lantiq_extended_write(struct phy_device *phydev,
>> + u8 mode, u32 dev_addr, u32 regnum, u16 val)
>> +{
>> +phy_write(phydev, MII_LANTIQ_MMD_CTRL_REG, dev_addr);
>> +phy_write(phydev, MII_LANTIQ_MMD_REGDATA_REG, regnum);
>> +phy_write(phydev, MII_LANTIQ_MMD_CTRL_REG, (mode << 14) | dev_addr);
>> +return phy_write(phydev, MII_LANTIQ_MMD_REGDATA_REG, val);
>> +}
>> +
>> +static int lantiq_of_load_led_config(struct phy_device *phydev,
>> + struct device_node *of_node,
>> + const struct lantiqphy_led_ctrl *leds,
>> + u8 entries)
>> +{
>> +u16 val;
>> +int i;
>> +int ret = 0;
>> +
>> +for (i = 0; i < entries; i++) {
>> +if (!of_property_read_u16(of_node, leds[i].property, )) {
>> +ret = lantiq_extended_write(phydev, OP_DATA, 0x1f,
>> + 

Re: [PATCH iproute2] ss: Tell user about -EOPNOTSUPP for SOCK_DESTROY

2016-05-18 Thread Eric Dumazet
On Wed, 2016-05-18 at 22:05 -0600, David Ahern wrote:

> You think it is ok to send a request to the kernel, the kernel says "I 
> can't do it" and the command says nothing to the user? That is current 
> behavior. How on Earth is that acceptable?

I don't know. Tell me what is acceptable on a 'dump many sockets' and
some of them can be killed, but not all of them.

What I do know is that you sent totally buggy patches.

If you want to 'fix' something, please send a patch that we can agree
on, ie not breaking existing scripts.








Re: [PATCH v11 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)

2016-05-18 Thread David Miller

I'm travelling and very busy with the merge window.  So sorry I won't be able
to think about this for some time.


Re: [net-next PATCH 0/2] Follow-ups for GUEoIPv6 patches

2016-05-18 Thread David Miller
From: Jeff Kirsher 
Date: Wed, 18 May 2016 14:27:58 -0700

> On Wed, 2016-05-18 at 10:44 -0700, Alexander Duyck wrote:
>> This patch series is meant to be applied after:
>> [PATCH v7 net-next 00/16] ipv6: Enable GUEoIPv6 and more fixes for v6
>> tunneling
>> 
>> The first patch addresses an issue we already resolved in the GREv4 and
>> is
>> now present in GREv6 with the introduction of FOU/GUE for IPv6 based GRE
>> tunnels.
>> 
>> The second patch goes through and enables IPv6 tunnel offloads for the
>> Intel
>> NICs that already support the IPv4 based IP-in-IP tunnel offloads.  I
>> have
>> only done a bit of touch testing but have seen ~20 Gb/s over an i40e
>> interface using a v4-in-v6 tunnel, and I have verified IPv6 GRE is still
>> passing traffic at around the same rate.  I plan to do further testing
>> but
>> with these patches present it should enable a wider audience to be able
>> to
>> test the new features introduced in Tom's patchset with hardware
>> offloads.
>> 
>> ---
>> 
>> Alexander Duyck (2):
>>   ip6_gre: Do not allow segmentation offloads GRE_CSUM is enabled
>> with FOU/GUE
>>   intel: Add support for IPv6 IP-in-IP offload
> 
> Dave, I have this series added to my queue.

Why would you if it depends upon Tom's series, as mentioned above, which
isn't even in my tree yet?


Re: [GIT] Networking

2016-05-18 Thread David Miller
From: Linus Torvalds 
Date: Wed, 18 May 2016 11:45:06 -0700

> David, do you happen to recall that merge conflict? I think you must
> have removed that "skb_info" variable declaration and initialization
> manually (due to the "unused variable" warning, which in turn was due
> to the incorrect merge of the actual conflict), because I think git
> would have merged that line into the result.

Yes, I know I buggered this merge conflict and Kalle said he'd have
a fix coming my way ASAP.

Sorry, I was travelling today, so I'll catch up with this tomorrow.


Re: [PATCH iproute2] ss: Tell user about -EOPNOTSUPP for SOCK_DESTROY

2016-05-18 Thread David Ahern

On 5/18/16 9:47 PM, Eric Dumazet wrote:

On Wed, 2016-05-18 at 21:02 -0600, David Ahern wrote:

On 5/18/16 6:55 PM, Lorenzo Colitti wrote:

On Wed, May 18, 2016 at 3:35 AM,  wrote:

Would it be acceptable to have a separate column which displays the result of 
the sock destroy operation per socket.
State... Killed
ESTAB Y
TIME_WAIT N


Fine by me, but... what problem are we trying to address? People who
compile their own kernels and don't turn CONFIG_INET_DIAG_DESTROY, and
then are confused why it doesn't work? Seems like we could fix that by
turning CONFIG_INET_DIAG_DESTROY on by default. CCing the people who
commented on the original SOCK_DESTROY patch to see if they have
opinions.


The problem is proper feedback to a user. If the kernel supports an
action but the action is not enabled the user should get some message to
that fact. Doing nothing and exiting 0 is just wrong.


So, lets say the filter found 123456 sockets matching the filter, and
12345 could be killed

What would be exit status of ss command ?


Again, I could care less if the exit status is 0 if the user is given "A 
request failed b/c operations is not supported" message. That is feedback.




In this case, there is no black/white answer.

It looks like you have specific needs, you should probably add an option
to ss to have a specific behavior.

But saying current behavior is 'wrong' is subjective.


You think it is ok to send a request to the kernel, the kernel says "I 
can't do it" and the command says nothing to the user? That is current 
behavior. How on Earth is that acceptable?


I believe my last proposal is that that user gets a single "I could not 
do what you asked" message and nice little summary:


N sockets closed
M sockets failed

If M = total number of sockets then perhaps there is a bigger problem -- 
like a config option is not enabled.


Re: [patch net-next 1/3] mlxsw: spectrum: Reduce number of supported 802.1D bridges

2016-05-18 Thread David Miller
From: Jiri Pirko 
Date: Wed, 18 May 2016 14:40:50 +0200

> This patch, commit b555cf4a50c17a9714715a2d7c8574dca1a7b356 ("mlxsw: spectrum:
> Reduce number of supported 802.1D bridges") reduced the resources the
> device allocates during init. Newer firmware versions now force a hard
> limit on these resources and thus require this patch.
> 
> We would like to ask this commit be backported to stable kernels, so
> that they can be used with newer firmware versions.

Breaking existing working drivers with a firmware update is a terrible
user experience, and in my opinion not a reason to do a -stable
backport.  Fix how you manage this stuff instead.

This new behavior should be requested by drivers that can handle it,
rather than retroactively forced upon existing setups.


Re: [PATCH iproute2] ss: Tell user about -EOPNOTSUPP for SOCK_DESTROY

2016-05-18 Thread Eric Dumazet
On Wed, 2016-05-18 at 21:02 -0600, David Ahern wrote:
> On 5/18/16 6:55 PM, Lorenzo Colitti wrote:
> > On Wed, May 18, 2016 at 3:35 AM,  wrote:
> >> Would it be acceptable to have a separate column which displays the result 
> >> of the sock destroy operation per socket.
> >> State... Killed
> >> ESTAB Y
> >> TIME_WAIT N
> >
> > Fine by me, but... what problem are we trying to address? People who
> > compile their own kernels and don't turn CONFIG_INET_DIAG_DESTROY, and
> > then are confused why it doesn't work? Seems like we could fix that by
> > turning CONFIG_INET_DIAG_DESTROY on by default. CCing the people who
> > commented on the original SOCK_DESTROY patch to see if they have
> > opinions.
> 
> The problem is proper feedback to a user. If the kernel supports an 
> action but the action is not enabled the user should get some message to 
> that fact. Doing nothing and exiting 0 is just wrong.

So, lets say the filter found 123456 sockets matching the filter, and
12345 could be killed

What would be exit status of ss command ?

In this case, there is no black/white answer.

It looks like you have specific needs, you should probably add an option
to ss to have a specific behavior.

But saying current behavior is 'wrong' is subjective.




Re: [PATCH iproute2] ss: Tell user about -EOPNOTSUPP for SOCK_DESTROY

2016-05-18 Thread David Ahern

On 5/18/16 6:55 PM, Lorenzo Colitti wrote:

On Wed, May 18, 2016 at 3:35 AM,  wrote:

Would it be acceptable to have a separate column which displays the result of 
the sock destroy operation per socket.
State... Killed
ESTAB Y
TIME_WAIT N


Fine by me, but... what problem are we trying to address? People who
compile their own kernels and don't turn CONFIG_INET_DIAG_DESTROY, and
then are confused why it doesn't work? Seems like we could fix that by
turning CONFIG_INET_DIAG_DESTROY on by default. CCing the people who
commented on the original SOCK_DESTROY patch to see if they have
opinions.


The problem is proper feedback to a user. If the kernel supports an 
action but the action is not enabled the user should get some message to 
that fact. Doing nothing and exiting 0 is just wrong.





If it is not supported from kernel, maybe print U (unsupported) for this.


In current code there is no way to distinguish U from N because in
both cases the error will be EOPNOTSUPP. It's certainly possible to
change SOCK_DESTROY to return something else (e.g., EBADFD) to
indicate "kernel supports closing this type of socket, but it can't be
closed due to the state it's in". In hindsight, perhaps I should have
done that from the start.

Regardless, we still have the problem of what to do if the user says
"ss -K dport = :443" and we encounter a UDP socket connected to port
443. Options:

1. Silently skip. if the tool prints something, it means it closed it.
2. Abort with an error message.
3. Skip the socket and print an error every time this happens.
4. Skip the socket and print an error the first time this happens.

Personally I still think #1 is the best option.


5. Print an error the first time and a summary at the end.

If the filter matches N sockets and all N fail with UNSUPPORTED give the 
user a message saying that all failed due to unsupported error which 
could mean the CONFIG option is not enabled or it could mean the sockets 
can not be forced closed.


Re: [PATCH v11 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)

2016-05-18 Thread gre...@linuxfoundation.org
On Thu, May 19, 2016 at 12:59:09AM +, Dexuan Cui wrote:
> > From: devel [mailto:driverdev-devel-boun...@linuxdriverproject.org] On 
> > Behalf
> > Of Dexuan Cui
> > Sent: Tuesday, May 17, 2016 10:46
> > To: David Miller 
> > Cc: o...@aepfle.de; gre...@linuxfoundation.org; jasow...@redhat.com;
> > linux-ker...@vger.kernel.org; j...@perches.com; netdev@vger.kernel.org;
> > a...@canonical.com; de...@linuxdriverproject.org; Haiyang Zhang
> > 
> > Subject: RE: [PATCH v11 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)
> > 
> > > From: David Miller [mailto:da...@davemloft.net]
> > > Sent: Monday, May 16, 2016 1:16
> > > To: Dexuan Cui 
> > > Cc: gre...@linuxfoundation.org; netdev@vger.kernel.org; linux-
> > > ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
> > > a...@canonical.com; jasow...@redhat.com; cav...@redhat.com; KY
> > > Srinivasan ; Haiyang Zhang ;
> > > j...@perches.com; vkuzn...@redhat.com
> > > Subject: Re: [PATCH v11 net-next 0/1] introduce Hyper-V VM 
> > > Sockets(hv_sock)
> > >
> > > From: Dexuan Cui 
> > > Date: Sun, 15 May 2016 09:52:42 -0700
> > >
> > > > Changes since v10
> > > >
> > > > 1) add module params: send_ring_page, recv_ring_page. They can be used
> > to
> > > > enlarge the ringbuffer size to get better performance, e.g.,
> > > > # modprobe hv_sock  recv_ring_page=16 send_ring_page=16
> > > > By default, recv_ring_page is 3 and send_ring_page is 2.
> > > >
> > > > 2) add module param max_socket_number (the default is 1024).
> > > > A user can enlarge the number to create more than 1024 hv_sock sockets.
> > > > By default, 1024 sockets take about 1024 * (3+2+1+1) * 4KB = 28M bytes.
> > > > (Here 1+1 means 1 page for send/recv buffers per connection, 
> > > > respectively.)
> > >
> > > This is papering around my objections, and create module parameters which
> > > I am fundamentally against.
> > >
> > > You're making the facility unusable by default, just to work around my
> > > memory consumption concerns.
> > >
> > > What will end up happening is that everyone will simply increase the
> > > values.
> > >
> > > You're not really addressing the core issue, and I will be ignoring you
> > > future submissions of this change until you do.
> > 
> > David,
> > I am sorry I came across as ignoring your feedback; that was not my 
> > intention.
> > The current host side design for this feature is such that each socket 
> > connection
> > needs its own channel, which consists of
> > 
> > 1.A ring buffer for host to guest communication
> > 2.A ring buffer for guest to host communication
> > 
> > The memory for the ring buffers has to be pinned down as this will be 
> > accessed
> > both from interrupt level in Linux guest and from the host OS at any time.
> > 
> > To address your concerns, I am planning to re-implement both the receive 
> > path
> > and the send path so that no additional pinned memory will be needed.
> > 
> > Receive Path:
> > When the application does a read on the socket, we will dynamically allocate
> > the buffer and perform the read operation on the incoming ring buffer. Since
> > we will be in the process context, we can sleep here and will set the
> > "GFP_KERNEL | __GFP_NOFAIL" flags. This buffer will be freed once the
> > application consumes all the data.
> > 
> > Send Path:
> > On the send side, we will construct the payload to be sent directly on the
> > outgoing ringbuffer.
> > 
> > So, with these changes, the only memory that will be pinned down will be the
> > memory for the ring buffers on a per-connection basis and this memory will 
> > be
> > pinned down until the connection is torn down.
> > 
> > Please let me know if this addresses your concerns.
> > 
> > -- Dexuan
> 
> Hi David,
> Ping. Really appreciate your comment.

Don't wait for people to respond to random design questions, go work on
the code and figure out if it is workable or not yourself.  Then post
patches.  We aren't responsible for your work, you are.

greg k-h


RE: [PATCH v11 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)

2016-05-18 Thread Dexuan Cui
> From: devel [mailto:driverdev-devel-boun...@linuxdriverproject.org] On Behalf
> Of Dexuan Cui
> Sent: Tuesday, May 17, 2016 10:46
> To: David Miller 
> Cc: o...@aepfle.de; gre...@linuxfoundation.org; jasow...@redhat.com;
> linux-ker...@vger.kernel.org; j...@perches.com; netdev@vger.kernel.org;
> a...@canonical.com; de...@linuxdriverproject.org; Haiyang Zhang
> 
> Subject: RE: [PATCH v11 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)
> 
> > From: David Miller [mailto:da...@davemloft.net]
> > Sent: Monday, May 16, 2016 1:16
> > To: Dexuan Cui 
> > Cc: gre...@linuxfoundation.org; netdev@vger.kernel.org; linux-
> > ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
> > a...@canonical.com; jasow...@redhat.com; cav...@redhat.com; KY
> > Srinivasan ; Haiyang Zhang ;
> > j...@perches.com; vkuzn...@redhat.com
> > Subject: Re: [PATCH v11 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)
> >
> > From: Dexuan Cui 
> > Date: Sun, 15 May 2016 09:52:42 -0700
> >
> > > Changes since v10
> > >
> > > 1) add module params: send_ring_page, recv_ring_page. They can be used
> to
> > > enlarge the ringbuffer size to get better performance, e.g.,
> > > # modprobe hv_sock  recv_ring_page=16 send_ring_page=16
> > > By default, recv_ring_page is 3 and send_ring_page is 2.
> > >
> > > 2) add module param max_socket_number (the default is 1024).
> > > A user can enlarge the number to create more than 1024 hv_sock sockets.
> > > By default, 1024 sockets take about 1024 * (3+2+1+1) * 4KB = 28M bytes.
> > > (Here 1+1 means 1 page for send/recv buffers per connection, 
> > > respectively.)
> >
> > This is papering around my objections, and create module parameters which
> > I am fundamentally against.
> >
> > You're making the facility unusable by default, just to work around my
> > memory consumption concerns.
> >
> > What will end up happening is that everyone will simply increase the
> > values.
> >
> > You're not really addressing the core issue, and I will be ignoring you
> > future submissions of this change until you do.
> 
> David,
> I am sorry I came across as ignoring your feedback; that was not my intention.
> The current host side design for this feature is such that each socket 
> connection
> needs its own channel, which consists of
> 
> 1.A ring buffer for host to guest communication
> 2.A ring buffer for guest to host communication
> 
> The memory for the ring buffers has to be pinned down as this will be accessed
> both from interrupt level in Linux guest and from the host OS at any time.
> 
> To address your concerns, I am planning to re-implement both the receive path
> and the send path so that no additional pinned memory will be needed.
> 
> Receive Path:
> When the application does a read on the socket, we will dynamically allocate
> the buffer and perform the read operation on the incoming ring buffer. Since
> we will be in the process context, we can sleep here and will set the
> "GFP_KERNEL | __GFP_NOFAIL" flags. This buffer will be freed once the
> application consumes all the data.
> 
> Send Path:
> On the send side, we will construct the payload to be sent directly on the
> outgoing ringbuffer.
> 
> So, with these changes, the only memory that will be pinned down will be the
> memory for the ring buffers on a per-connection basis and this memory will be
> pinned down until the connection is torn down.
> 
> Please let me know if this addresses your concerns.
> 
> -- Dexuan

Hi David,
Ping. Really appreciate your comment.

 Thanks,
-- Dexuan


Re: [PATCH iproute2] ss: Tell user about -EOPNOTSUPP for SOCK_DESTROY

2016-05-18 Thread Lorenzo Colitti
On Wed, May 18, 2016 at 3:35 AM,  wrote:
> Would it be acceptable to have a separate column which displays the result of 
> the sock destroy operation per socket.
> State... Killed
> ESTAB Y
> TIME_WAIT N

Fine by me, but... what problem are we trying to address? People who
compile their own kernels and don't turn CONFIG_INET_DIAG_DESTROY, and
then are confused why it doesn't work? Seems like we could fix that by
turning CONFIG_INET_DIAG_DESTROY on by default. CCing the people who
commented on the original SOCK_DESTROY patch to see if they have
opinions.

> If it is not supported from kernel, maybe print U (unsupported) for this.

In current code there is no way to distinguish U from N because in
both cases the error will be EOPNOTSUPP. It's certainly possible to
change SOCK_DESTROY to return something else (e.g., EBADFD) to
indicate "kernel supports closing this type of socket, but it can't be
closed due to the state it's in". In hindsight, perhaps I should have
done that from the start.

Regardless, we still have the problem of what to do if the user says
"ss -K dport = :443" and we encounter a UDP socket connected to port
443. Options:

1. Silently skip. if the tool prints something, it means it closed it.
2. Abort with an error message.
3. Skip the socket and print an error every time this happens.
4. Skip the socket and print an error the first time this happens.

Personally I still think #1 is the best option.


[PATCH] net/atm: sk_err_soft must be positive

2016-05-18 Thread Stefan Hajnoczi
The sk_err and sk_err_soft fields are positive errno values and
userspace applications rely on this when using getsockopt(SO_ERROR).

ATM code places an -errno into sk_err_soft in sigd_send() and returns it
from svc_addparty()/svc_dropparty().

Although I am not familiar with ATM code I came to this conclusion
because:

1. sigd_send() msg->type cases as_okay and as_error both have:

   sk->sk_err = -msg->reply;

   while the as_addparty and as_dropparty cases have:

   sk->sk_err_soft = msg->reply;

   This is the source of the inconsistency.

2. svc_addparty() returns an -errno and assumes sk_err_soft is also an
   -errno:

   if (flags & O_NONBLOCK) {
   error = -EINPROGRESS;
   goto out;
   }
   ...
   error = xchg(>sk_err_soft, 0);
   out:
   release_sock(sk);
   return error;

   This shows that sk_err_soft is indeed being treated as an -errno.

This patch ensures that sk_err_soft is always a positive errno.

Signed-off-by: Stefan Hajnoczi 
---
This patch is untested and potentially affects the getsockopt(SO_ERROR) ABI for
a specific case in ATM.  I leave it to the maintainer to decide whether this
inconsistency should be fixed or not.

 net/atm/signaling.c | 2 +-
 net/atm/svc.c   | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/atm/signaling.c b/net/atm/signaling.c
index 4fd6af4..adb6e3d 100644
--- a/net/atm/signaling.c
+++ b/net/atm/signaling.c
@@ -124,7 +124,7 @@ as_indicate_complete:
break;
case as_addparty:
case as_dropparty:
-   sk->sk_err_soft = msg->reply;
+   sk->sk_err_soft = -msg->reply;
/* < 0 failure, otherwise ep_ref */
clear_bit(ATM_VF_WAITING, >flags);
break;
diff --git a/net/atm/svc.c b/net/atm/svc.c
index 3fa0a9e..878563a 100644
--- a/net/atm/svc.c
+++ b/net/atm/svc.c
@@ -546,7 +546,7 @@ static int svc_addparty(struct socket *sock, struct 
sockaddr *sockaddr,
schedule();
}
finish_wait(sk_sleep(sk), );
-   error = xchg(>sk_err_soft, 0);
+   error = -xchg(>sk_err_soft, 0);
 out:
release_sock(sk);
return error;
@@ -573,7 +573,7 @@ static int svc_dropparty(struct socket *sock, int ep_ref)
error = -EUNATCH;
goto out;
}
-   error = xchg(>sk_err_soft, 0);
+   error = -xchg(>sk_err_soft, 0);
 out:
release_sock(sk);
return error;
-- 
2.5.5



Re: [net-next PATCH 0/2] Follow-ups for GUEoIPv6 patches

2016-05-18 Thread Jeff Kirsher
On Wed, 2016-05-18 at 16:19 -0700, Alexander Duyck wrote:
> On Wed, May 18, 2016 at 2:27 PM, Jeff Kirsher
>  wrote:
> > On Wed, 2016-05-18 at 10:44 -0700, Alexander Duyck wrote:
> >> This patch series is meant to be applied after:
> >> [PATCH v7 net-next 00/16] ipv6: Enable GUEoIPv6 and more fixes for v6
> >> tunneling
> >>
> >> The first patch addresses an issue we already resolved in the GREv4
> and
> >> is
> >> now present in GREv6 with the introduction of FOU/GUE for IPv6 based
> GRE
> >> tunnels.
> >>
> >> The second patch goes through and enables IPv6 tunnel offloads for the
> >> Intel
> >> NICs that already support the IPv4 based IP-in-IP tunnel offloads.  I
> >> have
> >> only done a bit of touch testing but have seen ~20 Gb/s over an i40e
> >> interface using a v4-in-v6 tunnel, and I have verified IPv6 GRE is
> still
> >> passing traffic at around the same rate.  I plan to do further testing
> >> but
> >> with these patches present it should enable a wider audience to be
> able
> >> to
> >> test the new features introduced in Tom's patchset with hardware
> >> offloads.
> >>
> >> ---
> >>
> >> Alexander Duyck (2):
> >>   ip6_gre: Do not allow segmentation offloads GRE_CSUM is enabled
> >> with FOU/GUE
> >>   intel: Add support for IPv6 IP-in-IP offload
> >
> > Dave, I have this series added to my queue.
> 
> Jeff,
> 
> If Tom's patches make it in for 4.7, then I would like to see if we
> could push these patches for "net" since essentially the first patch
> is a fix for Tom's earlier patches and the second is needed in order
> to really be able to test Tom's patches with a driver that actually
> supports a hardware offload.

Yeah, I was thinking that very same thing.  I will wait to see if Dave
sucks this into 4.7 or not and plan accordingly.  I also figured out that
your two patch series needs to have Tom's series applied before hand, so I
will be adding Tom's series to my tree just for testing purposes.

signature.asc
Description: This is a digitally signed message part


Re: [net-next PATCH 0/2] Follow-ups for GUEoIPv6 patches

2016-05-18 Thread Alexander Duyck
On Wed, May 18, 2016 at 2:27 PM, Jeff Kirsher
 wrote:
> On Wed, 2016-05-18 at 10:44 -0700, Alexander Duyck wrote:
>> This patch series is meant to be applied after:
>> [PATCH v7 net-next 00/16] ipv6: Enable GUEoIPv6 and more fixes for v6
>> tunneling
>>
>> The first patch addresses an issue we already resolved in the GREv4 and
>> is
>> now present in GREv6 with the introduction of FOU/GUE for IPv6 based GRE
>> tunnels.
>>
>> The second patch goes through and enables IPv6 tunnel offloads for the
>> Intel
>> NICs that already support the IPv4 based IP-in-IP tunnel offloads.  I
>> have
>> only done a bit of touch testing but have seen ~20 Gb/s over an i40e
>> interface using a v4-in-v6 tunnel, and I have verified IPv6 GRE is still
>> passing traffic at around the same rate.  I plan to do further testing
>> but
>> with these patches present it should enable a wider audience to be able
>> to
>> test the new features introduced in Tom's patchset with hardware
>> offloads.
>>
>> ---
>>
>> Alexander Duyck (2):
>>   ip6_gre: Do not allow segmentation offloads GRE_CSUM is enabled
>> with FOU/GUE
>>   intel: Add support for IPv6 IP-in-IP offload
>
> Dave, I have this series added to my queue.

Jeff,

If Tom's patches make it in for 4.7, then I would like to see if we
could push these patches for "net" since essentially the first patch
is a fix for Tom's earlier patches and the second is needed in order
to really be able to test Tom's patches with a driver that actually
supports a hardware offload.

Thanks.

- Alex


Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-18 Thread Francois Romieu
Lino Sanfilippo  :
[...]
> what about the (only compile tested) code below?

I may have misunderstood some parts but it nonetheless seems broken.

> The smp_wmb() in tx function combined with the smp_rmb() in tx_clean ensures
> that the CPU running tx_clean sees consistent values for info, data and skb 
> (thus no need to check for validity of all three values any more).
> The mb() fulfills several tasks:
> 1. makes sure that DMA writes to descriptor are completed before the HW is
> informed.

"DMA writes" == "CPU writes" ?

> 2. On multi processor systems: ensures that txbd_curr is updated (this is 
> paired
> with the smp_mb() at the end of tx_clean).

Smells like using barrier side-effects to control smp coherency. It isn't
the recommended style.

> 3. Ensure we see the most recent value for tx_dirty. With this we do not have 
> to
> recheck after we stopped the tx queue.
> 
> 
> --- a/drivers/net/ethernet/arc/emac_main.c
> +++ b/drivers/net/ethernet/arc/emac_main.c
> @@ -162,8 +162,13 @@ static void arc_emac_tx_clean(struct net_device *ndev)
>   struct sk_buff *skb = tx_buff->skb;
>   unsigned int info = le32_to_cpu(txbd->info);
>  
> - if ((info & FOR_EMAC) || !txbd->data || !skb)
> + if (info & FOR_EMAC) {
> + /* Make sure we see consistent values for info, skb
> +  * and data.
> +  */
> + smp_rmb();
>   break;
> + }

?

smp_rmb should appear before the variables you want coherency for.

>  
>   if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) {
>   stats->tx_errors++;
> @@ -679,36 +684,33 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
> net_device *ndev)
>   dma_unmap_addr_set(>tx_buff[*txbd_curr], addr, addr);
>   dma_unmap_len_set(>tx_buff[*txbd_curr], len, len);
>  
> - priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
>  
> - /* Make sure pointer to data buffer is set */
> - wmb();
> + priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
> + priv->tx_buff[*txbd_curr].skb = skb;
>  
> - skb_tx_timestamp(skb);
> + /* Make sure info is set after data and skb with respect to
> +  * other tx_clean().
> +  */
> + smp_wmb();
>  
>   *info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);

Afaik smp_wmb() does not imply wmb(). So priv->txbd[*txbd_curr].data and
*info (aka priv->txbd[*txbd_curr].info) are not necessarily written in
an orderly manner.

>  
> - /* Make sure info word is set */
> - wmb();
> -
> - priv->tx_buff[*txbd_curr].skb = skb;
> -
>   /* Increment index to point to the next BD */
>   *txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;

With this change it's possible that tx_clean() reads new value for
tx_curr and old value (0) for *info.

>  
> - /* Ensure that tx_clean() sees the new txbd_curr before
> + /* 1.Ensure that tx_clean() sees the new txbd_curr before
>* checking the queue status. This prevents an unneeded wake
>* of the queue in tx_clean().
> +  * 2.Ensure that all values are written to RAM and to DMA
> +  * before hardware is informed.

(I am not sure what "DMA" is supposed to mean here.)

> +  * 3.Ensure we see the most recent value for tx_dirty.
>*/
> - smp_mb();
> + mb();
>  
> - if (!arc_emac_tx_avail(priv)) {
> + if (!arc_emac_tx_avail(priv))
>   netif_stop_queue(ndev);
> - /* Refresh tx_dirty */
> - smp_mb();
> - if (arc_emac_tx_avail(priv))
> - netif_start_queue(ndev);
> - }

Xmit thread| Clean thread

mb();

arc_emac_tx_avail() test with old
tx_dirty - tx_clean has not issued
any mb yet - and new tx_curr

 smp_mb();

 if (netif_queue_stopped(ndev) && ...
 netif_wake_queue(ndev);

netif_stop_queue()

-> queue stopped.

You can't remove the revalidation step.

arc_emac_tx_avail() is essentially pessimistic. Even if arc_emac_tx_avail()
was "right", there would be a tx_clean window between arc_emac_tx_avail()
and netif_stop_queue().

> +
> + skb_tx_timestamp(skb);

You don't want to issue skb_tx_timestamp after releasing control of the
descriptor (*info = ...): skb may be long gone.

-- 
Ueimor


Re: [PATCH v7 net-next 02/16] net: define gso types for IPx over IPv4 and IPv6

2016-05-18 Thread Jeff Kirsher
On Wed, 2016-05-18 at 09:06 -0700, Tom Herbert wrote:
> This patch defines two new GSO definitions SKB_GSO_IPXIP4 and
> SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and
> NETIF_F_GSO_IPXIP6. These are used to described IP in IP
> tunnel and what the outer protocol is. The inner protocol
> can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and
> SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT
> are removed (these are both instances of SKB_GSO_IPXIP4).
> SKB_GSO_IPXIP6 will be used when support for GSO with IP
> encapsulation over IPv6 is added.
> 
> Signed-off-by: Tom Herbert 

Acked-by: Jeff Kirsher 
For the Intel Ethernet driver changes...

> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  |  5 ++---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c |  5 ++---
>  drivers/net/ethernet/intel/i40e/i40e_main.c   |  3 +--
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  3 +--
>  drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  3 +--
>  drivers/net/ethernet/intel/i40evf/i40evf_main.c   |  3 +--
>  drivers/net/ethernet/intel/igb/igb_main.c |  3 +--
>  drivers/net/ethernet/intel/igbvf/netdev.c |  3 +--
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  3 +--
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  3 +--
>  include/linux/netdev_features.h   | 12 ++--
>  include/linux/netdevice.h |  4 ++--
>  include/linux/skbuff.h    |  4 ++--
>  net/core/ethtool.c    |  4 ++--
>  net/ipv4/af_inet.c    |  2 +-
>  net/ipv4/ipip.c   |  2 +-
>  net/ipv6/ip6_offload.c    |  4 ++--
>  net/ipv6/sit.c    |  4 ++--
>  net/netfilter/ipvs/ip_vs_xmit.c   | 17 +++--
>  19 files changed, 37 insertions(+), 50 deletions(-)


signature.asc
Description: This is a digitally signed message part


Re: [PATCH net-next 1/4] scsi_tcp: block BH in TCP callbacks

2016-05-18 Thread Eric Dumazet
On Wed, 2016-05-18 at 15:48 -0500, Mike Christie wrote:

> Reviewed and tested. Thanks
> 
> Acked-by: Mike Christie 

Excellent, thanks Mike !




Re: [PATCH] net: suppress warnings on dev_alloc_skb

2016-05-18 Thread Eric Dumazet
On Wed, 2016-05-18 at 17:09 -0400, Neil Horman wrote:

> Oh, my bad, I misread what he was saying.  Yeah, thats the way to go.

Please submit a v2 ;)

Thanks





Re: [RFC PATCH net] e1000e: keep vlan interfaces functional after rxvlan off

2016-05-18 Thread Jeff Kirsher
On Tue, 2016-05-17 at 15:03 -0400, Jarod Wilson wrote:
> I've got a bug report about an e1000e interface, where a vlan interface
> is
> set up on top of it:
> 
> $ ip link add link ens1f0 name ens1f0.99 type vlan id 99
> $ ip link set ens1f0 up
> $ ip link set ens1f0.99 up
> $ ip addr add 192.168.99.92 dev ens1f0.99
> 
> At this point, I can ping another host on vlan 99, ip 192.168.99.91.
> However, if I do the following:
> 
> $ ethtool -K ens1f0 rxvlan off
> 
> Then no traffic passes on ens1f0.99. It comes back if I toggle rxvlan on
> again. I'm not sure if this is actually intended behavior, or if there's
> a
> lack of software vlan stripping fallback, or what, but things continue to
> work if I simply don't call e1000e_vlan_strip_disable() if there are
> active vlans (plagiarizing a function from the e1000 driver here) on the
> interface.
> 
> Also slipped a related-ish fix to the kerneldoc text for
> e1000e_vlan_strip_disable here...
> 
> CC: Jeff Kirsher 
> CC: intel-wired-...@lists.osuosl.org
> CC: netdev@vger.kernel.org
> Signed-off-by: Jarod Wilson 
> ---
>  drivers/net/ethernet/intel/e1000e/netdev.c | 15 +--
>  1 file changed, 13 insertions(+), 2 deletions(-)

Raanan, please review this patch.  Even though it is an RFC I will be
adding it to my queue for testing.
http://patchwork.ozlabs.org/patch/623238/

signature.asc
Description: This is a digitally signed message part


Re: [PATCHv2 bluetooth-next 04/10] ndisc: add addr_len parameter to ndisc_opt_addr_space

2016-05-18 Thread Michael Richardson

Alexander Aring  wrote:
> What I did in the patch series to store short address in neighbour
> private data, which makes everything 802.15.4 6lowpan specific.

FYI: there are a whole family of network types which have multiple
addresses (long/short).  This includes:
1) BTLE - RFC 7668 IPv6 over BLUETOOTH(R) Low Energy
2) G.9959 - RFC 7428 Transmission of IPv6 Packets over ITU-T G.9959 Networks
3) BACnet and DECT will be defined.  Some of these have 1 byte short
addresses...
4) the various LPWAN people might be able to do IPv6... we'll see.

I think that we the 802.15.4 specific stuff is fine for now, and we can
figure out it might need generalization later on.

--
]   Never tell me the odds! | ipv6 mesh networks [
]   Michael Richardson, Sandelman Software Works| network architect  [
] m...@sandelman.ca  http://www.sandelman.ca/|   ruby on rails[



signature.asc
Description: PGP signature


Re: [net-next PATCH 0/2] Follow-ups for GUEoIPv6 patches

2016-05-18 Thread Jeff Kirsher
On Wed, 2016-05-18 at 10:44 -0700, Alexander Duyck wrote:
> This patch series is meant to be applied after:
> [PATCH v7 net-next 00/16] ipv6: Enable GUEoIPv6 and more fixes for v6
> tunneling
> 
> The first patch addresses an issue we already resolved in the GREv4 and
> is
> now present in GREv6 with the introduction of FOU/GUE for IPv6 based GRE
> tunnels.
> 
> The second patch goes through and enables IPv6 tunnel offloads for the
> Intel
> NICs that already support the IPv4 based IP-in-IP tunnel offloads.  I
> have
> only done a bit of touch testing but have seen ~20 Gb/s over an i40e
> interface using a v4-in-v6 tunnel, and I have verified IPv6 GRE is still
> passing traffic at around the same rate.  I plan to do further testing
> but
> with these patches present it should enable a wider audience to be able
> to
> test the new features introduced in Tom's patchset with hardware
> offloads.
> 
> ---
> 
> Alexander Duyck (2):
>   ip6_gre: Do not allow segmentation offloads GRE_CSUM is enabled
> with FOU/GUE
>   intel: Add support for IPv6 IP-in-IP offload

Dave, I have this series added to my queue.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH net] bpf: rather use get_random_int for randomizations

2016-05-18 Thread Alexei Starovoitov
On Wed, May 18, 2016 at 07:17:48AM -0700, Eric Dumazet wrote:
> On Wed, 2016-05-18 at 15:28 +0200, Hannes Frederic Sowa wrote:
> 
> > I don't consider this a big thing, I just mentioned that we probably
> > shouldn't use prandom_u32 if the value somehow could leak to user space
> > and should be used for security.
> 
> Yes, I was mostly trying to understand if you had real security issues
> there or some general concerns ;)

agree with Eric. Frankly I wouldn't do it, but since it's a trivial
patch and if it makes security folks less worried then why not.
Acked-by: Alexei Starovoitov 



Re: [PATCH] net: suppress warnings on dev_alloc_skb

2016-05-18 Thread Neil Horman
On Wed, May 18, 2016 at 01:48:15PM -0700, Alexander Duyck wrote:
> On Wed, May 18, 2016 at 12:29 PM, Neil Horman  wrote:
> > On Wed, May 18, 2016 at 10:57:21AM -0700, Alexander Duyck wrote:
> >> On Wed, May 18, 2016 at 9:01 AM, Eric Dumazet  
> >> wrote:
> >> > On Wed, 2016-05-18 at 11:25 -0400, Neil Horman wrote:
> >> >> Noticed an allocation failure in a network driver the other day on a 32 
> >> >> bit
> >> >> system:
> >> >>
> >> >> DMA-API: debugging out of memory - disabling
> >> >> bnx2fc: adapter_lookup: hba NULL
> >> >> lldpad: page allocation failure. order:0, mode:0x4120
> >> >> Pid: 4556, comm: lldpad Not tainted 2.6.32-639.el6.i686.debug #1
> >> >> Call Trace:
> >> >>  [] ? printk+0x19/0x23
> >> >>  [] ? __alloc_pages_nodemask+0x664/0x830
> >> >>  [] ? free_object+0x82/0xa0
> >> >>  [] ? ixgbe_alloc_rx_buffers+0x10b/0x1d0 [ixgbe]
> >> >>  [] ? ixgbe_configure_rx_ring+0x29f/0x420 [ixgbe]
> >> >>  [] ? ixgbe_configure_tx_ring+0x15c/0x220 [ixgbe]
> >> >>  [] ? ixgbe_configure+0x589/0xc00 [ixgbe]
> >> >>  [] ? ixgbe_open+0xa7/0x5c0 [ixgbe]
> >> >>  [] ? ixgbe_init_interrupt_scheme+0x5b6/0x970 [ixgbe]
> >> >>  [] ? ixgbe_setup_tc+0x1a4/0x260 [ixgbe]
> >> >>  [] ? ixgbe_dcbnl_set_state+0x7f/0x90 [ixgbe]
> >> >>  [] ? dcb_doit+0x10ed/0x16d0
> >> >> ...
> >> >
> >> >
> >> > Well, maybe this call site (via ixgbe_configure_rx_ring()) should be
> >> > using GFP_KERNEL instead of GFP_ATOMIC.
> >>
> >> The problem is the ixgbe driver is using the same function for
> >> allocating memory in the softirq and in the init.  As such it has to
> >> default to GFP_ATOMIC so that it doesn't screw up the NAPI allocation
> >> case.
> >>
> >
> > I suppose a happy medium would be to extend dev_alloc_pages to accept a gfp
> > argument, and then have the function itself only set __GFP_NOWARN, if 
> > ATOMIC is
> > also set?
> 
> Why bother?  I really think Eric's patch is the way to go.
> 
> We can already pass gfp flags to __dev_alloc_pages so if we need to
> pass flags we can use that.  Otherwise for the users of dev_alloc_page
> and dev_alloc_pages we can just pass GFP_ATOMIC | __GFP_NOWARN like
> Eric did in his example patch.
> 
> - Alex
> 
Oh, my bad, I misread what he was saying.  Yeah, thats the way to go.

Neil



Re: [PATCH net-next 1/4] scsi_tcp: block BH in TCP callbacks

2016-05-18 Thread Mike Christie
On 05/17/2016 07:44 PM, Eric Dumazet wrote:
> iscsi_sw_tcp_data_ready() and iscsi_sw_tcp_state_change() were
> using read_lock(>sk_callback_lock) which is fine if caller
> disabled BH.
> 
> TCP stack no longer has this requirement and can run from
> process context.
> 
> Use read_lock_bh() variant to restore previous assumption.
> 
> Ideally this code could use RCU instead...
> 
> Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
> Fixes: d41a69f1d390 ("tcp: make tcp_sendmsg() aware of socket backlog")
> Signed-off-by: Eric Dumazet 
> Cc: Mike Christie 
> Cc: Venkatesh Srinivas 
> ---
>  drivers/scsi/iscsi_tcp.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
> index 2e4c82f8329c..ace4f1f41b8e 100644
> --- a/drivers/scsi/iscsi_tcp.c
> +++ b/drivers/scsi/iscsi_tcp.c
> @@ -131,10 +131,10 @@ static void iscsi_sw_tcp_data_ready(struct sock *sk)
>   struct iscsi_tcp_conn *tcp_conn;
>   read_descriptor_t rd_desc;
>  
> - read_lock(>sk_callback_lock);
> + read_lock_bh(>sk_callback_lock);
>   conn = sk->sk_user_data;
>   if (!conn) {
> - read_unlock(>sk_callback_lock);
> + read_unlock_bh(>sk_callback_lock);
>   return;
>   }
>   tcp_conn = conn->dd_data;
> @@ -154,7 +154,7 @@ static void iscsi_sw_tcp_data_ready(struct sock *sk)
>   /* If we had to (atomically) map a highmem page,
>* unmap it now. */
>   iscsi_tcp_segment_unmap(_conn->in.segment);
> - read_unlock(>sk_callback_lock);
> + read_unlock_bh(>sk_callback_lock);
>  }
>  
>  static void iscsi_sw_tcp_state_change(struct sock *sk)
> @@ -165,10 +165,10 @@ static void iscsi_sw_tcp_state_change(struct sock *sk)
>   struct iscsi_session *session;
>   void (*old_state_change)(struct sock *);
>  
> - read_lock(>sk_callback_lock);
> + read_lock_bh(>sk_callback_lock);
>   conn = sk->sk_user_data;
>   if (!conn) {
> - read_unlock(>sk_callback_lock);
> + read_unlock_bh(>sk_callback_lock);
>   return;
>   }
>   session = conn->session;
> @@ -179,7 +179,7 @@ static void iscsi_sw_tcp_state_change(struct sock *sk)
>   tcp_sw_conn = tcp_conn->dd_data;
>   old_state_change = tcp_sw_conn->old_state_change;
>  
> - read_unlock(>sk_callback_lock);
> + read_unlock_bh(>sk_callback_lock);
>  
>   old_state_change(sk);
>  }
> 

Reviewed and tested. Thanks

Acked-by: Mike Christie 


Re: [PATCH] net: suppress warnings on dev_alloc_skb

2016-05-18 Thread Alexander Duyck
On Wed, May 18, 2016 at 12:29 PM, Neil Horman  wrote:
> On Wed, May 18, 2016 at 10:57:21AM -0700, Alexander Duyck wrote:
>> On Wed, May 18, 2016 at 9:01 AM, Eric Dumazet  wrote:
>> > On Wed, 2016-05-18 at 11:25 -0400, Neil Horman wrote:
>> >> Noticed an allocation failure in a network driver the other day on a 32 
>> >> bit
>> >> system:
>> >>
>> >> DMA-API: debugging out of memory - disabling
>> >> bnx2fc: adapter_lookup: hba NULL
>> >> lldpad: page allocation failure. order:0, mode:0x4120
>> >> Pid: 4556, comm: lldpad Not tainted 2.6.32-639.el6.i686.debug #1
>> >> Call Trace:
>> >>  [] ? printk+0x19/0x23
>> >>  [] ? __alloc_pages_nodemask+0x664/0x830
>> >>  [] ? free_object+0x82/0xa0
>> >>  [] ? ixgbe_alloc_rx_buffers+0x10b/0x1d0 [ixgbe]
>> >>  [] ? ixgbe_configure_rx_ring+0x29f/0x420 [ixgbe]
>> >>  [] ? ixgbe_configure_tx_ring+0x15c/0x220 [ixgbe]
>> >>  [] ? ixgbe_configure+0x589/0xc00 [ixgbe]
>> >>  [] ? ixgbe_open+0xa7/0x5c0 [ixgbe]
>> >>  [] ? ixgbe_init_interrupt_scheme+0x5b6/0x970 [ixgbe]
>> >>  [] ? ixgbe_setup_tc+0x1a4/0x260 [ixgbe]
>> >>  [] ? ixgbe_dcbnl_set_state+0x7f/0x90 [ixgbe]
>> >>  [] ? dcb_doit+0x10ed/0x16d0
>> >> ...
>> >
>> >
>> > Well, maybe this call site (via ixgbe_configure_rx_ring()) should be
>> > using GFP_KERNEL instead of GFP_ATOMIC.
>>
>> The problem is the ixgbe driver is using the same function for
>> allocating memory in the softirq and in the init.  As such it has to
>> default to GFP_ATOMIC so that it doesn't screw up the NAPI allocation
>> case.
>>
>
> I suppose a happy medium would be to extend dev_alloc_pages to accept a gfp
> argument, and then have the function itself only set __GFP_NOWARN, if ATOMIC 
> is
> also set?

Why bother?  I really think Eric's patch is the way to go.

We can already pass gfp flags to __dev_alloc_pages so if we need to
pass flags we can use that.  Otherwise for the users of dev_alloc_page
and dev_alloc_pages we can just pass GFP_ATOMIC | __GFP_NOWARN like
Eric did in his example patch.

- Alex


Re: [RFC][PATCH net] bpf: Use mount_nodev not mount_ns to mount the bpf filesystem

2016-05-18 Thread Hannes Frederic Sowa
On 18.05.2016 22:43, Daniel Borkmann wrote:
> On 05/18/2016 04:56 PM, Eric W. Biederman wrote:
>> Hannes Frederic Sowa  writes:
>>> On 18.05.2016 01:12, Eric W. Biederman wrote:

 While reviewing the filesystems that set FS_USERNS_MOUNT I spotted the
 bpf filesystem.  Looking at the code I saw a broken usage of mount_ns
 with current->nsproxy->mnt_ns. As the code does not acquire a reference
 to the mount namespace it can not possibly be correct to store the
 mount
 namespace on the superblock as it does.

 Replace mount_ns with mount_nodev so that each mount of the bpf
 filesystem returns a distinct instance, and the code is not utterly
 broken.

 Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
 Signed-off-by: "Eric W. Biederman" 
 ---

 No one should care about this change, as userspace typically only
 mounts
 things once and does not depend on things in one mount do not
 showing up
 in another.  Can someone who actually uses the bpf filesystem please
 verify this.
> [...]
> 
> LGTM.
> 
> Acked-by: Daniel Borkmann 
> 
>>> The idea is to have the bpf filesystem as a singeleton per mnt-namespace
>>> to prevent endless instances being created and kernel resources being
>>> hogged by pinning them to hard to discover bpf mounts.
> 
> Eric, please send the patch officially and feel free to add my Ack. Given
> the circumstances, moving to mount_nodev() seems the best way forward. To
> also address above mentioned concern from Hannes, we need to remove the
> FS_USERNS_MOUNT flag along with the change. It looks like the fix is best
> addressed in a single patch if you want to include it. If not, we can
> otherwise send it separately as well, I don't mind.

I agree. Would make most sense to make the change in one patch. Later on
we can reason about if it makes sense to use the net namespace to split
bpf maps and programs or maybe even introduce a new primitive for that.

Thanks,
Hannes



Re: [RFC][PATCH net] bpf: Use mount_nodev not mount_ns to mount the bpf filesystem

2016-05-18 Thread Daniel Borkmann

On 05/18/2016 04:56 PM, Eric W. Biederman wrote:

Hannes Frederic Sowa  writes:

On 18.05.2016 01:12, Eric W. Biederman wrote:


While reviewing the filesystems that set FS_USERNS_MOUNT I spotted the
bpf filesystem.  Looking at the code I saw a broken usage of mount_ns
with current->nsproxy->mnt_ns. As the code does not acquire a reference
to the mount namespace it can not possibly be correct to store the mount
namespace on the superblock as it does.

Replace mount_ns with mount_nodev so that each mount of the bpf
filesystem returns a distinct instance, and the code is not utterly
broken.

Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
Signed-off-by: "Eric W. Biederman" 
---

No one should care about this change, as userspace typically only mounts
things once and does not depend on things in one mount do not showing up
in another.  Can someone who actually uses the bpf filesystem please
verify this.

[...]

LGTM.

Acked-by: Daniel Borkmann 


The idea is to have the bpf filesystem as a singeleton per mnt-namespace
to prevent endless instances being created and kernel resources being
hogged by pinning them to hard to discover bpf mounts.


Eric, please send the patch officially and feel free to add my Ack. Given
the circumstances, moving to mount_nodev() seems the best way forward. To
also address above mentioned concern from Hannes, we need to remove the
FS_USERNS_MOUNT flag along with the change. It looks like the fix is best
addressed in a single patch if you want to include it. If not, we can
otherwise send it separately as well, I don't mind.

Thanks for your feedback!

diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 8f94ca1..b2aefa2 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -378,7 +378,7 @@ static int bpf_fill_super(struct super_block *sb, void 
*data, int silent)
 static struct dentry *bpf_mount(struct file_system_type *type, int flags,
const char *dev_name, void *data)
 {
-   return mount_ns(type, flags, current->nsproxy->mnt_ns, bpf_fill_super);
+   return mount_nodev(type, flags, data, bpf_fill_super);
 }

 static struct file_system_type bpf_fs_type = {
@@ -386,7 +386,6 @@ static struct file_system_type bpf_fs_type = {
.name   = "bpf",
.mount  = bpf_mount,
.kill_sb= kill_litter_super,
-   .fs_flags   = FS_USERNS_MOUNT,
 };

 MODULE_ALIAS_FS("bpf");
--
1.9.3




Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-18 Thread Lino Sanfilippo
On 18.05.2016 02:01, Francois Romieu wrote:

> The smp_wmb() and wmb() could be made side-by-side once *info is
> updated but I don't see the adequate idiom to improve the smp_wmb + wmb
> combo. :o/
> 
>> And the wmb() looks like it should be a dma_wmb().
> 
> I see two points against it:
> - it could be too late for skb_tx_timestamp().
> - arc_emac_tx_clean must not see an index update before the device
>   got a chance to acquire the descriptor. arc_emac_tx_clean can't
>   tell the difference between an about-to-be-released descriptor
>   and a returned-from-device one.
> 

Hi,

what about the (only compile tested) code below?

The smp_wmb() in tx function combined with the smp_rmb() in tx_clean ensures
that the CPU running tx_clean sees consistent values for info, data and skb 
(thus no need to check for validity of all three values any more).

The mb() fulfills several tasks:
1. makes sure that DMA writes to descriptor are completed before the HW is
informed.
2. On multi processor systems: ensures that txbd_curr is updated (this is paired
with the smp_mb() at the end of tx_clean).
3. Ensure we see the most recent value for tx_dirty. With this we do not have to
recheck after we stopped the tx queue.


--- a/drivers/net/ethernet/arc/emac_main.c
+++ b/drivers/net/ethernet/arc/emac_main.c
@@ -162,8 +162,13 @@ static void arc_emac_tx_clean(struct net_device *ndev)
struct sk_buff *skb = tx_buff->skb;
unsigned int info = le32_to_cpu(txbd->info);
 
-   if ((info & FOR_EMAC) || !txbd->data || !skb)
+   if (info & FOR_EMAC) {
+   /* Make sure we see consistent values for info, skb
+* and data.
+*/
+   smp_rmb();
break;
+   }
 
if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) {
stats->tx_errors++;
@@ -679,36 +684,33 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
net_device *ndev)
dma_unmap_addr_set(>tx_buff[*txbd_curr], addr, addr);
dma_unmap_len_set(>tx_buff[*txbd_curr], len, len);
 
-   priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
 
-   /* Make sure pointer to data buffer is set */
-   wmb();
+   priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
+   priv->tx_buff[*txbd_curr].skb = skb;
 
-   skb_tx_timestamp(skb);
+   /* Make sure info is set after data and skb with respect to
+* other tx_clean().
+*/
+   smp_wmb();
 
*info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
 
-   /* Make sure info word is set */
-   wmb();
-
-   priv->tx_buff[*txbd_curr].skb = skb;
-
/* Increment index to point to the next BD */
*txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
 
-   /* Ensure that tx_clean() sees the new txbd_curr before
+   /* 1.Ensure that tx_clean() sees the new txbd_curr before
 * checking the queue status. This prevents an unneeded wake
 * of the queue in tx_clean().
+* 2.Ensure that all values are written to RAM and to DMA
+* before hardware is informed.
+* 3.Ensure we see the most recent value for tx_dirty.
 */
-   smp_mb();
+   mb();
 
-   if (!arc_emac_tx_avail(priv)) {
+   if (!arc_emac_tx_avail(priv))
netif_stop_queue(ndev);
-   /* Refresh tx_dirty */
-   smp_mb();
-   if (arc_emac_tx_avail(priv))
-   netif_start_queue(ndev);
-   }
+
+   skb_tx_timestamp(skb);
 
arc_reg_set(priv, R_STATUS, TXPL_MASK);
 
-- 
2.7.0

Regards,
Lino


Re: [PATCH] Revert "phy: add support for a reset-gpio specification"

2016-05-18 Thread Guenter Roeck
On Wed, May 18, 2016 at 09:52:22AM -0700, Florian Fainelli wrote:
> On 05/18/2016 09:05 AM, Fabio Estevam wrote:
> > Commit da47b4572056 ("phy: add support for a reset-gpio specification")
> > causes the following xtensa qemu crash according to Guenter Roeck:
> > 
> > [9.366256] libphy: ethoc-mdio: probed
> > [9.367389]  (null): could not attach to PHY
> > [9.368555]  (null): failed to probe MDIO bus
> > [9.371540] Unable to handle kernel paging request at virtual address 
> > 001c
> > [9.371540]  pc = d0320926, ra = 903209d1
> > [9.375358] Oops: sig: 11 [#1]
> > 
> > This reverts commit da47b4572056487fd7941c26f73b3e8815ff712a.
> > 
> > Reported-by: Guenter Roeck 
> > Signed-off-by: Fabio Estevam 
> 
> Acked-by: Florian Fainelli 
> 
Tested-by: Guenter Roeck 


Re: [PATCH] net: suppress warnings on dev_alloc_skb

2016-05-18 Thread Neil Horman
On Wed, May 18, 2016 at 10:57:21AM -0700, Alexander Duyck wrote:
> On Wed, May 18, 2016 at 9:01 AM, Eric Dumazet  wrote:
> > On Wed, 2016-05-18 at 11:25 -0400, Neil Horman wrote:
> >> Noticed an allocation failure in a network driver the other day on a 32 bit
> >> system:
> >>
> >> DMA-API: debugging out of memory - disabling
> >> bnx2fc: adapter_lookup: hba NULL
> >> lldpad: page allocation failure. order:0, mode:0x4120
> >> Pid: 4556, comm: lldpad Not tainted 2.6.32-639.el6.i686.debug #1
> >> Call Trace:
> >>  [] ? printk+0x19/0x23
> >>  [] ? __alloc_pages_nodemask+0x664/0x830
> >>  [] ? free_object+0x82/0xa0
> >>  [] ? ixgbe_alloc_rx_buffers+0x10b/0x1d0 [ixgbe]
> >>  [] ? ixgbe_configure_rx_ring+0x29f/0x420 [ixgbe]
> >>  [] ? ixgbe_configure_tx_ring+0x15c/0x220 [ixgbe]
> >>  [] ? ixgbe_configure+0x589/0xc00 [ixgbe]
> >>  [] ? ixgbe_open+0xa7/0x5c0 [ixgbe]
> >>  [] ? ixgbe_init_interrupt_scheme+0x5b6/0x970 [ixgbe]
> >>  [] ? ixgbe_setup_tc+0x1a4/0x260 [ixgbe]
> >>  [] ? ixgbe_dcbnl_set_state+0x7f/0x90 [ixgbe]
> >>  [] ? dcb_doit+0x10ed/0x16d0
> >> ...
> >
> >
> > Well, maybe this call site (via ixgbe_configure_rx_ring()) should be
> > using GFP_KERNEL instead of GFP_ATOMIC.
> 
> The problem is the ixgbe driver is using the same function for
> allocating memory in the softirq and in the init.  As such it has to
> default to GFP_ATOMIC so that it doesn't screw up the NAPI allocation
> case.
> 

I suppose a happy medium would be to extend dev_alloc_pages to accept a gfp
argument, and then have the function itself only set __GFP_NOWARN, if ATOMIC is
also set?

Neil



Re: [GIT] Networking

2016-05-18 Thread Kalle Valo
Linus Torvalds  writes:

> On Wed, May 18, 2016 at 11:58 AM, Kalle Valo  wrote:
>>
>> It would be best if you could send a patch either directly to Dave or
>> Linus to resolve this quickly.
>
> I'm committing my patch myself right now, since this bug makes my
> laptop useless, and I will take credit for finding and testing it on
> my own

Kiitti :)

> even if it was apparently also discussed independently on the
> networking list ;)

Yeah, sorry about taking this too long.

-- 
Kalle Valo


Re: [GIT] Networking

2016-05-18 Thread Coelho, Luciano
On Wed, 2016-05-18 at 12:00 -0700, Linus Torvalds wrote:
> On Wed, May 18, 2016 at 11:58 AM, Kalle Valo 
> wrote:
> > 
> > 
> > It would be best if you could send a patch either directly to Dave
> > or
> > Linus to resolve this quickly.
> I'm committing my patch myself right now, since this bug makes my
> laptop useless, and I will take credit for finding and testing it on
> my own even if it was apparently also discussed independently on the
> networking list ;)

Great! :)

You beat me by a few minutes, even though I had the whole day to play
with it. :\

--
Cheers,
Luca.

[ANNOUNCE] iproute2 4.6

2016-05-18 Thread Stephen Hemminger
Update to iproute2 utility to support new features in Linux 4.5.
Major things are improvements to bridg mdb management, and bpf.
Also, support for new devlink infrastructure

Source:
  http://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.6.0.tar.gz

Repository:
  git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git

Report problems (or enhancements) to the netdev@vger.kernel.org mailing list.

---
Daniel Borkmann (5):
  vxlan: add support to set flow label
  geneve: add support to set flow label
  tc, bpf: add new csum and tunnel signatures
  tc, bpf: further improve error reporting
  tc, bpf: add support for map pre/allocation

David Ahern (2):
  vrf: Add support for slave_info
  ip link: Add support for kernel side filtering

Edward Cree (1):
  geneve: fix IPv6 remote address reporting

Elad Raz (1):
  bridge: mdb: add support for offloaded mdb entries

Eric Dumazet (2):
  iplink: display number of rx/tx queues
  ss: take care of unknown min_rtt

Gustavo Zacarias (1):
  iproute2: tc_bpf.c: fix building with musl libc

Jamal Hadi Salim (3):
  tc: introduce IFE action
  tc: don't ignore ok as an action branch
  tc simple action update and breakage

Jeff Harris (1):
  ip: neigh: Fix leftover attributes message during flush

Jesse Gross (2):
  vxlan: Follow kernel defaults for outer UDP checksum.
  geneve: Add support for configuring UDP checksums.

Jiri Benc (2):
  ip link gre: create interfaces in external mode correctly
  ip link gre: print only relevant info in external mode

Jiri Pirko (13):
  include: add linked list implementation from kernel
  add devlink tool
  devlink: fix "devlink port" help message
  list: add list_for_each_entry_reverse macro
  list: add list_add_tail helper
  devlink: introduce pr_out_port_handle helper
  devlink: introduce helper to print out nice names (ifnames)
  devlink: split dl_argv_parse_put to parse and put parts
  devlink: introduce dump filtering function
  devlink: allow to parse both devlink and port handle in the same time
  devlink: add manpage for shared buffer

Luca Lemmo (3):
  tc: f_u32: add missing spaces around operators
  tc: f_u32: trivial coding style cleanups
  tc: q_{codel,fq_codel}: add missing space in help text

Marco Varlese (1):
  fix get_addr() and get_prefix() error messages

Nicolas Dichtel (1):
  iplink: display IFLA_PHYS_PORT_NAME

Nikolay Aleksandrov (5):
  bridge: mdb: add user-space support for extended attributes
  bridge: mdb: add support for extended router port information
  bridge: fdb: add support to filter by vlan id
  bridge: mdb: add support to filter by vlan id
  bridge: vlan: add support to filter by vlan id

Phil Sutter (22):
  tc/p_ip.c: Minor coding style cleanup
  tc: pedit: Fix for big-endian systems
  tc: pedit: Fix raw op
  testsuite: add a test for tc pedit action
  doc/tc-filters.tex: Drop overly subjective paragraphs
  tc: connmark, pedit: Rename BRANCH to CONTROL
  man: tc-csum.8: Add an example
  man: tc-mirred.8: Reword man page a bit, add generic mirror example
  man: tc-police.8: Emphasize on the two rate control mechanisms
  man: tc-skbedit.8: Elaborate a bit on TX queues
  tc/m_vlan.c: mention CONTROL option in help text
  man: tc-vlan.8: Describe CONTROL option
  color: introduce color helpers and COLOR_CLEAR
  ipaddress: colorize peer, broadcast and anycast addresses as well
  make format_host non-reentrant by default
  utils: make rt_addr_n2a() non-reentrant by default
  lib/utils: introduce format_host_rta()
  lib/utils: introduce rt_addr_n2a_rta()
  lib/ll_addr: improve ll_addr_n2a() a bit
  ip-link: Support printing VF trust setting
  ss: Drop silly assignment
  ss: Fix accidental state filter override

Quentin Monnet (1):
  tc: add bash-completion function

Stephen Hemminger (22):
  Update header files from net-next
  iplink: display rx nohandler stats
  ss: display not_sent and min_rtt info
  tc: code cleanup
  ip: code cleanup
  bridge: code cleanup
  misc: fix style issues
  update kernel headers to 4.6 (pre rc1)
  netconf: replace macro with a function
  scrub out whitespace issues
  devlink: ignore build result
  ip: only display phys attributes with details option
  ip: whitespace cleanup
  devlink: remove unused code
  devlink: remove more unused code
  vv4.6.0

Zhang Shengju (2):
  man: update netconf manual for new attributes
  netconf: add support for ignore route attribute

subas...@codeaurora.org (1):
  ss: Remove unused argument from kill_inet_sock



Re: [GIT] Networking

2016-05-18 Thread Linus Torvalds
On Wed, May 18, 2016 at 11:58 AM, Kalle Valo  wrote:
>
> It would be best if you could send a patch either directly to Dave or
> Linus to resolve this quickly.

I'm committing my patch myself right now, since this bug makes my
laptop useless, and I will take credit for finding and testing it on
my own even if it was apparently also discussed independently on the
networking list ;)

Linus


[PATCH] iwlwifi: mvm: fix merge damage in tx.c

2016-05-18 Thread Luca Coelho
From: Luca Coelho 

During the merge in commit 909b27f70643 ("Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net"), there was a
small merge damage where one instance of info was not converted into
skb_info.  Fix this.

Signed-off-by: Luca Coelho 
---
 drivers/net/wireless/intel/iwlwifi/mvm/tx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c 
b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
index 8802109..c53aa0f 100644
--- a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
+++ b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
@@ -211,6 +211,7 @@ void iwl_mvm_set_tx_cmd(struct iwl_mvm *mvm, struct sk_buff 
*skb,
struct iwl_tx_cmd *tx_cmd,
struct ieee80211_tx_info *info, u8 sta_id)
 {
+   struct ieee80211_tx_info *skb_info = IEEE80211_SKB_CB(skb);
struct ieee80211_hdr *hdr = (void *)skb->data;
__le16 fc = hdr->frame_control;
u32 tx_flags = le32_to_cpu(tx_cmd->tx_flags);
@@ -294,7 +295,7 @@ void iwl_mvm_set_tx_cmd(struct iwl_mvm *mvm, struct sk_buff 
*skb,
tx_cmd->tx_flags = cpu_to_le32(tx_flags);
/* Total # bytes to be transmitted */
tx_cmd->len = cpu_to_le16((u16)skb->len +
-   (uintptr_t)info->driver_data[0]);
+   (uintptr_t)skb_info->driver_data[0]);
tx_cmd->life_time = cpu_to_le32(TX_CMD_LIFE_TIME_INFINITE);
tx_cmd->sta_id = sta_id;
 
-- 
2.8.1



Re: [GIT] Networking

2016-05-18 Thread Kalle Valo
"Coelho, Luciano"  writes:

> Kalle, David, what is the status with the fix that is on the way via
> your trees?

It would be best if you could send a patch either directly to Dave or
Linus to resolve this quickly.

-- 
Kalle Valo


Re: [GIT] Networking

2016-05-18 Thread Linus Torvalds
On Wed, May 18, 2016 at 11:45 AM, Linus Torvalds
 wrote:
>
> From what I can tell, there's a merge bug in commit 909b27f70643,
> where David seems to have lost some of the changes to
> iwl_mvm_set_tx_cmd().
>
> I do not know if that's the reason for the problem I see. But I will test.

Yes. The attached patch that fixes the incorrect merge seems to fix
things for me.

That should mean that the assumption that this problem existed in v4.6
too was wrong, because the incorrect merge came in later. I think
Luciano mis-understood "v4.6+" to mean plain v4.6.

Reinoud Koornstra, does this patch fix things for you too?

   Linus
 drivers/net/wireless/intel/iwlwifi/mvm/tx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c 
b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
index 880210917a6f..c53aa0f220e0 100644
--- a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
+++ b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
@@ -211,6 +211,7 @@ void iwl_mvm_set_tx_cmd(struct iwl_mvm *mvm, struct sk_buff 
*skb,
struct iwl_tx_cmd *tx_cmd,
struct ieee80211_tx_info *info, u8 sta_id)
 {
+   struct ieee80211_tx_info *skb_info = IEEE80211_SKB_CB(skb);
struct ieee80211_hdr *hdr = (void *)skb->data;
__le16 fc = hdr->frame_control;
u32 tx_flags = le32_to_cpu(tx_cmd->tx_flags);
@@ -294,7 +295,7 @@ void iwl_mvm_set_tx_cmd(struct iwl_mvm *mvm, struct sk_buff 
*skb,
tx_cmd->tx_flags = cpu_to_le32(tx_flags);
/* Total # bytes to be transmitted */
tx_cmd->len = cpu_to_le16((u16)skb->len +
-   (uintptr_t)info->driver_data[0]);
+   (uintptr_t)skb_info->driver_data[0]);
tx_cmd->life_time = cpu_to_le32(TX_CMD_LIFE_TIME_INFINITE);
tx_cmd->sta_id = sta_id;
 


Re: [GIT] Networking

2016-05-18 Thread Coelho, Luciano
On Wed, 2016-05-18 at 11:45 -0700, Linus Torvalds wrote:
> On Wed, May 18, 2016 at 7:23 AM, Coelho, Luciano
>  wrote:
> > 
> > 
> > I can confirm that 4.6 contains the same bug.  And reverting the
> > patch
> > I mentioned does solve the problem...
> > 
> > The same patch works fine in our internal tree.  I'll have to
> > figure
> > out together with Emmanuel what the problem actually is.
> Hmm.
> 
> From what I can tell, there's a merge bug in commit 909b27f70643,
> where David seems to have lost some of the changes to
> iwl_mvm_set_tx_cmd().
> 
> The reason seems to be a conflict with d8fe484470dd, where David
> missed the fact that "info->driver_data[0]" had become
> "skb_info->driver_data[0]", and then he removed the skb_info because
> it was unused.
> 
> I do not know if that's the reason for the problem I see. But I will
> test.
> 
> David, do you happen to recall that merge conflict? I think you must
> have removed that "skb_info" variable declaration and initialization
> manually (due to the "unused variable" warning, which in turn was due
> to the incorrect merge of the actual conflict), because I think git
> would have merged that line into the result.

Actually I just tested it and indeed it seems to be the merge damage
(which we discussed extensively in the linux-wireless mailing list)
that causes this problem.  The "4.6 doesn't work either" thing was a
false alarm.

If the merge damage is fixed this way, the problem is gone:

diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
index b5f7c36..ae2ecf6 100644
--- a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
+++ b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
@@ -211,6 +211,7 @@ void iwl_mvm_set_tx_cmd(struct iwl_mvm *mvm, struct
sk_buff *skb,
struct iwl_tx_cmd *tx_cmd,
struct ieee80211_tx_info *info, u8 sta_id)
 {
+   struct ieee80211_tx_info *skb_info = IEEE80211_SKB_CB(skb);
struct ieee80211_hdr *hdr = (void *)skb->data;
__le16 fc = hdr->frame_control;
u32 tx_flags = le32_to_cpu(tx_cmd->tx_flags);
@@ -294,7 +295,7 @@ void iwl_mvm_set_tx_cmd(struct iwl_mvm *mvm, struct
sk_buff *skb,
tx_cmd->tx_flags = cpu_to_le32(tx_flags);
/* Total # bytes to be transmitted */
tx_cmd->len = cpu_to_le16((u16)skb->len +
-   (uintptr_t)info->driver_data[0]);
+   (uintptr_t)skb_info->driver_data[0]);
tx_cmd->life_time = cpu_to_le32(TX_CMD_LIFE_TIME_INFINITE);
tx_cmd->sta_id = sta_id;


Kalle, David, what is the status with the fix that is on the way via
your trees?

--
Cheers,
Luca.

Re: [PATCH iproute2] ss: Tell user about -EOPNOTSUPP for SOCK_DESTROY

2016-05-18 Thread Stephen Hemminger
On Tue, 17 May 2016 12:35:53 -0600
subas...@codeaurora.org wrote:

> On 2016-05-16 20:29, Lorenzo Colitti wrote:
> > On Tue, May 17, 2016 at 11:24 AM, David Ahern  
> > wrote:
> >> As I mentioned we can print the unsupported once or per socket matched 
> >> and
> >> with the socket params. e.g.,
> >> 
> >> +   } else if (errno == EOPNOTSUPP) {
> >> +   printf("Operation not supported for:\n");
> >> +   inet_show_sock(h, diag_arg->f, 
> >> diag_arg->protocol);
> >> 
> >> Actively suppressing all error messages is just wrong. I get the 
> >> flooding
> >> issue so I'm fine with just printing it once.
> > 
> > I disagree, but then I'm the one who wrote it in the first place, so
> > you wouldn't expect me to agree. :-) Let's see what Stephen says.
> 
> Hi Lorenzo
> 
> Would it be acceptable to have a separate column which displays the 
> result of the sock destroy operation per socket.
> State... Killed
> ESTAB Y
> TIME_WAIT N
> 
> If it is not supported from kernel, maybe print U (unsupported) for 
> this.

When you guys come to a conclusion, then I will review the result.
Right now neither solution looks good.



Re: [GIT] Networking

2016-05-18 Thread Linus Torvalds
On Wed, May 18, 2016 at 7:23 AM, Coelho, Luciano
 wrote:
>
> I can confirm that 4.6 contains the same bug.  And reverting the patch
> I mentioned does solve the problem...
>
> The same patch works fine in our internal tree.  I'll have to figure
> out together with Emmanuel what the problem actually is.

Hmm.

>From what I can tell, there's a merge bug in commit 909b27f70643,
where David seems to have lost some of the changes to
iwl_mvm_set_tx_cmd().

The reason seems to be a conflict with d8fe484470dd, where David
missed the fact that "info->driver_data[0]" had become
"skb_info->driver_data[0]", and then he removed the skb_info because
it was unused.

I do not know if that's the reason for the problem I see. But I will test.

David, do you happen to recall that merge conflict? I think you must
have removed that "skb_info" variable declaration and initialization
manually (due to the "unused variable" warning, which in turn was due
to the incorrect merge of the actual conflict), because I think git
would have merged that line into the result.

  Linus


[PATCH 1/1] net: pegasus: simplify logical constraint

2016-05-18 Thread Heinrich Schuchardt
If !count is true, count < 4 is also true.

Signed-off-by: Heinrich Schuchardt 
---
 drivers/net/usb/pegasus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/usb/pegasus.c b/drivers/net/usb/pegasus.c
index 36cd7f0..9bbe0161 100644
--- a/drivers/net/usb/pegasus.c
+++ b/drivers/net/usb/pegasus.c
@@ -473,7 +473,7 @@ static void read_bulk_callback(struct urb *urb)
goto goon;
}
 
-   if (!count || count < 4)
+   if (count < 4)
goto goon;
 
rx_status = buf[count - 2];
-- 
2.1.4



Re: [PATCH net 2/2] RDS: TCP: Avoid rds connection churn from rogue SYNs

2016-05-18 Thread Santosh Shilimkar

On 5/18/2016 10:06 AM, Sowmini Varadhan wrote:

When a rogue SYN is received after the connection arbitration
algorithm has converged, the incoming SYN should not needlessly
quiesce the transmit path, and it should not result in needless
TCP connection resets due to re-execution of the connection
arbitration logic.

Signed-off-by: Sowmini Varadhan 
---

Acked-by: Santosh Shilimkar 


Re: [PATCH net 1/2] RDS: TCP: rds_tcp_accept_worker() must exit gracefully when terminating rds-tcp

2016-05-18 Thread Santosh Shilimkar

On 5/18/2016 10:06 AM, Sowmini Varadhan wrote:

There are two instances where we want to terminate RDS-TCP: when
exiting the netns or during module unload. In either case, the
termination sequence is to stop the listen socket, mark the
rtn->rds_tcp_listen_sock as null, and flush any accept workqs.
Thus any workqs that get flushed at this point will encounter a
null rds_tcp_listen_sock, and must exit gracefully to allow
the RDS-TCP termination to complete successfully.

Signed-off-by: Sowmini Varadhan 
---
 net/rds/tcp_listen.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index be263cd..e10b422 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -80,6 +80,9 @@ int rds_tcp_accept_one(struct socket *sock)
int conn_state;
struct sock *nsk;

+   if (!sock) /* module unload or netns delete in progress */
+   return -ENETUNREACH;
+

New error code type in RDS but seems appropriate :-)
Patch looks good to me.
Acked-by: Santosh Shilimkar 


Re: Make TCP work better with re-ordered frames?

2016-05-18 Thread Yuchung Cheng
On Wed, May 18, 2016 at 9:02 AM, Eric Dumazet  wrote:
>
> On Wed, 2016-05-18 at 08:46 -0700, Ben Greear wrote:
>
> > I will work on captures...do you care if it is from transmitter or 
> > receiver's perspective?
>
> Receiver would probably be more useful.
Can I get sender side traces too? I am working on improving TCP
reordering at the sender side. Thanks.
>
>


Re: [PATCH] net: suppress warnings on dev_alloc_skb

2016-05-18 Thread Alexander Duyck
On Wed, May 18, 2016 at 9:01 AM, Eric Dumazet  wrote:
> On Wed, 2016-05-18 at 11:25 -0400, Neil Horman wrote:
>> Noticed an allocation failure in a network driver the other day on a 32 bit
>> system:
>>
>> DMA-API: debugging out of memory - disabling
>> bnx2fc: adapter_lookup: hba NULL
>> lldpad: page allocation failure. order:0, mode:0x4120
>> Pid: 4556, comm: lldpad Not tainted 2.6.32-639.el6.i686.debug #1
>> Call Trace:
>>  [] ? printk+0x19/0x23
>>  [] ? __alloc_pages_nodemask+0x664/0x830
>>  [] ? free_object+0x82/0xa0
>>  [] ? ixgbe_alloc_rx_buffers+0x10b/0x1d0 [ixgbe]
>>  [] ? ixgbe_configure_rx_ring+0x29f/0x420 [ixgbe]
>>  [] ? ixgbe_configure_tx_ring+0x15c/0x220 [ixgbe]
>>  [] ? ixgbe_configure+0x589/0xc00 [ixgbe]
>>  [] ? ixgbe_open+0xa7/0x5c0 [ixgbe]
>>  [] ? ixgbe_init_interrupt_scheme+0x5b6/0x970 [ixgbe]
>>  [] ? ixgbe_setup_tc+0x1a4/0x260 [ixgbe]
>>  [] ? ixgbe_dcbnl_set_state+0x7f/0x90 [ixgbe]
>>  [] ? dcb_doit+0x10ed/0x16d0
>> ...
>
>
> Well, maybe this call site (via ixgbe_configure_rx_ring()) should be
> using GFP_KERNEL instead of GFP_ATOMIC.

The problem is the ixgbe driver is using the same function for
allocating memory in the softirq and in the init.  As such it has to
default to GFP_ATOMIC so that it doesn't screw up the NAPI allocation
case.

> Otherwise, if you are unlucky, not a single page would be allocated and
> RX ring buffer would be empty.
>
> So the 'fix' could be limited to GFP_ATOMIC callers ?
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 
> c413c588a24f854be9e4df78d8a6872b6b1ff9f3..61b923f1520845145a5470d752b278d283cbb348
>  100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -2467,7 +2467,7 @@ static inline struct page *__dev_alloc_pages(gfp_t 
> gfp_mask,
>
>  static inline struct page *dev_alloc_pages(unsigned int order)
>  {
> -   return __dev_alloc_pages(GFP_ATOMIC, order);
> +   return __dev_alloc_pages(GFP_ATOMIC | __GFP_NOWARN, order);
>  }
>
>  /**
> @@ -2485,7 +2485,7 @@ static inline struct page *__dev_alloc_page(gfp_t 
> gfp_mask)
>
>  static inline struct page *dev_alloc_page(void)
>  {
> -   return __dev_alloc_page(GFP_ATOMIC);
> +   return dev_alloc_pages(0);
>  }
>
>  /**
>
>

I agree.  This is likely a much better way to go.

- Alex


Re: [PATCH net-next 1/4] scsi_tcp: block BH in TCP callbacks

2016-05-18 Thread Mike Christie
On 05/17/2016 07:44 PM, Eric Dumazet wrote:
> iscsi_sw_tcp_data_ready() and iscsi_sw_tcp_state_change() were
> using read_lock(>sk_callback_lock) which is fine if caller
> disabled BH.
> 
> TCP stack no longer has this requirement and can run from
> process context.
> 
> Use read_lock_bh() variant to restore previous assumption.
> 
> Ideally this code could use RCU instead...
> 
> Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
> Fixes: d41a69f1d390 ("tcp: make tcp_sendmsg() aware of socket backlog")
> Signed-off-by: Eric Dumazet 
> Cc: Mike Christie 
> Cc: Venkatesh Srinivas 
> ---
>  drivers/scsi/iscsi_tcp.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
> index 2e4c82f8329c..ace4f1f41b8e 100644
> --- a/drivers/scsi/iscsi_tcp.c
> +++ b/drivers/scsi/iscsi_tcp.c
> @@ -131,10 +131,10 @@ static void iscsi_sw_tcp_data_ready(struct sock *sk)
>   struct iscsi_tcp_conn *tcp_conn;
>   read_descriptor_t rd_desc;
>  
> - read_lock(>sk_callback_lock);
> + read_lock_bh(>sk_callback_lock);
>   conn = sk->sk_user_data;
>   if (!conn) {
> - read_unlock(>sk_callback_lock);
> + read_unlock_bh(>sk_callback_lock);
>   return;
>   }
>   tcp_conn = conn->dd_data;
> @@ -154,7 +154,7 @@ static void iscsi_sw_tcp_data_ready(struct sock *sk)
>   /* If we had to (atomically) map a highmem page,
>* unmap it now. */
>   iscsi_tcp_segment_unmap(_conn->in.segment);
> - read_unlock(>sk_callback_lock);
> + read_unlock_bh(>sk_callback_lock);
>  }
>  
>  static void iscsi_sw_tcp_state_change(struct sock *sk)
> @@ -165,10 +165,10 @@ static void iscsi_sw_tcp_state_change(struct sock *sk)
>   struct iscsi_session *session;
>   void (*old_state_change)(struct sock *);
>  
> - read_lock(>sk_callback_lock);
> + read_lock_bh(>sk_callback_lock);
>   conn = sk->sk_user_data;
>   if (!conn) {
> - read_unlock(>sk_callback_lock);
> + read_unlock_bh(>sk_callback_lock);
>   return;
>   }
>   session = conn->session;
> @@ -179,7 +179,7 @@ static void iscsi_sw_tcp_state_change(struct sock *sk)
>   tcp_sw_conn = tcp_conn->dd_data;
>   old_state_change = tcp_sw_conn->old_state_change;
>  
> - read_unlock(>sk_callback_lock);
> + read_unlock_bh(>sk_callback_lock);
>  
>   old_state_change(sk);
>  }

Can I just confirm that nested bh lock calls like:

spin_lock_bh(lock1);
spin_lock_bh(lock2);

do something

spin_unlock_bh(lock2);
spin_unlock_bh(lock1);

is ok? It seems smatch sometimes warns about this.

I found this thread

https://lkml.org/lkml/2011/1/25/232

which says it is ok.


[net-next PATCH 1/2] ip6_gre: Do not allow segmentation offloads GRE_CSUM is enabled with FOU/GUE

2016-05-18 Thread Alexander Duyck
This patch addresses the same issue we had for IPv4 where enabling GRE with
an inner checksum cannot be supported with FOU/GUE due to the fact that
they will jump past the GRE header at it is treated like a tunnel header.

Signed-off-by: Alexander Duyck 
---
 net/ipv6/ip6_gre.c |   12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 6fb1b89d0178..af503f518278 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1355,11 +1355,15 @@ static int ip6gre_newlink(struct net *src_net, struct 
net_device *dev,
dev->hw_features|= GRE6_FEATURES;
 
if (!(nt->parms.o_flags & TUNNEL_SEQ)) {
-   /* TCP segmentation offload is not supported when we
-* generate output sequences.
+   /* TCP offload with GRE SEQ is not supported, nor
+* can we support 2 levels of outer headers requiring
+* an update.
 */
-   dev->features|= NETIF_F_GSO_SOFTWARE;
-   dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+   if (!(nt->parms.o_flags & TUNNEL_CSUM) ||
+   (nt->encap.type == TUNNEL_ENCAP_NONE)) {
+   dev->features|= NETIF_F_GSO_SOFTWARE;
+   dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+   }
 
/* Can use a lockless transmit, unless we generate
 * output sequences



[net-next PATCH 2/2] intel: Add support for IPv6 IP-in-IP offload

2016-05-18 Thread Alexander Duyck
This patch adds support for offloading IPXIP6 type packets that represent
either IPv4 or IPv6 encapsulated inside of an IPv6 outer IP header.  In
addition with this change we should also be able to support FOU
encapsulated traffic with outer IPv6 headers.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c   |1 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |1 +
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |1 +
 drivers/net/ethernet/intel/i40evf/i40evf_main.c   |1 +
 drivers/net/ethernet/intel/igb/igb_main.c |1 +
 drivers/net/ethernet/intel/igbvf/netdev.c |1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |1 +
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |1 +
 8 files changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 14e9c99e9d36..c059d09f3257 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -9113,6 +9113,7 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
   NETIF_F_GSO_GRE  |
   NETIF_F_GSO_GRE_CSUM |
   NETIF_F_GSO_IPXIP4   |
+  NETIF_F_GSO_IPXIP6   |
   NETIF_F_GSO_UDP_TUNNEL   |
   NETIF_F_GSO_UDP_TUNNEL_CSUM  |
   NETIF_F_GSO_PARTIAL  |
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 0a8122c00ae2..55f151fca1dc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2285,6 +2285,7 @@ static int i40e_tso(struct sk_buff *skb, u8 *hdr_len, u64 
*cd_type_cmd_tso_mss)
if (skb_shinfo(skb)->gso_type & (SKB_GSO_GRE |
 SKB_GSO_GRE_CSUM |
 SKB_GSO_IPXIP4 |
+SKB_GSO_IPXIP6 |
 SKB_GSO_UDP_TUNNEL |
 SKB_GSO_UDP_TUNNEL_CSUM)) {
if (!(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL) &&
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 2d0f9f15..be99189da925 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -1560,6 +1560,7 @@ static int i40e_tso(struct sk_buff *skb, u8 *hdr_len, u64 
*cd_type_cmd_tso_mss)
if (skb_shinfo(skb)->gso_type & (SKB_GSO_GRE |
 SKB_GSO_GRE_CSUM |
 SKB_GSO_IPXIP4 |
+SKB_GSO_IPXIP6 |
 SKB_GSO_UDP_TUNNEL |
 SKB_GSO_UDP_TUNNEL_CSUM)) {
if (!(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL) &&
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 520c5b3b6cb4..eac057b88055 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -2231,6 +2231,7 @@ int i40evf_process_config(struct i40evf_adapter *adapter)
   NETIF_F_GSO_GRE  |
   NETIF_F_GSO_GRE_CSUM |
   NETIF_F_GSO_IPXIP4   |
+  NETIF_F_GSO_IPXIP6   |
   NETIF_F_GSO_UDP_TUNNEL   |
   NETIF_F_GSO_UDP_TUNNEL_CSUM  |
   NETIF_F_GSO_PARTIAL  |
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index b1a5cdb77088..ef3d642f5ff2 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2419,6 +2419,7 @@ static int igb_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 #define IGB_GSO_PARTIAL_FEATURES (NETIF_F_GSO_GRE | \
  NETIF_F_GSO_GRE_CSUM | \
  NETIF_F_GSO_IPXIP4 | \
+ NETIF_F_GSO_IPXIP6 | \
  NETIF_F_GSO_UDP_TUNNEL | \
  NETIF_F_GSO_UDP_TUNNEL_CSUM)
 
diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c 
b/drivers/net/ethernet/intel/igbvf/netdev.c
index 79b907f1a520..b0778ba65083 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -2764,6 +2764,7 @@ static int igbvf_probe(struct pci_dev *pdev, const struct 
pci_device_id 

[net-next PATCH 0/2] Follow-ups for GUEoIPv6 patches

2016-05-18 Thread Alexander Duyck
This patch series is meant to be applied after:
[PATCH v7 net-next 00/16] ipv6: Enable GUEoIPv6 and more fixes for v6 tunneling

The first patch addresses an issue we already resolved in the GREv4 and is
now present in GREv6 with the introduction of FOU/GUE for IPv6 based GRE
tunnels.

The second patch goes through and enables IPv6 tunnel offloads for the Intel
NICs that already support the IPv4 based IP-in-IP tunnel offloads.  I have
only done a bit of touch testing but have seen ~20 Gb/s over an i40e
interface using a v4-in-v6 tunnel, and I have verified IPv6 GRE is still
passing traffic at around the same rate.  I plan to do further testing but
with these patches present it should enable a wider audience to be able to
test the new features introduced in Tom's patchset with hardware offloads.

---

Alexander Duyck (2):
  ip6_gre: Do not allow segmentation offloads GRE_CSUM is enabled with 
FOU/GUE
  intel: Add support for IPv6 IP-in-IP offload


 drivers/net/ethernet/intel/i40e/i40e_main.c   |1 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |1 +
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |1 +
 drivers/net/ethernet/intel/i40evf/i40evf_main.c   |1 +
 drivers/net/ethernet/intel/igb/igb_main.c |1 +
 drivers/net/ethernet/intel/igbvf/netdev.c |1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |1 +
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |1 +
 net/ipv6/ip6_gre.c|   12 
 9 files changed, 16 insertions(+), 4 deletions(-)

--


Re: [PATCH v7 net-next 00/16] ipv6: Enable GUEoIPv6 and more fixes for v6 tunneling

2016-05-18 Thread Tom Herbert
On Wed, May 18, 2016 at 10:32 AM, Alexander Duyck
 wrote:
> On Wed, May 18, 2016 at 9:06 AM, Tom Herbert  wrote:
>> This patch set:
>>   - Fixes GRE6 to process translate flags correctly from configuration
>>   - Adds support for GSO and GRO for ip6ip6 and ip4ip6
>>   - Add support for FOU and GUE in IPv6
>>   - Support GRE, ip6ip6 and ip4ip6 over FOU/GUE
>>   - Fixes ip6_input to deal with UDP encapsulations
>>   - Some other minor fixes
>>
>> v2:
>>   - Removed a check of GSO types in MPLS
>>   - Define GSO type SKB_GSO_IPXIP6 and SKB_GSO_IPXIP4 (based on input
>> from Alexander)
>>   - Don't define GSO types specifically for IP6IP6 and IP4IP6, above
>> fix makes that unnecessary
>>   - Don't bother clearing encapsulation flag in UDP tunnel segment
>> (another item suggested by Alexander).
>>
>> v3:
>>   - Address some minor comments from Alexander
>>
>> v4:
>>   - Rebase on changes to fix IP TX tunnels
>>   - Fix MTU issues in ip4ip6, ip6ip6
>>   - Add test data for above
>>
>> v5:
>>   - Address feedback from Shmulik Ladkani regarding extension header
>> code that does not return next header but in instead relies
>> on returning value via nhoff. Solution here is to fix EH
>> processing to return nexthdr value.
>>   - Refactored IPv4 encaps so that we won't need to create
>> a ip6_tunnel_core.c when adding encap support IPv6.
>>
>> v6:
>>   - Fix build issues with regard to new GSO constants
>>   - FIx MTU calculation issues ip6_tunnel.c pointed out byt ALex
>>   - Add encap_hlen into headroom for GREv6 to work with FOU/GUE
>>
>> v7:
>>   - Added skb_set_inner_ipproto to ip4ip6 and ip6ip6
>>   - Clarified max_headroom in ip6_tnl_xmit
>>   - Set features for IPv6 tunnels
>>   - Other cleanup suggested by Alexander
>>   - Above fixes throughput performance issues in ip4ip6 and ip6ip6,
>> updated test results to reflect that
>>
>> Tested: Various cases of IP tunnels with netperf TCP_STREAM and TCP_RR.
>>
>> - IPv4/GRE/GUE/IPv6 with RCO
>>   1 TCP_STREAM
>> 6616 Mbps
>>   200 TCP_RR
>> 1244043 tps
>> 141/243/446 90/95/99% latencies
>> 86.61% CPU utilization
>>
>> - IPv6/GRE/GUE/IPv6 with RCO
>>   1 TCP_STREAM
>> 6940 Mbps
>>   200 TCP_RR
>> 1270903 tps
>> 138/236/440 90/95/99% latencies
>> 87.51% CPU utilization
>>
>>  - IP6IP6
>>   1 TCP_STREAM
>> 5307 Mbps
>>   200 TCP_RR
>> 498981 tps
>> 388/498/631 90/95/99% latencies
>> 19.75% CPU utilization (1 CPU saturated)
>>
>>  - IP6IP6/GUE with RCO
>>   1 TCP_STREAM
>> 5575 Mbps
>>   200 TCP_RR
>> 1233818 tps
>> 143/244/451 90/95/99% latencies
>> 87.57 CPU utilization
>>
>>  - IP4IP6
>>   1 TCP_STREAM
>> 5235 Mbps
>>   200 TCP_RR
>> 763774 tps
>> 250/318/466 90/95/99% latencies
>> 35.25% CPU utilization (1 CPU saturated)
>>
>>  - IP4IP6/GUE with RCO
>>   1 TCP_STREAM
>> 5337 Mbps
>>   200 TCP_RR
>> 1196385 tps
>> 148/251/460 90/95/99% latencies
>> 87.56 CPU utilization
>>
>>  - GRE with keyid
>>   200 TCP_RR
>> 744173 tps
>> 258/332/461 90/95/99% latencies
>> 34.59% CPU utilization (1 CPU saturated)
>>
>>
>> Tom Herbert (16):
>>   gso: Remove arbitrary checks for unsupported GSO
>>   net: define gso types for IPx over IPv4 and IPv6
>>   ipv6: Fix nexthdr for reinjection
>>   ipv6: Change "final" protocol processing for encapsulation
>>   net: Cleanup encap items in ip_tunnels.h
>>   fou: Call setup_udp_tunnel_sock
>>   fou: Split out {fou,gue}_build_header
>>   fou: Support IPv6 in fou
>>   ip6_tun: Add infrastructure for doing encapsulation
>>   fou: Add encap ops for IPv6 tunnels
>>   ip6_gre: Add support for fou/gue encapsulation
>>   ip6_tunnel: Add support for fou/gue encapsulation
>>   ipv6: Set features for IPv6 tunnels
>>   ip6ip6: Support for GSO/GRO
>>   ip4ip6: Support for GSO/GRO
>>   ipv6: Don't reset inner headers in ip6_tnl_xmit
>>
>>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  |   5 +-
>>  drivers/net/ethernet/broadcom/bnxt/bnxt.c |   5 +-
>>  drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +-
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   3 +-
>>  drivers/net/ethernet/intel/i40evf/i40e_txrx.c |   3 +-
>>  drivers/net/ethernet/intel/i40evf/i40evf_main.c   |   3 +-
>>  drivers/net/ethernet/intel/igb/igb_main.c |   3 +-
>>  drivers/net/ethernet/intel/igbvf/netdev.c |   3 +-
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +-
>>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   3 +-
>>  include/linux/netdev_features.h   |  12 +-
>>  include/linux/netdevice.h |   4 +-
>>  include/linux/skbuff.h|   4 +-
>>  include/net/fou.h  

Re: [PATCH v7 net-next 00/16] ipv6: Enable GUEoIPv6 and more fixes for v6 tunneling

2016-05-18 Thread Alexander Duyck
On Wed, May 18, 2016 at 9:06 AM, Tom Herbert  wrote:
> This patch set:
>   - Fixes GRE6 to process translate flags correctly from configuration
>   - Adds support for GSO and GRO for ip6ip6 and ip4ip6
>   - Add support for FOU and GUE in IPv6
>   - Support GRE, ip6ip6 and ip4ip6 over FOU/GUE
>   - Fixes ip6_input to deal with UDP encapsulations
>   - Some other minor fixes
>
> v2:
>   - Removed a check of GSO types in MPLS
>   - Define GSO type SKB_GSO_IPXIP6 and SKB_GSO_IPXIP4 (based on input
> from Alexander)
>   - Don't define GSO types specifically for IP6IP6 and IP4IP6, above
> fix makes that unnecessary
>   - Don't bother clearing encapsulation flag in UDP tunnel segment
> (another item suggested by Alexander).
>
> v3:
>   - Address some minor comments from Alexander
>
> v4:
>   - Rebase on changes to fix IP TX tunnels
>   - Fix MTU issues in ip4ip6, ip6ip6
>   - Add test data for above
>
> v5:
>   - Address feedback from Shmulik Ladkani regarding extension header
> code that does not return next header but in instead relies
> on returning value via nhoff. Solution here is to fix EH
> processing to return nexthdr value.
>   - Refactored IPv4 encaps so that we won't need to create
> a ip6_tunnel_core.c when adding encap support IPv6.
>
> v6:
>   - Fix build issues with regard to new GSO constants
>   - FIx MTU calculation issues ip6_tunnel.c pointed out byt ALex
>   - Add encap_hlen into headroom for GREv6 to work with FOU/GUE
>
> v7:
>   - Added skb_set_inner_ipproto to ip4ip6 and ip6ip6
>   - Clarified max_headroom in ip6_tnl_xmit
>   - Set features for IPv6 tunnels
>   - Other cleanup suggested by Alexander
>   - Above fixes throughput performance issues in ip4ip6 and ip6ip6,
> updated test results to reflect that
>
> Tested: Various cases of IP tunnels with netperf TCP_STREAM and TCP_RR.
>
> - IPv4/GRE/GUE/IPv6 with RCO
>   1 TCP_STREAM
> 6616 Mbps
>   200 TCP_RR
> 1244043 tps
> 141/243/446 90/95/99% latencies
> 86.61% CPU utilization
>
> - IPv6/GRE/GUE/IPv6 with RCO
>   1 TCP_STREAM
> 6940 Mbps
>   200 TCP_RR
> 1270903 tps
> 138/236/440 90/95/99% latencies
> 87.51% CPU utilization
>
>  - IP6IP6
>   1 TCP_STREAM
> 5307 Mbps
>   200 TCP_RR
> 498981 tps
> 388/498/631 90/95/99% latencies
> 19.75% CPU utilization (1 CPU saturated)
>
>  - IP6IP6/GUE with RCO
>   1 TCP_STREAM
> 5575 Mbps
>   200 TCP_RR
> 1233818 tps
> 143/244/451 90/95/99% latencies
> 87.57 CPU utilization
>
>  - IP4IP6
>   1 TCP_STREAM
> 5235 Mbps
>   200 TCP_RR
> 763774 tps
> 250/318/466 90/95/99% latencies
> 35.25% CPU utilization (1 CPU saturated)
>
>  - IP4IP6/GUE with RCO
>   1 TCP_STREAM
> 5337 Mbps
>   200 TCP_RR
> 1196385 tps
> 148/251/460 90/95/99% latencies
> 87.56 CPU utilization
>
>  - GRE with keyid
>   200 TCP_RR
> 744173 tps
> 258/332/461 90/95/99% latencies
> 34.59% CPU utilization (1 CPU saturated)
>
>
> Tom Herbert (16):
>   gso: Remove arbitrary checks for unsupported GSO
>   net: define gso types for IPx over IPv4 and IPv6
>   ipv6: Fix nexthdr for reinjection
>   ipv6: Change "final" protocol processing for encapsulation
>   net: Cleanup encap items in ip_tunnels.h
>   fou: Call setup_udp_tunnel_sock
>   fou: Split out {fou,gue}_build_header
>   fou: Support IPv6 in fou
>   ip6_tun: Add infrastructure for doing encapsulation
>   fou: Add encap ops for IPv6 tunnels
>   ip6_gre: Add support for fou/gue encapsulation
>   ip6_tunnel: Add support for fou/gue encapsulation
>   ipv6: Set features for IPv6 tunnels
>   ip6ip6: Support for GSO/GRO
>   ip4ip6: Support for GSO/GRO
>   ipv6: Don't reset inner headers in ip6_tnl_xmit
>
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  |   5 +-
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c |   5 +-
>  drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   3 +-
>  drivers/net/ethernet/intel/i40evf/i40e_txrx.c |   3 +-
>  drivers/net/ethernet/intel/i40evf/i40evf_main.c   |   3 +-
>  drivers/net/ethernet/intel/igb/igb_main.c |   3 +-
>  drivers/net/ethernet/intel/igbvf/netdev.c |   3 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +-
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   3 +-
>  include/linux/netdev_features.h   |  12 +-
>  include/linux/netdevice.h |   4 +-
>  include/linux/skbuff.h|   4 +-
>  include/net/fou.h |  10 +-
>  include/net/inet_common.h |   5 +
>  include/net/ip6_tunnel.h  |  58 +++
>  include/net/ip_tunnels.h  |  76 +++--
>  

Re: [PATCH net-next 1/4] scsi_tcp: block BH in TCP callbacks

2016-05-18 Thread Eric Dumazet
On Wed, 2016-05-18 at 12:21 -0500, Mike Christie wrote:

> Can I just confirm that nested bh lock calls like:
> 
> spin_lock_bh(lock1);
> spin_lock_bh(lock2);
> 
> do something
> 
> spin_unlock_bh(lock2);
> spin_unlock_bh(lock1);
> 
> is ok? It seems smatch sometimes warns about this.

It is ok.

More generally

local_bh_disable();
local_bh_disable();

..

local_bh_enable();
local_bh_enable();

is ok, we already have a lot of them.






Re: [net-next RFC 1/1] net sched actions: introduce timestamp for first time used

2016-05-18 Thread Cong Wang
On Tue, May 17, 2016 at 4:38 PM, Jamal Hadi Salim  wrote:
> diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
> index 014f9a6..f581e01 100644
> --- a/net/sched/act_bpf.c
> +++ b/net/sched/act_bpf.c
> @@ -156,6 +156,7 @@ static int tcf_bpf_dump(struct sk_buff *skb, struct 
> tc_action *act,
>
> tm.install = jiffies_to_clock_t(jiffies - prog->tcf_tm.install);
> tm.lastuse = jiffies_to_clock_t(jiffies - prog->tcf_tm.lastuse);
> +   tm.firstuse = jiffies_to_clock_t(jiffies - prog->tcf_tm.firstuse);
> tm.expires = jiffies_to_clock_t(prog->tcf_tm.expires);


I think it is the time to add a wrapper for these tm.XX =
jiffies_to_clock_t(XXX).


Re: [PATCH v4] net: sock: move ->sk_shutdown out of bitfields.

2016-05-18 Thread Eric Dumazet
On Wed, 2016-05-18 at 19:19 +0300, Andrey Ryabinin wrote:
> ->sk_shutdown bits share one bitfield with some other bits in sock struct,
> such as ->sk_no_check_[r,t]x, ->sk_userlocks ...
> sock_setsockopt() may write to these bits, while holding the socket lock.
> 
> In case of AF_UNIX sockets, we change ->sk_shutdown bits while holding only
> unix_state_lock(). So concurrent setsockopt() and shutdown() may lead
> to corrupting these bits.
> 
> Fix this by moving ->sk_shutdown bits out of bitfield into a separate byte.
> This will not change the 'struct sock' size since ->sk_shutdown moved into
> previously unused 16-bit hole.
> 
> Signed-off-by: Andrey Ryabinin 
> Suggested-by: Hannes Frederic Sowa 
> ---

Acked-by: Eric Dumazet 




Re: [PATCH net] fou: avoid using sk_user_data before it is initialised

2016-05-18 Thread Tom Herbert
On Wed, May 18, 2016 at 10:07 AM, Cong Wang  wrote:
> On Wed, May 18, 2016 at 9:30 AM, Tom Herbert  wrote:
>> On Wed, May 18, 2016 at 1:30 AM, Simon Horman
>>  wrote:
>>> During initialisation sk->sk_user_data should be initialised before
>>> it is referenced via a call to gue_encap_init().
>>>
>> I think this is should be fixed by proposed patch "fou:
>> Call-setup_udp_tunnel_sock".
>
> That is 6/16, do you have a fix for stable to backport?

Your patch to use fou directly is good.

Tom


Re: [net-next PATCH 2/2] net sched: actions use tcf_lastuse_update

2016-05-18 Thread Cong Wang
On Tue, May 17, 2016 at 2:19 PM, Jamal Hadi Salim  wrote:
> From: Jamal Hadi Salim 
>
> Signed-off-by: Jamal Hadi Salim 

Acked-by:  Cong Wang 


Re: [net-next PATCH 1/2] net sched: indentation and other stylistic fixes

2016-05-18 Thread Cong Wang
On Tue, May 17, 2016 at 2:19 PM, Jamal Hadi Salim  wrote:
> From: Jamal Hadi Salim 
>
> Signed-off-by: Jamal Hadi Salim 

Looks good to me. I assume this patch passes checkpatch.pl. ;)

Acked-by:  Cong Wang 


Re: [PATCH net] fou: avoid using sk_user_data before it is initialised

2016-05-18 Thread Cong Wang
On Wed, May 18, 2016 at 9:30 AM, Tom Herbert  wrote:
> On Wed, May 18, 2016 at 1:30 AM, Simon Horman
>  wrote:
>> During initialisation sk->sk_user_data should be initialised before
>> it is referenced via a call to gue_encap_init().
>>
> I think this is should be fixed by proposed patch "fou:
> Call-setup_udp_tunnel_sock".

That is 6/16, do you have a fix for stable to backport?


[PATCH net 0/2] RDS: TCP: connection spamming fixes

2016-05-18 Thread Sowmini Varadhan
We have been testing the RDS-TCP code with a connection spammer
that sends incoming SYNs to the RDS listen port well after 
an rds-tcp connection has been established, and found a few 
race-windows that are fixed by this patch series.

Patch 1 avoids a null pointer deref when an incoming SYN 
shows up when a netns is being dismantled, or when the 
rds-tcp module is being unloaded. 

Patch 2 addresses the case when a SYN is received after the
connection arbitration algorithm has converged: the incoming
SYN should not needlessly quiesce the transmit path, and it
should not result in needless TCP connection resets due to
re-execution of the connection arbitration logic.

Sowmini Varadhan (2):
  RDS: TCP: rds_tcp_accept_worker() must exit gracefully when
terminating rds-tcp
  RDS: TCP: Avoid rds connection churn from rogue SYNs

 net/rds/tcp_listen.c |   13 +
 1 files changed, 9 insertions(+), 4 deletions(-)



[PATCH net 2/2] RDS: TCP: Avoid rds connection churn from rogue SYNs

2016-05-18 Thread Sowmini Varadhan
When a rogue SYN is received after the connection arbitration
algorithm has converged, the incoming SYN should not needlessly
quiesce the transmit path, and it should not result in needless
TCP connection resets due to re-execution of the connection
arbitration logic.

Signed-off-by: Sowmini Varadhan 
---
 net/rds/tcp_listen.c |   10 ++
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index e10b422..bc387c2 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -132,11 +132,13 @@ int rds_tcp_accept_one(struct socket *sock)
 * so we must quiesce any send threads before resetting
 * c_transport_data.
 */
-   wait_event(conn->c_waitq,
-  !test_bit(RDS_IN_XMIT, >c_flags));
-   if (ntohl(inet->inet_saddr) < ntohl(inet->inet_daddr)) {
+   if (ntohl(inet->inet_saddr) < ntohl(inet->inet_daddr) ||
+   !conn->c_outgoing) {
goto rst_nsk;
-   } else if (rs_tcp->t_sock) {
+   } else {
+   atomic_set(>c_state, RDS_CONN_CONNECTING);
+   wait_event(conn->c_waitq,
+  !test_bit(RDS_IN_XMIT, >c_flags));
rds_tcp_restore_callbacks(rs_tcp->t_sock, rs_tcp);
conn->c_outgoing = 0;
}
-- 
1.7.1



[PATCH net 1/2] RDS: TCP: rds_tcp_accept_worker() must exit gracefully when terminating rds-tcp

2016-05-18 Thread Sowmini Varadhan
There are two instances where we want to terminate RDS-TCP: when
exiting the netns or during module unload. In either case, the
termination sequence is to stop the listen socket, mark the
rtn->rds_tcp_listen_sock as null, and flush any accept workqs.
Thus any workqs that get flushed at this point will encounter a
null rds_tcp_listen_sock, and must exit gracefully to allow
the RDS-TCP termination to complete successfully.

Signed-off-by: Sowmini Varadhan 
---
 net/rds/tcp_listen.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index be263cd..e10b422 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -80,6 +80,9 @@ int rds_tcp_accept_one(struct socket *sock)
int conn_state;
struct sock *nsk;
 
+   if (!sock) /* module unload or netns delete in progress */
+   return -ENETUNREACH;
+
ret = sock_create_kern(sock_net(sock->sk), sock->sk->sk_family,
   sock->sk->sk_type, sock->sk->sk_protocol,
   _sock);
-- 
1.7.1



Re: [PATCH 1/1 RFC] net/phy: Add Lantiq PHY driver

2016-05-18 Thread Andrew Lunn
> For LEDs, we had a patch series floating around adding LED triggers [1],
> and it seems to me like the LEDs class subsystem would be a good fit for
> controlling PHY LEDs, possibly with the help of PHYLIB when it comes to
> doing the low-level work of registering LEDs and their names with the
> LEDS subsystem.
> 
> [1]: http://lists.openwall.net/netdev/2016/03/23/61

That patch fizzled out. I got the feeling it was pushing the
capabilities of the coder. I do however think it is a reasonable path
to follow for PHY LEDs.

I took a quick look at the datasheet and the controlling of the LEDs
is very flexible. It should not be a problem to expose some of that
functionality via LED triggers.

  Andrew


Re: [PATCH] Revert "phy: add support for a reset-gpio specification"

2016-05-18 Thread Florian Fainelli
On 05/18/2016 09:05 AM, Fabio Estevam wrote:
> Commit da47b4572056 ("phy: add support for a reset-gpio specification")
> causes the following xtensa qemu crash according to Guenter Roeck:
> 
> [9.366256] libphy: ethoc-mdio: probed
> [9.367389]  (null): could not attach to PHY
> [9.368555]  (null): failed to probe MDIO bus
> [9.371540] Unable to handle kernel paging request at virtual address 
> 001c
> [9.371540]  pc = d0320926, ra = 903209d1
> [9.375358] Oops: sig: 11 [#1]
> 
> This reverts commit da47b4572056487fd7941c26f73b3e8815ff712a.
> 
> Reported-by: Guenter Roeck 
> Signed-off-by: Fabio Estevam 

Acked-by: Florian Fainelli 

Thanks!
-- 
Florian


[PATCH v4] net: sock: move ->sk_shutdown out of bitfields.

2016-05-18 Thread Andrey Ryabinin
->sk_shutdown bits share one bitfield with some other bits in sock struct,
such as ->sk_no_check_[r,t]x, ->sk_userlocks ...
sock_setsockopt() may write to these bits, while holding the socket lock.

In case of AF_UNIX sockets, we change ->sk_shutdown bits while holding only
unix_state_lock(). So concurrent setsockopt() and shutdown() may lead
to corrupting these bits.

Fix this by moving ->sk_shutdown bits out of bitfield into a separate byte.
This will not change the 'struct sock' size since ->sk_shutdown moved into
previously unused 16-bit hole.

Signed-off-by: Andrey Ryabinin 
Suggested-by: Hannes Frederic Sowa 
---

Changes since v1:
 - move sk_shutdown into a separate byte instead of locking
AF_UNIX sockets.

Changes since v2:
 - reorder bitfields, so that sk_type/sk_protocol still fit in
separate 2-bytes/byte.

Changes since v3:
  - padding and comment per Eric.

 include/net/sock.h | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index c9c8b19..649d2a8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -382,8 +382,13 @@ struct sock {
atomic_tsk_omem_alloc;
int sk_sndbuf;
struct sk_buff_head sk_write_queue;
+
+   /*
+* Because of non atomicity rules, all
+* changes are protected by socket lock.
+*/
kmemcheck_bitfield_begin(flags);
-   unsigned intsk_shutdown  : 2,
+   unsigned intsk_padding : 2,
sk_no_check_tx : 1,
sk_no_check_rx : 1,
sk_userlocks : 4,
@@ -391,6 +396,7 @@ struct sock {
sk_type  : 16;
 #define SK_PROTOCOL_MAX U8_MAX
kmemcheck_bitfield_end(flags);
+
int sk_wmem_queued;
gfp_t   sk_allocation;
u32 sk_pacing_rate; /* bytes per second */
@@ -418,6 +424,7 @@ struct sock {
struct timer_list   sk_timer;
ktime_t sk_stamp;
u16 sk_tsflags;
+   u8  sk_shutdown;
u32 sk_tskey;
struct socket   *sk_socket;
void*sk_user_data;
-- 
2.7.3



Re: [PATCH net-next] tuntap: introduce tx skb ring

2016-05-18 Thread Michael S. Tsirkin
On Wed, May 18, 2016 at 09:41:03AM -0700, Eric Dumazet wrote:
> On Wed, 2016-05-18 at 19:26 +0300, Michael S. Tsirkin wrote:
> > On Wed, May 18, 2016 at 10:13:59AM +0200, Jesper Dangaard Brouer wrote:
> > > I agree. It is sad to see everybody is implementing the same thing,
> > > open coding an array/circular based ring buffer.  This kind of code is
> > > hard to maintain and get right with barriers etc.  We can achieve the
> > > same performance with a generic implementation, by inlining the help
> > > function calls.
> > 
> > So my testing seems to show that at least for the common usecase
> > in networking, which isn't lockless, circular buffer
> > with indices does not perform that well, because
> > each index access causes a cache line to bounce between
> > CPUs, and index access causes stalls due to the dependency.
> 
> 
> Yes.
> 
> > 
> > By comparison, an array of pointers where NULL means invalid
> > and !NULL means valid, can be updated without messing up barriers
> > at all and does not have this issue.
> 
> Right but then you need appropriate barriers.
> 
> > 
> > You also mentioned cache pressure caused by using large queues, and I
> > think it's a significant issue. tun has a queue of 1000 entries by
> > default and that's 8K already.
> > 
> > So, I had an idea: with an array of pointers we could actually use
> > only part of the ring as long as it stays mostly empty.
> > We do want to fill at least two cache lines to prevent producer
> > and consumer from writing over the same cache line all the time.
> > This is SKB_ARRAY_MIN_SIZE logic below.
> > 
> > Pls take a look at the implementation below.  It's a straight port from 
> > virtio
> > unit test, so should work fine, except the SKB_ARRAY_MIN_SIZE hack that
> > I added.  Today I run out of time for testing this.  Posting for early
> > flames/feedback.
> > 
> > It's using skb pointers but we switching to void * would be easy at cost
> > of type safety, though it appears that people want lockless  push
> > etc so I'm not sure of the value.
> > 
> > --->
> > skb_array: array based FIFO for skbs
> > 
> > A simple array based FIFO of pointers.
> > Intended for net stack so uses skbs for type
> > safety, but we can replace with with void *
> > if others find it useful outside of net stack.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > 
> > ---
> > 
> > diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
> > new file mode 100644
> > index 000..a67cc8b
> > --- /dev/null
> > +++ b/include/linux/skb_array.h
> > @@ -0,0 +1,116 @@
> > +/*
> > + * See Documentation/skbular-buffers.txt for more information.
> > + */
> > +
> > +#ifndef _LINUX_SKB_ARRAY_H
> > +#define _LINUX_SKB_ARRAY_H 1
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +struct sk_buff;
> > +
> > +struct skb_array {
> > +   int producer cacheline_aligned_in_smp;
> > +   spinlock_t producer_lock;
> > +   int consumer cacheline_aligned_in_smp;
> > +   spinlock_t consumer_lock;
> > +   /* Shared consumer/producer data */
> > +   int size cacheline_aligned_in_smp; /* max entries in queue */
> > +   struct sk_buff **queue;
> > +};
> > +
> > +#define SKB_ARRAY_MIN_SIZE (2 * (0x1 << cache_line_size()) / \
> > +   sizeof (struct sk_buff *))
> > +
> > +static inline int __skb_array_produce(struct skb_array *a,
> > +  struct sk_buff *skb)
> > +{
> > +   /* Try to start from beginning: good for cache utilization as we'll
> > +* keep reusing the same cache line.
> > +* Produce at least SKB_ARRAY_MIN_SIZE entries before trying to do this,
> > +* to reduce bouncing cache lines between them.
> > +*/
> > +   if (a->producer >= SKB_ARRAY_MIN_SIZE && !a->queue[0])
> 
> a->queue[0] might be set by consumer, you probably need a barrier.

I think not - we write to the same place below and two accesses to
same address are never reordered.

> > +   a->producer = 0;
> > +   if (a->queue[a->producer])
> > +   return -ENOSPC;
> > +   a->queue[a->producer] = skb;
> > +   if (unlikely(++a->producer > a->size))
> > +   a->producer = 0;
> > +   return 0;
> > +}
> > +
> > +static inline int skb_array_produce_bh(struct skb_array *a,
> > +  struct sk_buff *skb)
> > +{
> > +   int ret;
> > +
> > +   spin_lock_bh(>producer_lock);
> > +   ret = __skb_array_produce(a, skb);
> > +   spin_unlock_bh(>producer_lock);
> > +
> > +   return ret;
> > +}
> > +
> > +static inline struct sk_buff *__skb_array_peek(struct skb_array *a)
> > +{
> > +   if (a->queue[a->consumer])
> > +   return a->queue[a->consumer];
> > +
> > +   /* Check whether producer started at the beginning. */
> > +   if (unlikely(a->consumer >= SKB_ARRAY_MIN_SIZE && a->queue[0])) {
> > +   a->consumer = 0;
> > +   return a->queue[0];
> > +   }
> > +
> > +   return NULL;
> > +}
> > +
> > +static 

Re: [PATCH net] fou: avoid using sk_user_data before it is initialised

2016-05-18 Thread Cong Wang
On Wed, May 18, 2016 at 1:30 AM, Simon Horman
 wrote:
> During initialisation sk->sk_user_data should be initialised before
> it is referenced via a call to gue_encap_init().

Or just use 'fou' directly?

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index eeec7d6..0b7a983 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -453,7 +453,7 @@ static int fou_encap_init(struct sock *sk, struct
fou *fou, struct fou_cfg *cfg)
udp_sk(sk)->encap_rcv = fou_udp_recv;
udp_sk(sk)->gro_receive = fou_gro_receive;
udp_sk(sk)->gro_complete = fou_gro_complete;
-   fou_from_sock(sk)->protocol = cfg->protocol;
+   fou->protocol = cfg->protocol;

return 0;
 }


Re: [Intel-wired-lan] [PATCH next-queue] ixgbe: netdev->vlan_features shouldn't have the vlan related flag

2016-05-18 Thread Xin Long
On Wed, May 18, 2016 at 11:07 PM, Alexander Duyck
 wrote:
> On Wed, May 18, 2016 at 1:55 AM, Xin Long  wrote:
>> vlan_features is used to set the vlan_dev->features when we create
>> a vlan device. it shouldn't have the vlan related flag, like
>> NETIF_F_HW_VLAN_CTAG_FILTER, which will cause vlan_dev create fail.
>> the call trace is as follow:
>>
>> [ 5604.264429] Call Trace:
>> [ 5604.278980]  [] dump_stack+0x63/0x84
>> [ 5604.341499]  [] __warn+0xd1/0xf0
>> [ 5604.382004]  [] warn_slowpath_fmt+0x5f/0x80
>> [ 5604.454602]  [] ? find_next_bit+0x19/0x20
>> [ 5604.541940]  [] register_netdevice+0x3c2/0x490
>> [ 5604.631744]  [] register_vlan_dev+0x133/0x290 [8021q]
>> [ 5604.710346]  [] vlan_newlink+0xbc/0xf0 [8021q]
>> [ 5604.789945]  [] rtnl_newlink+0x6c2/0x880
>> [ 5604.854000]  [] ? nla_parse+0xa3/0x100
>> [ 5604.889974]  [] ? rtnl_newlink+0x15c/0x880
>> [ 5604.951987]  [] rtnetlink_rcv_msg+0xa4/0x240
>> [ 5605.017614]  [] ? sock_has_perm+0x70/0x90
>> [ 5605.083120]  [] ? __alloc_skb+0x8d/0x2b0
>> [ 5605.147939]  [] ? rtnetlink_rcv+0x30/0x30
>> [ 5605.194973]  [] netlink_rcv_skb+0xa7/0xc0
>> [ 5605.246380]  [] rtnetlink_rcv+0x28/0x30
>> [ 5605.308998]  [] netlink_unicast+0x178/0x240
>> [ 5605.375020]  [] netlink_sendmsg+0x32e/0x3b0
>> [ 5605.463066]  [] sock_sendmsg+0x38/0x50
>> [ 5605.523910]  [] ___sys_sendmsg+0x279/0x290
>> [ 5605.574178]  [] ? filemap_map_pages+0x252/0x2d0
>> [ 5605.675281]  [] ? mem_cgroup_commit_charge+0x85/0x100
>> [ 5605.748882]  [] __sys_sendmsg+0x54/0x90
>> [ 5605.811931]  [] SyS_sendmsg+0x12/0x20
>> [ 5605.873955]  [] do_syscall_64+0x62/0x110
>> [ 5605.931006]  [] entry_SYSCALL64_slow_path+0x25/0x25
>> [ 5606.012017] ---[ end trace 11d7fa6a696c0c02 ]---
>>
>> it's from register_netdevice:
>>
>> if (((dev->hw_features | dev->features) &
>>  NETIF_F_HW_VLAN_CTAG_FILTER) &&
>> (!dev->netdev_ops->ndo_vlan_rx_add_vid ||
>>  !dev->netdev_ops->ndo_vlan_rx_kill_vid)) {
>> netdev_WARN(dev, "Buggy VLAN acceleration in driver!\n");
>> ret = -EINVAL;
>> goto err_uninit;
>> }
>>
>> the reason is vlan dev's features has NETIF_F_HW_VLAN_CTAG_FILTER flag,
>> but no ndo_vlan_rx_add_vid nor ndo_vlan_rx_kill_vid.
>>
>> we will fix it by put setting netdev->features' vlan related flags behind
>> using features to set netdev->vlan_features.
>>
>> Signed-off-by: Xin Long 
>
> There is already a fix for this in Dave Miller's net-next tree.  Take
> a look at commit 5eee87cd51df "ixgbe: Fix VLAN features error".
Ah, yes, just saw it, same patch. ;)
Thanks, Alex

>
> Thanks.
>
> - Alex


Re: [PATCH net-next] tuntap: introduce tx skb ring

2016-05-18 Thread Eric Dumazet
On Wed, 2016-05-18 at 19:26 +0300, Michael S. Tsirkin wrote:
> On Wed, May 18, 2016 at 10:13:59AM +0200, Jesper Dangaard Brouer wrote:
> > I agree. It is sad to see everybody is implementing the same thing,
> > open coding an array/circular based ring buffer.  This kind of code is
> > hard to maintain and get right with barriers etc.  We can achieve the
> > same performance with a generic implementation, by inlining the help
> > function calls.
> 
> So my testing seems to show that at least for the common usecase
> in networking, which isn't lockless, circular buffer
> with indices does not perform that well, because
> each index access causes a cache line to bounce between
> CPUs, and index access causes stalls due to the dependency.


Yes.

> 
> By comparison, an array of pointers where NULL means invalid
> and !NULL means valid, can be updated without messing up barriers
> at all and does not have this issue.

Right but then you need appropriate barriers.

> 
> You also mentioned cache pressure caused by using large queues, and I
> think it's a significant issue. tun has a queue of 1000 entries by
> default and that's 8K already.
> 
> So, I had an idea: with an array of pointers we could actually use
> only part of the ring as long as it stays mostly empty.
> We do want to fill at least two cache lines to prevent producer
> and consumer from writing over the same cache line all the time.
> This is SKB_ARRAY_MIN_SIZE logic below.
> 
> Pls take a look at the implementation below.  It's a straight port from virtio
> unit test, so should work fine, except the SKB_ARRAY_MIN_SIZE hack that
> I added.  Today I run out of time for testing this.  Posting for early
> flames/feedback.
> 
> It's using skb pointers but we switching to void * would be easy at cost
> of type safety, though it appears that people want lockless  push
> etc so I'm not sure of the value.
> 
> --->
> skb_array: array based FIFO for skbs
> 
> A simple array based FIFO of pointers.
> Intended for net stack so uses skbs for type
> safety, but we can replace with with void *
> if others find it useful outside of net stack.
> 
> Signed-off-by: Michael S. Tsirkin 
> 
> ---
> 
> diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
> new file mode 100644
> index 000..a67cc8b
> --- /dev/null
> +++ b/include/linux/skb_array.h
> @@ -0,0 +1,116 @@
> +/*
> + * See Documentation/skbular-buffers.txt for more information.
> + */
> +
> +#ifndef _LINUX_SKB_ARRAY_H
> +#define _LINUX_SKB_ARRAY_H 1
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct sk_buff;
> +
> +struct skb_array {
> + int producer cacheline_aligned_in_smp;
> + spinlock_t producer_lock;
> + int consumer cacheline_aligned_in_smp;
> + spinlock_t consumer_lock;
> + /* Shared consumer/producer data */
> + int size cacheline_aligned_in_smp; /* max entries in queue */
> + struct sk_buff **queue;
> +};
> +
> +#define SKB_ARRAY_MIN_SIZE (2 * (0x1 << cache_line_size()) / \
> + sizeof (struct sk_buff *))
> +
> +static inline int __skb_array_produce(struct skb_array *a,
> +struct sk_buff *skb)
> +{
> + /* Try to start from beginning: good for cache utilization as we'll
> +  * keep reusing the same cache line.
> +  * Produce at least SKB_ARRAY_MIN_SIZE entries before trying to do this,
> +  * to reduce bouncing cache lines between them.
> +  */
> + if (a->producer >= SKB_ARRAY_MIN_SIZE && !a->queue[0])

a->queue[0] might be set by consumer, you probably need a barrier.

> + a->producer = 0;
> + if (a->queue[a->producer])
> + return -ENOSPC;
> + a->queue[a->producer] = skb;
> + if (unlikely(++a->producer > a->size))
> + a->producer = 0;
> + return 0;
> +}
> +
> +static inline int skb_array_produce_bh(struct skb_array *a,
> +struct sk_buff *skb)
> +{
> + int ret;
> +
> + spin_lock_bh(>producer_lock);
> + ret = __skb_array_produce(a, skb);
> + spin_unlock_bh(>producer_lock);
> +
> + return ret;
> +}
> +
> +static inline struct sk_buff *__skb_array_peek(struct skb_array *a)
> +{
> + if (a->queue[a->consumer])
> + return a->queue[a->consumer];
> +
> + /* Check whether producer started at the beginning. */
> + if (unlikely(a->consumer >= SKB_ARRAY_MIN_SIZE && a->queue[0])) {
> + a->consumer = 0;
> + return a->queue[0];
> + }
> +
> + return NULL;
> +}
> +
> +static inline void __skb_array_consume(struct skb_array *a)
> +{
> + a->queue[a->consumer++] = NULL;
> + if (unlikely(++a->consumer > a->size))

a->consumer is incremented twice ?

> + a->consumer = 0;
> +}
> +






Re: [PATCH net] fou: avoid using sk_user_data before it is initialised

2016-05-18 Thread Tom Herbert
On Wed, May 18, 2016 at 1:30 AM, Simon Horman
 wrote:
> During initialisation sk->sk_user_data should be initialised before
> it is referenced via a call to gue_encap_init().
>
I think this is should be fixed by proposed patch "fou:
Call-setup_udp_tunnel_sock".

Thanks,
Tom

> Found by bisection after noticing the following:
>
> $ ip fou add port  ipproto 47
> [0.383417] BUG: unable to handle kernel NULL pointer dereference at 
> 0008
> [0.384132] IP: [] fou_nl_cmd_add_port+0x1e1/0x230
> [0.384707] PGD 1fafc067 PUD 1fb72067 PMD 0
> [0.385110] Oops: 0002 [#1] SMP
> [0.385387] Modules linked in:
> [0.385667] CPU: 0 PID: 55 Comm: ip Not tainted 4.6.0-03623-g0b7962a6c4a3 
> #430
> [0.386244] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.7.5-20140531_083030-gandalf 04/01/2014
> [0.386244] task: 88001fb9cac0 ti: 88001fbc8000 task.ti: 
> 88001fbc8000
> [0.386244] RIP: 0010:[]  [] 
> fou_nl_cmd_add_port+0x1e1/0x230
> [0.386244] RSP: 0018:88001fbcbb78  EFLAGS: 00010246
> [0.386244] RAX: 0001 RBX: 88001fb8eb40 RCX: 
> 002f
> [0.386244] RDX:  RSI:  RDI: 
> 880019fcafc0
> [0.386244] RBP: 880019fcaf80 R08: 8130c370 R09: 
> 880019fcaf80
> [0.386244] R10: 880019e12b8c R11:  R12: 
> 0004
> [0.386244] R13: 0014 R14: 88001fb1a300 R15: 
> 816634c0
> [0.386244] FS:  7f016eb4d700() GS:88001a20() 
> knlGS:
> [0.386244] CS:  0010 DS:  ES:  CR0: 80050033
> [0.386244] CR2: 0008 CR3: 1fb69000 CR4: 
> 06b0
> [0.386244] Stack:
> [0.386244]  88001faaea24 8800192426c0 0002002f0001 
> 
> [0.386244]     
> b822
> [0.386244]  81461480 88001faaea14 0004 
> 812b0e17
> [0.386244] Call Trace:
> [0.386244]  [] ? genl_family_rcv_msg+0x197/0x320
> [0.386244]  [] ? genl_family_rcv_msg+0x320/0x320
> [0.386244]  [] ? genl_rcv_msg+0x70/0xb0
> [0.386244]  [] ? netlink_rcv_skb+0xa1/0xc0
> [0.386244]  [] ? genl_rcv+0x24/0x40
> [0.386244]  [] ? netlink_unicast+0x143/0x1d0
> [0.386244]  [] ? netlink_sendmsg+0x366/0x390
> [0.386244]  [] ? rw_copy_check_uvector+0x68/0x110
> [0.386244]  [] ? sock_sendmsg+0x10/0x20
> [0.386244]  [] ? ___sys_sendmsg+0x1f1/0x200
> [0.386244]  [] ? pipe_write+0x1a0/0x420
> [0.386244]  [] ? do_filp_open+0x92/0xe0
> [0.386244]  [] ? __sys_sendmsg+0x41/0x70
> [0.386244]  [] ? entry_SYSCALL_64_fastpath+0x13/0x8f
> [0.386244] Code: 4c 24 12 48 8b 93 28 02 00 00 48 c7 83 68 03 00 00 e0 76 
> 32 81 48 c7 83 78 03 00 00 50 61 32 81 48 c7 83 80 03 00 00 e0 64 32 81 <88> 
> 4a 08 e9 20 ff ff ff 4c 89 e7 bb 8e ff ff ff e8 1a 34 07 00
> [0.386244] RIP  [] fou_nl_cmd_add_port+0x1e1/0x230
> [0.386244]  RSP 
> [0.386244] CR2: 0008
> [0.407176] ---[ end trace 13bf0d24a4b7f9c3 ]---
>
> Fixes: d92283e338f6 ("fou: change to use UDP socket GRO")
> Signed-off-by: Simon Horman 
> ---
>  net/ipv4/fou.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
> index eeec7d60e5fd..5f9634915cf2 100644
> --- a/net/ipv4/fou.c
> +++ b/net/ipv4/fou.c
> @@ -488,6 +488,7 @@ static int fou_create(struct net *net, struct fou_cfg 
> *cfg,
> }
>
> sk = sock->sk;
> +   sk->sk_user_data = fou;
>
> fou->flags = cfg->flags;
> fou->port = cfg->udp_config.local_udp_port;
> @@ -514,7 +515,6 @@ static int fou_create(struct net *net, struct fou_cfg 
> *cfg,
> udp_sk(sk)->encap_type = 1;
> udp_encap_enable();
>
> -   sk->sk_user_data = fou;
> fou->sock = sock;
>
> inet_inc_convert_csum(sk);
> --
> 2.1.4
>


Re: [PATCH net-next] tuntap: introduce tx skb ring

2016-05-18 Thread Michael S. Tsirkin
On Wed, May 18, 2016 at 10:13:59AM +0200, Jesper Dangaard Brouer wrote:
> I agree. It is sad to see everybody is implementing the same thing,
> open coding an array/circular based ring buffer.  This kind of code is
> hard to maintain and get right with barriers etc.  We can achieve the
> same performance with a generic implementation, by inlining the help
> function calls.

So my testing seems to show that at least for the common usecase
in networking, which isn't lockless, circular buffer
with indices does not perform that well, because
each index access causes a cache line to bounce between
CPUs, and index access causes stalls due to the dependency.

By comparison, an array of pointers where NULL means invalid
and !NULL means valid, can be updated without messing up barriers
at all and does not have this issue.

You also mentioned cache pressure caused by using large queues, and I
think it's a significant issue. tun has a queue of 1000 entries by
default and that's 8K already.

So, I had an idea: with an array of pointers we could actually use
only part of the ring as long as it stays mostly empty.
We do want to fill at least two cache lines to prevent producer
and consumer from writing over the same cache line all the time.
This is SKB_ARRAY_MIN_SIZE logic below.

Pls take a look at the implementation below.  It's a straight port from virtio
unit test, so should work fine, except the SKB_ARRAY_MIN_SIZE hack that
I added.  Today I run out of time for testing this.  Posting for early
flames/feedback.

It's using skb pointers but we switching to void * would be easy at cost
of type safety, though it appears that people want lockless  push
etc so I'm not sure of the value.

--->
skb_array: array based FIFO for skbs

A simple array based FIFO of pointers.
Intended for net stack so uses skbs for type
safety, but we can replace with with void *
if others find it useful outside of net stack.

Signed-off-by: Michael S. Tsirkin 

---

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
new file mode 100644
index 000..a67cc8b
--- /dev/null
+++ b/include/linux/skb_array.h
@@ -0,0 +1,116 @@
+/*
+ * See Documentation/skbular-buffers.txt for more information.
+ */
+
+#ifndef _LINUX_SKB_ARRAY_H
+#define _LINUX_SKB_ARRAY_H 1
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct sk_buff;
+
+struct skb_array {
+   int producer cacheline_aligned_in_smp;
+   spinlock_t producer_lock;
+   int consumer cacheline_aligned_in_smp;
+   spinlock_t consumer_lock;
+   /* Shared consumer/producer data */
+   int size cacheline_aligned_in_smp; /* max entries in queue */
+   struct sk_buff **queue;
+};
+
+#define SKB_ARRAY_MIN_SIZE (2 * (0x1 << cache_line_size()) / \
+   sizeof (struct sk_buff *))
+
+static inline int __skb_array_produce(struct skb_array *a,
+  struct sk_buff *skb)
+{
+   /* Try to start from beginning: good for cache utilization as we'll
+* keep reusing the same cache line.
+* Produce at least SKB_ARRAY_MIN_SIZE entries before trying to do this,
+* to reduce bouncing cache lines between them.
+*/
+   if (a->producer >= SKB_ARRAY_MIN_SIZE && !a->queue[0])
+   a->producer = 0;
+   if (a->queue[a->producer])
+   return -ENOSPC;
+   a->queue[a->producer] = skb;
+   if (unlikely(++a->producer > a->size))
+   a->producer = 0;
+   return 0;
+}
+
+static inline int skb_array_produce_bh(struct skb_array *a,
+  struct sk_buff *skb)
+{
+   int ret;
+
+   spin_lock_bh(>producer_lock);
+   ret = __skb_array_produce(a, skb);
+   spin_unlock_bh(>producer_lock);
+
+   return ret;
+}
+
+static inline struct sk_buff *__skb_array_peek(struct skb_array *a)
+{
+   if (a->queue[a->consumer])
+   return a->queue[a->consumer];
+
+   /* Check whether producer started at the beginning. */
+   if (unlikely(a->consumer >= SKB_ARRAY_MIN_SIZE && a->queue[0])) {
+   a->consumer = 0;
+   return a->queue[0];
+   }
+
+   return NULL;
+}
+
+static inline void __skb_array_consume(struct skb_array *a)
+{
+   a->queue[a->consumer++] = NULL;
+   if (unlikely(++a->consumer > a->size))
+   a->consumer = 0;
+}
+
+static inline struct sk_buff *skb_array_consume_bh(struct skb_array *a)
+{
+   struct sk_buff *skb;
+
+   spin_lock_bh(>producer_lock);
+   skb = __skb_array_peek(a);
+   if (skb)
+   __skb_array_consume(a);
+   spin_unlock_bh(>producer_lock);
+
+   return skb;
+}
+
+static inline int skb_array_init(struct skb_array *a, int size, gfp_t gfp)
+{
+   a->queue = kmalloc(ALIGN(size * sizeof *(a->queue), SMP_CACHE_BYTES),
+  gfp);
+   if (!a->queue)
+   return -ENOMEM;
+
+   a->size = 

Re: [PATCH 1/1 RFC] net/phy: Add Lantiq PHY driver

2016-05-18 Thread Florian Fainelli
CC'ing Andrew, John,

On 05/18/2016 09:03 AM, Alexander Stein wrote:
> This currently only supports PEF7071 and allows to specify max-speed and
> is able to read the LED configuration from device-tree.
> 
> Signed-off-by: Alexander Stein 
> ---
> The main purpose for now is to set a LED configuration from device tree and
> to limit the maximum speed. The latter one in my case hardware limited.
> As MAC and it's link partner support 1000MBit/s they would try to use that
> but will eventually fail due to magnetics only supporting 100MBit/s. So
> limit the maximum link speed supported directly from the start.

The 'max-speed' parsing that you do in the driver should not be needed,
PHYLIB takes care of that already see
drivers/net/phy/phy_device.c::of_set_phy_supported

For LEDs, we had a patch series floating around adding LED triggers [1],
and it seems to me like the LEDs class subsystem would be a good fit for
controlling PHY LEDs, possibly with the help of PHYLIB when it comes to
doing the low-level work of registering LEDs and their names with the
LEDS subsystem.

[1]: http://lists.openwall.net/netdev/2016/03/23/61

> 
> As this is a RFC I skipped the device tree binding doc.

Too bad, that's probably what needs to be discussed here, because the
driver looks pretty reasonable otherwise.

> 
>  drivers/net/phy/Kconfig  |   5 ++
>  drivers/net/phy/Makefile |   1 +
>  drivers/net/phy/lantiq.c | 167 
> +++
>  3 files changed, 173 insertions(+)
>  create mode 100644 drivers/net/phy/lantiq.c
> 
> diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
> index 3e28f7a..c004885 100644
> --- a/drivers/net/phy/Kconfig
> +++ b/drivers/net/phy/Kconfig
> @@ -119,6 +119,11 @@ config STE10XP
>   ---help---
> This is the driver for the STe100p and STe101p PHYs.
>  
> +config LANTIQ_PHY
> + tristate "Driver for Lantiq PHYs"
> + ---help---
> +   Supports the PEF7071 PHYs.
> +
>  config LSI_ET1011C_PHY
>   tristate "Driver for LSI ET1011C PHY"
>   ---help---
> diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
> index 8ad4ac6..e886549 100644
> --- a/drivers/net/phy/Makefile
> +++ b/drivers/net/phy/Makefile
> @@ -38,3 +38,4 @@ obj-$(CONFIG_MDIO_SUN4I)+= mdio-sun4i.o
>  obj-$(CONFIG_MDIO_MOXART)+= mdio-moxart.o
>  obj-$(CONFIG_AMD_XGBE_PHY)   += amd-xgbe-phy.o
>  obj-$(CONFIG_MDIO_BCM_UNIMAC)+= mdio-bcm-unimac.o
> +obj-$(CONFIG_LANTIQ_PHY) += lantiq.o
> diff --git a/drivers/net/phy/lantiq.c b/drivers/net/phy/lantiq.c
> new file mode 100644
> index 000..876a7d1
> --- /dev/null
> +++ b/drivers/net/phy/lantiq.c
> @@ -0,0 +1,167 @@
> +/*
> + * Driver for Lantiq PHYs
> + *
> + * Author: Alexander Stein 
> + *
> + * Copyright (c) 2015-2016 SYS TEC electronic GmbH
> + *
> + * This program is free software; you can redistribute  it and/or modify it
> + * under  the terms of  the GNU General  Public License as published by the
> + * Free Software Foundation;  either version 2 of the  License, or (at your
> + * option) any later version.
> + *
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define PHY_ID_PEF7071   0xd565a401
> +
> +#define MII_LANTIQ_MMD_CTRL_REG  0x0d
> +#define MII_LANTIQ_MMD_REGDATA_REG   0x0e
> +#define OP_DATA  1
> +
> +struct lantiqphy_led_ctrl {
> + const char *property;
> + u32 regnum;
> +};
> +
> +static int lantiq_extended_write(struct phy_device *phydev,
> +  u8 mode, u32 dev_addr, u32 regnum, u16 val)
> +{
> + phy_write(phydev, MII_LANTIQ_MMD_CTRL_REG, dev_addr);
> + phy_write(phydev, MII_LANTIQ_MMD_REGDATA_REG, regnum);
> + phy_write(phydev, MII_LANTIQ_MMD_CTRL_REG, (mode << 14) | dev_addr);
> + return phy_write(phydev, MII_LANTIQ_MMD_REGDATA_REG, val);
> +}
> +
> +static int lantiq_of_load_led_config(struct phy_device *phydev,
> +  struct device_node *of_node,
> +  const struct lantiqphy_led_ctrl *leds,
> +  u8 entries)
> +{
> + u16 val;
> + int i;
> + int ret = 0;
> +
> + for (i = 0; i < entries; i++) {
> + if (!of_property_read_u16(of_node, leds[i].property, )) {
> + ret = lantiq_extended_write(phydev, OP_DATA, 0x1f,
> + leds[i].regnum, val);
> + if (ret) {
> + dev_err(>dev, "Error writing register 
> 0x1f.%04x (%d)\n",
> + leds[i].regnum, ret);
> + break;
> + }
> + }
> + }
> +
> + return ret;
> +}
> +
> +static const struct lantiqphy_led_ctrl leds[] = {
> + {
> + .property = "led0h",
> + .regnum = 0x01e2,
> + },
> + 

[PATCH] Revert "phy: add support for a reset-gpio specification"

2016-05-18 Thread Fabio Estevam
Commit da47b4572056 ("phy: add support for a reset-gpio specification")
causes the following xtensa qemu crash according to Guenter Roeck:

[9.366256] libphy: ethoc-mdio: probed
[9.367389]  (null): could not attach to PHY
[9.368555]  (null): failed to probe MDIO bus
[9.371540] Unable to handle kernel paging request at virtual address 
001c
[9.371540]  pc = d0320926, ra = 903209d1
[9.375358] Oops: sig: 11 [#1]

This reverts commit da47b4572056487fd7941c26f73b3e8815ff712a.

Reported-by: Guenter Roeck 
Signed-off-by: Fabio Estevam 
---
 Documentation/devicetree/bindings/net/phy.txt | 3 ---
 drivers/net/phy/phy_device.c  | 8 
 2 files changed, 11 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/phy.txt 
b/Documentation/devicetree/bindings/net/phy.txt
index c00a9a8..bc1c3c8 100644
--- a/Documentation/devicetree/bindings/net/phy.txt
+++ b/Documentation/devicetree/bindings/net/phy.txt
@@ -35,8 +35,6 @@ Optional Properties:
 - broken-turn-around: If set, indicates the PHY device does not correctly
   release the turn around line low at the end of a MDIO transaction.
 
-- reset-gpios: Reference to a GPIO used to reset the phy.
-
 Example:
 
 ethernet-phy@0 {
@@ -44,5 +42,4 @@ ethernet-phy@0 {
interrupt-parent = <4>;
interrupts = <35 1>;
reg = <0>;
-   reset-gpios = < 17 GPIO_ACTIVE_LOW>;
 };
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 307f72a..e977ba9 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -34,7 +34,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 
@@ -1571,16 +1570,9 @@ static int phy_probe(struct device *dev)
struct device_driver *drv = phydev->mdio.dev.driver;
struct phy_driver *phydrv = to_phy_driver(drv);
int err = 0;
-   struct gpio_descs *reset_gpios;
 
phydev->drv = phydrv;
 
-   /* take phy out of reset */
-   reset_gpios = devm_gpiod_get_array_optional(dev, "reset",
-   GPIOD_OUT_LOW);
-   if (IS_ERR(reset_gpios))
-   return PTR_ERR(reset_gpios);
-
/* Disable the interrupt if the PHY doesn't support it
 * but the interrupt is still a valid one
 */
-- 
1.9.1



[PATCH v7 net-next 11/16] ip6_gre: Add support for fou/gue encapsulation

2016-05-18 Thread Tom Herbert
Add netlink and setup for encapsulation

Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_gre.c | 79 +++---
 1 file changed, 75 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 4541fa5..6fb1b89 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -729,7 +729,7 @@ static void ip6gre_tnl_link_config(struct ip6_tnl *t, int 
set_mtu)
 
t->tun_hlen = gre_calc_hlen(t->parms.o_flags);
 
-   t->hlen = t->tun_hlen;
+   t->hlen = t->encap_hlen + t->tun_hlen;
 
t_hlen = t->hlen + sizeof(struct ipv6hdr);
 
@@ -1022,9 +1022,7 @@ static int ip6gre_tunnel_init_common(struct net_device 
*dev)
}
 
tunnel->tun_hlen = gre_calc_hlen(tunnel->parms.o_flags);
-
-   tunnel->hlen = tunnel->tun_hlen;
-
+   tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen;
t_hlen = tunnel->hlen + sizeof(struct ipv6hdr);
 
dev->hard_header_len = LL_MAX_HEADER + t_hlen;
@@ -1290,15 +1288,57 @@ static void ip6gre_tap_setup(struct net_device *dev)
dev->priv_flags &= ~IFF_TX_SKB_SHARING;
 }
 
+static bool ip6gre_netlink_encap_parms(struct nlattr *data[],
+  struct ip_tunnel_encap *ipencap)
+{
+   bool ret = false;
+
+   memset(ipencap, 0, sizeof(*ipencap));
+
+   if (!data)
+   return ret;
+
+   if (data[IFLA_GRE_ENCAP_TYPE]) {
+   ret = true;
+   ipencap->type = nla_get_u16(data[IFLA_GRE_ENCAP_TYPE]);
+   }
+
+   if (data[IFLA_GRE_ENCAP_FLAGS]) {
+   ret = true;
+   ipencap->flags = nla_get_u16(data[IFLA_GRE_ENCAP_FLAGS]);
+   }
+
+   if (data[IFLA_GRE_ENCAP_SPORT]) {
+   ret = true;
+   ipencap->sport = nla_get_be16(data[IFLA_GRE_ENCAP_SPORT]);
+   }
+
+   if (data[IFLA_GRE_ENCAP_DPORT]) {
+   ret = true;
+   ipencap->dport = nla_get_be16(data[IFLA_GRE_ENCAP_DPORT]);
+   }
+
+   return ret;
+}
+
 static int ip6gre_newlink(struct net *src_net, struct net_device *dev,
struct nlattr *tb[], struct nlattr *data[])
 {
struct ip6_tnl *nt;
struct net *net = dev_net(dev);
struct ip6gre_net *ign = net_generic(net, ip6gre_net_id);
+   struct ip_tunnel_encap ipencap;
int err;
 
nt = netdev_priv(dev);
+
+   if (ip6gre_netlink_encap_parms(data, )) {
+   int err = ip6_tnl_encap_setup(nt, );
+
+   if (err < 0)
+   return err;
+   }
+
ip6gre_netlink_parms(data, >parms);
 
if (ip6gre_tunnel_find(net, >parms, dev->type))
@@ -1345,10 +1385,18 @@ static int ip6gre_changelink(struct net_device *dev, 
struct nlattr *tb[],
struct net *net = nt->net;
struct ip6gre_net *ign = net_generic(net, ip6gre_net_id);
struct __ip6_tnl_parm p;
+   struct ip_tunnel_encap ipencap;
 
if (dev == ign->fb_tunnel_dev)
return -EINVAL;
 
+   if (ip6gre_netlink_encap_parms(data, )) {
+   int err = ip6_tnl_encap_setup(nt, );
+
+   if (err < 0)
+   return err;
+   }
+
ip6gre_netlink_parms(data, );
 
t = ip6gre_tunnel_locate(net, , 0);
@@ -1400,6 +1448,14 @@ static size_t ip6gre_get_size(const struct net_device 
*dev)
nla_total_size(4) +
/* IFLA_GRE_FLAGS */
nla_total_size(4) +
+   /* IFLA_GRE_ENCAP_TYPE */
+   nla_total_size(2) +
+   /* IFLA_GRE_ENCAP_FLAGS */
+   nla_total_size(2) +
+   /* IFLA_GRE_ENCAP_SPORT */
+   nla_total_size(2) +
+   /* IFLA_GRE_ENCAP_DPORT */
+   nla_total_size(2) +
0;
 }
 
@@ -1422,6 +1478,17 @@ static int ip6gre_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
nla_put_be32(skb, IFLA_GRE_FLOWINFO, p->flowinfo) ||
nla_put_u32(skb, IFLA_GRE_FLAGS, p->flags))
goto nla_put_failure;
+
+   if (nla_put_u16(skb, IFLA_GRE_ENCAP_TYPE,
+   t->encap.type) ||
+   nla_put_be16(skb, IFLA_GRE_ENCAP_SPORT,
+t->encap.sport) ||
+   nla_put_be16(skb, IFLA_GRE_ENCAP_DPORT,
+t->encap.dport) ||
+   nla_put_u16(skb, IFLA_GRE_ENCAP_FLAGS,
+   t->encap.flags))
+   goto nla_put_failure;
+
return 0;
 
 nla_put_failure:
@@ -1440,6 +1507,10 @@ static const struct nla_policy 
ip6gre_policy[IFLA_GRE_MAX + 1] = {
[IFLA_GRE_ENCAP_LIMIT] = { .type = NLA_U8 },
[IFLA_GRE_FLOWINFO]= { .type = NLA_U32 },
[IFLA_GRE_FLAGS]   = { .type = NLA_U32 },
+   [IFLA_GRE_ENCAP_TYPE]   = { .type = NLA_U16 },
+   [IFLA_GRE_ENCAP_FLAGS]  = { .type = NLA_U16 },
+   [IFLA_GRE_ENCAP_SPORT]  = { .type = NLA_U16 },
+   

[PATCH v7 net-next 05/16] net: Cleanup encap items in ip_tunnels.h

2016-05-18 Thread Tom Herbert
Consolidate all the ip_tunnel_encap definitions in one spot in the
header file. Also, move ip_encap_hlen and ip_tunnel_encap from
ip_tunnel.c to ip_tunnels.h so they call be called without a dependency
on ip_tunnel module. Similarly, move iptun_encaps to ip_tunnel_core.c.

Signed-off-by: Tom Herbert 
---
 include/net/ip_tunnels.h  | 76 ---
 net/ipv4/ip_tunnel.c  | 45 
 net/ipv4/ip_tunnel_core.c |  4 +++
 3 files changed, 62 insertions(+), 63 deletions(-)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index d916b43..dbf 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -171,22 +171,6 @@ struct ip_tunnel_net {
struct ip_tunnel __rcu *collect_md_tun;
 };
 
-struct ip_tunnel_encap_ops {
-   size_t (*encap_hlen)(struct ip_tunnel_encap *e);
-   int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e,
-   u8 *protocol, struct flowi4 *fl4);
-};
-
-#define MAX_IPTUN_ENCAP_OPS 8
-
-extern const struct ip_tunnel_encap_ops __rcu *
-   iptun_encaps[MAX_IPTUN_ENCAP_OPS];
-
-int ip_tunnel_encap_add_ops(const struct ip_tunnel_encap_ops *op,
-   unsigned int num);
-int ip_tunnel_encap_del_ops(const struct ip_tunnel_encap_ops *op,
-   unsigned int num);
-
 static inline void ip_tunnel_key_init(struct ip_tunnel_key *key,
  __be32 saddr, __be32 daddr,
  u8 tos, u8 ttl, __be32 label,
@@ -251,8 +235,6 @@ void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct 
rtnl_link_ops *ops);
 void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
const struct iphdr *tnl_params, const u8 protocol);
 int ip_tunnel_ioctl(struct net_device *dev, struct ip_tunnel_parm *p, int cmd);
-int ip_tunnel_encap(struct sk_buff *skb, struct ip_tunnel *t,
-   u8 *protocol, struct flowi4 *fl4);
 int __ip_tunnel_change_mtu(struct net_device *dev, int new_mtu, bool strict);
 int ip_tunnel_change_mtu(struct net_device *dev, int new_mtu);
 
@@ -271,9 +253,67 @@ int ip_tunnel_changelink(struct net_device *dev, struct 
nlattr *tb[],
 int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[],
  struct ip_tunnel_parm *p);
 void ip_tunnel_setup(struct net_device *dev, int net_id);
+
+struct ip_tunnel_encap_ops {
+   size_t (*encap_hlen)(struct ip_tunnel_encap *e);
+   int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e,
+   u8 *protocol, struct flowi4 *fl4);
+};
+
+#define MAX_IPTUN_ENCAP_OPS 8
+
+extern const struct ip_tunnel_encap_ops __rcu *
+   iptun_encaps[MAX_IPTUN_ENCAP_OPS];
+
+int ip_tunnel_encap_add_ops(const struct ip_tunnel_encap_ops *op,
+   unsigned int num);
+int ip_tunnel_encap_del_ops(const struct ip_tunnel_encap_ops *op,
+   unsigned int num);
+
 int ip_tunnel_encap_setup(struct ip_tunnel *t,
  struct ip_tunnel_encap *ipencap);
 
+static inline int ip_encap_hlen(struct ip_tunnel_encap *e)
+{
+   const struct ip_tunnel_encap_ops *ops;
+   int hlen = -EINVAL;
+
+   if (e->type == TUNNEL_ENCAP_NONE)
+   return 0;
+
+   if (e->type >= MAX_IPTUN_ENCAP_OPS)
+   return -EINVAL;
+
+   rcu_read_lock();
+   ops = rcu_dereference(iptun_encaps[e->type]);
+   if (likely(ops && ops->encap_hlen))
+   hlen = ops->encap_hlen(e);
+   rcu_read_unlock();
+
+   return hlen;
+}
+
+static inline int ip_tunnel_encap(struct sk_buff *skb, struct ip_tunnel *t,
+ u8 *protocol, struct flowi4 *fl4)
+{
+   const struct ip_tunnel_encap_ops *ops;
+   int ret = -EINVAL;
+
+   if (t->encap.type == TUNNEL_ENCAP_NONE)
+   return 0;
+
+   if (t->encap.type >= MAX_IPTUN_ENCAP_OPS)
+   return -EINVAL;
+
+   rcu_read_lock();
+   ops = rcu_dereference(iptun_encaps[t->encap.type]);
+   if (likely(ops && ops->build_header))
+   ret = ops->build_header(skb, >encap, protocol, fl4);
+   rcu_read_unlock();
+
+   return ret;
+}
+
 /* Extract dsfield from inner protocol */
 static inline u8 ip_tunnel_get_dsfield(const struct iphdr *iph,
   const struct sk_buff *skb)
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index a69ed94..d8f5e0a 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -443,29 +443,6 @@ drop:
 }
 EXPORT_SYMBOL_GPL(ip_tunnel_rcv);
 
-static int ip_encap_hlen(struct ip_tunnel_encap *e)
-{
-   const struct ip_tunnel_encap_ops *ops;
-   int hlen = -EINVAL;
-
-   if (e->type == TUNNEL_ENCAP_NONE)
-   return 0;
-
-   if (e->type >= MAX_IPTUN_ENCAP_OPS)
-   return -EINVAL;
-
-   

[PATCH v7 net-next 10/16] fou: Add encap ops for IPv6 tunnels

2016-05-18 Thread Tom Herbert
This patch add a new fou6 module that provides encapsulation
operations for IPv6.

Signed-off-by: Tom Herbert 
---
 include/net/fou.h |   2 +-
 net/ipv6/Makefile |   1 +
 net/ipv6/fou6.c   | 140 ++
 3 files changed, 142 insertions(+), 1 deletion(-)
 create mode 100644 net/ipv6/fou6.c

diff --git a/include/net/fou.h b/include/net/fou.h
index 7d2fda2..f5cc691 100644
--- a/include/net/fou.h
+++ b/include/net/fou.h
@@ -9,7 +9,7 @@
 #include 
 
 size_t fou_encap_hlen(struct ip_tunnel_encap *e);
-static size_t gue_encap_hlen(struct ip_tunnel_encap *e);
+size_t gue_encap_hlen(struct ip_tunnel_encap *e);
 
 int __fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
   u8 *protocol, __be16 *sport, int type);
diff --git a/net/ipv6/Makefile b/net/ipv6/Makefile
index 5e9d6bf..7ec3129 100644
--- a/net/ipv6/Makefile
+++ b/net/ipv6/Makefile
@@ -42,6 +42,7 @@ obj-$(CONFIG_IPV6_VTI) += ip6_vti.o
 obj-$(CONFIG_IPV6_SIT) += sit.o
 obj-$(CONFIG_IPV6_TUNNEL) += ip6_tunnel.o
 obj-$(CONFIG_IPV6_GRE) += ip6_gre.o
+obj-$(CONFIG_NET_FOU) += fou6.o
 
 obj-y += addrconf_core.o exthdrs_core.o ip6_checksum.o ip6_icmp.o
 obj-$(CONFIG_INET) += output_core.o protocol.o $(ipv6-offload)
diff --git a/net/ipv6/fou6.c b/net/ipv6/fou6.c
new file mode 100644
index 000..c972d0b
--- /dev/null
+++ b/net/ipv6/fou6.c
@@ -0,0 +1,140 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static void fou6_build_udp(struct sk_buff *skb, struct ip_tunnel_encap *e,
+  struct flowi6 *fl6, u8 *protocol, __be16 sport)
+{
+   struct udphdr *uh;
+
+   skb_push(skb, sizeof(struct udphdr));
+   skb_reset_transport_header(skb);
+
+   uh = udp_hdr(skb);
+
+   uh->dest = e->dport;
+   uh->source = sport;
+   uh->len = htons(skb->len);
+   udp6_set_csum(!(e->flags & TUNNEL_ENCAP_FLAG_CSUM6), skb,
+ >saddr, >daddr, skb->len);
+
+   *protocol = IPPROTO_UDP;
+}
+
+int fou6_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+ u8 *protocol, struct flowi6 *fl6)
+{
+   __be16 sport;
+   int err;
+   int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM6 ?
+   SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
+
+   err = __fou_build_header(skb, e, protocol, , type);
+   if (err)
+   return err;
+
+   fou6_build_udp(skb, e, fl6, protocol, sport);
+
+   return 0;
+}
+EXPORT_SYMBOL(fou6_build_header);
+
+int gue6_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+ u8 *protocol, struct flowi6 *fl6)
+{
+   __be16 sport;
+   int err;
+   int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM6 ?
+   SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
+
+   err = __gue_build_header(skb, e, protocol, , type);
+   if (err)
+   return err;
+
+   fou6_build_udp(skb, e, fl6, protocol, sport);
+
+   return 0;
+}
+EXPORT_SYMBOL(gue6_build_header);
+
+#ifdef CONFIG_NET_FOU_IP_TUNNELS
+
+static const struct ip6_tnl_encap_ops fou_ip6tun_ops = {
+   .encap_hlen = fou_encap_hlen,
+   .build_header = fou6_build_header,
+};
+
+static const struct ip6_tnl_encap_ops gue_ip6tun_ops = {
+   .encap_hlen = gue_encap_hlen,
+   .build_header = gue6_build_header,
+};
+
+static int ip6_tnl_encap_add_fou_ops(void)
+{
+   int ret;
+
+   ret = ip6_tnl_encap_add_ops(_ip6tun_ops, TUNNEL_ENCAP_FOU);
+   if (ret < 0) {
+   pr_err("can't add fou6 ops\n");
+   return ret;
+   }
+
+   ret = ip6_tnl_encap_add_ops(_ip6tun_ops, TUNNEL_ENCAP_GUE);
+   if (ret < 0) {
+   pr_err("can't add gue6 ops\n");
+   ip6_tnl_encap_del_ops(_ip6tun_ops, TUNNEL_ENCAP_FOU);
+   return ret;
+   }
+
+   return 0;
+}
+
+static void ip6_tnl_encap_del_fou_ops(void)
+{
+   ip6_tnl_encap_del_ops(_ip6tun_ops, TUNNEL_ENCAP_FOU);
+   ip6_tnl_encap_del_ops(_ip6tun_ops, TUNNEL_ENCAP_GUE);
+}
+
+#else
+
+static int ip6_tnl_encap_add_fou_ops(void)
+{
+   return 0;
+}
+
+static void ip6_tnl_encap_del_fou_ops(void)
+{
+}
+
+#endif
+
+static int __init fou6_init(void)
+{
+   int ret;
+
+   ret = ip6_tnl_encap_add_fou_ops();
+
+   return ret;
+}
+
+static void __exit fou6_fini(void)
+{
+   ip6_tnl_encap_del_fou_ops();
+}
+
+module_init(fou6_init);
+module_exit(fou6_fini);
+MODULE_AUTHOR("Tom Herbert ");
+MODULE_LICENSE("GPL");
-- 
2.8.0.rc2



[PATCH v7 net-next 03/16] ipv6: Fix nexthdr for reinjection

2016-05-18 Thread Tom Herbert
In ip6_input_finish the nexthdr protocol is retrieved from the
next header offset that is returned in the cb of the skb.
This method does not work for UDP encapsulation that may not
even have a concept of a nexthdr field (e.g. FOU).

This patch checks for a final protocol (INET6_PROTO_FINAL) when a
protocol handler returns > 0. If the protocol is not final then
resubmission is performed on nhoff value. If the protocol is final
then the nexthdr is taken to be the return value.

Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_input.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index f185cbc..d35dff2 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -236,6 +236,7 @@ resubmit:
nhoff = IP6CB(skb)->nhoff;
nexthdr = skb_network_header(skb)[nhoff];
 
+resubmit_final:
raw = raw6_local_deliver(skb, nexthdr);
ipprot = rcu_dereference(inet6_protos[nexthdr]);
if (ipprot) {
@@ -263,10 +264,21 @@ resubmit:
goto discard;
 
ret = ipprot->handler(skb);
-   if (ret > 0)
-   goto resubmit;
-   else if (ret == 0)
+   if (ret > 0) {
+   if (ipprot->flags & INET6_PROTO_FINAL) {
+   /* Not an extension header, most likely UDP
+* encapsulation. Use return value as nexthdr
+* protocol not nhoff (which presumably is
+* not set by handler).
+*/
+   nexthdr = ret;
+   goto resubmit_final;
+   } else {
+   goto resubmit;
+   }
+   } else if (ret == 0) {
__IP6_INC_STATS(net, idev, IPSTATS_MIB_INDELIVERS);
+   }
} else {
if (!raw) {
if (xfrm6_policy_check(NULL, XFRM_POLICY_IN, skb)) {
-- 
2.8.0.rc2



[PATCH v7 net-next 06/16] fou: Call setup_udp_tunnel_sock

2016-05-18 Thread Tom Herbert
Use helper function to set up UDP tunnel related information for a fou
socket.

Signed-off-by: Tom Herbert 
---
 net/ipv4/fou.c | 50 --
 1 file changed, 16 insertions(+), 34 deletions(-)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index eeec7d6..6cbc725 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -448,31 +448,13 @@ static void fou_release(struct fou *fou)
kfree_rcu(fou, rcu);
 }
 
-static int fou_encap_init(struct sock *sk, struct fou *fou, struct fou_cfg 
*cfg)
-{
-   udp_sk(sk)->encap_rcv = fou_udp_recv;
-   udp_sk(sk)->gro_receive = fou_gro_receive;
-   udp_sk(sk)->gro_complete = fou_gro_complete;
-   fou_from_sock(sk)->protocol = cfg->protocol;
-
-   return 0;
-}
-
-static int gue_encap_init(struct sock *sk, struct fou *fou, struct fou_cfg 
*cfg)
-{
-   udp_sk(sk)->encap_rcv = gue_udp_recv;
-   udp_sk(sk)->gro_receive = gue_gro_receive;
-   udp_sk(sk)->gro_complete = gue_gro_complete;
-
-   return 0;
-}
-
 static int fou_create(struct net *net, struct fou_cfg *cfg,
  struct socket **sockp)
 {
struct socket *sock = NULL;
struct fou *fou = NULL;
struct sock *sk;
+   struct udp_tunnel_sock_cfg tunnel_cfg;
int err;
 
/* Open UDP socket */
@@ -491,33 +473,33 @@ static int fou_create(struct net *net, struct fou_cfg 
*cfg,
 
fou->flags = cfg->flags;
fou->port = cfg->udp_config.local_udp_port;
+   fou->type = cfg->type;
+   fou->sock = sock;
+
+   memset(_cfg, 0, sizeof(tunnel_cfg));
+   tunnel_cfg.encap_type = 1;
+   tunnel_cfg.sk_user_data = fou;
+   tunnel_cfg.encap_destroy = NULL;
 
/* Initial for fou type */
switch (cfg->type) {
case FOU_ENCAP_DIRECT:
-   err = fou_encap_init(sk, fou, cfg);
-   if (err)
-   goto error;
+   tunnel_cfg.encap_rcv = fou_udp_recv;
+   tunnel_cfg.gro_receive = fou_gro_receive;
+   tunnel_cfg.gro_complete = fou_gro_complete;
+   fou->protocol = cfg->protocol;
break;
case FOU_ENCAP_GUE:
-   err = gue_encap_init(sk, fou, cfg);
-   if (err)
-   goto error;
+   tunnel_cfg.encap_rcv = gue_udp_recv;
+   tunnel_cfg.gro_receive = gue_gro_receive;
+   tunnel_cfg.gro_complete = gue_gro_complete;
break;
default:
err = -EINVAL;
goto error;
}
 
-   fou->type = cfg->type;
-
-   udp_sk(sk)->encap_type = 1;
-   udp_encap_enable();
-
-   sk->sk_user_data = fou;
-   fou->sock = sock;
-
-   inet_inc_convert_csum(sk);
+   setup_udp_tunnel_sock(net, sock, _cfg);
 
sk->sk_allocation = GFP_ATOMIC;
 
-- 
2.8.0.rc2



[PATCH v7 net-next 09/16] ip6_tun: Add infrastructure for doing encapsulation

2016-05-18 Thread Tom Herbert
Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions
for getting encap hlen, setting up encap on a tunnel, performing
encapsulation operation.

Signed-off-by: Tom Herbert 
---
 include/net/ip6_tunnel.h  | 58 +
 net/ipv4/ip_tunnel_core.c |  5 +++
 net/ipv6/ip6_tunnel.c | 94 ---
 3 files changed, 144 insertions(+), 13 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index fb9e015..d325c81 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -52,10 +52,68 @@ struct ip6_tnl {
__u32 o_seqno;  /* The last output seqno */
int hlen;   /* tun_hlen + encap_hlen */
int tun_hlen;   /* Precalculated header length */
+   int encap_hlen; /* Encap header length (FOU,GUE) */
+   struct ip_tunnel_encap encap;
int mlink;
+};
 
+struct ip6_tnl_encap_ops {
+   size_t (*encap_hlen)(struct ip_tunnel_encap *e);
+   int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e,
+   u8 *protocol, struct flowi6 *fl6);
 };
 
+extern const struct ip6_tnl_encap_ops __rcu *
+   ip6tun_encaps[MAX_IPTUN_ENCAP_OPS];
+
+int ip6_tnl_encap_add_ops(const struct ip6_tnl_encap_ops *ops,
+ unsigned int num);
+int ip6_tnl_encap_del_ops(const struct ip6_tnl_encap_ops *ops,
+ unsigned int num);
+int ip6_tnl_encap_setup(struct ip6_tnl *t,
+   struct ip_tunnel_encap *ipencap);
+
+static inline int ip6_encap_hlen(struct ip_tunnel_encap *e)
+{
+   const struct ip6_tnl_encap_ops *ops;
+   int hlen = -EINVAL;
+
+   if (e->type == TUNNEL_ENCAP_NONE)
+   return 0;
+
+   if (e->type >= MAX_IPTUN_ENCAP_OPS)
+   return -EINVAL;
+
+   rcu_read_lock();
+   ops = rcu_dereference(ip6tun_encaps[e->type]);
+   if (likely(ops && ops->encap_hlen))
+   hlen = ops->encap_hlen(e);
+   rcu_read_unlock();
+
+   return hlen;
+}
+
+static inline int ip6_tnl_encap(struct sk_buff *skb, struct ip6_tnl *t,
+   u8 *protocol, struct flowi6 *fl6)
+{
+   const struct ip6_tnl_encap_ops *ops;
+   int ret = -EINVAL;
+
+   if (t->encap.type == TUNNEL_ENCAP_NONE)
+   return 0;
+
+   if (t->encap.type >= MAX_IPTUN_ENCAP_OPS)
+   return -EINVAL;
+
+   rcu_read_lock();
+   ops = rcu_dereference(ip6tun_encaps[t->encap.type]);
+   if (likely(ops && ops->build_header))
+   ret = ops->build_header(skb, >encap, protocol, fl6);
+   rcu_read_unlock();
+
+   return ret;
+}
+
 /* Tunnel encapsulation limit destination sub-option */
 
 struct ipv6_tlv_tnl_enc_lim {
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index cc66a20..afd6b59 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -51,6 +52,10 @@ const struct ip_tunnel_encap_ops __rcu *
iptun_encaps[MAX_IPTUN_ENCAP_OPS] __read_mostly;
 EXPORT_SYMBOL(iptun_encaps);
 
+const struct ip6_tnl_encap_ops __rcu *
+   ip6tun_encaps[MAX_IPTUN_ENCAP_OPS] __read_mostly;
+EXPORT_SYMBOL(ip6tun_encaps);
+
 void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct sk_buff *skb,
   __be32 src, __be32 dst, __u8 proto,
   __u8 tos, __u8 ttl, __be16 df, bool xnet)
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index e79330f..64ddbeac 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1010,7 +1010,8 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
struct dst_entry *dst = NULL, *ndst = NULL;
struct net_device *tdev;
int mtu;
-   unsigned int max_headroom = sizeof(struct ipv6hdr);
+   unsigned int psh_hlen = sizeof(struct ipv6hdr) + t->encap_hlen;
+   unsigned int max_headroom = psh_hlen;
int err = -1;
 
/* NBMA tunnel */
@@ -1063,7 +1064,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
 t->parms.name);
goto tx_err_dst_release;
}
-   mtu = dst_mtu(dst) - sizeof(*ipv6h);
+   mtu = dst_mtu(dst) - psh_hlen;
if (encap_limit >= 0) {
max_headroom += 8;
mtu -= 8;
@@ -1124,11 +1125,18 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
skb->encapsulation = 1;
}
 
+   /* Calculate max headroom for all the headers and adjust
+* needed_headroom if necessary.
+*/
max_headroom = LL_RESERVED_SPACE(dst->dev) + sizeof(struct ipv6hdr)
-   + dst->header_len;
+   + dst->header_len + t->hlen;
if (max_headroom > 

[PATCH v7 net-next 08/16] fou: Support IPv6 in fou

2016-05-18 Thread Tom Herbert
This patch adds receive path support for IPv6 with fou.

- Add address family to fou structure for open sockets. This supports
  AF_INET and AF_INET6. Lookups for fou ports are performed on both the
  port number and family.
- In fou and gue receive adjust tot_len in IPv4 header or payload_len
  based on address family.
- Allow AF_INET6 in FOU_ATTR_AF netlink attribute.

Signed-off-by: Tom Herbert 
---
 net/ipv4/fou.c | 47 +++
 1 file changed, 35 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index f4f2ddd..5f9207c 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -21,6 +21,7 @@ struct fou {
u8 protocol;
u8 flags;
__be16 port;
+   u8 family;
u16 type;
struct list_head list;
struct rcu_head rcu;
@@ -47,14 +48,17 @@ static inline struct fou *fou_from_sock(struct sock *sk)
return sk->sk_user_data;
 }
 
-static int fou_recv_pull(struct sk_buff *skb, size_t len)
+static int fou_recv_pull(struct sk_buff *skb, struct fou *fou, size_t len)
 {
-   struct iphdr *iph = ip_hdr(skb);
-
/* Remove 'len' bytes from the packet (UDP header and
 * FOU header if present).
 */
-   iph->tot_len = htons(ntohs(iph->tot_len) - len);
+   if (fou->family == AF_INET)
+   ip_hdr(skb)->tot_len = htons(ntohs(ip_hdr(skb)->tot_len) - len);
+   else
+   ipv6_hdr(skb)->payload_len =
+   htons(ntohs(ipv6_hdr(skb)->payload_len) - len);
+
__skb_pull(skb, len);
skb_postpull_rcsum(skb, udp_hdr(skb), len);
skb_reset_transport_header(skb);
@@ -68,7 +72,7 @@ static int fou_udp_recv(struct sock *sk, struct sk_buff *skb)
if (!fou)
return 1;
 
-   if (fou_recv_pull(skb, sizeof(struct udphdr)))
+   if (fou_recv_pull(skb, fou, sizeof(struct udphdr)))
goto drop;
 
return -fou->protocol;
@@ -141,7 +145,11 @@ static int gue_udp_recv(struct sock *sk, struct sk_buff 
*skb)
 
hdrlen = sizeof(struct guehdr) + optlen;
 
-   ip_hdr(skb)->tot_len = htons(ntohs(ip_hdr(skb)->tot_len) - len);
+   if (fou->family == AF_INET)
+   ip_hdr(skb)->tot_len = htons(ntohs(ip_hdr(skb)->tot_len) - len);
+   else
+   ipv6_hdr(skb)->payload_len =
+   htons(ntohs(ipv6_hdr(skb)->payload_len) - len);
 
/* Pull csum through the guehdr now . This can be used if
 * there is a remote checksum offload.
@@ -426,7 +434,8 @@ static int fou_add_to_port_list(struct net *net, struct fou 
*fou)
 
mutex_lock(>fou_lock);
list_for_each_entry(fout, >fou_list, list) {
-   if (fou->port == fout->port) {
+   if (fou->port == fout->port &&
+   fou->family == fout->family) {
mutex_unlock(>fou_lock);
return -EALREADY;
}
@@ -471,8 +480,9 @@ static int fou_create(struct net *net, struct fou_cfg *cfg,
 
sk = sock->sk;
 
-   fou->flags = cfg->flags;
fou->port = cfg->udp_config.local_udp_port;
+   fou->family = cfg->udp_config.family;
+   fou->flags = cfg->flags;
fou->type = cfg->type;
fou->sock = sock;
 
@@ -524,12 +534,13 @@ static int fou_destroy(struct net *net, struct fou_cfg 
*cfg)
 {
struct fou_net *fn = net_generic(net, fou_net_id);
__be16 port = cfg->udp_config.local_udp_port;
+   u8 family = cfg->udp_config.family;
int err = -EINVAL;
struct fou *fou;
 
mutex_lock(>fou_lock);
list_for_each_entry(fou, >fou_list, list) {
-   if (fou->port == port) {
+   if (fou->port == port && fou->family == family) {
fou_release(fou);
err = 0;
break;
@@ -567,8 +578,15 @@ static int parse_nl_config(struct genl_info *info,
if (info->attrs[FOU_ATTR_AF]) {
u8 family = nla_get_u8(info->attrs[FOU_ATTR_AF]);
 
-   if (family != AF_INET)
-   return -EINVAL;
+   switch (family) {
+   case AF_INET:
+   break;
+   case AF_INET6:
+   cfg->udp_config.ipv6_v6only = 1;
+   break;
+   default:
+   return -EAFNOSUPPORT;
+   }
 
cfg->udp_config.family = family;
}
@@ -659,6 +677,7 @@ static int fou_nl_cmd_get_port(struct sk_buff *skb, struct 
genl_info *info)
struct fou_cfg cfg;
struct fou *fout;
__be16 port;
+   u8 family;
int ret;
 
ret = parse_nl_config(info, );
@@ -668,6 +687,10 @@ static int fou_nl_cmd_get_port(struct sk_buff *skb, struct 
genl_info *info)
if (port == 0)
return -EINVAL;
 
+   family = cfg.udp_config.family;
+   if (family != 

[PATCH v7 net-next 01/16] gso: Remove arbitrary checks for unsupported GSO

2016-05-18 Thread Tom Herbert
In several gso_segment functions there are checks of gso_type against
a seemingly arbitrary list of SKB_GSO_* flags. This seems like an
attempt to identify unsupported GSO types, but since the stack is
the one that set these GSO types in the first place this seems
unnecessary to do. If a combination isn't valid in the first
place that stack should not allow setting it.

This is a code simplication especially for add new GSO types.

Signed-off-by: Tom Herbert 
---
 net/ipv4/af_inet.c | 18 --
 net/ipv4/gre_offload.c | 14 --
 net/ipv4/tcp_offload.c | 19 ---
 net/ipv4/udp_offload.c | 10 --
 net/ipv6/ip6_offload.c | 18 --
 net/ipv6/udp_offload.c | 13 -
 net/mpls/mpls_gso.c| 11 +--
 7 files changed, 1 insertion(+), 102 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 2e6e65f..7f08d45 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1205,24 +1205,6 @@ static struct sk_buff *inet_gso_segment(struct sk_buff 
*skb,
int ihl;
int id;
 
-   if (unlikely(skb_shinfo(skb)->gso_type &
-~(SKB_GSO_TCPV4 |
-  SKB_GSO_UDP |
-  SKB_GSO_DODGY |
-  SKB_GSO_TCP_ECN |
-  SKB_GSO_GRE |
-  SKB_GSO_GRE_CSUM |
-  SKB_GSO_IPIP |
-  SKB_GSO_SIT |
-  SKB_GSO_TCPV6 |
-  SKB_GSO_UDP_TUNNEL |
-  SKB_GSO_UDP_TUNNEL_CSUM |
-  SKB_GSO_TCP_FIXEDID |
-  SKB_GSO_TUNNEL_REMCSUM |
-  SKB_GSO_PARTIAL |
-  0)))
-   goto out;
-
skb_reset_network_header(skb);
nhoff = skb_network_header(skb) - skb_mac_header(skb);
if (unlikely(!pskb_may_pull(skb, sizeof(*iph
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index e88190a..ecd1e09 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -26,20 +26,6 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
int gre_offset, outer_hlen;
bool need_csum, ufo;
 
-   if (unlikely(skb_shinfo(skb)->gso_type &
-   ~(SKB_GSO_TCPV4 |
- SKB_GSO_TCPV6 |
- SKB_GSO_UDP |
- SKB_GSO_DODGY |
- SKB_GSO_TCP_ECN |
- SKB_GSO_TCP_FIXEDID |
- SKB_GSO_GRE |
- SKB_GSO_GRE_CSUM |
- SKB_GSO_IPIP |
- SKB_GSO_SIT |
- SKB_GSO_PARTIAL)))
-   goto out;
-
if (!skb->encapsulation)
goto out;
 
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 02737b6..5c59649 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -83,25 +83,6 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 
if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
/* Packet is from an untrusted source, reset gso_segs. */
-   int type = skb_shinfo(skb)->gso_type;
-
-   if (unlikely(type &
-~(SKB_GSO_TCPV4 |
-  SKB_GSO_DODGY |
-  SKB_GSO_TCP_ECN |
-  SKB_GSO_TCP_FIXEDID |
-  SKB_GSO_TCPV6 |
-  SKB_GSO_GRE |
-  SKB_GSO_GRE_CSUM |
-  SKB_GSO_IPIP |
-  SKB_GSO_SIT |
-  SKB_GSO_UDP_TUNNEL |
-  SKB_GSO_UDP_TUNNEL_CSUM |
-  SKB_GSO_TUNNEL_REMCSUM |
-  0) ||
-!(type & (SKB_GSO_TCPV4 |
-  SKB_GSO_TCPV6
-   goto out;
 
skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss);
 
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 6b7459c..81f253b 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -209,16 +209,6 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff 
*skb,
 
if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
/* Packet is from an untrusted source, reset gso_segs. */
-   int type = skb_shinfo(skb)->gso_type;
-
-   if (unlikely(type & ~(SKB_GSO_UDP | SKB_GSO_DODGY |
- SKB_GSO_UDP_TUNNEL |
- SKB_GSO_UDP_TUNNEL_CSUM |
- SKB_GSO_TUNNEL_REMCSUM |
- SKB_GSO_IPIP |
- 

[PATCH v7 net-next 16/16] ipv6: Don't reset inner headers in ip6_tnl_xmit

2016-05-18 Thread Tom Herbert
Since iptunnel_handle_offloads() is called in all paths we can
probably drop the block in ip6_tnl_xmit that was checking for
skb->encapsulation and resetting the inner headers.

Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_tunnel.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 823dad1..7b0481e 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1120,11 +1120,6 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
ipv6_push_nfrag_opts(skb, , , NULL);
}
 
-   if (likely(!skb->encapsulation)) {
-   skb_reset_inner_headers(skb);
-   skb->encapsulation = 1;
-   }
-
/* Calculate max headroom for all the headers and adjust
 * needed_headroom if necessary.
 */
-- 
2.8.0.rc2



[PATCH v7 net-next 15/16] ip4ip6: Support for GSO/GRO

2016-05-18 Thread Tom Herbert
Signed-off-by: Tom Herbert 
---
 include/net/inet_common.h |  5 +
 net/ipv4/af_inet.c| 12 +++-
 net/ipv6/ip6_offload.c| 33 -
 net/ipv6/ip6_tunnel.c |  5 +
 4 files changed, 49 insertions(+), 6 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 109e3ee..5d68342 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -39,6 +39,11 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short 
family,
 int inet_recv_error(struct sock *sk, struct msghdr *msg, int len,
int *addr_len);
 
+struct sk_buff **inet_gro_receive(struct sk_buff **head, struct sk_buff *skb);
+int inet_gro_complete(struct sk_buff *skb, int nhoff);
+struct sk_buff *inet_gso_segment(struct sk_buff *skb,
+netdev_features_t features);
+
 static inline void inet_ctl_sock_destroy(struct sock *sk)
 {
if (sk)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 25040b1..377424e 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1192,8 +1192,8 @@ int inet_sk_rebuild_header(struct sock *sk)
 }
 EXPORT_SYMBOL(inet_sk_rebuild_header);
 
-static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
-   netdev_features_t features)
+struct sk_buff *inet_gso_segment(struct sk_buff *skb,
+netdev_features_t features)
 {
bool udpfrag = false, fixedid = false, encap;
struct sk_buff *segs = ERR_PTR(-EINVAL);
@@ -1280,9 +1280,9 @@ static struct sk_buff *inet_gso_segment(struct sk_buff 
*skb,
 out:
return segs;
 }
+EXPORT_SYMBOL(inet_gso_segment);
 
-static struct sk_buff **inet_gro_receive(struct sk_buff **head,
-struct sk_buff *skb)
+struct sk_buff **inet_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 {
const struct net_offload *ops;
struct sk_buff **pp = NULL;
@@ -1398,6 +1398,7 @@ out:
 
return pp;
 }
+EXPORT_SYMBOL(inet_gro_receive);
 
 static struct sk_buff **ipip_gro_receive(struct sk_buff **head,
 struct sk_buff *skb)
@@ -1449,7 +1450,7 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, 
int len, int *addr_len)
return -EINVAL;
 }
 
-static int inet_gro_complete(struct sk_buff *skb, int nhoff)
+int inet_gro_complete(struct sk_buff *skb, int nhoff)
 {
__be16 newlen = htons(skb->len - nhoff);
struct iphdr *iph = (struct iphdr *)(skb->data + nhoff);
@@ -1479,6 +1480,7 @@ out_unlock:
 
return err;
 }
+EXPORT_SYMBOL(inet_gro_complete);
 
 static int ipip_gro_complete(struct sk_buff *skb, int nhoff)
 {
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 332d6a0..22e90e5 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -16,6 +16,7 @@
 
 #include 
 #include 
+#include 
 
 #include "ip6_offload.h"
 
@@ -268,6 +269,21 @@ static struct sk_buff **sit_ip6ip6_gro_receive(struct 
sk_buff **head,
return ipv6_gro_receive(head, skb);
 }
 
+static struct sk_buff **ip4ip6_gro_receive(struct sk_buff **head,
+  struct sk_buff *skb)
+{
+   /* Common GRO receive for SIT and IP6IP6 */
+
+   if (NAPI_GRO_CB(skb)->encap_mark) {
+   NAPI_GRO_CB(skb)->flush = 1;
+   return NULL;
+   }
+
+   NAPI_GRO_CB(skb)->encap_mark = 1;
+
+   return inet_gro_receive(head, skb);
+}
+
 static int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
 {
const struct net_offload *ops;
@@ -307,6 +323,13 @@ static int ip6ip6_gro_complete(struct sk_buff *skb, int 
nhoff)
return ipv6_gro_complete(skb, nhoff);
 }
 
+static int ip4ip6_gro_complete(struct sk_buff *skb, int nhoff)
+{
+   skb->encapsulation = 1;
+   skb_shinfo(skb)->gso_type |= SKB_GSO_IPXIP6;
+   return inet_gro_complete(skb, nhoff);
+}
+
 static struct packet_offload ipv6_packet_offload __read_mostly = {
.type = cpu_to_be16(ETH_P_IPV6),
.callbacks = {
@@ -324,6 +347,14 @@ static const struct net_offload sit_offload = {
},
 };
 
+static const struct net_offload ip4ip6_offload = {
+   .callbacks = {
+   .gso_segment= inet_gso_segment,
+   .gro_receive= ip4ip6_gro_receive,
+   .gro_complete   = ip4ip6_gro_complete,
+   },
+};
+
 static const struct net_offload ip6ip6_offload = {
.callbacks = {
.gso_segment= ipv6_gso_segment,
@@ -331,7 +362,6 @@ static const struct net_offload ip6ip6_offload = {
.gro_complete   = ip6ip6_gro_complete,
},
 };
-
 static int __init ipv6_offload_init(void)
 {
 
@@ -344,6 +374,7 @@ static int __init ipv6_offload_init(void)
 
inet_add_offload(_offload, IPPROTO_IPV6);
inet6_add_offload(_offload, IPPROTO_IPV6);
+   inet6_add_offload(_offload, IPPROTO_IPIP);
 
return 

[PATCH v7 net-next 12/16] ip6_tunnel: Add support for fou/gue encapsulation

2016-05-18 Thread Tom Herbert
Add netlink and setup for encapsulation

Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_tunnel.c | 72 +++
 1 file changed, 72 insertions(+)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 64ddbeac..74b35e4 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1797,13 +1797,55 @@ static void ip6_tnl_netlink_parms(struct nlattr *data[],
parms->proto = nla_get_u8(data[IFLA_IPTUN_PROTO]);
 }
 
+static bool ip6_tnl_netlink_encap_parms(struct nlattr *data[],
+   struct ip_tunnel_encap *ipencap)
+{
+   bool ret = false;
+
+   memset(ipencap, 0, sizeof(*ipencap));
+
+   if (!data)
+   return ret;
+
+   if (data[IFLA_IPTUN_ENCAP_TYPE]) {
+   ret = true;
+   ipencap->type = nla_get_u16(data[IFLA_IPTUN_ENCAP_TYPE]);
+   }
+
+   if (data[IFLA_IPTUN_ENCAP_FLAGS]) {
+   ret = true;
+   ipencap->flags = nla_get_u16(data[IFLA_IPTUN_ENCAP_FLAGS]);
+   }
+
+   if (data[IFLA_IPTUN_ENCAP_SPORT]) {
+   ret = true;
+   ipencap->sport = nla_get_be16(data[IFLA_IPTUN_ENCAP_SPORT]);
+   }
+
+   if (data[IFLA_IPTUN_ENCAP_DPORT]) {
+   ret = true;
+   ipencap->dport = nla_get_be16(data[IFLA_IPTUN_ENCAP_DPORT]);
+   }
+
+   return ret;
+}
+
 static int ip6_tnl_newlink(struct net *src_net, struct net_device *dev,
   struct nlattr *tb[], struct nlattr *data[])
 {
struct net *net = dev_net(dev);
struct ip6_tnl *nt, *t;
+   struct ip_tunnel_encap ipencap;
 
nt = netdev_priv(dev);
+
+   if (ip6_tnl_netlink_encap_parms(data, )) {
+   int err = ip6_tnl_encap_setup(nt, );
+
+   if (err < 0)
+   return err;
+   }
+
ip6_tnl_netlink_parms(data, >parms);
 
t = ip6_tnl_locate(net, >parms, 0);
@@ -1820,10 +1862,17 @@ static int ip6_tnl_changelink(struct net_device *dev, 
struct nlattr *tb[],
struct __ip6_tnl_parm p;
struct net *net = t->net;
struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
+   struct ip_tunnel_encap ipencap;
 
if (dev == ip6n->fb_tnl_dev)
return -EINVAL;
 
+   if (ip6_tnl_netlink_encap_parms(data, )) {
+   int err = ip6_tnl_encap_setup(t, );
+
+   if (err < 0)
+   return err;
+   }
ip6_tnl_netlink_parms(data, );
 
t = ip6_tnl_locate(net, , 0);
@@ -1864,6 +1913,14 @@ static size_t ip6_tnl_get_size(const struct net_device 
*dev)
nla_total_size(4) +
/* IFLA_IPTUN_PROTO */
nla_total_size(1) +
+   /* IFLA_IPTUN_ENCAP_TYPE */
+   nla_total_size(2) +
+   /* IFLA_IPTUN_ENCAP_FLAGS */
+   nla_total_size(2) +
+   /* IFLA_IPTUN_ENCAP_SPORT */
+   nla_total_size(2) +
+   /* IFLA_IPTUN_ENCAP_DPORT */
+   nla_total_size(2) +
0;
 }
 
@@ -1881,6 +1938,17 @@ static int ip6_tnl_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
nla_put_u32(skb, IFLA_IPTUN_FLAGS, parm->flags) ||
nla_put_u8(skb, IFLA_IPTUN_PROTO, parm->proto))
goto nla_put_failure;
+
+   if (nla_put_u16(skb, IFLA_IPTUN_ENCAP_TYPE,
+   tunnel->encap.type) ||
+   nla_put_be16(skb, IFLA_IPTUN_ENCAP_SPORT,
+tunnel->encap.sport) ||
+   nla_put_be16(skb, IFLA_IPTUN_ENCAP_DPORT,
+tunnel->encap.dport) ||
+   nla_put_u16(skb, IFLA_IPTUN_ENCAP_FLAGS,
+   tunnel->encap.flags))
+   goto nla_put_failure;
+
return 0;
 
 nla_put_failure:
@@ -1904,6 +1972,10 @@ static const struct nla_policy 
ip6_tnl_policy[IFLA_IPTUN_MAX + 1] = {
[IFLA_IPTUN_FLOWINFO]   = { .type = NLA_U32 },
[IFLA_IPTUN_FLAGS]  = { .type = NLA_U32 },
[IFLA_IPTUN_PROTO]  = { .type = NLA_U8 },
+   [IFLA_IPTUN_ENCAP_TYPE] = { .type = NLA_U16 },
+   [IFLA_IPTUN_ENCAP_FLAGS]= { .type = NLA_U16 },
+   [IFLA_IPTUN_ENCAP_SPORT]= { .type = NLA_U16 },
+   [IFLA_IPTUN_ENCAP_DPORT]= { .type = NLA_U16 },
 };
 
 static struct rtnl_link_ops ip6_link_ops __read_mostly = {
-- 
2.8.0.rc2



[PATCH v7 net-next 13/16] ipv6: Set features for IPv6 tunnels

2016-05-18 Thread Tom Herbert
Need to set dev features, use same values that are used in GREv6.

Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_tunnel.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 74b35e4..cabf492 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1640,6 +1640,11 @@ static const struct net_device_ops ip6_tnl_netdev_ops = {
.ndo_get_iflink = ip6_tnl_get_iflink,
 };
 
+#define IPXIPX_FEATURES (NETIF_F_SG |  \
+NETIF_F_FRAGLIST | \
+NETIF_F_HIGHDMA |  \
+NETIF_F_GSO_SOFTWARE | \
+NETIF_F_HW_CSUM)
 
 /**
  * ip6_tnl_dev_setup - setup virtual tunnel device
@@ -1659,6 +1664,10 @@ static void ip6_tnl_dev_setup(struct net_device *dev)
dev->addr_len = sizeof(struct in6_addr);
dev->features |= NETIF_F_LLTX;
netif_keep_dst(dev);
+
+   dev->features   |= IPXIPX_FEATURES;
+   dev->hw_features|= IPXIPX_FEATURES;
+
/* This perm addr will be used as interface identifier by IPv6 */
dev->addr_assign_type = NET_ADDR_RANDOM;
eth_random_addr(dev->perm_addr);
-- 
2.8.0.rc2



[PATCH v7 net-next 02/16] net: define gso types for IPx over IPv4 and IPv6

2016-05-18 Thread Tom Herbert
This patch defines two new GSO definitions SKB_GSO_IPXIP4 and
SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and
NETIF_F_GSO_IPXIP6. These are used to described IP in IP
tunnel and what the outer protocol is. The inner protocol
can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and
SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT
are removed (these are both instances of SKB_GSO_IPXIP4).
SKB_GSO_IPXIP6 will be used when support for GSO with IP
encapsulation over IPv6 is added.

Signed-off-by: Tom Herbert 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  |  5 ++---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  5 ++---
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  3 +--
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  3 +--
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  3 +--
 drivers/net/ethernet/intel/i40evf/i40evf_main.c   |  3 +--
 drivers/net/ethernet/intel/igb/igb_main.c |  3 +--
 drivers/net/ethernet/intel/igbvf/netdev.c |  3 +--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  3 +--
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  3 +--
 include/linux/netdev_features.h   | 12 ++--
 include/linux/netdevice.h |  4 ++--
 include/linux/skbuff.h|  4 ++--
 net/core/ethtool.c|  4 ++--
 net/ipv4/af_inet.c|  2 +-
 net/ipv4/ipip.c   |  2 +-
 net/ipv6/ip6_offload.c|  4 ++--
 net/ipv6/sit.c|  4 ++--
 net/netfilter/ipvs/ip_vs_xmit.c   | 17 +++--
 19 files changed, 37 insertions(+), 50 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index d465bd7..0a5b770 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -13259,12 +13259,11 @@ static int bnx2x_init_dev(struct bnx2x *bp, struct 
pci_dev *pdev,
NETIF_F_RXHASH | NETIF_F_HW_VLAN_CTAG_TX;
if (!chip_is_e1x) {
dev->hw_features |= NETIF_F_GSO_GRE | NETIF_F_GSO_UDP_TUNNEL |
-   NETIF_F_GSO_IPIP | NETIF_F_GSO_SIT;
+   NETIF_F_GSO_IPXIP4;
dev->hw_enc_features =
NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | NETIF_F_SG |
NETIF_F_TSO | NETIF_F_TSO_ECN | NETIF_F_TSO6 |
-   NETIF_F_GSO_IPIP |
-   NETIF_F_GSO_SIT |
+   NETIF_F_GSO_IPXIP4 |
NETIF_F_GSO_GRE | NETIF_F_GSO_UDP_TUNNEL;
}
 
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 5a0dca3..72a2eff 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -6311,7 +6311,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
dev->hw_features = NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | NETIF_F_SG |
   NETIF_F_TSO | NETIF_F_TSO6 |
   NETIF_F_GSO_UDP_TUNNEL | NETIF_F_GSO_GRE |
-  NETIF_F_GSO_IPIP | NETIF_F_GSO_SIT |
+  NETIF_F_GSO_IPXIP4 |
   NETIF_F_GSO_UDP_TUNNEL_CSUM | NETIF_F_GSO_GRE_CSUM |
   NETIF_F_GSO_PARTIAL | NETIF_F_RXHASH |
   NETIF_F_RXCSUM | NETIF_F_LRO | NETIF_F_GRO;
@@ -6321,8 +6321,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
NETIF_F_TSO | NETIF_F_TSO6 |
NETIF_F_GSO_UDP_TUNNEL | NETIF_F_GSO_GRE |
NETIF_F_GSO_UDP_TUNNEL_CSUM | NETIF_F_GSO_GRE_CSUM |
-   NETIF_F_GSO_IPIP | NETIF_F_GSO_SIT |
-   NETIF_F_GSO_PARTIAL;
+   NETIF_F_GSO_IPXIP4 | NETIF_F_GSO_PARTIAL;
dev->gso_partial_features = NETIF_F_GSO_UDP_TUNNEL_CSUM |
NETIF_F_GSO_GRE_CSUM;
dev->vlan_features = dev->hw_features | NETIF_F_HIGHDMA;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 1cd0ebf..242a1ff 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -9083,8 +9083,7 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
   NETIF_F_TSO6 |
   NETIF_F_GSO_GRE  |
   NETIF_F_GSO_GRE_CSUM |
-  NETIF_F_GSO_IPIP |
-  NETIF_F_GSO_SIT  |
+  

[PATCH v7 net-next 04/16] ipv6: Change "final" protocol processing for encapsulation

2016-05-18 Thread Tom Herbert
When performing foo-over-UDP, UDP packets are processed by the
encapsulation handler which returns another protocol to process.
This may result in processing two (or more) protocols in the
loop that are marked as INET6_PROTO_FINAL. The actions taken
for hitting a final protocol, in particular the skb_postpull_rcsum
can only be performed once.

This patch set adds a check of a final protocol has been seen. The
rules are:
  - If the final protocol has not been seen any protocol is processed
(final and non-final). In the case of a final protocol, the final
actions are taken (like the skb_postpull_rcsum)
  - If a final protocol has been seen (e.g. an encapsulating UDP
header) then no further non-final protocols are allowed
(e.g. extension headers). For more final protocols the
final actions are not taken (e.g. skb_postpull_rcsum).

Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_input.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index d35dff2..94611e4 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -223,6 +223,7 @@ static int ip6_input_finish(struct net *net, struct sock 
*sk, struct sk_buff *sk
unsigned int nhoff;
int nexthdr;
bool raw;
+   bool have_final = false;
 
/*
 *  Parse extension headers
@@ -242,9 +243,21 @@ resubmit_final:
if (ipprot) {
int ret;
 
-   if (ipprot->flags & INET6_PROTO_FINAL) {
+   if (have_final) {
+   if (!(ipprot->flags & INET6_PROTO_FINAL)) {
+   /* Once we've seen a final protocol don't
+* allow encapsulation on any non-final
+* ones. This allows foo in UDP encapsulation
+* to work.
+*/
+   goto discard;
+   }
+   } else if (ipprot->flags & INET6_PROTO_FINAL) {
const struct ipv6hdr *hdr;
 
+   /* Only do this once for first final protocol */
+   have_final = true;
+
/* Free reference early: we don't need it any more,
   and it may hold ip_conntrack module loaded
   indefinitely. */
-- 
2.8.0.rc2



[PATCH v7 net-next 14/16] ip6ip6: Support for GSO/GRO

2016-05-18 Thread Tom Herbert
Signed-off-by: Tom Herbert 
---
 net/ipv6/ip6_offload.c | 24 +---
 net/ipv6/ip6_tunnel.c  |  5 +
 2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 787e55f..332d6a0 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -253,9 +253,11 @@ out:
return pp;
 }
 
-static struct sk_buff **sit_gro_receive(struct sk_buff **head,
-   struct sk_buff *skb)
+static struct sk_buff **sit_ip6ip6_gro_receive(struct sk_buff **head,
+  struct sk_buff *skb)
 {
+   /* Common GRO receive for SIT and IP6IP6 */
+
if (NAPI_GRO_CB(skb)->encap_mark) {
NAPI_GRO_CB(skb)->flush = 1;
return NULL;
@@ -298,6 +300,13 @@ static int sit_gro_complete(struct sk_buff *skb, int nhoff)
return ipv6_gro_complete(skb, nhoff);
 }
 
+static int ip6ip6_gro_complete(struct sk_buff *skb, int nhoff)
+{
+   skb->encapsulation = 1;
+   skb_shinfo(skb)->gso_type |= SKB_GSO_IPXIP6;
+   return ipv6_gro_complete(skb, nhoff);
+}
+
 static struct packet_offload ipv6_packet_offload __read_mostly = {
.type = cpu_to_be16(ETH_P_IPV6),
.callbacks = {
@@ -310,11 +319,19 @@ static struct packet_offload ipv6_packet_offload 
__read_mostly = {
 static const struct net_offload sit_offload = {
.callbacks = {
.gso_segment= ipv6_gso_segment,
-   .gro_receive= sit_gro_receive,
+   .gro_receive= sit_ip6ip6_gro_receive,
.gro_complete   = sit_gro_complete,
},
 };
 
+static const struct net_offload ip6ip6_offload = {
+   .callbacks = {
+   .gso_segment= ipv6_gso_segment,
+   .gro_receive= sit_ip6ip6_gro_receive,
+   .gro_complete   = ip6ip6_gro_complete,
+   },
+};
+
 static int __init ipv6_offload_init(void)
 {
 
@@ -326,6 +343,7 @@ static int __init ipv6_offload_init(void)
dev_add_offload(_packet_offload);
 
inet_add_offload(_offload, IPPROTO_IPV6);
+   inet6_add_offload(_offload, IPPROTO_IPV6);
 
return 0;
 }
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index cabf492..d26d226 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1242,6 +1242,11 @@ ip6ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev)
if (t->parms.flags & IP6_TNL_F_USE_ORIG_FWMARK)
fl6.flowi6_mark = skb->mark;
 
+   if (iptunnel_handle_offloads(skb, SKB_GSO_IPXIP6))
+   return -1;
+
+   skb_set_inner_ipproto(skb, IPPROTO_IPV6);
+
err = ip6_tnl_xmit(skb, dev, dsfield, , encap_limit, ,
   IPPROTO_IPV6);
if (err != 0) {
-- 
2.8.0.rc2



[PATCH v7 net-next 07/16] fou: Split out {fou,gue}_build_header

2016-05-18 Thread Tom Herbert
Create __fou_build_header and __gue_build_header. These implement the
protocol generic parts of building the fou and gue header.
fou_build_header and gue_build_header implement the IPv4 specific
functions and call the __*_build_header functions.

Signed-off-by: Tom Herbert 
---
 include/net/fou.h |  8 
 net/ipv4/fou.c| 47 +--
 2 files changed, 41 insertions(+), 14 deletions(-)

diff --git a/include/net/fou.h b/include/net/fou.h
index 19b8a0c..7d2fda2 100644
--- a/include/net/fou.h
+++ b/include/net/fou.h
@@ -11,9 +11,9 @@
 size_t fou_encap_hlen(struct ip_tunnel_encap *e);
 static size_t gue_encap_hlen(struct ip_tunnel_encap *e);
 
-int fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-u8 *protocol, struct flowi4 *fl4);
-int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-u8 *protocol, struct flowi4 *fl4);
+int __fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+  u8 *protocol, __be16 *sport, int type);
+int __gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+  u8 *protocol, __be16 *sport, int type);
 
 #endif
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 6cbc725..f4f2ddd 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -780,6 +780,22 @@ static void fou_build_udp(struct sk_buff *skb, struct 
ip_tunnel_encap *e,
*protocol = IPPROTO_UDP;
 }
 
+int __fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+  u8 *protocol, __be16 *sport, int type)
+{
+   int err;
+
+   err = iptunnel_handle_offloads(skb, type);
+   if (err)
+   return err;
+
+   *sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev),
+   skb, 0, 0, false);
+
+   return 0;
+}
+EXPORT_SYMBOL(__fou_build_header);
+
 int fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
 u8 *protocol, struct flowi4 *fl4)
 {
@@ -788,26 +804,21 @@ int fou_build_header(struct sk_buff *skb, struct 
ip_tunnel_encap *e,
__be16 sport;
int err;
 
-   err = iptunnel_handle_offloads(skb, type);
+   err = __fou_build_header(skb, e, protocol, , type);
if (err)
return err;
 
-   sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev),
-  skb, 0, 0, false);
fou_build_udp(skb, e, fl4, protocol, sport);
 
return 0;
 }
 EXPORT_SYMBOL(fou_build_header);
 
-int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
-u8 *protocol, struct flowi4 *fl4)
+int __gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+  u8 *protocol, __be16 *sport, int type)
 {
-   int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM ? SKB_GSO_UDP_TUNNEL_CSUM :
-  SKB_GSO_UDP_TUNNEL;
struct guehdr *guehdr;
size_t hdrlen, optlen = 0;
-   __be16 sport;
void *data;
bool need_priv = false;
int err;
@@ -826,8 +837,8 @@ int gue_build_header(struct sk_buff *skb, struct 
ip_tunnel_encap *e,
return err;
 
/* Get source port (based on flow hash) before skb_push */
-   sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev),
-  skb, 0, 0, false);
+   *sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev),
+   skb, 0, 0, false);
 
hdrlen = sizeof(struct guehdr) + optlen;
 
@@ -872,6 +883,22 @@ int gue_build_header(struct sk_buff *skb, struct 
ip_tunnel_encap *e,
 
}
 
+   return 0;
+}
+EXPORT_SYMBOL(__gue_build_header);
+
+int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
+u8 *protocol, struct flowi4 *fl4)
+{
+   int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM ? SKB_GSO_UDP_TUNNEL_CSUM :
+  SKB_GSO_UDP_TUNNEL;
+   __be16 sport;
+   int err;
+
+   err = __gue_build_header(skb, e, protocol, , type);
+   if (err)
+   return err;
+
fou_build_udp(skb, e, fl4, protocol, sport);
 
return 0;
-- 
2.8.0.rc2



[PATCH v7 net-next 00/16] ipv6: Enable GUEoIPv6 and more fixes for v6 tunneling

2016-05-18 Thread Tom Herbert
This patch set:
  - Fixes GRE6 to process translate flags correctly from configuration
  - Adds support for GSO and GRO for ip6ip6 and ip4ip6
  - Add support for FOU and GUE in IPv6
  - Support GRE, ip6ip6 and ip4ip6 over FOU/GUE
  - Fixes ip6_input to deal with UDP encapsulations
  - Some other minor fixes

v2:
  - Removed a check of GSO types in MPLS
  - Define GSO type SKB_GSO_IPXIP6 and SKB_GSO_IPXIP4 (based on input
from Alexander)
  - Don't define GSO types specifically for IP6IP6 and IP4IP6, above
fix makes that unnecessary
  - Don't bother clearing encapsulation flag in UDP tunnel segment
(another item suggested by Alexander).

v3:
  - Address some minor comments from Alexander

v4:
  - Rebase on changes to fix IP TX tunnels
  - Fix MTU issues in ip4ip6, ip6ip6
  - Add test data for above

v5:
  - Address feedback from Shmulik Ladkani regarding extension header
code that does not return next header but in instead relies
on returning value via nhoff. Solution here is to fix EH
processing to return nexthdr value.
  - Refactored IPv4 encaps so that we won't need to create
a ip6_tunnel_core.c when adding encap support IPv6.

v6:
  - Fix build issues with regard to new GSO constants
  - FIx MTU calculation issues ip6_tunnel.c pointed out byt ALex
  - Add encap_hlen into headroom for GREv6 to work with FOU/GUE

v7:
  - Added skb_set_inner_ipproto to ip4ip6 and ip6ip6
  - Clarified max_headroom in ip6_tnl_xmit
  - Set features for IPv6 tunnels
  - Other cleanup suggested by Alexander
  - Above fixes throughput performance issues in ip4ip6 and ip6ip6,
updated test results to reflect that

Tested: Various cases of IP tunnels with netperf TCP_STREAM and TCP_RR.

- IPv4/GRE/GUE/IPv6 with RCO
  1 TCP_STREAM
6616 Mbps
  200 TCP_RR
1244043 tps
141/243/446 90/95/99% latencies
86.61% CPU utilization

- IPv6/GRE/GUE/IPv6 with RCO
  1 TCP_STREAM
6940 Mbps
  200 TCP_RR
1270903 tps
138/236/440 90/95/99% latencies
87.51% CPU utilization

 - IP6IP6
  1 TCP_STREAM
5307 Mbps
  200 TCP_RR
498981 tps
388/498/631 90/95/99% latencies
19.75% CPU utilization (1 CPU saturated)

 - IP6IP6/GUE with RCO
  1 TCP_STREAM
5575 Mbps
  200 TCP_RR
1233818 tps
143/244/451 90/95/99% latencies
87.57 CPU utilization

 - IP4IP6
  1 TCP_STREAM
5235 Mbps
  200 TCP_RR
763774 tps
250/318/466 90/95/99% latencies
35.25% CPU utilization (1 CPU saturated)

 - IP4IP6/GUE with RCO
  1 TCP_STREAM
5337 Mbps
  200 TCP_RR
1196385 tps
148/251/460 90/95/99% latencies
87.56 CPU utilization

 - GRE with keyid
  200 TCP_RR
744173 tps
258/332/461 90/95/99% latencies
34.59% CPU utilization (1 CPU saturated)
  

Tom Herbert (16):
  gso: Remove arbitrary checks for unsupported GSO
  net: define gso types for IPx over IPv4 and IPv6
  ipv6: Fix nexthdr for reinjection
  ipv6: Change "final" protocol processing for encapsulation
  net: Cleanup encap items in ip_tunnels.h
  fou: Call setup_udp_tunnel_sock
  fou: Split out {fou,gue}_build_header
  fou: Support IPv6 in fou
  ip6_tun: Add infrastructure for doing encapsulation
  fou: Add encap ops for IPv6 tunnels
  ip6_gre: Add support for fou/gue encapsulation
  ip6_tunnel: Add support for fou/gue encapsulation
  ipv6: Set features for IPv6 tunnels
  ip6ip6: Support for GSO/GRO
  ip4ip6: Support for GSO/GRO
  ipv6: Don't reset inner headers in ip6_tnl_xmit

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  |   5 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |   5 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   3 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |   3 +-
 drivers/net/ethernet/intel/i40evf/i40evf_main.c   |   3 +-
 drivers/net/ethernet/intel/igb/igb_main.c |   3 +-
 drivers/net/ethernet/intel/igbvf/netdev.c |   3 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   3 +-
 include/linux/netdev_features.h   |  12 +-
 include/linux/netdevice.h |   4 +-
 include/linux/skbuff.h|   4 +-
 include/net/fou.h |  10 +-
 include/net/inet_common.h |   5 +
 include/net/ip6_tunnel.h  |  58 +++
 include/net/ip_tunnels.h  |  76 +++--
 net/core/ethtool.c|   4 +-
 net/ipv4/af_inet.c|  32 +---
 net/ipv4/fou.c| 144 +---
 net/ipv4/gre_offload.c|  14 --
 net/ipv4/ip_tunnel.c  |  45 -
 

[PATCH 1/1 RFC] net/phy: Add Lantiq PHY driver

2016-05-18 Thread Alexander Stein
This currently only supports PEF7071 and allows to specify max-speed and
is able to read the LED configuration from device-tree.

Signed-off-by: Alexander Stein 
---
The main purpose for now is to set a LED configuration from device tree and
to limit the maximum speed. The latter one in my case hardware limited.
As MAC and it's link partner support 1000MBit/s they would try to use that
but will eventually fail due to magnetics only supporting 100MBit/s. So
limit the maximum link speed supported directly from the start.

As this is a RFC I skipped the device tree binding doc.

 drivers/net/phy/Kconfig  |   5 ++
 drivers/net/phy/Makefile |   1 +
 drivers/net/phy/lantiq.c | 167 +++
 3 files changed, 173 insertions(+)
 create mode 100644 drivers/net/phy/lantiq.c

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 3e28f7a..c004885 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -119,6 +119,11 @@ config STE10XP
---help---
  This is the driver for the STe100p and STe101p PHYs.
 
+config LANTIQ_PHY
+   tristate "Driver for Lantiq PHYs"
+   ---help---
+ Supports the PEF7071 PHYs.
+
 config LSI_ET1011C_PHY
tristate "Driver for LSI ET1011C PHY"
---help---
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index 8ad4ac6..e886549 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -38,3 +38,4 @@ obj-$(CONFIG_MDIO_SUN4I)  += mdio-sun4i.o
 obj-$(CONFIG_MDIO_MOXART)  += mdio-moxart.o
 obj-$(CONFIG_AMD_XGBE_PHY) += amd-xgbe-phy.o
 obj-$(CONFIG_MDIO_BCM_UNIMAC)  += mdio-bcm-unimac.o
+obj-$(CONFIG_LANTIQ_PHY)   += lantiq.o
diff --git a/drivers/net/phy/lantiq.c b/drivers/net/phy/lantiq.c
new file mode 100644
index 000..876a7d1
--- /dev/null
+++ b/drivers/net/phy/lantiq.c
@@ -0,0 +1,167 @@
+/*
+ * Driver for Lantiq PHYs
+ *
+ * Author: Alexander Stein 
+ *
+ * Copyright (c) 2015-2016 SYS TEC electronic GmbH
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#define PHY_ID_PEF7071 0xd565a401
+
+#define MII_LANTIQ_MMD_CTRL_REG0x0d
+#define MII_LANTIQ_MMD_REGDATA_REG 0x0e
+#define OP_DATA1
+
+struct lantiqphy_led_ctrl {
+   const char *property;
+   u32 regnum;
+};
+
+static int lantiq_extended_write(struct phy_device *phydev,
+u8 mode, u32 dev_addr, u32 regnum, u16 val)
+{
+   phy_write(phydev, MII_LANTIQ_MMD_CTRL_REG, dev_addr);
+   phy_write(phydev, MII_LANTIQ_MMD_REGDATA_REG, regnum);
+   phy_write(phydev, MII_LANTIQ_MMD_CTRL_REG, (mode << 14) | dev_addr);
+   return phy_write(phydev, MII_LANTIQ_MMD_REGDATA_REG, val);
+}
+
+static int lantiq_of_load_led_config(struct phy_device *phydev,
+struct device_node *of_node,
+const struct lantiqphy_led_ctrl *leds,
+u8 entries)
+{
+   u16 val;
+   int i;
+   int ret = 0;
+
+   for (i = 0; i < entries; i++) {
+   if (!of_property_read_u16(of_node, leds[i].property, )) {
+   ret = lantiq_extended_write(phydev, OP_DATA, 0x1f,
+   leds[i].regnum, val);
+   if (ret) {
+   dev_err(>dev, "Error writing register 
0x1f.%04x (%d)\n",
+   leds[i].regnum, ret);
+   break;
+   }
+   }
+   }
+
+   return ret;
+}
+
+static const struct lantiqphy_led_ctrl leds[] = {
+   {
+   .property = "led0h",
+   .regnum = 0x01e2,
+   },
+   {
+   .property = "led0l",
+   .regnum = 0x01e3,
+   },
+   {
+   .property = "led1h",
+   .regnum = 0x01e4,
+   },
+   {
+   .property = "led1l",
+   .regnum = 0x01e5,
+   },
+   {
+   .property = "led2h",
+   .regnum = 0x01e6,
+   },
+   {
+   .property = "led2l",
+   .regnum = 0x01e7,
+   },
+};
+
+static int lantiqphy_config_init(struct phy_device *phydev)
+{
+   struct device *dev = >dev;
+   struct device_node *of_node = dev->of_node;
+   u32 max_speed;
+
+   if (!of_node && dev->parent->of_node)
+   of_node = dev->parent->of_node;
+
+   if (of_node) {
+   lantiq_of_load_led_config(phydev, of_node, leds,
+ 

Re: Make TCP work better with re-ordered frames?

2016-05-18 Thread Eric Dumazet
On Wed, 2016-05-18 at 08:46 -0700, Ben Greear wrote:

> I will work on captures...do you care if it is from transmitter or receiver's 
> perspective?

Receiver would probably be more useful.




Re: [PATCH] net: suppress warnings on dev_alloc_skb

2016-05-18 Thread Eric Dumazet
On Wed, 2016-05-18 at 11:25 -0400, Neil Horman wrote:
> Noticed an allocation failure in a network driver the other day on a 32 bit
> system:
> 
> DMA-API: debugging out of memory - disabling
> bnx2fc: adapter_lookup: hba NULL
> lldpad: page allocation failure. order:0, mode:0x4120
> Pid: 4556, comm: lldpad Not tainted 2.6.32-639.el6.i686.debug #1
> Call Trace:
>  [] ? printk+0x19/0x23
>  [] ? __alloc_pages_nodemask+0x664/0x830
>  [] ? free_object+0x82/0xa0
>  [] ? ixgbe_alloc_rx_buffers+0x10b/0x1d0 [ixgbe]
>  [] ? ixgbe_configure_rx_ring+0x29f/0x420 [ixgbe]
>  [] ? ixgbe_configure_tx_ring+0x15c/0x220 [ixgbe]
>  [] ? ixgbe_configure+0x589/0xc00 [ixgbe]
>  [] ? ixgbe_open+0xa7/0x5c0 [ixgbe]
>  [] ? ixgbe_init_interrupt_scheme+0x5b6/0x970 [ixgbe]
>  [] ? ixgbe_setup_tc+0x1a4/0x260 [ixgbe]
>  [] ? ixgbe_dcbnl_set_state+0x7f/0x90 [ixgbe]
>  [] ? dcb_doit+0x10ed/0x16d0
> ...


Well, maybe this call site (via ixgbe_configure_rx_ring()) should be
using GFP_KERNEL instead of GFP_ATOMIC.

Otherwise, if you are unlucky, not a single page would be allocated and
RX ring buffer would be empty.

So the 'fix' could be limited to GFP_ATOMIC callers ?

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 
c413c588a24f854be9e4df78d8a6872b6b1ff9f3..61b923f1520845145a5470d752b278d283cbb348
 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2467,7 +2467,7 @@ static inline struct page *__dev_alloc_pages(gfp_t 
gfp_mask,
 
 static inline struct page *dev_alloc_pages(unsigned int order)
 {
-   return __dev_alloc_pages(GFP_ATOMIC, order);
+   return __dev_alloc_pages(GFP_ATOMIC | __GFP_NOWARN, order);
 }
 
 /**
@@ -2485,7 +2485,7 @@ static inline struct page *__dev_alloc_page(gfp_t 
gfp_mask)
 
 static inline struct page *dev_alloc_page(void)
 {
-   return __dev_alloc_page(GFP_ATOMIC);
+   return dev_alloc_pages(0);
 }
 
 /**




Re: Make TCP work better with re-ordered frames?

2016-05-18 Thread Ben Greear

On 05/18/2016 08:25 AM, Eric Dumazet wrote:

On Wed, 2016-05-18 at 08:07 -0700, Ben Greear wrote:


On 05/18/2016 07:29 AM, Eric Dumazet wrote:

On Wed, 2016-05-18 at 07:00 -0700, Ben Greear wrote:

We are investigating a system that has fairly poor TCP throughput
with the 3.17 and 4.0 kernels, but evidently it worked pretty well
with 3.14 (I should be able to verify 3.14 later today).

One thing I notice is that a UDP download test shows lots of reordered
frames, so I am thinking maybe TCP is running slow because of this.

(We see about 800Mbps UDP download, but only 500Mbps TCP, even when
using 100 concurrent TCP streams.)

Is there some way to tune the TCP stack to better handle reordered frames?


Nothing yet. Are you the sender or the receiver ?

You really want to avoid reorders as much as possible.

Are you telling us something broke in networking layers between 3.14 and
3.17 leadings to reorders ?


I am both sender and receiver, through an access-controller and wifi AP as DUT.
The sender is Intel 1G NIC, so I suspect it is not causing reordering, which
indicates most likely DUT is to blame.

Using several off-the-shelf APs in our lab we do not see this problem.

I am not certain yet what is the difference, but customer reports 600+Mbps
with their older code, and best I can get is around 500Mbps with newer stuff.

Lots of stuff changed though (ath10k firmware, user-space at least slightly,
kernel, etc), so possibly the regression is elsewhere.



You possibly could send me some pcap (limited to the headers, using -s
128 for example) and limited to few flows, not the whole of them ;)

TCP reorders are tricky for the receiver : It sends a lot of SACK (one
for every incoming packet, instead of the normal rule of sending one ACK
for two incoming packets)

Increasing number of ACK might impact half-duplex networks, but also
considerably increase cpu processing time.


I will work on captures...do you care if it is from transmitter or receiver's 
perspective?

Thanks,
Ben








--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: Crashes in -next due to 'phy: add support for a reset-gpio specification'

2016-05-18 Thread Guenter Roeck
On Tue, May 17, 2016 at 10:01:37PM -0700, Florian Fainelli wrote:
> Le 17/05/2016 21:37, Guenter Roeck a écrit :
> > Hi,
> > 
> > my xtensa qemu tests crash in -next as follows.
> > 
> > [ ... ]
> > 
> > [9.366256] libphy: ethoc-mdio: probed
> > [9.367389]  (null): could not attach to PHY
> > [9.368555]  (null): failed to probe MDIO bus
> > [9.371540] Unable to handle kernel paging request at virtual address
> > 001c
> > [9.371540]  pc = d0320926, ra = 903209d1
> > [9.375358] Oops: sig: 11 [#1]
> > [9.376081] PREEMPT
> > [9.377080] CPU: 0 PID: 1 Comm: swapper Not tainted
> > 4.6.0-next-20160517 #1
> > [9.378397] task: d7c2c000 ti: d7c3 task.ti: d7c3
> > [9.379394] a00: 903209d1 d7c31bd0 d7fb5810 0001 
> >  d7f45c00 d7c31bd0
> > [9.382298] a08:     00060100
> > d04b0c10 d7f45dfc d7c31bb0
> > [9.385732] pc: d0320926, ps: 00060110, depc: 0018, excvaddr:
> > 001c
> > [9.387061] lbeg: d0322e35, lend: d0322e57 lcount: , sar:
> > 0011
> > [9.388173]
> > Stack: d7c31be0 00060700 d7f45c00 d7c31bd0 9021d509 d7c31c30 d7f45c00
> > 
> >d0485dcc d0485dcc d7fb5810 d7c2c000  d7c31c30 d7f45c00
> > d025befc
> >d0485dcc d7c3 d7f45c34 d7c31bf0 9021c985 d7c31c50 d7f45c00
> > d7f45c34
> > [9.396652] Call Trace:
> > [9.397469]  [] __device_release_driver+0x7d/0x98
> > [9.398869]  [] device_release_driver+0x15/0x20
> > [9.400247]  [] bus_remove_device+0xc1/0xd4
> > [9.401569]  [] device_del+0x109/0x15c
> > [9.402794]  [] phy_mdio_device_remove+0xd/0x18
> > [9.404124]  [] mdiobus_unregister+0x40/0x5c
> > [9.405444]  [] ethoc_probe+0x534/0x5b8
> > [9.406742]  [] platform_drv_probe+0x28/0x48
> > [9.408122]  [] driver_probe_device+0x101/0x234
> > [9.409499]  [] __driver_attach+0x7d/0x98
> > [9.410809]  [] bus_for_each_dev+0x30/0x5c
> > [9.412104]  [] driver_attach+0x14/0x18
> > [9.413385]  [] bus_add_driver+0xc9/0x198
> > [9.414686]  [] driver_register+0x70/0xa0
> > [9.416001]  [] __platform_driver_register+0x24/0x28
> > [9.417463]  [] ethoc_driver_init+0x10/0x14
> > [9.418824]  [] do_one_initcall+0x80/0x1ac
> > [9.420083]  [] kernel_init_freeable+0x131/0x198
> > [9.421504]  [] kernel_init+0xc/0xb0
> > [9.422693]  [] ret_from_kernel_thread+0x8/0xc
> > 
> > Bisect points to commit da47b4572056 ("phy: add support for a reset-gpio
> > specification").
> > Bisect log is attached. Reverting the patch fixes the problem.
> 
> Aside from what you pointed out, this patch was still in dicussion when
> it got merged, since we got a concurrent patch from Sergei which tries
> to deal with the same kind of problem.
> 
> Do you mind sending a revert, or I can do that first thing in the morning.
> 
I don't think I'll find the time to do that today, and also I would like
to hear from Dave what his preferences are.

Note that this now also affects mainline.

Guenter


  1   2   >