date:20160122

Re: [PATCH 1/3][v2] net: phy: introduce 1000BASE-KX and 10GBASE-KR

2016-01-22 Thread Shaohui Xie

___
From: Andrew Lunn 
Sent: Friday, January 22, 2016 5:12 AM
To: Shaohui Xie
Cc: Sebastian Hesselbarth; Florian Fainelli; shh@gmail.com; 
devicet...@vger.kernel.org; netdev@vger.kernel.org; 
linuxppc-...@lists.ozlabs.org; da...@davemloft.net; Shaohui Xie
Subject: Re: [PATCH 1/3][v2] net: phy: introduce 1000BASE-KX and 10GBASE-KR

On Tue, Jan 19, 2016 at 05:00:35AM +, Shaohui Xie wrote:
> > -Original Message-
> > From: Andrew Lunn [mailto:and...@lunn.ch]
> > Sent: Monday, January 18, 2016 11:15 PM
> > To: Shaohui Xie
> > Cc: Sebastian Hesselbarth; Florian Fainelli; shh@gmail.com;
> > devicet...@vger.kernel.org; netdev@vger.kernel.org; linuxppc-
> > d...@lists.ozlabs.org; da...@davemloft.net; Shaohui Xie
> > Subject: Re: [PATCH 1/3][v2] net: phy: introduce 1000BASE-KX and 10GBASE-KR
> >
> > > [S.H] the fsl backplane, e.g. 10GBASE-KR, needs software to handle
> > > link training, It's to train link partner, and trained by link partner
> > parallel.
> > >
> > > But if media type is not copper, e.g. optical module, we won't need this.
> >
> > So what we actually need to know is copper vs fibre?

> Copper is not enough to indicate backplane, since backplane is
> always copper, but copper is not always backplane.

>O.K, lets try again

[S.H]Seems I did not get your point, Sorry for the inconvenient.

>If it is copper backplane you need to perform training.

>Looking at the driver probe function, it is either 1000BASE-KX, no
>training needed, or else it is 10GBASE-RK and training is needed.

>Looking at fsl_backplane_config_aneg() you expect phydev->speed to be
>set, and from the speed you then kick of either KR autoneg or KX
>autoneg. Could you also start the training at this point? Use the
>speed to indicate if training is needed?

 [S.H]The training cannot be started at this point, yet, because it's based on 
autoneg result, only when both sides autoneg-ed to 10G-KR, then to start
the training.

Besides the driver, generally speaking, "copper + speed" is not enough to 
indicate 
it's backplane, for ex. "copper + 1000" does not mean it has to be 1000BASE-KX, 
it could be SGMII, hence cannot use KX autoneg.

If putting backplane property to phy.txt is not good, I can put it to fsl 
specific
binding, like the second patch 2/3 did.

Thank you!
Shaohui

Re: [PATCH 1/3][v2] net: phy: introduce 1000BASE-KX and 10GBASE-KR

2016-01-22 Thread Sebastian Hesselbarth

On 22.01.2016 09:15, Shaohui Xie wrote:

___
From: Andrew Lunn 
Sent: Friday, January 22, 2016 5:12 AM
To: Shaohui Xie
Cc: Sebastian Hesselbarth; Florian Fainelli; shh@gmail.com; 
devicet...@vger.kernel.org; netdev@vger.kernel.org; 
linuxppc-...@lists.ozlabs.org; da...@davemloft.net; Shaohui Xie
Subject: Re: [PATCH 1/3][v2] net: phy: introduce 1000BASE-KX and 10GBASE-KR

On Tue, Jan 19, 2016 at 05:00:35AM +, Shaohui Xie wrote:

-Original Message-
From: Andrew Lunn [mailto:and...@lunn.ch]
Sent: Monday, January 18, 2016 11:15 PM
To: Shaohui Xie
Cc: Sebastian Hesselbarth; Florian Fainelli; shh@gmail.com;
devicet...@vger.kernel.org; netdev@vger.kernel.org; linuxppc-
d...@lists.ozlabs.org; da...@davemloft.net; Shaohui Xie
Subject: Re: [PATCH 1/3][v2] net: phy: introduce 1000BASE-KX and 10GBASE-KR

[S.H] the fsl backplane, e.g. 10GBASE-KR, needs software to handle
link training, It's to train link partner, and trained by link partner

parallel.

But if media type is not copper, e.g. optical module, we won't need this.

So what we actually need to know is copper vs fibre?

Copper is not enough to indicate backplane, since backplane is
always copper, but copper is not always backplane.

O.K, lets try again

[S.H]Seems I did not get your point, Sorry for the inconvenient.

If it is copper backplane you need to perform training.

Looking at the driver probe function, it is either 1000BASE-KX, no
training needed, or else it is 10GBASE-RK and training is needed.

Looking at fsl_backplane_config_aneg() you expect phydev->speed to be
set, and from the speed you then kick of either KR autoneg or KX
autoneg. Could you also start the training at this point? Use the
speed to indicate if training is needed?

  [S.H]The training cannot be started at this point, yet, because it's based on
autoneg result, only when both sides autoneg-ed to 10G-KR, then to start
the training.

Shaohui,

look, we want you to convince us why to have a generic backplane
property in the phy binding. You had a bad start by adding new
property values to a property that does not relate to your use
case at all. Your job now really is to give strong reasons _why_
and _what_ a phy driver needs to know about the backplane setup.

Your first approach was to add "10gbase-rk" or "40gbase-foo" but
now you tell us about ANEG. Of what use is the information given
by the property when ANEG tells you something different? E.g.
consider the property tells you "10g-something" but ANEG gives
you "40g"; does the property add any value to your training
decision now?

Besides the driver, generally speaking, "copper + speed" is not enough to 
indicate
it's backplane, for ex. "copper + 1000" does not mean it has to be 1000BASE-KX,
it could be SGMII, hence cannot use KX autoneg.

So, is it copper + speed + backplane? or speed + backplane? And out of
the set of required input, is there anything your _cannot_ determine
from other things, e.g. ANEG?

If it is backplane only, would a boolean property ("backplane-mode")
be enough for the training decision?

If putting backplane property to phy.txt is not good, I can put it to fsl 
specific
binding, like the second patch 2/3 did.

You seem to see vendor specific properties as a place to dump all your
waste you don't want to think about. You fail to give good reasons what
is so special about the backplane setup and now you are telling us that
it is even fsl-specific?

Sebastian

Re: [PATCH v2] wl1251: add sysfs interface for bluetooth coexistence mode configuration

2016-01-22 Thread Pali Rohár

On Thursday 21 January 2016 16:44:33 Kalle Valo wrote:
> Pali Rohár  writes:
> 
> > On Thursday 21 January 2016 15:48:14 Kalle Valo wrote:
> >> Pali Rohár  writes:
> >> 
> >> > On Thursday 14 January 2016 10:16:54 Pavel Machek wrote:
> >> >> On Wed 2016-01-13 23:32:47, Arend van Spriel wrote:
> >> >> > On 12/26/2015 12:45 PM, Pali Rohár wrote:
> >> >> > >Port the bt_coex_mode sysfs interface from wl1251 driver version 
> >> >> > >included
> >> >> > >in the Maemo Fremantle kernel to allow bt-coexistence mode 
> >> >> > >configuration.
> >> >> > >This enables userspace applications to set one of the modes
> >> >> > >WL1251_BT_COEX_OFF, WL1251_BT_COEX_ENABLE and 
> >> >> > >WL1251_BT_COEX_MONOAUDIO.
> >> >> > >The default mode is WL1251_BT_COEX_OFF.
> >> >> > >It should be noted that this driver always enabled bt-coexistence 
> >> >> > >before
> >> >> > >and enabled bt-coexistence directly affects the receiving 
> >> >> > >performance,
> >> >> > >rendering it unusable in some low-signal situations. Especially 
> >> >> > >monitor
> >> >> > >mode is affected very badly with bt-coexistence enabled.
> >> >> > 
> >> >> > So what user-space process will be using this interface. Did you 
> >> >> > consider
> >> >> > adding debugfs interface? In case of monitor mode you could consider
> >> >> > disabling bt-coex from within the driver itself.
> >> >> 
> >> >> This aint no debugging feature.
> >> >
> >> > Right, bt-coex is not for debugging purpose, but for normal usage, when
> >> > user want to use together bluetooth and wifi or just one of those.
> >> 
> >> I think most of other drivers have a debugfs interface for btcoex, I
> >> guess mostly for testing purposes. But this really should be added to
> >> cfg80211.
> >
> > All other TI wireless drivers have "bt_coex_state" sysfs node.
> 
> Then that's a mistake, they shouldn't have that.

But it is there, wl1251 is also TI wireless driver and for last two
years there is no interface to deal with this problem...

So as other drivers do, I'm proposing solution which fix bt coex also
for wl1251 driver need on Nokia N900.

-- 
Pali Rohár
pali.ro...@gmail.com

Re: [PATCH 1/3][v2] net: phy: introduce 1000BASE-KX and 10GBASE-KR

2016-01-22 Thread Shaohui Xie

>> Looking at fsl_backplane_config_aneg() you expect phydev->speed to be
>> set, and from the speed you then kick of either KR autoneg or KX
>> autoneg. Could you also start the training at this point? Use the
>> speed to indicate if training is needed?
>
>   [S.H]The training cannot be started at this point, yet, because it's based 
> on
> autoneg result, only when both sides autoneg-ed to 10G-KR, then to start
> the training.

Shaohui,

look, we want you to convince us why to have a generic backplane
property in the phy binding. You had a bad start by adding new
property values to a property that does not relate to your use
case at all. Your job now really is to give strong reasons _why_
and _what_ a phy driver needs to know about the backplane setup.

[S.H] The PHY driver needs to know what the backplane mode is, 
1000BASE-KX or 10GBASE-KR, the driver parses the property for
"1000base-kx" or "10gbase-kr", the driver does use the property,
I don't understand why the property is not related to my use case?

Your first approach was to add "10gbase-rk" or "40gbase-foo" but
now you tell us about ANEG. Of what use is the information given
by the property when ANEG tells you something different? E.g.
consider the property tells you "10g-something" but ANEG gives
you "40g"; does the property add any value to your training
decision now?

[S.H] The ANEG is not a gerneric AN, it's special to 10G-KR,  KR AN can only
AN to KR if both sides support KR, it cannot give "40g" or anything different, 
drive needs the property to do speicific init.

> Besides the driver, generally speaking, "copper + speed" is not enough to 
> indicate
> it's backplane, for ex. "copper + 1000" does not mean it has to be 
> 1000BASE-KX,
> it could be SGMII, hence cannot use KX autoneg.

So, is it copper + speed + backplane? or speed + backplane? And out of
the set of required input, is there anything your _cannot_ determine
from other things, e.g. ANEG?

[S.H] Backplane is enough, 1000BASE-KX means : copper + 1000 + KX stuff.
The KR AN is to check whether LP is also KR, if yes, do training, otherwise, do 
nothing.

If it is backplane only, would a boolean property ("backplane-mode")
be enough for the training decision?

[S.H] a boolean property is not enough, there are different backplane modes.

> If putting backplane property to phy.txt is not good, I can put it to fsl 
> specific
> binding, like the second patch 2/3 did.

You seem to see vendor specific properties as a place to dump all your
waste you don't want to think about. You fail to give good reasons what
is so special about the backplane setup and now you are telling us that
it is even fsl-specific?

[S.H] I don't think it's waste, I just thought it might be special to fsl.
Agreed I failed to give good reasons, but I tried hard to. :(

Thanks,
Shaohui

Sebastian

[PATCH] net: simplify napi_synchronize() to avoid warnings

2016-01-22 Thread Arnd Bergmann

The napi_synchronize() function is defined twice: The definition
for SMP builds waits for other CPUs to be done, while the uniprocessor
variant just contains a barrier and ignores its argument.

In the mvneta driver, this leads to a warning about an unused variable
when we lookup the NAPI struct of another CPU and then don't use it:

ethernet/marvell/mvneta.c: In function 'mvneta_percpu_notifier':
ethernet/marvell/mvneta.c:2910:30: error: unused variable 'other_port' 
[-Werror=unused-variable]

There are no other CPUs on a UP build, so that code never runs, but
gcc does not know this.

The nicest solution seems to be to turn the napi_synchronize() helper
into an inline function for the UP case as well, as that leads gcc to
not complain about the argument being unused. Once we do that, we can
also combine the two cases into a single function definition and use
if(IS_ENABLED()) rather than #ifdef to make it look a bit nicer.

The warning first came up in linux-4.4, but I failed to catch it
earlier.

Signed-off-by: Arnd Bergmann 
Fixes: f86428854480 ("net: mvneta: Statically assign queues to CPUs")

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5ac140dcb789..289c2314d766 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -512,7 +512,6 @@ static inline void napi_enable(struct napi_struct *n)
clear_bit(NAPI_STATE_NPSVC, &n->state);
 }
 
-#ifdef CONFIG_SMP
 /**
  * napi_synchronize - wait until NAPI is not running
  * @n: napi context
@@ -523,12 +522,12 @@ static inline void napi_enable(struct napi_struct *n)
  */
 static inline void napi_synchronize(const struct napi_struct *n)
 {
-   while (test_bit(NAPI_STATE_SCHED, &n->state))
-   msleep(1);
+   if (IS_ENABLED(CONFIG_SMP))
+   while (test_bit(NAPI_STATE_SCHED, &n->state))
+   msleep(1);
+   else
+   barrier();
 }
-#else
-# define napi_synchronize(n)   barrier()
-#endif
 
 enum netdev_queue_state_t {
__QUEUE_STATE_DRV_XOFF,

Re: [PATCH v2] wl1251: add sysfs interface for bluetooth coexistence mode configuration

2016-01-22 Thread Kalle Valo

Pali Rohár  writes:

>> >> > Right, bt-coex is not for debugging purpose, but for normal usage, when
>> >> > user want to use together bluetooth and wifi or just one of those.
>> >> 
>> >> I think most of other drivers have a debugfs interface for btcoex, I
>> >> guess mostly for testing purposes. But this really should be added to
>> >> cfg80211.
>> >
>> > All other TI wireless drivers have "bt_coex_state" sysfs node.
>> 
>> Then that's a mistake, they shouldn't have that.
>
> But it is there, wl1251 is also TI wireless driver and for last two
> years there is no interface to deal with this problem...
>
> So as other drivers do, I'm proposing solution which fix bt coex also
> for wl1251 driver need on Nokia N900.

Even if the wlcore sysfs interface fell through the cracks it's no
excuse to add more private interfaces to wireless drivers. We have
cfg80211 and all generic interfaces, like btcoex control, should go
through that subsystem. That way all drivers can share a common
interface and everyone are happy.

What I suggest is that you add this yourself to cfg80211 and mac80211.
It's not that hard and I think you could use NL80211_CMD_SET_POWER_SAVE
as an example.

-- 
Kalle Valo

Re: Optimizing instruction-cache, more packets at each stage

2016-01-22 Thread Jesper Dangaard Brouer

On Thu, 21 Jan 2016 09:48:36 -0800
Eric Dumazet  wrote:

> On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote:
> 
> > Sure, but the receive path is parallelized.  
> 
> This is true for multiqueue processing, assuming you can dedicate many
> cores to process RX.
> 
> >  Improving parallelism has
> > continuously shown to have much more impact than attempting to
> > optimize for cache misses. The primary goal is not to drive 100Gbps
> > with 64 packets from a single CPU. It is one benchmark of many we
> > should look at to measure efficiency of the data path, but I've yet to
> > see any real workload that requires that...
> > 
> > Regardless of anything, we need to load packet headers into CPU cache
> > to do protocol processing. I'm not sure I see how trying to defer that
> > as long as possible helps except in cases where the packet is crossing
> > CPU cache boundaries and can eliminate cache misses completely (not
> > just move them around from one function to another).  
> 
> Note that some user space use multiple core (or hyper threads) to
> implement a pipeline, using a single RX queue.
> 
> One thread can handle one stage (device RX drain) and prefetch data into
> shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads)
> 
> The second thread process packets with headers already in L1/L2

I agree. I've heard experiences where DPDK users use 2 core for RX, and
1 core for TX, and achieve 10G wirespeed (14Mpps) real IPv4 forwarding
with full Internet routing table look up.

One of the ideas behind my alf_queue, is that it can be used for
efficiently distributing object (pointers) between threads.
1. because it only transfers the pointers (not touching object), and
2. because it enqueue/dequeue multiple objects with a single locked cmpxchg.
Thus, lower in the message passing cost between threads.

> This way, the ~100 ns (or even more if you also consider skb
> allocations) penalty to bring packet headers do not hurt PPS.

I've studied the allocation cost in great detail, thus let me share my
numbers, 100 ns is too high:

Total cost of alloc+free for 256 byte objects (on CPU i7-4790K @ 4.00GHz).
The cycles count should be comparable with other CPUs, but that nanosec
measurement is affected by the very high clock freq of this CPU.

Kmem_cache fastpath "recycle" case:
 SLUB => 44 cycles(tsc) 11.205 ns
 SLAB => 96 cycles(tsc) 24.119 ns.

The problem is that real use-cases in the network stack, almost always
hit the slowpath in kmem_cache allocators.

Kmem_cache "slowpath" case:
 SLUB => 117 cycles(tsc) 29.276 ns
 SLAB => 101 cycles(tsc) 25.342 ns

I've addressed this "slowpath" problem in the SLUB and SLAB allocators,
by introducing a bulk API, which amortize the needed sync-mechanisms.

Kmem_cache using bulk API:
 SLUB => 37 cycles(tsc) 9.280 ns
 SLAB => 20 cycles(tsc) 5.035 ns

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH 1/3] xen-netback: fix license ident used in MODULE_LICENSE

2016-01-22 Thread Wei Liu

The comment at the beginning of the file is the canonical source of
licenses for this module. Currently it contains GPL and MIT license. Fix
the code to reflect the reality.

Signed-off-by: Wei Liu 
---
 drivers/net/xen-netback/netback.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/xen-netback/netback.c 
b/drivers/net/xen-netback/netback.c
index 61b97c3..2427242 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -2192,5 +2192,5 @@ static void __exit netback_fini(void)
 }
 module_exit(netback_fini);
 
-MODULE_LICENSE("Dual BSD/GPL");
+MODULE_LICENSE("Dual MIT/GPL");
 MODULE_ALIAS("xen-backend:vif");
-- 
2.1.4

Re: [Xen-devel] [PATCH 1/3] xen-netback: fix license ident used in MODULE_LICENSE

2016-01-22 Thread David Vrabel

On 22/01/16 12:34, Wei Liu wrote:
> The comment at the beginning of the file is the canonical source of
> licenses for this module. Currently it contains GPL and MIT license. Fix
> the code to reflect the reality.

"The MIT license" isn't really a thing.  The closest is the X11
license[1], but this not applicable here either since the text in the
drivers does not refer to X11 trademarks etc.

You can either use "GPL" which would be correct for a Linux kernel
module since the alternate only applies when distributed separately from
Linux ("or, when distributed separately from the Linux kernel or
incorporated into other software packages, subject to the following
license:"); or you can use "GPL and additional rights".

(Or you could just leave it as-is since "Dual BSD/GPL" is close enough.)

David

[1] http://www.gnu.org/licenses/license-list.html#X11License

Re: [Xen-devel] [PATCH 1/3] xen-netback: fix license ident used in MODULE_LICENSE

2016-01-22 Thread Wei Liu

On Fri, Jan 22, 2016 at 01:14:24PM +, David Vrabel wrote:
> On 22/01/16 12:34, Wei Liu wrote:
> > The comment at the beginning of the file is the canonical source of
> > licenses for this module. Currently it contains GPL and MIT license. Fix
> > the code to reflect the reality.
> 
> "The MIT license" isn't really a thing.  The closest is the X11
> license[1], but this not applicable here either since the text in the
> drivers does not refer to X11 trademarks etc.
> 

That was referring to the license ident string in Linux.  If MIT license
isn't a thing, why would Linux have it at all?

> You can either use "GPL" which would be correct for a Linux kernel
> module since the alternate only applies when distributed separately from
> Linux ("or, when distributed separately from the Linux kernel or
> incorporated into other software packages, subject to the following
> license:"); or you can use "GPL and additional rights".
> 
> (Or you could just leave it as-is since "Dual BSD/GPL" is close enough.)
> 

No, I don't want to leave it as-is. That's not BSD license.

I can change that to "GPL". That is acceptable to me.

Wei.

> David
> 
> [1] http://www.gnu.org/licenses/license-list.html#X11License
>

[PATCH v2 1/3] xen-netback: fix license ident used in MODULE_LICENSE

2016-01-22 Thread Wei Liu

The comment at the beginning of the file is the canonical source of
licenses for this module. The license used to distribute outside of
Linux kernel is not BSD license but X11 license. Instead of trying to
determine whether X11 license can be mapped into Linux's "MIT" license
ident, simply make the code to use "GPL" only.

Signed-off-by: Wei Liu 
---
 drivers/net/xen-netback/netback.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/xen-netback/netback.c 
b/drivers/net/xen-netback/netback.c
index 61b97c3..e9601d1 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -2192,5 +2192,5 @@ static void __exit netback_fini(void)
 }
 module_exit(netback_fini);
 
-MODULE_LICENSE("Dual BSD/GPL");
+MODULE_LICENSE("GPL");
 MODULE_ALIAS("xen-backend:vif");
-- 
2.1.4

Re: [PATCH 1/3][v2] net: phy: introduce 1000BASE-KX and 10GBASE-KR

2016-01-22 Thread Shaohui Xie

Hi,

I'm not sure I explained myself clearly in previous reply, so I guess it's 
worth to
rephrase it.

1000BASE-KX and 10GBASE-KR are backplane modes supported by Freescale's PCS
PHY, but different modes need different hardware settings, not just different 
PHY init
routines, this also needs different serdes lane settings, this is done in 
probe() according
to DTS properties.

Regarding the autoneg part, it's not like normal autoneg which can autoneg to 
different
capabilities, when the PHY is 1000BSE-KX, it can only autoneg to 1000BASE-KX 
only if
LP is 1000BASE-KX,  same is true for 10GBASE-KR. The purpose of KR AN is to 
detect whether
LP is also KR, if yes, do training, if not, do nothing, no any other result the 
KR AN can give.

The reason "copper + speed" is not enough for backplane is because they are not 
precise,
and backplane itself explains what it is, for ex. 1000BASE-KX clearly means the 
media is copper,
speed is 1000, and should follow 1000BASE-KX standard in IEEE802.3.

The reason I mentioned maybe I should put the backplane property in fsl's 
binding is because
the backplane implementation is really vendor specific, it's heavily relay how 
hardware implements
the feature, maybe another vendor's hardware only needs a boolean property for 
driver to tell it 
should work in backplane, hardware can deal with different modes, or even no 
any special 
property needed if the hardware is strong enough to handle any connections, I 
cannot assume.
But I know what fsl's hardware needs to support backplane.

Thank you for your time and reviewing!
Shaohui


From: Shaohui Xie
Sent: Friday, January 22, 2016 6:05 PM
To: Sebastian Hesselbarth; Andrew Lunn
Cc: Florian Fainelli; shh@gmail.com; devicet...@vger.kernel.org; 
netdev@vger.kernel.org; linuxppc-...@lists.ozlabs.org; da...@davemloft.net; 
Shaohui Xie
Subject: Re: [PATCH 1/3][v2] net: phy: introduce 1000BASE-KX and 10GBASE-KR

>> Looking at fsl_backplane_config_aneg() you expect phydev->speed to be
>> set, and from the speed you then kick of either KR autoneg or KX
>> autoneg. Could you also start the training at this point? Use the
>> speed to indicate if training is needed?
>
>   [S.H]The training cannot be started at this point, yet, because it's based 
> on
> autoneg result, only when both sides autoneg-ed to 10G-KR, then to start
> the training.

Shaohui,

look, we want you to convince us why to have a generic backplane
property in the phy binding. You had a bad start by adding new
property values to a property that does not relate to your use
case at all. Your job now really is to give strong reasons _why_
and _what_ a phy driver needs to know about the backplane setup.

[S.H] The PHY driver needs to know what the backplane mode is,
1000BASE-KX or 10GBASE-KR, the driver parses the property for
"1000base-kx" or "10gbase-kr", the driver does use the property,
I don't understand why the property is not related to my use case?

Your first approach was to add "10gbase-rk" or "40gbase-foo" but
now you tell us about ANEG. Of what use is the information given
by the property when ANEG tells you something different? E.g.
consider the property tells you "10g-something" but ANEG gives
you "40g"; does the property add any value to your training
decision now?

[S.H] The ANEG is not a gerneric AN, it's special to 10G-KR,  KR AN can only
AN to KR if both sides support KR, it cannot give "40g" or anything different,
drive needs the property to do speicific init.

> Besides the driver, generally speaking, "copper + speed" is not enough to 
> indicate
> it's backplane, for ex. "copper + 1000" does not mean it has to be 
> 1000BASE-KX,
> it could be SGMII, hence cannot use KX autoneg.

So, is it copper + speed + backplane? or speed + backplane? And out of
the set of required input, is there anything your _cannot_ determine
from other things, e.g. ANEG?

[S.H] Backplane is enough, 1000BASE-KX means : copper + 1000 + KX stuff.
The KR AN is to check whether LP is also KR, if yes, do training, otherwise, do 
nothing.

If it is backplane only, would a boolean property ("backplane-mode")
be enough for the training decision?

[S.H] a boolean property is not enough, there are different backplane modes.

> If putting backplane property to phy.txt is not good, I can put it to fsl 
> specific
> binding, like the second patch 2/3 did.

You seem to see vendor specific properties as a place to dump all your
waste you don't want to think about. You fail to give good reasons what
is so special about the backplane setup and now you are telling us that
it is even fsl-specific?

[S.H] I don't think it's waste, I just thought it might be special to fsl.
Agreed I failed to give good reasons, but I tried hard to. :(

Thanks,
Shaohui

Sebastian

Re: [Xen-devel] [PATCH 1/3] xen-netback: fix license ident used in MODULE_LICENSE

2016-01-22 Thread Ian Campbell

On Fri, 2016-01-22 at 13:49 +, Wei Liu wrote:
> On Fri, Jan 22, 2016 at 01:14:24PM +, David Vrabel wrote:
> > On 22/01/16 12:34, Wei Liu wrote:
> > > The comment at the beginning of the file is the canonical source of
> > > licenses for this module. Currently it contains GPL and MIT license.
> > > Fix
> > > the code to reflect the reality.
> > 
> > "The MIT license" isn't really a thing.  The closest is the X11
> > license[1], but this not applicable here either since the text in the
> > drivers does not refer to X11 trademarks etc.
> > 
> 
> That was referring to the license ident string in Linux.  If MIT license
> isn't a thing, why would Linux have it at all?

The fact what include/linux/license.h:license_is_gpl_compatible includes
"Dual MIT/GPL" as an option seems to suggest that it is enough of a thing
to be validly used as the contents of a MODULE_LICENSE() thing.

It's also in https://opensource.org/licenses/MIT , the fact that it might
be confused for other licenses used by MIT notwithstanding.

FWIW https://en.wikipedia.org/wiki/MIT_License seems to think that the
wording most commonly called the "MIT License" might be the "Expat
license", rather than the "X11 License" which is similar but different.

Ian.

Re: Optimizing instruction-cache, more packets at each stage

2016-01-22 Thread Eric Dumazet

On Fri, 2016-01-22 at 13:33 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 21 Jan 2016 09:48:36 -0800
> Eric Dumazet  wrote:
> 
> > On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote:
> > 
> > > Sure, but the receive path is parallelized.  
> > 
> > This is true for multiqueue processing, assuming you can dedicate many
> > cores to process RX.
> > 
> > >  Improving parallelism has
> > > continuously shown to have much more impact than attempting to
> > > optimize for cache misses. The primary goal is not to drive 100Gbps
> > > with 64 packets from a single CPU. It is one benchmark of many we
> > > should look at to measure efficiency of the data path, but I've yet to
> > > see any real workload that requires that...
> > > 
> > > Regardless of anything, we need to load packet headers into CPU cache
> > > to do protocol processing. I'm not sure I see how trying to defer that
> > > as long as possible helps except in cases where the packet is crossing
> > > CPU cache boundaries and can eliminate cache misses completely (not
> > > just move them around from one function to another).  
> > 
> > Note that some user space use multiple core (or hyper threads) to
> > implement a pipeline, using a single RX queue.
> > 
> > One thread can handle one stage (device RX drain) and prefetch data into
> > shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads)
> > 
> > The second thread process packets with headers already in L1/L2
> 
> I agree. I've heard experiences where DPDK users use 2 core for RX, and
> 1 core for TX, and achieve 10G wirespeed (14Mpps) real IPv4 forwarding
> with full Internet routing table look up.
> 
> One of the ideas behind my alf_queue, is that it can be used for
> efficiently distributing object (pointers) between threads.
> 1. because it only transfers the pointers (not touching object), and
> 2. because it enqueue/dequeue multiple objects with a single locked cmpxchg.
> Thus, lower in the message passing cost between threads.
> 
> 
> > This way, the ~100 ns (or even more if you also consider skb
> > allocations) penalty to bring packet headers do not hurt PPS.
> 
> I've studied the allocation cost in great detail, thus let me share my
> numbers, 100 ns is too high:
> 
> Total cost of alloc+free for 256 byte objects (on CPU i7-4790K @ 4.00GHz).
> The cycles count should be comparable with other CPUs, but that nanosec
> measurement is affected by the very high clock freq of this CPU.
> 
> Kmem_cache fastpath "recycle" case:
>  SLUB => 44 cycles(tsc) 11.205 ns
>  SLAB => 96 cycles(tsc) 24.119 ns.
> 
> The problem is that real use-cases in the network stack, almost always
> hit the slowpath in kmem_cache allocators.
> 
> Kmem_cache "slowpath" case:
>  SLUB => 117 cycles(tsc) 29.276 ns
>  SLAB => 101 cycles(tsc) 25.342 ns
> 
> I've addressed this "slowpath" problem in the SLUB and SLAB allocators,
> by introducing a bulk API, which amortize the needed sync-mechanisms.
> 
> Kmem_cache using bulk API:
>  SLUB => 37 cycles(tsc) 9.280 ns
>  SLAB => 20 cycles(tsc) 5.035 ns


Your numbers are nice, but the reality of most applications is they run
on hosts with ~72 hyperthreads, soon to be ~128 ht.

(Two physical sockets, with their corresponding memory)

The perf numbers show about 100 ns penalty per cache line miss, when all
these threads perform real work and applications are properly tuned,
because it is very rare the working set is all in caches.

In the following real case, we can see these numbers.

$ perf guncore -M miss_lat_rem,miss_lat_loc
#
#Socket0  |Socket1  
|
#
# Load Miss Latency  | Load Miss Latency  | Load Miss Latency  | Load Miss 
Latency  |
# Remote RAM | Local RAM  | Remote RAM | Local RAM  
|
#  ns|  ns|  ns|
  ns|
#
   162.25   130.61   173.74   
116.80
   162.40   130.41   173.33   
116.59
   163.11   132.28   175.90   
117.09
   163.36   132.86   176.69   
117.45
   161.92   130.32   173.20   
117.35
   163.46   130.99   174.80   
117.42
   163.54   130.55   174.09   
117.26
   163.29   129.75   173.84   
117.36
   162.38   130.31   173.44   
117.18
   163.00   130.81   174.47   
117.24

Re: [Xen-devel] [PATCH 1/3] xen-netback: fix license ident used in MODULE_LICENSE

2016-01-22 Thread David Vrabel

On 22/01/16 14:15, Ian Campbell wrote:
> On Fri, 2016-01-22 at 13:49 +, Wei Liu wrote:
>> On Fri, Jan 22, 2016 at 01:14:24PM +, David Vrabel wrote:
>>> On 22/01/16 12:34, Wei Liu wrote:
 The comment at the beginning of the file is the canonical source of
 licenses for this module. Currently it contains GPL and MIT license.
 Fix
 the code to reflect the reality.
>>>
>>> "The MIT license" isn't really a thing.  The closest is the X11
>>> license[1], but this not applicable here either since the text in the
>>> drivers does not refer to X11 trademarks etc.
>>>
>>
>> That was referring to the license ident string in Linux.  If MIT license
>> isn't a thing, why would Linux have it at all?
> 
> The fact what include/linux/license.h:license_is_gpl_compatible includes
> "Dual MIT/GPL" as an option seems to suggest that it is enough of a thing
> to be validly used as the contents of a MODULE_LICENSE() thing.

"Dual MIT/GPL" is used exactly once in the source in a file that has no
license text and there is no other documentation.

David

Re: [PATCH 1/3][v2] net: phy: introduce 1000BASE-KX and 10GBASE-KR

2016-01-22 Thread Andrew Lunn

> The reason I mentioned maybe I should put the backplane property in
> fsl's binding is because the backplane implementation is really
> vendor specific, it's heavily relay how hardware implements the
> feature, maybe another vendor's hardware only needs a boolean
> property for driver to tell it should work in backplane, hardware
> can deal with different modes, or even no any special property
> needed if the hardware is strong enough to handle any connections, I
> cannot assume.  But I know what fsl's hardware needs to support
> backplane.

This is the key point really. We don't really care about the Freescale
PCS. We want a generic solution for 1000BASE-KX and 10GBASE-KR,
independent of who makes it, Marvell, Micrel, Broadcom, or Acme.

So, what generic properties are needed for 1000BASE-KX and 10GBASE-KR?
Properties that most/all manufactures are likely to need?

   Andrew

Re: [PATCH] Netlink messages for multicast HW addr programming

2016-01-22 Thread Patrick Ruddy

On Thu, 2016-01-21 at 21:04 +0100, Jiri Pirko wrote:
> Thu, Jan 21, 2016 at 12:47:54PM CET, pru...@brocade.com wrote:
> >Add RTM_NEWADDR and RTM_DELADDR netlink messages to indicate
> >interest in specific multicast hardware addresses. These messages
> >are sent when addressed are added or deleted from the appropriate
> >interface driver.
> >
> >Signed-off-by: Patrick Ruddy 
> >
> >diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
> >index c0548d2..a0ebadd 100644
> >--- a/net/core/dev_addr_lists.c
> >+++ b/net/core/dev_addr_lists.c
> >@@ -12,6 +12,7 @@
> >  */
> > 
> > #include 
> >+#include 
> > #include 
> > #include 
> > #include 
> >@@ -661,6 +662,62 @@ out:
> > }
> > EXPORT_SYMBOL(dev_mc_add_excl);
> > 
> >+static int fill_addr(struct sk_buff *skb, struct net_device *dev,
> >+ const unsigned char *addr, int type)
> >+{
> >+struct nlmsghdr *nlh;
> >+struct ifaddrmsg *ifm;
> >+
> >+nlh = nlmsg_put(skb, 0, 0, type, sizeof(*ifm), 0);
> >+if (nlh == NULL)
> >+return -EMSGSIZE;
> >+
> >+ifm = nlmsg_data(nlh);
> >+ifm->ifa_family = AF_UNSPEC;
> >+ifm->ifa_prefixlen = 0;
> >+ifm->ifa_flags = IFA_F_PERMANENT;
> >+ifm->ifa_scope = RT_SCOPE_LINK;
> >+ifm->ifa_index = dev->ifindex;
> >+if (nla_put(skb, IFA_ADDRESS, dev->addr_len, addr))
> >+goto nla_put_failure;
> >+nlmsg_end(skb, nlh);
> 
> How can userspace tell if this is a multicast addr? Also, why not add
> notifications for unicast addrs so we handle both the same?

I am only using this RTM_NEW/DELADDR AF_UNSPEC message for multicast
addresses. I understand that the unicast addresses are notified in the
RTM_NEW/DELLINK.

-pr

Re: [PATCH] Netlink messages for multicast HW addr programming

2016-01-22 Thread Jiri Pirko

Fri, Jan 22, 2016 at 03:51:55PM CET, pru...@brocade.com wrote:
>On Thu, 2016-01-21 at 21:04 +0100, Jiri Pirko wrote:
>> Thu, Jan 21, 2016 at 12:47:54PM CET, pru...@brocade.com wrote:
>> >Add RTM_NEWADDR and RTM_DELADDR netlink messages to indicate
>> >interest in specific multicast hardware addresses. These messages
>> >are sent when addressed are added or deleted from the appropriate
>> >interface driver.
>> >
>> >Signed-off-by: Patrick Ruddy 
>> >
>> >diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
>> >index c0548d2..a0ebadd 100644
>> >--- a/net/core/dev_addr_lists.c
>> >+++ b/net/core/dev_addr_lists.c
>> >@@ -12,6 +12,7 @@
>> >  */
>> > 
>> > #include 
>> >+#include 
>> > #include 
>> > #include 
>> > #include 
>> >@@ -661,6 +662,62 @@ out:
>> > }
>> > EXPORT_SYMBOL(dev_mc_add_excl);
>> > 
>> >+static int fill_addr(struct sk_buff *skb, struct net_device *dev,
>> >+const unsigned char *addr, int type)
>> >+{
>> >+   struct nlmsghdr *nlh;
>> >+   struct ifaddrmsg *ifm;
>> >+
>> >+   nlh = nlmsg_put(skb, 0, 0, type, sizeof(*ifm), 0);
>> >+   if (nlh == NULL)
>> >+   return -EMSGSIZE;
>> >+
>> >+   ifm = nlmsg_data(nlh);
>> >+   ifm->ifa_family = AF_UNSPEC;
>> >+   ifm->ifa_prefixlen = 0;
>> >+   ifm->ifa_flags = IFA_F_PERMANENT;
>> >+   ifm->ifa_scope = RT_SCOPE_LINK;
>> >+   ifm->ifa_index = dev->ifindex;
>> >+   if (nla_put(skb, IFA_ADDRESS, dev->addr_len, addr))
>> >+   goto nla_put_failure;
>> >+   nlmsg_end(skb, nlh);
>> 
>> How can userspace tell if this is a multicast addr? Also, why not add
>> notifications for unicast addrs so we handle both the same?
>
>I am only using this RTM_NEW/DELADDR AF_UNSPEC message for multicast
>addresses. I understand that the unicast addresses are notified in the
>RTM_NEW/DELLINK.

dev->dev_addr is notified by that. UC addresses do not have any
notification now.

Re: [PATCH] Netlink messages for multicast HW addr programming

2016-01-22 Thread Patrick Ruddy

On Fri, 2016-01-22 at 16:05 +0100, Jiri Pirko wrote:
> Fri, Jan 22, 2016 at 03:51:55PM CET, pru...@brocade.com wrote:
> >On Thu, 2016-01-21 at 21:04 +0100, Jiri Pirko wrote:
> >> Thu, Jan 21, 2016 at 12:47:54PM CET, pru...@brocade.com wrote:
> >> >Add RTM_NEWADDR and RTM_DELADDR netlink messages to indicate
> >> >interest in specific multicast hardware addresses. These messages
> >> >are sent when addressed are added or deleted from the appropriate
> >> >interface driver.
> >> >
> >> >Signed-off-by: Patrick Ruddy 
> >> >
> >> >diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
> >> >index c0548d2..a0ebadd 100644
> >> >--- a/net/core/dev_addr_lists.c
> >> >+++ b/net/core/dev_addr_lists.c
> >> >@@ -12,6 +12,7 @@
> >> >  */
> >> > 
> >> > #include 
> >> >+#include 
> >> > #include 
> >> > #include 
> >> > #include 
> >> >@@ -661,6 +662,62 @@ out:
> >> > }
> >> > EXPORT_SYMBOL(dev_mc_add_excl);
> >> > 
> >> >+static int fill_addr(struct sk_buff *skb, struct net_device *dev,
> >> >+  const unsigned char *addr, int type)
> >> >+{
> >> >+ struct nlmsghdr *nlh;
> >> >+ struct ifaddrmsg *ifm;
> >> >+
> >> >+ nlh = nlmsg_put(skb, 0, 0, type, sizeof(*ifm), 0);
> >> >+ if (nlh == NULL)
> >> >+ return -EMSGSIZE;
> >> >+
> >> >+ ifm = nlmsg_data(nlh);
> >> >+ ifm->ifa_family = AF_UNSPEC;
> >> >+ ifm->ifa_prefixlen = 0;
> >> >+ ifm->ifa_flags = IFA_F_PERMANENT;
> >> >+ ifm->ifa_scope = RT_SCOPE_LINK;
> >> >+ ifm->ifa_index = dev->ifindex;
> >> >+ if (nla_put(skb, IFA_ADDRESS, dev->addr_len, addr))
> >> >+ goto nla_put_failure;
> >> >+ nlmsg_end(skb, nlh);
> >> 
> >> How can userspace tell if this is a multicast addr? Also, why not add
> >> notifications for unicast addrs so we handle both the same?
> >
> >I am only using this RTM_NEW/DELADDR AF_UNSPEC message for multicast
> >addresses. I understand that the unicast addresses are notified in the
> >RTM_NEW/DELLINK.
> 
> dev->dev_addr is notified by that. UC addresses do not have any
> notification now.
It is the multicast MAC address (ie the multicast equivalent of the
dev_addr) that we are notifying here right? Not L3 addresses.

-pr

Re: [PATCH 23/33] x86/asm/bpf: Create stack frames in bpf_jit.S

2016-01-22 Thread Josh Poimboeuf

On Thu, Jan 21, 2016 at 08:18:46PM -0800, Alexei Starovoitov wrote:
> On Thu, Jan 21, 2016 at 09:55:31PM -0600, Josh Poimboeuf wrote:
> > On Thu, Jan 21, 2016 at 06:44:28PM -0800, Alexei Starovoitov wrote:
> > > On Thu, Jan 21, 2016 at 04:49:27PM -0600, Josh Poimboeuf wrote:
> > > > bpf_jit.S has several callable non-leaf functions which don't honor
> > > > CONFIG_FRAME_POINTER, which can result in bad stack traces.
> > > > 
> > > > Create a stack frame before the call instructions when
> > > > CONFIG_FRAME_POINTER is enabled.
> > > > 
> > > > Signed-off-by: Josh Poimboeuf 
> > > > Cc: Alexei Starovoitov 
> > > > Cc: netdev@vger.kernel.org
> > > > ---
> > > >  arch/x86/net/bpf_jit.S | 9 +++--
> > > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/net/bpf_jit.S b/arch/x86/net/bpf_jit.S
> > > > index eb4a3bd..f2a7faf 100644
> > > > --- a/arch/x86/net/bpf_jit.S
> > > > +++ b/arch/x86/net/bpf_jit.S
> > > > @@ -8,6 +8,7 @@
> > > >   * of the License.
> > > >   */
> > > >  #include 
> > > > +#include 
> > > >  
> > > >  /*
> > > >   * Calling convention :
> > > > @@ -65,16 +66,18 @@ FUNC(sk_load_byte_positive_offset)
> > > >  
> > > >  /* rsi contains offset and can be scratched */
> > > >  #define bpf_slow_path_common(LEN)  \
> > > > +   lea -MAX_BPF_STACK + 32(%rbp), %rdx;\
> > > > +   FRAME_BEGIN;\
> > > > mov %rbx, %rdi; /* arg1 == skb */   \
> > > > push%r9;\
> > > > pushSKBDATA;\
> > > >  /* rsi already has offset */   \
> > > > mov $LEN,%ecx;  /* len */   \
> > > > -   lea - MAX_BPF_STACK + 32(%rbp),%rdx;
> > > > \
> > > > callskb_copy_bits;  \
> > > > test%eax,%eax;  \
> > > > pop SKBDATA;\
> > > > -   pop %r9;
> > > > +   pop %r9;\
> > > > +   FRAME_END
> > > 
> > > I'm not sure what above is doing.
> > > There is already 'push rbp; mov rbp,rsp' at the beginning of generated
> > > code and with above the stack trace will show two function at the same ip?
> > > since there were no calls between them?
> > > I think the stack walker will get even more confused?
> > > Also the JIT of bpf_call insn will emit variable number of push/pop
> > > around the call and I definitely don't want to add extra push rbp
> > > there, since it's the critical path and callee will do its own
> > > push rbp.
> > > Also there are push/pops emitted around div/mod
> > > and there is indirect goto emitted as well for bpf_tail_call
> > > that jumps into different function body without touching
> > > current stack.
> > 
> > Hm, I'm not sure I follow.  Let me try to explain my understanding.
> > 
> > As you mentioned, the generated code sets up the frame pointer.  From
> > emit_prologue():
> > 
> > EMIT1(0x55); /* push rbp */
> > EMIT3(0x48, 0x89, 0xE5); /* mov rbp,rsp */
> > 
> > And then later, do_jit() can generate a call into the functions in
> > bpf_jit.S.  For example:
> > 
> > func = CHOOSE_LOAD_FUNC(imm32, sk_load_word);
> > ...
> > EMIT1_off32(0xE8, jmp_offset); /* call */
> > 
> > So the functions in bpf_jit.S are being called by the generated code.
> > They're not part of the generated code itself.  So they're callees and
> > need to create their own stack frame before they call out to somewhere
> > else.
> > 
> > Or did I miss something?
> 
> yes. all correct.
> This particular patch is ok, since it adds to
> bpf_slow_path_common and as the name says it's slow and rare,
> but wanted to make sure the rest of it is understood.
> 
> > > Also none of the JITed function are dwarf annotated.
> > 
> > But what does that have to do with frame pointers?
> 
> nothing, but then why do you need
> .type name, @function
> annotations in another patch?

Those are ELF function annotations which are needed so that stacktool
can find and analyze all callable functions (and they're a good idea
anyway for the sake of other tooling).

Obviously we can't annotate the JIT functions which have no name, but
that's ok.  As long as they're doing the right thing with respect to
frame pointers, the stack dump code will still be able to see their
frames (and that they're associated with generated code).

> > > I could be missing something. I think either this patch
> > > is not need or you need to teach the tool to ignore
> > > all JITed stuff. I don't think it's practical to annotate
> > > everything. Different JITs do their own magic.
> > > s390 JIT is even more fancy.
> > 
> > Well, but the point of these patches isn't to make the tool happy.  It's
> > really to make sure that runtime stack traces can be made reliable.
> > Maybe I'm missing something but I don't see why JIT code can't honor
> > CONFIG_FRAME_POINTER just like any other

Re: [PATCH net 1/3] sctp: fix the transport dead race check by using atomic_add_unless on refcnt

2016-01-22 Thread Vlad Yasevich

On 01/21/2016 12:49 PM, Xin Long wrote:
> Now when __sctp_lookup_association is running in BH, it will try to
> check if t->dead is set, but meanwhile other CPUs may be freeing this
> transport and this assoc and if it happens that
> __sctp_lookup_association checked t->dead a bit too early, it may think
> that the association is still good while it was already freed.
> 
> So we fix this race by using atomic_add_unless in sctp_transport_hold.
> After we get one transport from hashtable, we will hold it only when
> this transport's refcnt is not 0, so that we can make sure t->asoc
> cannot be freed before we hold the asoc again.

atomic_add_unless() uses atomic_read() to check the value.  Since there
don't appear to be any barriers, what guarantees that the value
read will not have been modified in another thread under a proper lock?


> 
> Note that sctp association is not freed using RCU so we can't use
> atomic_add_unless() with it as it may just be too late for that either.
> 
> Fixes: 4f0087812648 ("sctp: apply rhashtable api to send/recv path")
> Reported-by: Vlad Yasevich 
> Signed-off-by: Xin Long 
> ---
>  include/net/sctp/structs.h |  2 +-
>  net/sctp/input.c   | 17 +++--
>  net/sctp/transport.c   |  4 ++--
>  3 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 20e7212..344da04 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -955,7 +955,7 @@ void sctp_transport_route(struct sctp_transport *, union 
> sctp_addr *,
>  void sctp_transport_pmtu(struct sctp_transport *, struct sock *sk);
>  void sctp_transport_free(struct sctp_transport *);
>  void sctp_transport_reset_timers(struct sctp_transport *);
> -void sctp_transport_hold(struct sctp_transport *);
> +int sctp_transport_hold(struct sctp_transport *);
>  void sctp_transport_put(struct sctp_transport *);
>  void sctp_transport_update_rto(struct sctp_transport *, __u32);
>  void sctp_transport_raise_cwnd(struct sctp_transport *, __u32, __u32);
> diff --git a/net/sctp/input.c b/net/sctp/input.c
> index bf61dfb..49d2cc7 100644
> --- a/net/sctp/input.c
> +++ b/net/sctp/input.c
> @@ -935,15 +935,22 @@ static struct sctp_association 
> *__sctp_lookup_association(
>   struct sctp_transport **pt)
>  {
>   struct sctp_transport *t;
> + struct sctp_association *asoc = NULL;
>  
> + rcu_read_lock();
>   t = sctp_addrs_lookup_transport(net, local, peer);
> - if (!t || t->dead)
> - return NULL;
> + if (!t || !sctp_transport_hold(t))
> + goto out;
>  
> - sctp_association_hold(t->asoc);
> + asoc = t->asoc;
> + sctp_association_hold(asoc);

I don't think you can modify the reference count on a transport, let alone
the association outside of a lock.

-vlad

>   *pt = t;
>  
> - return t->asoc;
> + sctp_transport_put(t);
> +
> +out:
> + rcu_read_unlock();
> + return asoc;
>  }
>  
>  /* Look up an association. protected by RCU read lock */
> @@ -955,9 +962,7 @@ struct sctp_association *sctp_lookup_association(struct 
> net *net,
>  {
>   struct sctp_association *asoc;
>  
> - rcu_read_lock();
>   asoc = __sctp_lookup_association(net, laddr, paddr, transportp);
> - rcu_read_unlock();
>  
>   return asoc;
>  }
> diff --git a/net/sctp/transport.c b/net/sctp/transport.c
> index aab9e3f..69f3799 100644
> --- a/net/sctp/transport.c
> +++ b/net/sctp/transport.c
> @@ -296,9 +296,9 @@ void sctp_transport_route(struct sctp_transport 
> *transport,
>  }
>  
>  /* Hold a reference to a transport.  */
> -void sctp_transport_hold(struct sctp_transport *transport)
> +int sctp_transport_hold(struct sctp_transport *transport)
>  {
> - atomic_inc(&transport->refcnt);
> + return atomic_add_unless(&transport->refcnt, 1, 0);
>  }
>  
>  /* Release a reference to a transport and clean up
>

Re: struct pid memory leak

2016-01-22 Thread Eric Dumazet

CC netdev, as it looks some af_unix issue ...

On Fri, 2016-01-22 at 16:08 +0100, Dmitry Vyukov wrote:
> Hello,
> 
> The following program causes struct pid memory leak:
> 
> // autogenerated by syzkaller (http://github.com/google/syzkaller)
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> long r[37];
> 
> void* thr(void* arg)
> {
>   switch ((long)arg) {
>   case 0:
> r[0] = syscall(SYS_mmap, 0x2000ul, 0xd000ul, 0x3ul, 0x32ul,
>0xul, 0x0ul);
> break;
>   case 1:
> r[1] = syscall(SYS_socketpair, 0x1ul, 0x2ul, 0x0ul, 0x2ffcul, 0,
>0);
> if (r[1] != -1)
>   r[2] = *(uint32_t*)0x20001000;
> break;
>   case 2:
> r[3] =
> syscall(SYS_accept, r[2], 0x2ffbul, 0x2ffful, 0, 0, 0);
> break;
>   case 3:
> r[4] = syscall(SYS_socketpair, 0x1ul, 0x1ul, 0x0ul, 0x20001ff8ul, 0,
>0);
> if (r[4] != -1)
>   r[5] = *(uint32_t*)0x20001ff8;
> if (r[4] != -1)
>   r[6] = *(uint32_t*)0x20001ffc;
> break;
>   case 4:
> memcpy((void*)0x2000bf5c,
>"\xd4\x37\x4c\x81\xff\x25\x00\xf7\x44\x0d\x1a\xe2\x4d\xae"
>"\x17\x36\xb0\xef\x85\xd0\xb6\xa2\x0a\x4c\x29\xf0\x43\x3c"
>"\x2b\xab\xdf\x9f\x3e\x4b\x9c\x1b\xb0\x36\xce\xe7\x14\x2b"
>"\xa4\x33\x47\xd5\x58\x76\x63\x83\x71\xb3\x95\x37\xca\x25"
>"\x93\x3f\x46\xd7\xc0\x8f\x8e\x2a\xcf\x0d\x60\xb7\x62\xac"
>"\xd9\xaf\x6e\x88\x3f\xe0\xbf\x94\xc3\x57\x74\x8d\x22\xed"
>"\x61\x71\x85\x10\x64\x2d\x50\xdf\xae\x9a\xdd\xa2\x5e\x28"
>"\xa3\xf8\x14\xf0\x94\x4a\xac\x82\x45\xed\x85\x7a\xb6\x2b"
>"\xef\xb4\x0b\x78\xb8\x92\x30\xcc\x5d\xcc\x07\xbf\x70\x4e"
>"\x1c\x10\x38\xde\x89\x58\x8b\x87\x97\xc9\x6a\x62\x84\x3b"
>"\xcd\x37\xbb\x8d\x41\x50\x65\x24\xa8\x90\x85\xa7\x51\x32"
>"\x58\xf9\x71\xb3\x0b\xf0\x0f\xe6\xc4\x81",
>164);
> r[8] = syscall(SYS_write, r[5], 0x2000bf5cul, 0xa4ul, 0, 0, 0);
> break;
>   case 5:
> *(uint32_t*)0x2000cb16 = (uint32_t)0x20;
> *(uint32_t*)0x2000cb1a = (uint32_t)0xfffd;
> *(uint64_t*)0x2000cb1e = (uint64_t)0x1;
> *(uint32_t*)0x2000cb26 = (uint32_t)0xd51;
> *(uint32_t*)0x2000cb2a = (uint32_t)0x3;
> *(uint32_t*)0x2000cb2e = (uint32_t)0x9;
> *(uint32_t*)0x2000cb32 = (uint32_t)0x9;
> r[16] = syscall(SYS_write, r[5], 0x2000cb16ul, 0x20ul, 0, 0, 0);
> break;
>   case 6:
> *(uint32_t*)0x20005ffc = (uint32_t)0x7;
> r[18] = syscall(SYS_setsockopt, r[6], 0x1ul, 0x10ul, 0x20005ffcul,
> 0x4ul, 0);
> break;
>   case 7:
> memcpy((void*)0x20003000,
>"\xad\xd4\xf6\xb6\x5d\x21\x41\x96\x29\xc7\x46\x59\xb5\x12"
>"\x13\x1f\xc2\xab\x18\x66\x38\x2f\x01\xd0\x78\x07\x19\xe4"
>"\x2f\xac\xa5\x81\xc9\x01\x6f\x8d\xeb\x2f\x06\x23\xc8\x42"
>"\xf8\x6e\x04\xf6\xcf\x7e\x76\x1a\xb8\xe3\xff\x45\x30\x9b"
>"\x0a\x9a\x0d\x1a\x6d\xfe\x01\x94\xc3\xc6\xfb\xc7\xd2\x7d"
>"\xe3\x5f\xc9\xdb\xa8\xfc\x9a\x0c\xdf\x4a\xf9\x6c\xf5\xcd"
>"\x20\x90\x16\xd6\x2a\xec\x79\xac\x6a\x04\x9d\x92\xd3\x7d"
>"\x2c\xf5\x24\x60\xcc\x57\xb1\x1e\x2a\xf9\x33\x54\x7b\xd8"
>"\x5b\x23\x26\x79\xdb\x89\x72\xf7\x17\xe0\x1c\x1f\x2e\xc0"
>"\x23\x94\xc5\xb1\x7d\xea\x84\xd1\x40\x43\x8a\xc1\x89\xa2"
>"\x72\xd8\x8a\xff\xf7\x30\xc9\x96\x5c\x84\x58\x4f\x7e\x04"
>"\x84\x45\x1b\x83\x51\xb6\x90\x4a\x17\x4e\x95\x09\xb9\x37"
>"\x6e\xe6\xb0\x5e\xb5\x11\xb6\x2f\x06\x75\x31\x57\xda\xc2"
>"\xfe\x5d\x84\x0e\x6a\x29\xb7\xe6\x22\xf2\xc9\x00\xa5\x80"
>"\x0f\x48\x42\x38\x5a\x66\x32\x91\x5a\xe5\x5e\xfe\xce\xc0"
>"\x98\x16\x19\x39\x21\x4b\x60\xe1\xa5\x7a\xba\x62\xd4\x38"
>"\x96\x2d\x79\x09\x30\x2c\x75\x54\x68\xca",
>234);
> r[20] = syscall(SYS_sendto, r[5], 0x20003000ul, 0xeaul, 0x4000ul,
> 0x20003000ul, 0x0ul);
> break;
>   case 8:
> *(uint64_t*)0x2000a000 = (uint64_t)0x2000a000;
> *(uint32_t*)0x2000a008 = (uint32_t)0x1c;
> *(uint64_t*)0x2000a010 = (uint64_t)0x2000ac60;
> *(uint64_t*)0x2000a018 = (uint64_t)0x4;
> *(uint64_t*)0x2000a020 = (uint64_t)0x2000a8b3;
> *(uint64_t*)0x2000a028 = (uint64_t)0x1000;
> *(uint32_t*)0x2000a030 = (uint32_t)0x0;
> *(uint64_t*)0x2000ac60 = (uint64_t)0x2000afb4;
> *(uint64_t*)0x2000ac68 = (uint64_t)0x7a;
> *(uint64_t*)0x2000ac70 = (uint64_t)0x2000affe;
> *(uint64_t*)0x2000ac78 = (uint64_t)0x10;
> *(uint64_t*)0x2000ac80 = (uint64_t)0x2000afe2;
> *(uint64_t*)0x2000ac88 = (uint64_t)0x2f;
> *(uint64_t*)0x2000ac90 = (uint64_t)0x2000afdb;
> *(uint64_t*)0x2000ac98 = (uint64_t)0xd4;
> r[36] =
> syscall(SYS_recvmsg, r[6], 0x2000a000ul, 0x12100ul, 0, 0, 0);
> break;
>   }
>   return 0;
> }
> 
> int main()
> {
>   long i;
>   pthread

Re: [RFC PATCH] ethtool: add IPv6 to the NFC API

2016-01-22 Thread Edward Cree

On 21/01/16 22:48, Alexander Duyck wrote:
> On Thu, Jan 21, 2016 at 11:14 AM, Edward Cree  wrote:
>> +   __be32  spi;
> Is this supposed to be the flow label, or is this the IPSec security
> parameter index?  If it is the flow label you may want to rename it.
> If it is supposed to be the IPSec security parameter index this might
> belong in a different flow definition since it is not actually a part
> of the IPv6 header.
It's the IPSec SPI; I just blindly copied what ethtool_ah_espip4_spec had.
I guess splitting out three different spec structs as per ipv4 would make
this somewhat clearer.
Would the flow label be useful thing to include?  I would have thought it'd
be a bit too short-lived normally.
>> +   __u8tos;
>> +   __u8proto;
>> +};
>> +
> Technically the name of the field for proto is nexthdr.
But will NICs filter on the actual nexthdr, or on the *last* nexthdr in the
chain of IPv6 options?  If the latter, calling it nexthdr might be misleading.
(I've just checked what sfc will do, it filters on the last nexthdr.)

Will spin a v2 shortly.

-Ed

Re: Optimizing instruction-cache, more packets at each stage

2016-01-22 Thread Tom Herbert

On Fri, Jan 22, 2016 at 4:33 AM, Jesper Dangaard Brouer
 wrote:
> On Thu, 21 Jan 2016 09:48:36 -0800
> Eric Dumazet  wrote:
>
>> On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote:
>>
>> > Sure, but the receive path is parallelized.
>>
>> This is true for multiqueue processing, assuming you can dedicate many
>> cores to process RX.
>>
>> >  Improving parallelism has
>> > continuously shown to have much more impact than attempting to
>> > optimize for cache misses. The primary goal is not to drive 100Gbps
>> > with 64 packets from a single CPU. It is one benchmark of many we
>> > should look at to measure efficiency of the data path, but I've yet to
>> > see any real workload that requires that...
>> >
>> > Regardless of anything, we need to load packet headers into CPU cache
>> > to do protocol processing. I'm not sure I see how trying to defer that
>> > as long as possible helps except in cases where the packet is crossing
>> > CPU cache boundaries and can eliminate cache misses completely (not
>> > just move them around from one function to another).
>>
>> Note that some user space use multiple core (or hyper threads) to
>> implement a pipeline, using a single RX queue.
>>
>> One thread can handle one stage (device RX drain) and prefetch data into
>> shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads)
>>
>> The second thread process packets with headers already in L1/L2
>
> I agree. I've heard experiences where DPDK users use 2 core for RX, and
> 1 core for TX, and achieve 10G wirespeed (14Mpps) real IPv4 forwarding
> with full Internet routing table look up.
>
> One of the ideas behind my alf_queue, is that it can be used for
> efficiently distributing object (pointers) between threads.
> 1. because it only transfers the pointers (not touching object), and
> 2. because it enqueue/dequeue multiple objects with a single locked cmpxchg.
> Thus, lower in the message passing cost between threads.
>
>
>> This way, the ~100 ns (or even more if you also consider skb
>> allocations) penalty to bring packet headers do not hurt PPS.
>
> I've studied the allocation cost in great detail, thus let me share my
> numbers, 100 ns is too high:
>
> Total cost of alloc+free for 256 byte objects (on CPU i7-4790K @ 4.00GHz).
> The cycles count should be comparable with other CPUs, but that nanosec
> measurement is affected by the very high clock freq of this CPU.
>
> Kmem_cache fastpath "recycle" case:
>  SLUB => 44 cycles(tsc) 11.205 ns
>  SLAB => 96 cycles(tsc) 24.119 ns.
>
> The problem is that real use-cases in the network stack, almost always
> hit the slowpath in kmem_cache allocators.
>
> Kmem_cache "slowpath" case:
>  SLUB => 117 cycles(tsc) 29.276 ns
>  SLAB => 101 cycles(tsc) 25.342 ns
>
> I've addressed this "slowpath" problem in the SLUB and SLAB allocators,
> by introducing a bulk API, which amortize the needed sync-mechanisms.
>
> Kmem_cache using bulk API:
>  SLUB => 37 cycles(tsc) 9.280 ns
>  SLAB => 20 cycles(tsc) 5.035 ns
>
Hi Jesper,

I am a little confused. I believe the 100ns hit refers specifically
cache miss on packet headers. Memory object allocation seems like
different problem; the latency might depend on cache misses, but it's
not on packet data (which we seem to assume is always a cache miss).
For the cache miss problem on the packet headers I think we really
need to evaluate whether DDIO adequately solves the it (need more
numbers :) ). As I read it, DDIO is enabled by default since Sandy
Bridge-EP and is transparent to both HW and SW. It seems like we
should have seen some sort of measurable benefit by now...

Tom

>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

Re: Optimizing instruction-cache, more packets at each stage

2016-01-22 Thread Jesper Dangaard Brouer

On Fri, 22 Jan 2016 09:07:43 -0800
Tom Herbert  wrote:

> On Fri, Jan 22, 2016 at 4:33 AM, Jesper Dangaard Brouer
>  wrote:
> > On Thu, 21 Jan 2016 09:48:36 -0800
> > Eric Dumazet  wrote:
> >  
> >> On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote:
> >>  
> >> > Sure, but the receive path is parallelized.  
> >>
> >> This is true for multiqueue processing, assuming you can dedicate many
> >> cores to process RX.
> >>  
> >> >  Improving parallelism has
> >> > continuously shown to have much more impact than attempting to
> >> > optimize for cache misses. The primary goal is not to drive 100Gbps
> >> > with 64 packets from a single CPU. It is one benchmark of many we
> >> > should look at to measure efficiency of the data path, but I've yet to
> >> > see any real workload that requires that...
> >> >
> >> > Regardless of anything, we need to load packet headers into CPU cache
> >> > to do protocol processing. I'm not sure I see how trying to defer that
> >> > as long as possible helps except in cases where the packet is crossing
> >> > CPU cache boundaries and can eliminate cache misses completely (not
> >> > just move them around from one function to another).  
> >>
> >> Note that some user space use multiple core (or hyper threads) to
> >> implement a pipeline, using a single RX queue.
> >>
> >> One thread can handle one stage (device RX drain) and prefetch data into
> >> shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads)
> >>
> >> The second thread process packets with headers already in L1/L2  
> >
> > I agree. I've heard experiences where DPDK users use 2 core for RX, and
> > 1 core for TX, and achieve 10G wirespeed (14Mpps) real IPv4 forwarding
> > with full Internet routing table look up.
> >
> > One of the ideas behind my alf_queue, is that it can be used for
> > efficiently distributing object (pointers) between threads.
> > 1. because it only transfers the pointers (not touching object), and
> > 2. because it enqueue/dequeue multiple objects with a single locked cmpxchg.
> > Thus, lower in the message passing cost between threads.
> >
> >  
> >> This way, the ~100 ns (or even more if you also consider skb
> >> allocations) penalty to bring packet headers do not hurt PPS.  
> >
> > I've studied the allocation cost in great detail, thus let me share my
> > numbers, 100 ns is too high:
> >
> > Total cost of alloc+free for 256 byte objects (on CPU i7-4790K @ 4.00GHz).
> > The cycles count should be comparable with other CPUs, but that nanosec
> > measurement is affected by the very high clock freq of this CPU.
> >
> > Kmem_cache fastpath "recycle" case:
> >  SLUB => 44 cycles(tsc) 11.205 ns
> >  SLAB => 96 cycles(tsc) 24.119 ns.
> >
> > The problem is that real use-cases in the network stack, almost always
> > hit the slowpath in kmem_cache allocators.
> >
> > Kmem_cache "slowpath" case:
> >  SLUB => 117 cycles(tsc) 29.276 ns
> >  SLAB => 101 cycles(tsc) 25.342 ns
> >
> > I've addressed this "slowpath" problem in the SLUB and SLAB allocators,
> > by introducing a bulk API, which amortize the needed sync-mechanisms.
> >
> > Kmem_cache using bulk API:
> >  SLUB => 37 cycles(tsc) 9.280 ns
> >  SLAB => 20 cycles(tsc) 5.035 ns
> >  
> Hi Jesper,
> 
> I am a little confused. I believe the 100ns hit refers specifically
> cache miss on packet headers. 

Sorry, I misread Eric's statement.  You are right.

> Memory object allocation seems like different problem;

Yes, it is, I just misread it, and though we were talking about memory
object alloc overhead. Sorry, for the confusion.


> the latency might depend on cache misses, but it's
> not on packet data (which we seem to assume is always a cache miss).
> For the cache miss problem on the packet headers I think we really
> need to evaluate whether DDIO adequately solves the it (need more
> numbers :) ). As I read it, DDIO is enabled by default since Sandy
> Bridge-EP and is transparent to both HW and SW. It seems like we
> should have seen some sort of measurable benefit by now...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH 23/33] x86/asm/bpf: Create stack frames in bpf_jit.S

2016-01-22 Thread Alexei Starovoitov

On Fri, Jan 22, 2016 at 09:58:04AM -0600, Josh Poimboeuf wrote:
> On Thu, Jan 21, 2016 at 08:18:46PM -0800, Alexei Starovoitov wrote:
> > On Thu, Jan 21, 2016 at 09:55:31PM -0600, Josh Poimboeuf wrote:
> > > On Thu, Jan 21, 2016 at 06:44:28PM -0800, Alexei Starovoitov wrote:
> > > > On Thu, Jan 21, 2016 at 04:49:27PM -0600, Josh Poimboeuf wrote:
> > > > > bpf_jit.S has several callable non-leaf functions which don't honor
> > > > > CONFIG_FRAME_POINTER, which can result in bad stack traces.
> > > > > 
> > > > > Create a stack frame before the call instructions when
> > > > > CONFIG_FRAME_POINTER is enabled.
> > > > > 
> > > > > Signed-off-by: Josh Poimboeuf 
> > > > > Cc: Alexei Starovoitov 
> > > > > Cc: netdev@vger.kernel.org
> > > > > ---
> > > > >  arch/x86/net/bpf_jit.S | 9 +++--
...
> > > > >  /* rsi contains offset and can be scratched */
> > > > >  #define bpf_slow_path_common(LEN)\
> > > > > + lea -MAX_BPF_STACK + 32(%rbp), %rdx;\
> > > > > + FRAME_BEGIN;\
> > > > >   mov %rbx, %rdi; /* arg1 == skb */   \
> > > > >   push%r9;\
> > > > >   pushSKBDATA;\
> > > > >  /* rsi already has offset */ \
> > > > >   mov $LEN,%ecx;  /* len */   \
> > > > > - lea - MAX_BPF_STACK + 32(%rbp),%rdx;
> > > > > \
> > > > >   callskb_copy_bits;  \
> > > > >   test%eax,%eax;  \
> > > > >   pop SKBDATA;\
> > > > > - pop %r9;
> > > > > + pop %r9;\
> > > > > + FRAME_END
...
> > > Well, but the point of these patches isn't to make the tool happy.  It's
> > > really to make sure that runtime stack traces can be made reliable.
> > > Maybe I'm missing something but I don't see why JIT code can't honor
> > > CONFIG_FRAME_POINTER just like any other code.
> > 
> > It can if there is no performance cost added.
> 
> CONFIG_FRAME_POINTER always adds a small performance cost but as you
> mentioned it only affects the slow path here.  And hopefully we'll soon
> have an in-kernel DWARF unwinder on x86 so we can get rid of the need
> for frame pointers.

ok. fair enough.
Acked-by: Alexei Starovoitov

Re: [PATCH net 1/3] sctp: fix the transport dead race check by using atomic_add_unless on refcnt

2016-01-22 Thread Marcelo Ricardo Leitner

On Fri, Jan 22, 2016 at 11:50:20AM -0500, Vlad Yasevich wrote:
> On 01/21/2016 12:49 PM, Xin Long wrote:
> > Now when __sctp_lookup_association is running in BH, it will try to
> > check if t->dead is set, but meanwhile other CPUs may be freeing this
> > transport and this assoc and if it happens that
> > __sctp_lookup_association checked t->dead a bit too early, it may think
> > that the association is still good while it was already freed.
> > 
> > So we fix this race by using atomic_add_unless in sctp_transport_hold.
> > After we get one transport from hashtable, we will hold it only when
> > this transport's refcnt is not 0, so that we can make sure t->asoc
> > cannot be freed before we hold the asoc again.
> 
> atomic_add_unless() uses atomic_read() to check the value.  Since there
> don't appear to be any barriers, what guarantees that the value
> read will not have been modified in another thread under a proper lock?
> 

atomic_read() is used only as a starting point.  If it got changed in
between, the new current value (return of atomic_cmpxchg) will be used
then.

> > 
> > Note that sctp association is not freed using RCU so we can't use
> > atomic_add_unless() with it as it may just be too late for that either.
> > 
> > Fixes: 4f0087812648 ("sctp: apply rhashtable api to send/recv path")
> > Reported-by: Vlad Yasevich 
> > Signed-off-by: Xin Long 
> > ---
> >  include/net/sctp/structs.h |  2 +-
> >  net/sctp/input.c   | 17 +++--
> >  net/sctp/transport.c   |  4 ++--
> >  3 files changed, 14 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> > index 20e7212..344da04 100644
> > --- a/include/net/sctp/structs.h
> > +++ b/include/net/sctp/structs.h
> > @@ -955,7 +955,7 @@ void sctp_transport_route(struct sctp_transport *, 
> > union sctp_addr *,
> >  void sctp_transport_pmtu(struct sctp_transport *, struct sock *sk);
> >  void sctp_transport_free(struct sctp_transport *);
> >  void sctp_transport_reset_timers(struct sctp_transport *);
> > -void sctp_transport_hold(struct sctp_transport *);
> > +int sctp_transport_hold(struct sctp_transport *);
> >  void sctp_transport_put(struct sctp_transport *);
> >  void sctp_transport_update_rto(struct sctp_transport *, __u32);
> >  void sctp_transport_raise_cwnd(struct sctp_transport *, __u32, __u32);
> > diff --git a/net/sctp/input.c b/net/sctp/input.c
> > index bf61dfb..49d2cc7 100644
> > --- a/net/sctp/input.c
> > +++ b/net/sctp/input.c
> > @@ -935,15 +935,22 @@ static struct sctp_association 
> > *__sctp_lookup_association(
> > struct sctp_transport **pt)
> >  {
> > struct sctp_transport *t;
> > +   struct sctp_association *asoc = NULL;
> >  
> > +   rcu_read_lock();
> > t = sctp_addrs_lookup_transport(net, local, peer);
> > -   if (!t || t->dead)
> > -   return NULL;
> > +   if (!t || !sctp_transport_hold(t))
> > +   goto out;
> >  
> > -   sctp_association_hold(t->asoc);
> > +   asoc = t->asoc;
> > +   sctp_association_hold(asoc);
> 
> I don't think you can modify the reference count on a transport, let alone
> the association outside of a lock.

The transport memory is not freed, as it's protected by rcu_read_lock(),
so we are safe to use it yet.
atomic_ operations include an embedded lock instruction protecting the
counter itself, there shouldn't be a need to use another lock around it.

And in the code above, as we could grab a hold on the transport, means
the association was not freed yet because transports hold a ref on
assoc. That's why the dance: hold(transport) hold(assoc) put(transport)

  Marcelo

> 
> -vlad
> 
> > *pt = t;
> >  
> > -   return t->asoc;
> > +   sctp_transport_put(t);
> > +
> > +out:
> > +   rcu_read_unlock();
> > +   return asoc;
> >  }
> >  
> >  /* Look up an association. protected by RCU read lock */
> > @@ -955,9 +962,7 @@ struct sctp_association *sctp_lookup_association(struct 
> > net *net,
> >  {
> > struct sctp_association *asoc;
> >  
> > -   rcu_read_lock();
> > asoc = __sctp_lookup_association(net, laddr, paddr, transportp);
> > -   rcu_read_unlock();
> >  
> > return asoc;
> >  }
> > diff --git a/net/sctp/transport.c b/net/sctp/transport.c
> > index aab9e3f..69f3799 100644
> > --- a/net/sctp/transport.c
> > +++ b/net/sctp/transport.c
> > @@ -296,9 +296,9 @@ void sctp_transport_route(struct sctp_transport 
> > *transport,
> >  }
> >  
> >  /* Hold a reference to a transport.  */
> > -void sctp_transport_hold(struct sctp_transport *transport)
> > +int sctp_transport_hold(struct sctp_transport *transport)
> >  {
> > -   atomic_inc(&transport->refcnt);
> > +   return atomic_add_unless(&transport->refcnt, 1, 0);
> >  }
> >  
> >  /* Release a reference to a transport and clean up
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/

Re: [PATCH 31/33] bpf: Add __bpf_prog_run() to stacktool whitelist

2016-01-22 Thread Alexei Starovoitov

On Thu, Jan 21, 2016 at 10:13:02PM -0600, Josh Poimboeuf wrote:
> On Thu, Jan 21, 2016 at 06:55:41PM -0800, Alexei Starovoitov wrote:
> > On Thu, Jan 21, 2016 at 04:49:35PM -0600, Josh Poimboeuf wrote:
> > > stacktool reports the following false positive warnings:
> > > 
> > >   stacktool: kernel/bpf/core.o: __bpf_prog_run()+0x5c: sibling call from 
> > > callable instruction with changed frame pointer
> > >   stacktool: kernel/bpf/core.o: __bpf_prog_run()+0x60: function has 
> > > unreachable instruction
> > >   stacktool: kernel/bpf/core.o: __bpf_prog_run()+0x64: function has 
> > > unreachable instruction
> > >   [...]
> > > 
> > > It's confused by the following dynamic jump instruction in
> > > __bpf_prog_run()::
> > > 
> > >   jmp *(%r12,%rax,8)
> > > 
> > > which corresponds to the following line in the C code:
> > > 
> > >   goto *jumptable[insn->code];
> > > 
> > > There's no way for stacktool to deterministically find all possible
> > > branch targets for a dynamic jump, so it can't verify this code.
> > > 
> > > In this case the jumps all stay within the function, and there's nothing
> > > unusual going on related to the stack, so we can whitelist the function.
> > 
> > well, few things are very unusual in this function.
> > did you see what JMP_CALL does? it's a call into a different function,
> > but not like typical indirect call. Will it be ok as well?
> > 
> > In general it's not possible for any tool to identify all possible
> > branch targets. bpf programs can be loaded on the fly and
> > jumping sequence will change.
> > So if this marking says 'don't bother analyzing this function because
> > it does sane stuff' that's probably not the case.
> > If this marking says 'don't bother analyzing, the stack may be crazy
> > from here on' then it's ok.
> 
> So the tool doesn't need to follow all possible call targets.  Instead
> it just verifies that all functions follow the frame pointer convention.
> That way it doesn't matter *which* function is being called because they
> all do the right thing.
> 
> But it *does* need to follow all jump targets, so that it can analyze
> all possible code paths within the function itself.  With a dynamic
> jump, it can't do that.
> 
> So the JMP_CALL is fine, but the goto *jumptable[insn->code] isn't.
> (And BTW that's the only occurrence of such a dynamic jump table in the
> entire kernel.)

Acked-by: Alexei Starovoitov

Re: [PATCH 23/33] x86/asm/bpf: Create stack frames in bpf_jit.S

2016-01-22 Thread Josh Poimboeuf

On Fri, Jan 22, 2016 at 09:18:23AM -0800, Alexei Starovoitov wrote:
> On Fri, Jan 22, 2016 at 09:58:04AM -0600, Josh Poimboeuf wrote:
> > On Thu, Jan 21, 2016 at 08:18:46PM -0800, Alexei Starovoitov wrote:
> > > On Thu, Jan 21, 2016 at 09:55:31PM -0600, Josh Poimboeuf wrote:
> > > > On Thu, Jan 21, 2016 at 06:44:28PM -0800, Alexei Starovoitov wrote:
> > > > > On Thu, Jan 21, 2016 at 04:49:27PM -0600, Josh Poimboeuf wrote:
> > > > > > bpf_jit.S has several callable non-leaf functions which don't honor
> > > > > > CONFIG_FRAME_POINTER, which can result in bad stack traces.
> > > > > > 
> > > > > > Create a stack frame before the call instructions when
> > > > > > CONFIG_FRAME_POINTER is enabled.
> > > > > > 
> > > > > > Signed-off-by: Josh Poimboeuf 
> > > > > > Cc: Alexei Starovoitov 
> > > > > > Cc: netdev@vger.kernel.org
> > > > > > ---
> > > > > >  arch/x86/net/bpf_jit.S | 9 +++--
> ...
> > > > > >  /* rsi contains offset and can be scratched */
> > > > > >  #define bpf_slow_path_common(LEN)  \
> > > > > > +   lea -MAX_BPF_STACK + 32(%rbp), %rdx;\
> > > > > > +   FRAME_BEGIN;\
> > > > > > mov %rbx, %rdi; /* arg1 == skb */   \
> > > > > > push%r9;\
> > > > > > pushSKBDATA;\
> > > > > >  /* rsi already has offset */   \
> > > > > > mov $LEN,%ecx;  /* len */   \
> > > > > > -   lea - MAX_BPF_STACK + 32(%rbp),%rdx;
> > > > > > \
> > > > > > callskb_copy_bits;  \
> > > > > > test%eax,%eax;  \
> > > > > > pop SKBDATA;\
> > > > > > -   pop %r9;
> > > > > > +   pop %r9;\
> > > > > > +   FRAME_END
> ...
> > > > Well, but the point of these patches isn't to make the tool happy.  It's
> > > > really to make sure that runtime stack traces can be made reliable.
> > > > Maybe I'm missing something but I don't see why JIT code can't honor
> > > > CONFIG_FRAME_POINTER just like any other code.
> > > 
> > > It can if there is no performance cost added.
> > 
> > CONFIG_FRAME_POINTER always adds a small performance cost but as you
> > mentioned it only affects the slow path here.  And hopefully we'll soon
> > have an in-kernel DWARF unwinder on x86 so we can get rid of the need
> > for frame pointers.
> 
> ok. fair enough.
> Acked-by: Alexei Starovoitov 

Thanks!

Can I assume your ack also applies to the previous patch which adds the
ELF annotations ("x86/asm/bpf: Annotate callable functions")?

-- 
Josh

Re: [PATCH 23/33] x86/asm/bpf: Create stack frames in bpf_jit.S

2016-01-22 Thread Alexei Starovoitov

On Fri, Jan 22, 2016 at 11:36:14AM -0600, Josh Poimboeuf wrote:
> On Fri, Jan 22, 2016 at 09:18:23AM -0800, Alexei Starovoitov wrote:
> > On Fri, Jan 22, 2016 at 09:58:04AM -0600, Josh Poimboeuf wrote:
> > > On Thu, Jan 21, 2016 at 08:18:46PM -0800, Alexei Starovoitov wrote:
> > > > On Thu, Jan 21, 2016 at 09:55:31PM -0600, Josh Poimboeuf wrote:
> > > > > On Thu, Jan 21, 2016 at 06:44:28PM -0800, Alexei Starovoitov wrote:
> > > > > > On Thu, Jan 21, 2016 at 04:49:27PM -0600, Josh Poimboeuf wrote:
> > > > > > > bpf_jit.S has several callable non-leaf functions which don't 
> > > > > > > honor
> > > > > > > CONFIG_FRAME_POINTER, which can result in bad stack traces.
> > > > > > > 
> > > > > > > Create a stack frame before the call instructions when
> > > > > > > CONFIG_FRAME_POINTER is enabled.
> > > > > > > 
> > > > > > > Signed-off-by: Josh Poimboeuf 
> > > > > > > Cc: Alexei Starovoitov 
> > > > > > > Cc: netdev@vger.kernel.org
> > > > > > > ---
> > > > > > >  arch/x86/net/bpf_jit.S | 9 +++--
> > ...
> > > > > > >  /* rsi contains offset and can be scratched */
> > > > > > >  #define bpf_slow_path_common(LEN)\
> > > > > > > + lea -MAX_BPF_STACK + 32(%rbp), %rdx;\
> > > > > > > + FRAME_BEGIN;\
> > > > > > >   mov %rbx, %rdi; /* arg1 == skb */   \
> > > > > > >   push%r9;\
> > > > > > >   pushSKBDATA;\
> > > > > > >  /* rsi already has offset */ \
> > > > > > >   mov $LEN,%ecx;  /* len */   \
> > > > > > > - lea - MAX_BPF_STACK + 32(%rbp),%rdx;
> > > > > > > \
> > > > > > >   callskb_copy_bits;  \
> > > > > > >   test%eax,%eax;  \
> > > > > > >   pop SKBDATA;\
> > > > > > > - pop %r9;
> > > > > > > + pop %r9;\
> > > > > > > + FRAME_END
> > ...
> > > > > Well, but the point of these patches isn't to make the tool happy.  
> > > > > It's
> > > > > really to make sure that runtime stack traces can be made reliable.
> > > > > Maybe I'm missing something but I don't see why JIT code can't honor
> > > > > CONFIG_FRAME_POINTER just like any other code.
> > > > 
> > > > It can if there is no performance cost added.
> > > 
> > > CONFIG_FRAME_POINTER always adds a small performance cost but as you
> > > mentioned it only affects the slow path here.  And hopefully we'll soon
> > > have an in-kernel DWARF unwinder on x86 so we can get rid of the need
> > > for frame pointers.
> > 
> > ok. fair enough.
> > Acked-by: Alexei Starovoitov 
> 
> Thanks!
> 
> Can I assume your ack also applies to the previous patch which adds the
> ELF annotations ("x86/asm/bpf: Annotate callable functions")?

Yes. Thanks.

Vr headset,vr box,Vr glasses,from the original factory

2016-01-22 Thread Raymond

Dear 
Happy 2016, 
Hottest VR box,for mobile phone,Watch 3D movie and vr video,Play vr games
Price only $6/pc now,only from the arcpeaks factory 
Alibaba 
link,http://arcpeaks.en.alibaba.com/product/60401287002-802634136/2016_hottest_product_virtual_reality_3d_games_and_video_3d_vr_glasses.html
Please feel free to contact me for more details 
Thanks
Best Regards
Ray
arcpeaks.en.alibaba.com
Skype:sixiwenzhi
email:raym...@arcpeaks.com

Re: [PATCH 00/33] Compile-time stack metadata validation

2016-01-22 Thread Chris J Arges

On Thu, Jan 21, 2016 at 04:49:04PM -0600, Josh Poimboeuf wrote:
> This is v16 of the compile-time stack metadata validation patch set,
> along with proposed fixes for most of the warnings it found.  It's based
> on the tip/master branch.
>
Josh,

Looks good, with my config [1] I do still get a few warnings building
linux/linux-next.

Here are the warnings:
$ grep ^stacktool build.log | grep -v staging
stacktool: arch/x86/kvm/vmx.o: vmx_handle_external_intr()+0x67: call without 
frame pointer save/setup
stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x9e: return without 
frame pointer restore
stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x89: duplicate frame 
pointer save
stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x8a: duplicate frame 
pointer setup
stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x9e: frame pointer 
state mismatch
stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x0: frame pointer 
state mismatch
stacktool: fs/reiserfs/ibalance.o: .text: unexpected end of section
stacktool: fs/reiserfs/tail_conversion.o: .text: unexpected end of section

For vmx_handle_external_intr, I'm wondering if ignoring this function is the
best option.

--

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e2951b6..d19dfb2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "kvm_cache_regs.h"
 #include "x86.h"
 
@@ -8398,6 +8399,7 @@ static void vmx_handle_external_intr(struct kvm_vcpu 
*vcpu)
} else
local_irq_enable();
 }
+STACKTOOL_IGNORE_FUNC(vmx_handle_external_intr);
 
 static bool vmx_has_high_real_mode_segbase(void)
 {

--chris

[1] http://paste.ubuntu.com/14599083/

 
> v15 can be found here:
> 
>   https://lkml.kernel.org/r/cover.1450442274.git.jpoim...@redhat.com
> 
> For more information about the motivation behind this patch set, and
> more details about what it does, see the first patch changelog and
> tools/stacktool/Documentation/stack-validation.txt.
> 
> Patches 1-4 add stacktool and integrate it into the kernel build.
> 
> Patches 5-28 are some proposed fixes for several of the warnings
> reported by stacktool.  They've been compile-tested and boot-tested in a
> VM, but I haven't attempted any meaningful testing for many of them.
> 
> Patches 29-33 add some directories, files, and functions to the
> stacktool whitelist in order to silence false positive warnings.
> 
> v16:
> - fix all allyesconfig warnings, except for staging
> - get rid of STACKTOOL_IGNORE_INSN which is no longer needed
> - remove several whitelists in favor of automatically whitelisting any
>   function with a special instruction like ljmp, lret, or vmrun
> - split up stacktool patch into 3 parts as suggested by Ingo
> - update the global noreturn function list
> - detect noreturn function fallthroughs
> - skip weak functions in noreturn call detection logic
> - add empty function check to noreturn logic
> - allow non-section rela symbols for __ex_table sections
> - support rare switch table case with jmpq *[addr](%rip)
> - don't warn on frame pointer restore without save
> - rearrange patch order a bit
> 
> v15:
> - restructure code for a new cmdline interface "stacktool check" using
>   the new subcommand framework in tools/lib/subcmd
> - fix 32 bit build fail (put __sp at end) in paravirt_types.h patch 10
>   which was reported by 0day
> 
> v14:
> - make tools/include/linux/list.h self-sufficient
> - create FRAME_OFFSET to allow 32-bit code to be able to access function
>   arguments on the stack
> - add FRAME_OFFSET usage in crypto patch 14/24: "Create stack frames in
>   aesni-intel_asm.S"
> - rename "index" -> "idx" to fix build with some compilers
> 
> v13:
> - LDFLAGS order fix from Chris J Arges
> - new warning fix patches from Chris J Arges
> - "--frame-pointer" -> "--check-frame-pointer"
> 
> v12:
> - rename "stackvalidate" -> "stacktool"
> - move from scripts/ to tools/:
>   - makefile rework
>   - make a copy of the x86 insn code (and warn if the code diverges)
>   - use tools/include/linux/list.h
> - move warning macros to a new warn.h file
> - change wording: "stack validation" -> "stack metadata validation"
> 
> v11:
> - attempt to answer the "why" question better in the documentation and
>   commit message
> - s/FP_SAVE/FRAME_BEGIN/ in documentation
> 
> v10:
> - add scripts/mod to directory ignores
> - remove circular dependencies for ignored objects which are built
>   before stackvalidate
> - fix CONFIG_MODVERSIONS incompatibility
> 
> v9:
> - rename FRAME/ENDFRAME -> FRAME_BEGIN/FRAME_END
> - fix jump table issue for when the original instruction is a jump
> - drop paravirt thunk alignment patch
> - add maintainers to CC for proposed warning fixes
> 
> v8:
> - add proposed fixes for warnings
> - fix all memory leaks
> - process ignores earlier and add more ignore checks
> - always assume POPCNT alternative is enabled
> - drop hweight inline as

[RFC PATCH v2] ethtool: add IPv6 to the NFC API

2016-01-22 Thread Edward Cree

Signed-off-by: Edward Cree 
---
changes from v1:
* split out separate spec structs for different flow types
* clarified the proto field in usr_ip6_spec
* changed IP6_USER_FLOW to 0x0e as I noticed there's a gap there

 include/uapi/linux/ethtool.h | 67 
 1 file changed, 62 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 57fa390..ad805b9 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -748,6 +748,56 @@ struct ethtool_usrip4_spec {
__u8proto;
 };
 
+/**
+ * struct ethtool_tcpip6_spec - flow specification for TCP/IPv6 etc.
+ * @ip6src: Source host
+ * @ip6dst: Destination host
+ * @psrc: Source port
+ * @pdst: Destination port
+ * @tos: Type-of-service
+ *
+ * This can be used to specify a TCP/IPv6, UDP/IPv6 or SCTP/IPv6 flow.
+ */
+struct ethtool_tcpip6_spec {
+   __be32  ip6src[4];
+   __be32  ip6dst[4];
+   __be16  psrc;
+   __be16  pdst;
+   __u8tos;
+};
+
+/**
+ * struct ethtool_ah_espip6_spec - flow specification for IPsec/IPv6
+ * @ip6src: Source host
+ * @ip6dst: Destination host
+ * @spi: Security parameters index
+ * @tos: Type-of-service
+ *
+ * This can be used to specify an IPsec transport or tunnel over IPv6.
+ */
+struct ethtool_ah_espip6_spec {
+   __be32  ip6src[4];
+   __be32  ip6dst[4];
+   __be32  spi;
+   __u8tos;
+};
+
+/**
+ * struct ethtool_usrip6_spec - general flow specification for IPv6
+ * @ip6src: Source host
+ * @ip6dst: Destination host
+ * @l4_4_bytes: First 4 bytes of transport (layer 4) header
+ * @tos: Type-of-service
+ * @proto: Transport protocol number (nexthdr after any Extension Headers)
+ */
+struct ethtool_usrip6_spec {
+   __be32  ip6src[4];
+   __be32  ip6dst[4];
+   __be32  l4_4_bytes;
+   __u8tos;
+   __u8proto;
+};
+
 union ethtool_flow_union {
struct ethtool_tcpip4_spec  tcp_ip4_spec;
struct ethtool_tcpip4_spec  udp_ip4_spec;
@@ -755,6 +805,12 @@ union ethtool_flow_union {
struct ethtool_ah_espip4_spec   ah_ip4_spec;
struct ethtool_ah_espip4_spec   esp_ip4_spec;
struct ethtool_usrip4_spec  usr_ip4_spec;
+   struct ethtool_tcpip6_spec  tcp_ip6_spec;
+   struct ethtool_tcpip6_spec  udp_ip6_spec;
+   struct ethtool_tcpip6_spec  sctp_ip6_spec;
+   struct ethtool_ah_espip6_spec   ah_ip6_spec;
+   struct ethtool_ah_espip6_spec   esp_ip6_spec;
+   struct ethtool_usrip6_spec  usr_ip6_spec;
struct ethhdr   ether_spec;
__u8hdata[52];
 };
@@ -1367,15 +1423,16 @@ enum ethtool_sfeatures_retval_bits {
 #defineUDP_V4_FLOW 0x02/* hash or spec (udp_ip4_spec) */
 #defineSCTP_V4_FLOW0x03/* hash or spec (sctp_ip4_spec) */
 #defineAH_ESP_V4_FLOW  0x04/* hash only */
-#defineTCP_V6_FLOW 0x05/* hash only */
-#defineUDP_V6_FLOW 0x06/* hash only */
-#defineSCTP_V6_FLOW0x07/* hash only */
+#defineTCP_V6_FLOW 0x05/* hash or spec (tcp_ip6_spec; nfc 
only) */
+#defineUDP_V6_FLOW 0x06/* hash or spec (udp_ip6_spec; nfc 
only) */
+#defineSCTP_V6_FLOW0x07/* hash or spec (sctp_ip6_spec; nfc 
only) */
 #defineAH_ESP_V6_FLOW  0x08/* hash only */
 #defineAH_V4_FLOW  0x09/* hash or spec (ah_ip4_spec) */
 #defineESP_V4_FLOW 0x0a/* hash or spec (esp_ip4_spec) */
-#defineAH_V6_FLOW  0x0b/* hash only */
-#defineESP_V6_FLOW 0x0c/* hash only */
+#defineAH_V6_FLOW  0x0b/* hash or spec (ah_ip6_spec; nfc only) 
*/
+#defineESP_V6_FLOW 0x0c/* hash or spec (esp_ip6_spec; nfc 
only) */
 #defineIP_USER_FLOW0x0d/* spec only (usr_ip4_spec) */
+#defineIP6_USER_FLOW   0x0e/* spec only (usr_ip6_spec; nfc only) */
 #defineIPV4_FLOW   0x10/* hash only */
 #defineIPV6_FLOW   0x11/* hash only */
 #defineETHER_FLOW  0x12/* spec only (ether_spec) */

Re: [PATCH net 1/3] sctp: fix the transport dead race check by using atomic_add_unless on refcnt

2016-01-22 Thread Vlad Yasevich

On 01/22/2016 12:18 PM, Marcelo Ricardo Leitner wrote:
> On Fri, Jan 22, 2016 at 11:50:20AM -0500, Vlad Yasevich wrote:
>> On 01/21/2016 12:49 PM, Xin Long wrote:
>>> Now when __sctp_lookup_association is running in BH, it will try to
>>> check if t->dead is set, but meanwhile other CPUs may be freeing this
>>> transport and this assoc and if it happens that
>>> __sctp_lookup_association checked t->dead a bit too early, it may think
>>> that the association is still good while it was already freed.
>>>
>>> So we fix this race by using atomic_add_unless in sctp_transport_hold.
>>> After we get one transport from hashtable, we will hold it only when
>>> this transport's refcnt is not 0, so that we can make sure t->asoc
>>> cannot be freed before we hold the asoc again.
>>
>> atomic_add_unless() uses atomic_read() to check the value.  Since there
>> don't appear to be any barriers, what guarantees that the value
>> read will not have been modified in another thread under a proper lock?
>>
> 
> atomic_read() is used only as a starting point.  If it got changed in
> between, the new current value (return of atomic_cmpxchg) will be used
> then.
> 
>>>
>>> Note that sctp association is not freed using RCU so we can't use
>>> atomic_add_unless() with it as it may just be too late for that either.
>>>
>>> Fixes: 4f0087812648 ("sctp: apply rhashtable api to send/recv path")
>>> Reported-by: Vlad Yasevich 
>>> Signed-off-by: Xin Long 
>>> ---
>>>  include/net/sctp/structs.h |  2 +-
>>>  net/sctp/input.c   | 17 +++--
>>>  net/sctp/transport.c   |  4 ++--
>>>  3 files changed, 14 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
>>> index 20e7212..344da04 100644
>>> --- a/include/net/sctp/structs.h
>>> +++ b/include/net/sctp/structs.h
>>> @@ -955,7 +955,7 @@ void sctp_transport_route(struct sctp_transport *, 
>>> union sctp_addr *,
>>>  void sctp_transport_pmtu(struct sctp_transport *, struct sock *sk);
>>>  void sctp_transport_free(struct sctp_transport *);
>>>  void sctp_transport_reset_timers(struct sctp_transport *);
>>> -void sctp_transport_hold(struct sctp_transport *);
>>> +int sctp_transport_hold(struct sctp_transport *);
>>>  void sctp_transport_put(struct sctp_transport *);
>>>  void sctp_transport_update_rto(struct sctp_transport *, __u32);
>>>  void sctp_transport_raise_cwnd(struct sctp_transport *, __u32, __u32);
>>> diff --git a/net/sctp/input.c b/net/sctp/input.c
>>> index bf61dfb..49d2cc7 100644
>>> --- a/net/sctp/input.c
>>> +++ b/net/sctp/input.c
>>> @@ -935,15 +935,22 @@ static struct sctp_association 
>>> *__sctp_lookup_association(
>>> struct sctp_transport **pt)
>>>  {
>>> struct sctp_transport *t;
>>> +   struct sctp_association *asoc = NULL;
>>>  
>>> +   rcu_read_lock();
>>> t = sctp_addrs_lookup_transport(net, local, peer);
>>> -   if (!t || t->dead)
>>> -   return NULL;
>>> +   if (!t || !sctp_transport_hold(t))
>>> +   goto out;
>>>  
>>> -   sctp_association_hold(t->asoc);
>>> +   asoc = t->asoc;
>>> +   sctp_association_hold(asoc);
>>
>> I don't think you can modify the reference count on a transport, let alone
>> the association outside of a lock.
> 
> The transport memory is not freed, as it's protected by rcu_read_lock(),
> so we are safe to use it yet.
> atomic_ operations include an embedded lock instruction protecting the
> counter itself, there shouldn't be a need to use another lock around it.
> 
> And in the code above, as we could grab a hold on the transport, means
> the association was not freed yet because transports hold a ref on
> assoc. That's why the dance: hold(transport) hold(assoc) put(transport)
> 

OK,  I see how that holds together, but I think there might be hole wrt icmp
handling.  Some icmp processes assume transport can't disappear on them, but in
this case that last put(transport) may result in a call to 
sctp_transport_destroy()
and that might be bad.  I am looking at it now.

Thanks
-vlad

>   Marcelo
> 
>>
>> -vlad
>>
>>> *pt = t;
>>>  
>>> -   return t->asoc;
>>> +   sctp_transport_put(t);
>>> +
>>> +out:
>>> +   rcu_read_unlock();
>>> +   return asoc;
>>>  }
>>>  
>>>  /* Look up an association. protected by RCU read lock */
>>> @@ -955,9 +962,7 @@ struct sctp_association *sctp_lookup_association(struct 
>>> net *net,
>>>  {
>>> struct sctp_association *asoc;
>>>  
>>> -   rcu_read_lock();
>>> asoc = __sctp_lookup_association(net, laddr, paddr, transportp);
>>> -   rcu_read_unlock();
>>>  
>>> return asoc;
>>>  }
>>> diff --git a/net/sctp/transport.c b/net/sctp/transport.c
>>> index aab9e3f..69f3799 100644
>>> --- a/net/sctp/transport.c
>>> +++ b/net/sctp/transport.c
>>> @@ -296,9 +296,9 @@ void sctp_transport_route(struct sctp_transport 
>>> *transport,
>>>  }
>>>  
>>>  /* Hold a reference to a transport.  */
>>> -void sctp_transport_hold(struct sctp_transport *transport)
>>> +int sctp_transport_hold(

Re: [RFC PATCH v2] ethtool: add IPv6 to the NFC API

2016-01-22 Thread Alexander Duyck

On Fri, Jan 22, 2016 at 10:04 AM, Edward Cree  wrote:
> Signed-off-by: Edward Cree 
> ---
> changes from v1:
> * split out separate spec structs for different flow types
> * clarified the proto field in usr_ip6_spec
> * changed IP6_USER_FLOW to 0x0e as I noticed there's a gap there
>
>  include/uapi/linux/ethtool.h | 67 
> 
>  1 file changed, 62 insertions(+), 5 deletions(-)
>
> diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
> index 57fa390..ad805b9 100644
> --- a/include/uapi/linux/ethtool.h
> +++ b/include/uapi/linux/ethtool.h
> @@ -748,6 +748,56 @@ struct ethtool_usrip4_spec {
> __u8proto;
>  };
>
> +/**
> + * struct ethtool_tcpip6_spec - flow specification for TCP/IPv6 etc.
> + * @ip6src: Source host
> + * @ip6dst: Destination host
> + * @psrc: Source port
> + * @pdst: Destination port
> + * @tos: Type-of-service
> + *
> + * This can be used to specify a TCP/IPv6, UDP/IPv6 or SCTP/IPv6 flow.
> + */
> +struct ethtool_tcpip6_spec {
> +   __be32  ip6src[4];
> +   __be32  ip6dst[4];
> +   __be16  psrc;
> +   __be16  pdst;
> +   __u8tos;
> +};
> +
> +/**
> + * struct ethtool_ah_espip6_spec - flow specification for IPsec/IPv6
> + * @ip6src: Source host
> + * @ip6dst: Destination host
> + * @spi: Security parameters index
> + * @tos: Type-of-service
> + *
> + * This can be used to specify an IPsec transport or tunnel over IPv6.
> + */
> +struct ethtool_ah_espip6_spec {
> +   __be32  ip6src[4];
> +   __be32  ip6dst[4];
> +   __be32  spi;
> +   __u8tos;
> +};
> +
> +/**
> + * struct ethtool_usrip6_spec - general flow specification for IPv6
> + * @ip6src: Source host
> + * @ip6dst: Destination host
> + * @l4_4_bytes: First 4 bytes of transport (layer 4) header
> + * @tos: Type-of-service
> + * @proto: Transport protocol number (nexthdr after any Extension Headers)
> + */
> +struct ethtool_usrip6_spec {
> +   __be32  ip6src[4];
> +   __be32  ip6dst[4];
> +   __be32  l4_4_bytes;
> +   __u8tos;
> +   __u8proto;
> +};
> +

It might be better to refer to this as l4_proto so that it is clear
that this is specifying the protocol of the l4 header that the
l4_4_bytes will be pulled from.

It still might even be useful to add a nexthdr field since it is
possible that there may be NICs out there that don't support parsing
the extension headers.  In such a case they could block setting
protocol and use nexthdr instead.  It provides an indirect way of
communicating if the NIC supports parsing extension headers or not as
the NIC can block adding a filter on one mask being set or the other.

>  union ethtool_flow_union {
> struct ethtool_tcpip4_spec  tcp_ip4_spec;
> struct ethtool_tcpip4_spec  udp_ip4_spec;
> @@ -755,6 +805,12 @@ union ethtool_flow_union {
> struct ethtool_ah_espip4_spec   ah_ip4_spec;
> struct ethtool_ah_espip4_spec   esp_ip4_spec;
> struct ethtool_usrip4_spec  usr_ip4_spec;
> +   struct ethtool_tcpip6_spec  tcp_ip6_spec;
> +   struct ethtool_tcpip6_spec  udp_ip6_spec;
> +   struct ethtool_tcpip6_spec  sctp_ip6_spec;
> +   struct ethtool_ah_espip6_spec   ah_ip6_spec;
> +   struct ethtool_ah_espip6_spec   esp_ip6_spec;
> +   struct ethtool_usrip6_spec  usr_ip6_spec;
> struct ethhdr   ether_spec;
> __u8hdata[52];
>  };
> @@ -1367,15 +1423,16 @@ enum ethtool_sfeatures_retval_bits {
>  #defineUDP_V4_FLOW 0x02/* hash or spec (udp_ip4_spec) */
>  #defineSCTP_V4_FLOW0x03/* hash or spec (sctp_ip4_spec) */
>  #defineAH_ESP_V4_FLOW  0x04/* hash only */
> -#defineTCP_V6_FLOW 0x05/* hash only */
> -#defineUDP_V6_FLOW 0x06/* hash only */
> -#defineSCTP_V6_FLOW0x07/* hash only */
> +#defineTCP_V6_FLOW 0x05/* hash or spec (tcp_ip6_spec; nfc 
> only) */
> +#defineUDP_V6_FLOW 0x06/* hash or spec (udp_ip6_spec; nfc 
> only) */
> +#defineSCTP_V6_FLOW0x07/* hash or spec (sctp_ip6_spec; nfc 
> only) */
>  #defineAH_ESP_V6_FLOW  0x08/* hash only */
>  #defineAH_V4_FLOW  0x09/* hash or spec (ah_ip4_spec) */
>  #defineESP_V4_FLOW 0x0a/* hash or spec (esp_ip4_spec) */
> -#defineAH_V6_FLOW  0x0b/* hash only */
> -#defineESP_V6_FLOW 0x0c/* hash only */
> +#defineAH_V6_FLOW  0x0b/* hash or spec (ah_ip6_spec; nfc 
> only) */
> +#defineESP_V6_FLOW 0x0c/* hash or spec (esp_ip6_spec; nfc 
> only) */
>  #defineIP_USER_FLOW0x0d/* spec only (usr_ip4_spec) */
> +#defineIP6_USER_FLOW   0x0e/* spec only (usr_ip6_spec; nfc only) 
> */
>  #defineIPV4_FLOW   0x10

[RFC PATCH net] net/core: don't increment rx_dropped on inactive slaves

2016-01-22 Thread Jarod Wilson

The network core tries to keep track of dropped packets, but some packets
you wouldn't really call dropped, so much as intentionally ignored, under
certain circumstances. One such case is that of bonding and team device
slaves that are currently inactive. Their respective rx_handler functions
return RX_HANDLER_EXACT (the only places in the kernel that return that),
which ends up tracking into the network core's __netif_receive_skb_core()
function's drop path, with no pt_prev set. On a noisy network, this can
result in a very rapidly incrementing rx_dropped counter, not only on the
inactive slave(s), but also on the master device, such as the following:

Inter-|   Receive|  Transmit
 face |bytespackets errs drop fifo frame compressed multicast|bytes
packets errs drop fifo colls carrier compressed
  p7p1: 14783346  1404300 1404280 0  0  2040  680   
8000 0   0  0
  p7p2: 14805198  140648000 0  0  20340 
  0000 0   0  0
 bond0: 53365248  5327980 4211600 0  0115151 2040   
   24000 0   0  0
lo:5420  54000 0  0 0 5420  
54000 0   0  0
  p5p1: 19292195  1961970 1403680 0  0 56564  680   
8000 0   0  0
  p5p2: 19289707  1961710 1403640 0  0 56547  680   
8000 0   0  0
   em3: 20996626  158214000 0  0   3830 
  0000 0   0  0
   em2: 14065122  138462000 0  0   3100 
  0000 0   0  0
   em1: 14063162  138440000 0  0   3080 
  0000 0   0  0
   em4: 21050830  158729000 0  0   38571662 
469000 0   0  0
   ib0:   0   0000 0  0 00  
 0000 0   0  0

In this scenario, p5p1, p5p2 and p7p1 are all inactive slaves in an
active-backup bond0, and you can see that all three have high drop counts,
with the master bond0 showing a tally of all three.

I know that this was previously discussed some here:

http://www.spinics.net/lists/netdev/msg226341.html

It seems additional counters never came to fruition, but honestly, for
this particular case, I'm not even sure they're warranted, I'd be inclined
to say just silently drop these packets without incrementing a counter. At
least, that's probably what would make someone who has complained loudly
about this issue happy, as they have monitoring tools that are squaking
loudly at any increments to rx_dropped.

CC: "David S. Miller" 
CC: Eric Dumazet 
CC: Jiri Pirko 
CC: Daniel Borkmann 
CC: Tom Herbert 
CC: Jay Vosburgh 
CC: Veaceslav Falico 
CC: Andy Gospodarek 
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 net/core/dev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 8cba3d8..1354c7b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4153,8 +4153,11 @@ ncls:
else
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
} else {
+   if (deliver_exact)
+   goto inactive; /* bond or team inactive slave */
 drop:
atomic_long_inc(&skb->dev->rx_dropped);
+inactive:
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
 * me how you were going to use this. :-)
-- 
1.8.3.1

Re: [PATCH 00/33] Compile-time stack metadata validation

2016-01-22 Thread Josh Poimboeuf

On Fri, Jan 22, 2016 at 11:43:48AM -0600, Chris J Arges wrote:
> On Thu, Jan 21, 2016 at 04:49:04PM -0600, Josh Poimboeuf wrote:
> > This is v16 of the compile-time stack metadata validation patch set,
> > along with proposed fixes for most of the warnings it found.  It's based
> > on the tip/master branch.
> >
> Josh,
> 
> Looks good, with my config [1] I do still get a few warnings building
> linux/linux-next.
> 
> Here are the warnings:
> $ grep ^stacktool build.log | grep -v staging

Thanks for reporting these!

> stacktool: arch/x86/kvm/vmx.o: vmx_handle_external_intr()+0x67: call without 
> frame pointer save/setup

This can be fixed by setting the stack pointer as an output operand for
the inline asm call in vmx_handle_external_intr().

Feel free to submit a patch, or I'll get around to it eventually.

> stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x9e: return 
> without frame pointer restore
> stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x89: duplicate 
> frame pointer save
> stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x8a: duplicate 
> frame pointer setup
> stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x9e: frame pointer 
> state mismatch
> stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x0: frame pointer 
> state mismatch

These are false positives.  Stacktool is confused by the use of a
"noreturn" function which it doesn't know about (__reiserfs_panic).

Unfortunately the only solution I currently have for dealing with global
noreturn functions is to just hard-code a list of them.  So the short
term fix would be to add "__reiserfs_panic" to the global_noreturns list
in tools/stacktool/builtin-check.c.

I'm still trying to figure out a better way to deal with this type of
issue, as it's a pain to have to keep a hard-coded list of noreturn
functions.  Unfortunately that info isn't available in the ELF.

> stacktool: fs/reiserfs/ibalance.o: .text: unexpected end of section
> stacktool: fs/reiserfs/tail_conversion.o: .text: unexpected end of section

For some reason I'm not able to recreate these warnings...  Can you
share one of the .o files?

-- 
Josh

Re: xen-netfront crash when detaching network while some network activity

2016-01-22 Thread Marek Marczykowski-Górecki

On Thu, Jan 21, 2016 at 12:30:48PM +, Joao Martins wrote:
> 
> 
> On 01/20/2016 09:59 PM, Konrad Rzeszutek Wilk wrote:
> > On Tue, Dec 01, 2015 at 11:32:58PM +0100, Marek Marczykowski-Górecki wrote:
> >> On Tue, Dec 01, 2015 at 05:00:42PM -0500, Konrad Rzeszutek Wilk wrote:
> >>> On Tue, Nov 17, 2015 at 03:45:15AM +0100, Marek Marczykowski-Górecki 
> >>> wrote:
>  On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki 
>  wrote:
> > On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki 
> > wrote:
> >> On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
> >>> On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
>  Hi all,
> 
>  I'm experiencing xen-netfront crash when doing xl network-detach 
>  while
>  some network activity is going on at the same time. It happens only 
>  when
>  domU has more than one vcpu. Not sure if this matters, but the 
>  backend
>  is in another domU (not dom0). I'm using Xen 4.2.2. It happens on 
>  kernel
>  3.9.4 and 4.1-rc1 as well.
> 
>  Steps to reproduce:
>  1. Start the domU with some network interface
>  2. Call there 'ping -f some-IP'
>  3. Call 'xl network-detach NAME 0'
> >>>
> >>> Do you see this all the time or just on occassions?
> >>
> >> Using above procedure - all the time.
> >>
> >>> I tried to reproduce it and couldn't see it. Is your VM an PV or HVM?
> >>
> >> PV, started by libvirt. This may have something to do, the problem didn't
> >> existed on older Xen (4.1) and started by xl. I'm not sure about kernel
> >> version there, but I think I've tried there 3.18 too, which has this
> >> problem.
> >>
> >> But I don't see anything special in domU config file (neither backend
> >> nor frontend) - it may be some libvirt default. If that's really the
> >> cause. Can I (and how) get any useful information about that?
> > 
> > libvirt naturally does some libxl calls, and they may be different.
> > 
> > Any chance you could give me an idea of:
> >  - What commands you use in libvirt?
> >  - Do you use a bond or bridge?
> >  - What version of libvirt you are using?
> > 
> > Thanks!
> > CC-ing Joao just in case he has seen this.
> >>
> Hm, So far I couldn't reproduce the issue with upstream Xen/linux/libvirt, 
> using
> both libvirt or plain xl (both on a bridge setup) and also irrespective of the
> both load and direction of traffic (be it a ping flood, pktgen with min.
> sized packets or iperf).

I've ran the test again, on vanilla 4.4 and collected some info:
 - xenstore dump of frontend (xs-frontend-before.txt)
 - xenstore dump of backend (xs-backend-before.txt)
 - kernel messages (console output) (console.log)
 - kernel config (config-4.4)
 - libvirt config of that domain (netdebug.conf)

Versions:
 - kernel 4.4 (frontend), 4.2.8 (backend)
 - libvirt 1.2.20
 - xen 4.6.0

In backend domain there is no bridge or anything like that - only
routing. The same in frontend - nothing fancy - just IP set on eth0
there.

Steps to reproduce were the same:
 - start frontend domain (virsh create ...)
 - call ping -f
 - xl network-detach NAME 0

Note that the crash doesn't happen with attached patch applied (as noted
in mail on Oct 21), but I have no idea whether is it a proper fix, or
just prevents the crash by a coincidence.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
[0.00] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC  
[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 4.4.0-1.pvops.qubes.x86_64 (user@devel-3rdparty) 
(gcc version 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC) ) #20 SMP Fri Jan 22 
00:39:29 CET 2016
[0.00] Command line: root=/dev/mapper/dmroot ro nomodeset console=hvc0 
rd_NO_PLYMOUTH 3 rd.break
[0.00] x86/fpu: Legacy x87 FPU detected.
[0.00] x86/fpu: Using 'lazy' FPU context switches.
[0.00] ACPI in unprivileged domain disabled
[0.00] Released 0 page(s)
[0.00] e820: BIOS-provided physical RAM map:
[0.00] Xen: [mem 0x-0x0009] usable
[0.00] Xen: [mem 0x000a-0x000f] reserved
[0.00] Xen: [mem 0x0010-0xf9ff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] DMI not present or invalid.
[0.00] Hypervisor detected: Xen
[0.00] e820: last_pfn = 0xfa000 max_arch_pfn = 0x4
[0.00] MTRR: Disabled
[0.00] RAMDISK: [mem 0x0203-0x027c6fff]
[0.00] NUMA turned off
[0.00] Faking a node at [mem 0x-0xf9ff]
[0.00] NODE_DATA(0) allocated [mem 0x18837000-0x1

Has RFC4821 been implemented?

2016-01-22 Thread Tom Herbert

This came up on in one of the IETF mailing lists in the context that
PMTUD doesn't really work on the Internet so  PLPMTUD is the way to
go. Looking at the code I only see inet_csk_update_pmtu called from
ICMP PTB error. I am missing something?

Thanks,
Tom

Re: Has RFC4821 been implemented?

2016-01-22 Thread Tom Herbert

Found it: tcp_mtu_probing


On Fri, Jan 22, 2016 at 11:39 AM, Tom Herbert  wrote:
> This came up on in one of the IETF mailing lists in the context that
> PMTUD doesn't really work on the Internet so  PLPMTUD is the way to
> go. Looking at the code I only see inet_csk_update_pmtu called from
> ICMP PTB error. I am missing something?
>
> Thanks,
> Tom

Re: [Xen-devel] [PATCH 1/3] xen-netback: fix license ident used in MODULE_LICENSE

2016-01-22 Thread One Thousand Gnomes

> The fact what include/linux/license.h:license_is_gpl_compatible includes
> "Dual MIT/GPL" as an option seems to suggest that it is enough of a thing
> to be validly used as the contents of a MODULE_LICENSE() thing.

Yes. The MIT licence most definitely exists, and people know what it
means.

Also nobody should be changing the licence on anything unless they have
the written permission of all rights holders on record, so it's best to
leave it be 8)

Alan

[PATCH net] sctp: allow setting SCTP_SACK_IMMEDIATELY by the application

2016-01-22 Thread Marcelo Ricardo Leitner

This patch extends commit b93d6471748d ("sctp: implement the sender side
for SACK-IMMEDIATELY extension") as it didn't white list
SCTP_SACK_IMMEDIATELY on sctp_msghdr_parse(), causing it to be
understood as an invalid flag and returning -EINVAL to the application.

Note that the actual handling of the flag is already there in
sctp_datamsg_from_user().

https://tools.ietf.org/html/rfc7053#section-7

Fixes: b93d6471748d ("sctp: implement the sender side for SACK-IMMEDIATELY 
extension")
Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/socket.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 
2657b51f627625a4d674041da1f6dd913932b1d7..594ba9db9c2660b6b0803a2a3955ea7e8b3a7a44
 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -6638,6 +6638,7 @@ static int sctp_msghdr_parse(const struct msghdr *msg, 
sctp_cmsgs_t *cmsgs)
 
if (cmsgs->srinfo->sinfo_flags &
~(SCTP_UNORDERED | SCTP_ADDR_OVER |
+ SCTP_SACK_IMMEDIATELY |
  SCTP_ABORT | SCTP_EOF))
return -EINVAL;
break;
@@ -6661,6 +6662,7 @@ static int sctp_msghdr_parse(const struct msghdr *msg, 
sctp_cmsgs_t *cmsgs)
 
if (cmsgs->sinfo->snd_flags &
~(SCTP_UNORDERED | SCTP_ADDR_OVER |
+ SCTP_SACK_IMMEDIATELY |
  SCTP_ABORT | SCTP_EOF))
return -EINVAL;
break;
-- 
2.5.0

Re: [PATCH 00/33] Compile-time stack metadata validation

2016-01-22 Thread Chris J Arges

On Fri, Jan 22, 2016 at 01:14:47PM -0600, Josh Poimboeuf wrote:
> On Fri, Jan 22, 2016 at 11:43:48AM -0600, Chris J Arges wrote:
> > On Thu, Jan 21, 2016 at 04:49:04PM -0600, Josh Poimboeuf wrote:
> > > This is v16 of the compile-time stack metadata validation patch set,
> > > along with proposed fixes for most of the warnings it found.  It's based
> > > on the tip/master branch.
> > >
> > Josh,
> > 
> > Looks good, with my config [1] I do still get a few warnings building
> > linux/linux-next.
> > 
> > Here are the warnings:
> > $ grep ^stacktool build.log | grep -v staging
> 
> Thanks for reporting these!
> 
> > stacktool: arch/x86/kvm/vmx.o: vmx_handle_external_intr()+0x67: call 
> > without frame pointer save/setup
> 
> This can be fixed by setting the stack pointer as an output operand for
> the inline asm call in vmx_handle_external_intr().
> 
> Feel free to submit a patch, or I'll get around to it eventually.
> 
> > stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x9e: return 
> > without frame pointer restore
> > stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x89: duplicate 
> > frame pointer save
> > stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x8a: duplicate 
> > frame pointer setup
> > stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x9e: frame 
> > pointer state mismatch
> > stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x0: frame 
> > pointer state mismatch
> 
> These are false positives.  Stacktool is confused by the use of a
> "noreturn" function which it doesn't know about (__reiserfs_panic).
> 
> Unfortunately the only solution I currently have for dealing with global
> noreturn functions is to just hard-code a list of them.  So the short
> term fix would be to add "__reiserfs_panic" to the global_noreturns list
> in tools/stacktool/builtin-check.c.
> 
> I'm still trying to figure out a better way to deal with this type of
> issue, as it's a pain to have to keep a hard-coded list of noreturn
> functions.  Unfortunately that info isn't available in the ELF.
> 

Josh,
Ok I'll hack on the patches above.

> > stacktool: fs/reiserfs/ibalance.o: .text: unexpected end of section
> > stacktool: fs/reiserfs/tail_conversion.o: .text: unexpected end of section
> 
> For some reason I'm not able to recreate these warnings...  Can you
> share one of the .o files?
> 
> -- 
> Josh
> 

Binaries are here:
http://people.canonical.com/~arges/stacktool/

--chris

net: use-after-free in recvmmsg

2016-01-22 Thread Dmitry Vyukov

Hello,

While running syzkaller fuzzer I've hit the following use-after-free:

==
BUG: KASAN: use-after-free in __sys_recvmmsg+0x6fa/0x7f0 at addr
88003b689ce0
Read of size 8 by task syz-executor/11997
=
BUG sock_inode_cache (Not tainted): kasan: bad access detected
-

INFO: Allocated in sock_alloc_inode+0x1d/0x250 age=125 cpu=1 pid=11960
[<  none  >] ___slab_alloc+0x4c2/0x500 mm/slub.c:2470
[<  none  >] __slab_alloc+0x66/0xc0 mm/slub.c:2499
[< inline >] slab_alloc_node mm/slub.c:2562
[< inline >] slab_alloc mm/slub.c:2604
[<  none  >] kmem_cache_alloc+0x257/0x2d0 mm/slub.c:2609
[<  none  >] sock_alloc_inode+0x1d/0x250 net/socket.c:250
[<  none  >] alloc_inode+0x61/0x180 fs/inode.c:198
[<  none  >] new_inode_pseudo+0x17/0xe0 fs/inode.c:878
[<  none  >] sock_alloc+0x3d/0x260 net/socket.c:541
[<  none  >] __sock_create+0xa7/0x640 net/socket.c:1127
[< inline >] sock_create net/socket.c:1203
[< inline >] SYSC_socketpair net/socket.c:1275
[<  none  >] SyS_socketpair+0x112/0x4e0 net/socket.c:1254
[<  none  >] entry_SYSCALL_64_fastpath+0x16/0x7a
arch/x86/entry/entry_64.S:185

INFO: Freed in sock_destroy_inode+0x56/0x70 age=25 cpu=1 pid=11960
[<  none  >] __slab_free+0x1fc/0x320 mm/slub.c:2680
[< inline >] slab_free mm/slub.c:2835
[<  none  >] kmem_cache_free+0x2ec/0x370 mm/slub.c:2844
[<  none  >] sock_destroy_inode+0x56/0x70 net/socket.c:280
[<  none  >] destroy_inode+0xc7/0x130 fs/inode.c:255
[<  none  >] evict+0x329/0x500 fs/inode.c:559
[< inline >] iput_final fs/inode.c:1477
[<  none  >] iput+0x45f/0x860 fs/inode.c:1504
[< inline >] dentry_iput fs/dcache.c:358
[<  none  >] __dentry_kill+0x457/0x620 fs/dcache.c:543
[< inline >] dentry_kill fs/dcache.c:587
[<  none  >] dput+0x65b/0x740 fs/dcache.c:796
[<  none  >] __fput+0x42f/0x780 fs/file_table.c:226
[<  none  >] fput+0x15/0x20 fs/file_table.c:244
[<  none  >] task_work_run+0x170/0x210 kernel/task_work.c:115
[< inline >] tracehook_notify_resume include/linux/tracehook.h:191
[<  none  >] exit_to_usermode_loop+0x1d1/0x210
arch/x86/entry/common.c:251
[< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:282
[<  none  >] syscall_return_slowpath+0x2ba/0x340
arch/x86/entry/common.c:344
[<  none  >] int_ret_from_sys_call+0x25/0x9f
arch/x86/entry/entry_64.S:281

INFO: Slab 0xeaeda200 objects=22 used=2 fp=0x88003b689cc0
flags=0x1fffc004080
INFO: Object 0x88003b689cc0 @offset=7360 fp=0x88003b68a840
CPU: 3 PID: 11997 Comm: syz-executor Tainted: GB   4.4.0+ #275
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  880038fefb68 82994c8d 88003df06d00
 88003b689cc0 88003b688000 880038fefb98 81755374
 88003df06d00 eaeda200 88003b689cc0 0002

Call Trace:
 [] __asan_report_load8_noabort+0x3e/0x40
mm/kasan/report.c:295
 [] __sys_recvmmsg+0x6fa/0x7f0 net/socket.c:2261
 [< inline >] SYSC_recvmmsg net/socket.c:2281
 [] SyS_recvmmsg+0x16f/0x180 net/socket.c:2270
 [] entry_SYSCALL_64_fastpath+0x16/0x7a
arch/x86/entry/entry_64.S:185
==

I cannot reproduce it, but looking at __sys_recvmmsg code, it seems
that sock is not necessary live after fput_light:

out_put:
fput_light(sock->file, fput_needed);

if (err == 0)
return datagrams;

if (datagrams != 0) {
/*
 * We may return less entries than requested (vlen) if the
 * sock is non block and there aren't enough datagrams...
 */
if (err != -EAGAIN) {
/*
 * ... or  if recvmsg returns an error after we
 * received some datagrams, where we record the
 * error to return on the next call or if the
 * app asks about it using getsockopt(SO_ERROR).
 */
sock->sk->sk_err = -err;
}

return datagrams;
}

return err;
}

I am on commit 30f05309bde49295e02e45c7e615f73aa4e0ccc2 (Jan 20).
Seems to be added in commit a2e2725541fad72416326798c2d7fa4dafb7d337
(Oct 2009).

Re: [PATCH 00/33] Compile-time stack metadata validation

2016-01-22 Thread Josh Poimboeuf

On Fri, Jan 22, 2016 at 02:40:35PM -0600, Chris J Arges wrote:
> On Fri, Jan 22, 2016 at 01:14:47PM -0600, Josh Poimboeuf wrote:
> > On Fri, Jan 22, 2016 at 11:43:48AM -0600, Chris J Arges wrote:
> > > On Thu, Jan 21, 2016 at 04:49:04PM -0600, Josh Poimboeuf wrote:
> > > > This is v16 of the compile-time stack metadata validation patch set,
> > > > along with proposed fixes for most of the warnings it found.  It's based
> > > > on the tip/master branch.
> > > >
> > > Josh,
> > > 
> > > Looks good, with my config [1] I do still get a few warnings building
> > > linux/linux-next.
> > > 
> > > Here are the warnings:
> > > $ grep ^stacktool build.log | grep -v staging
> > 
> > Thanks for reporting these!
> > 
> > > stacktool: arch/x86/kvm/vmx.o: vmx_handle_external_intr()+0x67: call 
> > > without frame pointer save/setup
> > 
> > This can be fixed by setting the stack pointer as an output operand for
> > the inline asm call in vmx_handle_external_intr().
> > 
> > Feel free to submit a patch, or I'll get around to it eventually.
> > 
> > > stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x9e: return 
> > > without frame pointer restore
> > > stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x89: duplicate 
> > > frame pointer save
> > > stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x8a: duplicate 
> > > frame pointer setup
> > > stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x9e: frame 
> > > pointer state mismatch
> > > stacktool: fs/reiserfs/namei.o: set_de_name_and_namelen()+0x0: frame 
> > > pointer state mismatch
> > 
> > These are false positives.  Stacktool is confused by the use of a
> > "noreturn" function which it doesn't know about (__reiserfs_panic).
> > 
> > Unfortunately the only solution I currently have for dealing with global
> > noreturn functions is to just hard-code a list of them.  So the short
> > term fix would be to add "__reiserfs_panic" to the global_noreturns list
> > in tools/stacktool/builtin-check.c.
> > 
> > I'm still trying to figure out a better way to deal with this type of
> > issue, as it's a pain to have to keep a hard-coded list of noreturn
> > functions.  Unfortunately that info isn't available in the ELF.
> > 
> 
> Josh,
> Ok I'll hack on the patches above.
> 
> > > stacktool: fs/reiserfs/ibalance.o: .text: unexpected end of section
> > > stacktool: fs/reiserfs/tail_conversion.o: .text: unexpected end of section
> > 
> > For some reason I'm not able to recreate these warnings...  Can you
> > share one of the .o files?
> 
> Binaries are here:
> http://people.canonical.com/~arges/stacktool/

Thanks, looks like the same __reiserfs_panic() noreturn fix for those.

-- 
Josh

Re: [Xen-devel] [PATCH 1/3] xen-netback: fix license ident used in MODULE_LICENSE

2016-01-22 Thread David Miller

From: One Thousand Gnomes 
Date: Fri, 22 Jan 2016 20:25:21 +

>> The fact what include/linux/license.h:license_is_gpl_compatible includes
>> "Dual MIT/GPL" as an option seems to suggest that it is enough of a thing
>> to be validly used as the contents of a MODULE_LICENSE() thing.
> 
> Yes. The MIT licence most definitely exists, and people know what it
> means.
> 
> Also nobody should be changing the licence on anything unless they have
> the written permission of all rights holders on record, so it's best to
> leave it be 8)

+1

Re: [RFC PATCH net] net/core: don't increment rx_dropped on inactive slaves

2016-01-22 Thread Jay Vosburgh

Jarod Wilson  wrote:

>The network core tries to keep track of dropped packets, but some packets
>you wouldn't really call dropped, so much as intentionally ignored, under
>certain circumstances. One such case is that of bonding and team device
>slaves that are currently inactive. Their respective rx_handler functions
>return RX_HANDLER_EXACT (the only places in the kernel that return that),
>which ends up tracking into the network core's __netif_receive_skb_core()
>function's drop path, with no pt_prev set. On a noisy network, this can
>result in a very rapidly incrementing rx_dropped counter, not only on the
>inactive slave(s), but also on the master device, such as the following:
[...]
>In this scenario, p5p1, p5p2 and p7p1 are all inactive slaves in an
>active-backup bond0, and you can see that all three have high drop counts,
>with the master bond0 showing a tally of all three.
>
>I know that this was previously discussed some here:
>
>http://www.spinics.net/lists/netdev/msg226341.html
>
>It seems additional counters never came to fruition, but honestly, for
>this particular case, I'm not even sure they're warranted, I'd be inclined
>to say just silently drop these packets without incrementing a counter. At
>least, that's probably what would make someone who has complained loudly
>about this issue happy, as they have monitoring tools that are squaking
>loudly at any increments to rx_dropped.

I don't think the kernel should silently drop packets; there
should be a counter somewhere.  If a packet is being thrown away
deliberately, it should not just vanish into the screaming void of
space.  Someday someone will try and track down where that packet is
being dropped.

I've had that same conversation with customers who insist on
accounting for every packet drop (from the "any drop is an error"
mindset), so I understand the issue.

Thinking about the prior discussion, the rx_drop_inactive is
still a good idea, but I'd actually today get good use from a
"rx_drop_unforwardable" (or an equivalent but shorter name) counter that
counts every time a packet is dropped due to is_skb_forwardable()
returning false.  __dev_forward_skb does this (and hits rx_dropped), as
does the bridge (and does not count it).

-J

>CC: "David S. Miller" 
>CC: Eric Dumazet 
>CC: Jiri Pirko 
>CC: Daniel Borkmann 
>CC: Tom Herbert 
>CC: Jay Vosburgh 
>CC: Veaceslav Falico 
>CC: Andy Gospodarek 
>CC: netdev@vger.kernel.org
>Signed-off-by: Jarod Wilson 
>---
> net/core/dev.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 8cba3d8..1354c7b 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -4153,8 +4153,11 @@ ncls:
>   else
>   ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
>   } else {
>+  if (deliver_exact)
>+  goto inactive; /* bond or team inactive slave */
> drop:
>   atomic_long_inc(&skb->dev->rx_dropped);
>+inactive:
>   kfree_skb(skb);
>   /* Jamal, now you will not able to escape explaining
>* me how you were going to use this. :-)
>-- 
>1.8.3.1
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: net: use-after-free in recvmmsg

2016-01-22 Thread Arnaldo Carvalho de Melo

Em Fri, Jan 22, 2016 at 09:39:53PM +0100, Dmitry Vyukov escreveu:
> While running syzkaller fuzzer I've hit the following use-after-free:


 
> Call Trace:
>  [] __asan_report_load8_noabort+0x3e/0x40
> mm/kasan/report.c:295
>  [] __sys_recvmmsg+0x6fa/0x7f0 net/socket.c:2261
>  [< inline >] SYSC_recvmmsg net/socket.c:2281
>  [] SyS_recvmmsg+0x16f/0x180 net/socket.c:2270
>  [] entry_SYSCALL_64_fastpath+0x16/0x7a
> arch/x86/entry/entry_64.S:185
> ==
> 
> I cannot reproduce it, but looking at __sys_recvmmsg code, it seems
> that sock is not necessary live after fput_light:
> 
> out_put:
> fput_light(sock->file, fput_needed);
> 
> if (err == 0)
> return datagrams;
> 
> if (datagrams != 0) {
> /*
>  * We may return less entries than requested (vlen) if the
>  * sock is non block and there aren't enough datagrams...
>  */
> if (err != -EAGAIN) {
> /*
>  * ... or  if recvmsg returns an error after we
>  * received some datagrams, where we record the
>  * error to return on the next call or if the
>  * app asks about it using getsockopt(SO_ERROR).
>  */
> sock->sk->sk_err = -err;
> }
> 
> return datagrams;
> }
> 
> return err;
> }
> 
> I am on commit 30f05309bde49295e02e45c7e615f73aa4e0ccc2 (Jan 20).
> Seems to be added in commit a2e2725541fad72416326798c2d7fa4dafb7d337
> (Oct 2009).

Maybe this helps? Compile testing now...


diff --git a/net/socket.c b/net/socket.c
index 91c2de6f5020..03e57ad7ec9f 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2240,31 +2240,31 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, 
unsigned int vlen,
cond_resched();
}
 
-out_put:
-   fput_light(sock->file, fput_needed);
-
if (err == 0)
-   return datagrams;
+   goto out_put;
 
-   if (datagrams != 0) {
+   if (datagrams == 0) {
+   datagrams = err;
+   goto out_put;
+   }
+
+   /*
+* We may return less entries than requested (vlen) if the
+* sock is non block and there aren't enough datagrams...
+*/
+   if (err != -EAGAIN) {
/*
-* We may return less entries than requested (vlen) if the
-* sock is non block and there aren't enough datagrams...
+* ... or  if recvmsg returns an error after we
+* received some datagrams, where we record the
+* error to return on the next call or if the
+* app asks about it using getsockopt(SO_ERROR).
 */
-   if (err != -EAGAIN) {
-   /*
-* ... or  if recvmsg returns an error after we
-* received some datagrams, where we record the
-* error to return on the next call or if the
-* app asks about it using getsockopt(SO_ERROR).
-*/
-   sock->sk->sk_err = -err;
-   }
-
-   return datagrams;
+   sock->sk->sk_err = -err;
}
+out_put:
+   fput_light(sock->file, fput_needed);
 
-   return err;
+   return datagrams;
 }
 
 SYSCALL_DEFINE5(recvmmsg, int, fd, struct mmsghdr __user *, mmsg,

Re: Has RFC4821 been implemented?

2016-01-22 Thread Hannes Frederic Sowa


Hi,

On 22.01.2016 20:50, Tom Herbert wrote:

Found it: tcp_mtu_probing


Exactly.


On Fri, Jan 22, 2016 at 11:39 AM, Tom Herbert  wrote:

This came up on in one of the IETF mailing lists in the context that
PMTUD doesn't really work on the Internet so  PLPMTUD is the way to
go. Looking at the code I only see inet_csk_update_pmtu called from
ICMP PTB error. I am missing something?


If I recall correctly we don't store the MSS in the inetpeer tcp metrics 
cache neither do we update the MTU on the specific path (update the 
dst_entry). So it is a one connection only thing so far.


Bye,
Hannes

[PATCH net] inet: frag: Always orphan skbs inside ip_defrag()

2016-01-22 Thread Joe Stringer

Later parts of the stack (including fragmentation) expect that there is
never a socket attached to frag in a frag_list, however this invariant
was not enforced on all defrag paths. This could lead to the
BUG_ON(skb->sk) during ip_do_fragment(), as per the call stack at the
end of this commit message.

While the call could be added to openvswitch to fix this particular
error, the head and tail of the frags list are already orphaned
indirectly inside ip_defrag(), so it seems like the remaining fragments
should all be orphaned in all circumstances.

kernel BUG at net/ipv4/ip_output.c:586!
[...]
Call Trace:
 
 [] ? do_output.isra.29+0x1b0/0x1b0 [openvswitch]
 [] ovs_fragment+0xcc/0x214 [openvswitch]
 [] ? dst_discard_out+0x20/0x20
 [] ? dst_ifdown+0x80/0x80
 [] ? find_bucket.isra.2+0x62/0x70 [openvswitch]
 [] ? mod_timer_pending+0x65/0x210
 [] ? __lock_acquire+0x3db/0x1b90
 [] ? nf_conntrack_in+0x252/0x500 [nf_conntrack]
 [] ? __lock_is_held+0x54/0x70
 [] do_output.isra.29+0xe3/0x1b0 [openvswitch]
 [] do_execute_actions+0xe11/0x11f0 [openvswitch]
 [] ? __lock_is_held+0x54/0x70
 [] ovs_execute_actions+0x32/0xd0 [openvswitch]
 [] ovs_dp_process_packet+0x85/0x140 [openvswitch]
 [] ? __lock_is_held+0x54/0x70
 [] ovs_execute_actions+0xb2/0xd0 [openvswitch]
 [] ovs_dp_process_packet+0x85/0x140 [openvswitch]
 [] ? ovs_ct_get_labels+0x49/0x80 [openvswitch]
 [] ovs_vport_receive+0x5d/0xa0 [openvswitch]
 [] ? __lock_acquire+0x3db/0x1b90
 [] ? __lock_acquire+0x3db/0x1b90
 [] ? __lock_acquire+0x3db/0x1b90
 [] ? internal_dev_xmit+0x5/0x140 [openvswitch]
 [] internal_dev_xmit+0x6c/0x140 [openvswitch]
 [] ? internal_dev_xmit+0x5/0x140 [openvswitch]
 [] dev_hard_start_xmit+0x2b9/0x5e0
 [] ? netif_skb_features+0xd1/0x1f0
 [] __dev_queue_xmit+0x800/0x930
 [] ? __dev_queue_xmit+0x50/0x930
 [] ? mark_held_locks+0x71/0x90
 [] ? neigh_resolve_output+0x106/0x220
 [] dev_queue_xmit+0x10/0x20
 [] neigh_resolve_output+0x178/0x220
 [] ? ip_finish_output2+0x1ff/0x590
 [] ip_finish_output2+0x1ff/0x590
 [] ? ip_finish_output2+0x7e/0x590
 [] ip_do_fragment+0x831/0x8a0
 [] ? ip_copy_metadata+0x1b0/0x1b0
 [] ip_fragment.constprop.49+0x43/0x80
 [] ip_finish_output+0x17c/0x340
 [] ? nf_hook_slow+0xe4/0x190
 [] ip_output+0x70/0x110
 [] ? ip_fragment.constprop.49+0x80/0x80
 [] ip_local_out+0x39/0x70
 [] ip_send_skb+0x19/0x40
 [] ip_push_pending_frames+0x33/0x40
 [] icmp_push_reply+0xea/0x120
 [] icmp_reply.constprop.23+0x1ed/0x230
 [] icmp_echo.part.21+0x4e/0x50
 [] ? __lock_is_held+0x54/0x70
 [] ? rcu_read_lock_held+0x5e/0x70
 [] icmp_echo+0x36/0x70
 [] icmp_rcv+0x271/0x450
 [] ip_local_deliver_finish+0x127/0x3a0
 [] ? ip_local_deliver_finish+0x41/0x3a0
 [] ip_local_deliver+0x60/0xd0
 [] ? ip_rcv_finish+0x560/0x560
 [] ip_rcv_finish+0xdd/0x560
 [] ip_rcv+0x283/0x3e0
 [] ? match_held_lock+0x192/0x200
 [] ? inet_del_offload+0x40/0x40
 [] __netif_receive_skb_core+0x392/0xae0
 [] ? process_backlog+0x8e/0x230
 [] ? mark_held_locks+0x71/0x90
 [] __netif_receive_skb+0x18/0x60
 [] process_backlog+0x78/0x230
 [] ? process_backlog+0xdd/0x230
 [] net_rx_action+0x155/0x400
 [] __do_softirq+0xcc/0x420
 [] ? ip_finish_output2+0x217/0x590
 [] do_softirq_own_stack+0x1c/0x30
 
 [] do_softirq+0x4e/0x60
 [] __local_bh_enable_ip+0xa8/0xb0
 [] ip_finish_output2+0x240/0x590
 [] ? ip_do_fragment+0x831/0x8a0
 [] ip_do_fragment+0x831/0x8a0
 [] ? ip_copy_metadata+0x1b0/0x1b0
 [] ip_fragment.constprop.49+0x43/0x80
 [] ip_finish_output+0x17c/0x340
 [] ? nf_hook_slow+0xe4/0x190
 [] ip_output+0x70/0x110
 [] ? ip_fragment.constprop.49+0x80/0x80
 [] ip_local_out+0x39/0x70
 [] ip_send_skb+0x19/0x40
 [] ip_push_pending_frames+0x33/0x40
 [] raw_sendmsg+0x7d3/0xc30
 [] ? __lock_acquire+0x3db/0x1b90
 [] ? inet_sendmsg+0xc7/0x1d0
 [] ? __lock_is_held+0x54/0x70
 [] inet_sendmsg+0x10a/0x1d0
 [] ? inet_sendmsg+0x5/0x1d0
 [] sock_sendmsg+0x38/0x50
 [] ___sys_sendmsg+0x25f/0x270
 [] ? handle_mm_fault+0x8dd/0x1320
 [] ? _raw_spin_unlock+0x27/0x40
 [] ? __do_page_fault+0x1e2/0x460
 [] ? __fget_light+0x66/0x90
 [] __sys_sendmsg+0x42/0x80
 [] SyS_sendmsg+0x12/0x20
 [] entry_SYSCALL_64_fastpath+0x12/0x6f
Code: 00 00 44 89 e0 e9 7c fb ff ff 4c 89 ff e8 e7 e7 ff ff 41 8b 9d 80 00 00 
00 2b 5d d4 89 d8 c1 f8 03 0f b7 c0 e9 33 ff ff f
 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
RIP  [] ip_do_fragment+0x892/0x8a0
 RSP 

Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
Signed-off-by: Joe Stringer 
---
Maybe the corresponding change needs to be proposed for IPv6 too. I haven't
been able to reproduce it there yet though.
---
 net/ipv4/ip_fragment.c  | 1 +
 net/ipv4/netfilter/nf_defrag_ipv4.c | 2 --
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 3f00810b7288..187c6fcc3027 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -661,6 +661,7 @@ int ip_defrag(struct net *net, struct sk_buff *skb, u32 
user)
struct ipq *qp;
 
IP_INC_STATS_BH(net, IPSTATS_MIB_REASMREQDS);
+

Re: [PATCH V2 5/5] net: can: ifi: Add IFI CANFD IP support

2016-01-22 Thread Rustad, Mark D


Marek Vasut  wrote:


I see, so adding u32 also here works. I'm starting to wonder if the BIT
macro is really that nice and if I shouldn't just use (1 << n) as usual.


Actually, (1 << n) is not so good either when n is 31 - it can trigger  
overflow warnings since it will be presumed to be a signed value. (1U << n)  
should avoid that problem. Unfortunately, BIT() uses 1UL which produces  
64-bit values on 64-bit arches. The bit ops are kind of a mess in that way.  
It would be nicer if BIT was restricted to int-size values and a new BIT_UL  
or something would produce the long values.


--
Mark Rustad, Networking Division, Intel Corporation


signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [PATCH net] inet: frag: Always orphan skbs inside ip_defrag()

2016-01-22 Thread Eric Dumazet

On Fri, 2016-01-22 at 15:49 -0800, Joe Stringer wrote:
> Later parts of the stack (including fragmentation) expect that there is
> never a socket attached to frag in a frag_list, however this invariant
> was not enforced on all defrag paths. This could lead to the
> BUG_ON(skb->sk) during ip_do_fragment(), as per the call stack at the
> end of this commit message.
> 
> While the call could be added to openvswitch to fix this particular
> error, the head and tail of the frags list are already orphaned
> indirectly inside ip_defrag(), so it seems like the remaining fragments
> should all be orphaned in all circumstances.


Yes, it looks we have a problem, and even IP early demux apparently does
not check if incoming packet is a fragment.

Your patch could also remove some socket leaks in this respect.

I guess we also could add a safety check (ipv4 only, but ipv6 needs care
as well)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index b1209b63381f..99513c829213 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -316,7 +316,9 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, 
struct sk_buff *skb)
const struct iphdr *iph = ip_hdr(skb);
struct rtable *rt;
 
-   if (sysctl_ip_early_demux && !skb_dst(skb) && !skb->sk) {
+   if (sysctl_ip_early_demux &&
+   !skb_dst(skb) && !skb->sk &&
+   !ip_is_fragment(iph)) {
const struct net_protocol *ipprot;
int protocol = iph->protocol;

Re: [PATCH net] lan78xx: changed to use updated phy-ignore-interrupts

2016-01-22 Thread Florian Fainelli

On 21/01/2016 12:15, woojung@microchip.com wrote:
> Update lan78xx to use patch of commit 4f2aaf7dd95b
> ("Merge branch 'fix-phy-ignore-interrupts'")

Looks fine, just one nit below:

> 
> Signed-off-by: Woojung Huh 
> ---
>  drivers/net/usb/lan78xx.c | 28 ++--
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
> index 2ed5333..85ca7de 100644
> --- a/drivers/net/usb/lan78xx.c
> +++ b/drivers/net/usb/lan78xx.c
> @@ -36,7 +36,7 @@
>  #define DRIVER_AUTHOR"WOOJUNG HUH "
>  #define DRIVER_DESC  "LAN78XX USB 3.0 Gigabit Ethernet Devices"
>  #define DRIVER_NAME  "lan78xx"
> -#define DRIVER_VERSION   "1.0.1"
> +#define DRIVER_VERSION   "1.0.2"
>  
>  #define TX_TIMEOUT_JIFFIES   (5 * HZ)
>  #define THROTTLE_JIFFIES (HZ / 8)
> @@ -914,6 +914,8 @@ static int lan78xx_link_reset(struct lan78xx_net *dev)
>   ret = lan78xx_write_reg(dev, MAC_CR, buf);
>   if (unlikely(ret < 0))
>   return -EIO;
> +
> + phy_mac_interrupt(phydev, 0);
>   } else if (phydev->link && !dev->link_on) {
>   dev->link_on = true;
>  
> @@ -954,6 +956,7 @@ static int lan78xx_link_reset(struct lan78xx_net *dev)
>  
>   ret = lan78xx_update_flowcontrol(dev, ecmd.duplex, ladv, radv);
>   netif_carrier_on(dev->net);

Do you need this netif_carrier_on() call here? PHYLIB should set that
already.

> + phy_mac_interrupt(phydev, 1);
>   }
>  
>   return ret;
> @@ -1495,7 +1498,6 @@ done:
>  static int lan78xx_mdio_init(struct lan78xx_net *dev)
>  {
>   int ret;
> - int i;
>  
>   dev->mdiobus = mdiobus_alloc();
>   if (!dev->mdiobus) {
> @@ -1511,10 +1513,6 @@ static int lan78xx_mdio_init(struct lan78xx_net *dev)
>   snprintf(dev->mdiobus->id, MII_BUS_ID_SIZE, "usb-%03d:%03d",
>dev->udev->bus->busnum, dev->udev->devnum);
>  
> - /* handle our own interrupt */
> - for (i = 0; i < PHY_MAX_ADDR; i++)
> - dev->mdiobus->irq[i] = PHY_IGNORE_INTERRUPT;
> -
>   switch (dev->devid & ID_REV_CHIP_ID_MASK_) {
>   case 0x7800:
>   case 0x7850:
> @@ -1558,6 +1556,16 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
>   return -EIO;
>   }
>  
> + /* Enable PHY interrupts.
> +  * We handle our own interrupt
> +  */
> + ret = phy_read(phydev, LAN88XX_INT_STS);
> + ret = phy_write(phydev, LAN88XX_INT_MASK,
> + LAN88XX_INT_MASK_MDINTPIN_EN_ |
> + LAN88XX_INT_MASK_LINK_CHANGE_);
> +
> + phydev->irq = PHY_IGNORE_INTERRUPT;
> +
>   ret = phy_connect_direct(dev->net, phydev,
>lan78xx_link_status_change,
>PHY_INTERFACE_MODE_GMII);
> @@ -1580,14 +1588,6 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
> SUPPORTED_Pause | SUPPORTED_Asym_Pause);
>   genphy_config_aneg(phydev);
>  
> - /* Workaround to enable PHY interrupt.
> -  * phy_start_interrupts() is API for requesting and enabling
> -  * PHY interrupt. However, USB-to-Ethernet device can't use
> -  * request_irq() called in phy_start_interrupts().
> -  * Set PHY to PHY_HALTED and call phy_start()
> -  * to make a call to phy_enable_interrupts()
> -  */
> - phy_stop(phydev);
>   phy_start(phydev);
>  
>   netif_dbg(dev, ifup, dev->net, "phy initialised successfully");
>

[PATCH] dump_stack: avoid potential deadlocks

2016-01-22 Thread Eric Dumazet

From: Eric Dumazet 

Some servers experienced fatal deadlocks because of a combination
of bugs, leading to multiple cpus calling dump_stack().

The checksumming bug was fixed in commit 34ae6a1aa054
("ipv6: update skb->csum when CE mark is propagated").

The second problem is a faulty locking in dump_stack()

CPU1 runs in process context and calls dump_stack(), grabs dump_lock.

   CPU2 receives a TCP packet under softirq, grabs socket spinlock, and
   call dump_stack() from netdev_rx_csum_fault().

   dump_stack() spins on atomic_cmpxchg(&dump_lock, -1, 2), since
   dump_lock is owned by CPU1

While dumping its stack, CPU1 is interrupted by a softirq, and happens
to process a packet for the TCP socket locked by CPU2.

CPU1 spins forever in spin_lock() : deadlock

Stack trace on CPU1 looked like :

[306295.402231] NMI backtrace for cpu 1
[306295.402238] RIP: 0010:[]  [] 
_raw_spin_lock+0x25/0x30
...
[306295.402255] Stack:
[306295.402256]  88407f023cb0 a99cbdc3 88407f023ca0 
88012f496bb0
[306295.402266]  aa4dc1f0 8820d94f0dc0 000a 
aa4b4280
[306295.402275]  88407f023ce0 a98a21d0 88407f023cc0 
88407f023ca0
[306295.402284] Call Trace:
[306295.402286]   
[306295.402288] 
[306295.402291]  [] tcp_v6_rcv+0x243/0x620
[306295.402304]  [] ip6_input_finish+0x11f/0x330
[306295.402309]  [] ip6_input+0x38/0x40
[306295.402313]  [] ip6_rcv_finish+0x3c/0x90
[306295.402318]  [] ipv6_rcv+0x2a9/0x500
[306295.402323]  [] process_backlog+0x461/0xaa0
[306295.402332]  [] net_rx_action+0x147/0x430
[306295.402337]  [] __do_softirq+0x167/0x2d0
[306295.402341]  [] call_softirq+0x1c/0x30
[306295.402345]  [] do_softirq+0x3f/0x80
[306295.402350]  [] irq_exit+0x6e/0xc0
[306295.402355]  [] 
smp_call_function_single_interrupt+0x35/0x40
[306295.402360]  [] call_function_single_interrupt+0x6a/0x70
[306295.402361]   
[306295.402364] 
[306295.402376]  [] printk+0x4d/0x4f
[306295.402390]  [] printk_address+0x31/0x33
[306295.402395]  [] print_trace_address+0x33/0x3c
[306295.402408]  [] print_context_stack+0x7f/0x119
[306295.402412]  [] dump_trace+0x26b/0x28e
[306295.402417]  [] show_trace_log_lvl+0x4f/0x5c
[306295.402421]  [] show_stack_log_lvl+0x104/0x113
[306295.402425]  [] show_stack+0x42/0x44
[306295.402429]  [] dump_stack+0x46/0x58
[306295.402434]  [] netdev_rx_csum_fault+0x38/0x3c
[306295.402439]  [] __skb_checksum_complete_head+0x6e/0x80
[306295.402444]  [] __skb_checksum_complete+0x11/0x20
[306295.402449]  [] tcp_rcv_established+0x2bd5/0x2fd0
[306295.402468]  [] tcp_v6_do_rcv+0x13c/0x620
[306295.402477]  [] sk_backlog_rcv+0x15/0x30
[306295.402482]  [] release_sock+0xd2/0x150
[306295.402486]  [] tcp_recvmsg+0x1c1/0xfc0
[306295.402491]  [] inet_recvmsg+0x7d/0x90
[306295.402495]  [] sock_recvmsg+0xaf/0xe0
[306295.402505]  [] ___sys_recvmsg+0x111/0x3b0
[306295.402528]  [] SyS_recvmsg+0x5c/0xb0
[306295.402532]  [] system_call_fastpath+0x16/0x1b


Fixes: b58d977432c8 ("dump_stack: serialize the output from dump_stack()")
Signed-off-by: Eric Dumazet 
Cc: Alex Thorlton 
---
 lib/dump_stack.c |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/lib/dump_stack.c b/lib/dump_stack.c
index 6745c6230db3..c30d07e99dba 100644
--- a/lib/dump_stack.c
+++ b/lib/dump_stack.c
@@ -25,6 +25,7 @@ static atomic_t dump_lock = ATOMIC_INIT(-1);
 
 asmlinkage __visible void dump_stack(void)
 {
+   unsigned long flags;
int was_locked;
int old;
int cpu;
@@ -33,9 +34,8 @@ asmlinkage __visible void dump_stack(void)
 * Permit this cpu to perform nested stack dumps while serialising
 * against other CPUs
 */
-   preempt_disable();
-
 retry:
+   local_irq_save(flags);
cpu = smp_processor_id();
old = atomic_cmpxchg(&dump_lock, -1, cpu);
if (old == -1) {
@@ -43,6 +43,7 @@ retry:
} else if (old == cpu) {
was_locked = 1;
} else {
+   local_irq_restore(flags);
cpu_relax();
goto retry;
}
@@ -52,7 +53,7 @@ retry:
if (!was_locked)
atomic_set(&dump_lock, -1);
 
-   preempt_enable();
+   local_irq_restore(flags);
 }
 #else
 asmlinkage __visible void dump_stack(void)

Re: [PATCH] Netlink messages for multicast HW addr programming

2016-01-22 Thread Jiri Pirko

Fri, Jan 22, 2016 at 04:49:36PM CET, pru...@brocade.com wrote:
>On Fri, 2016-01-22 at 16:05 +0100, Jiri Pirko wrote:
>> Fri, Jan 22, 2016 at 03:51:55PM CET, pru...@brocade.com wrote:
>> >On Thu, 2016-01-21 at 21:04 +0100, Jiri Pirko wrote:
>> >> Thu, Jan 21, 2016 at 12:47:54PM CET, pru...@brocade.com wrote:
>> >> >Add RTM_NEWADDR and RTM_DELADDR netlink messages to indicate
>> >> >interest in specific multicast hardware addresses. These messages
>> >> >are sent when addressed are added or deleted from the appropriate
>> >> >interface driver.
>> >> >
>> >> >Signed-off-by: Patrick Ruddy 
>> >> >
>> >> >diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
>> >> >index c0548d2..a0ebadd 100644
>> >> >--- a/net/core/dev_addr_lists.c
>> >> >+++ b/net/core/dev_addr_lists.c
>> >> >@@ -12,6 +12,7 @@
>> >> >  */
>> >> > 
>> >> > #include 
>> >> >+#include 
>> >> > #include 
>> >> > #include 
>> >> > #include 
>> >> >@@ -661,6 +662,62 @@ out:
>> >> > }
>> >> > EXPORT_SYMBOL(dev_mc_add_excl);
>> >> > 
>> >> >+static int fill_addr(struct sk_buff *skb, struct net_device *dev,
>> >> >+ const unsigned char *addr, int type)
>> >> >+{
>> >> >+struct nlmsghdr *nlh;
>> >> >+struct ifaddrmsg *ifm;
>> >> >+
>> >> >+nlh = nlmsg_put(skb, 0, 0, type, sizeof(*ifm), 0);
>> >> >+if (nlh == NULL)
>> >> >+return -EMSGSIZE;
>> >> >+
>> >> >+ifm = nlmsg_data(nlh);
>> >> >+ifm->ifa_family = AF_UNSPEC;
>> >> >+ifm->ifa_prefixlen = 0;
>> >> >+ifm->ifa_flags = IFA_F_PERMANENT;
>> >> >+ifm->ifa_scope = RT_SCOPE_LINK;
>> >> >+ifm->ifa_index = dev->ifindex;
>> >> >+if (nla_put(skb, IFA_ADDRESS, dev->addr_len, addr))
>> >> >+goto nla_put_failure;
>> >> >+nlmsg_end(skb, nlh);
>> >> 
>> >> How can userspace tell if this is a multicast addr? Also, why not add
>> >> notifications for unicast addrs so we handle both the same?
>> >
>> >I am only using this RTM_NEW/DELADDR AF_UNSPEC message for multicast
>> >addresses. I understand that the unicast addresses are notified in the
>> >RTM_NEW/DELLINK.
>> 
>> dev->dev_addr is notified by that. UC addresses do not have any
>> notification now.
>It is the multicast MAC address (ie the multicast equivalent of the
>dev_addr) that we are notifying here right? Not L3 addresses.

I know :)

For example if kernel code calls dev_uc_add or dev_uc_del, thise
addresses are not notified to the userspace. All I'm saying is, if you
add notification for dev_mc_add and dev_mc_del, add it for uc variants
as well. Then you have to distinguish mc and uc in the netlink message.

>
>-pr
>

55 matches

Mail list logo