date:20171117

Re: [PATCH 2/4] btrfs: make open_ctree error injectable

2017-11-17 Thread Ingo Molnar


* Josef Bacik  wrote:

> From: Josef Bacik 
> 
> This allows us to do error injection with BPF for open_ctree.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/disk-io.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index dfdab849037b..c6b4e1f07072 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -31,6 +31,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "hash.h"
> @@ -3283,6 +3284,7 @@ int open_ctree(struct super_block *sb,
>   goto fail_block_groups;
>   goto retry_root_backup;
>  }
> +BPF_ALLOW_ERROR_INJECTION(open_ctree);

Ok, this looks a lot better - except the random header inclusion dependency: if 
a 
facility is in the BPF_*() namespace then it should include  and 
not 
a random asm/* header...

With that detail fixed:

  Acked-by: Ingo Molnar 

for the whole series.

Thanks,

Ingo

Re: [PATCH net] tcp: when scheduling TLP, time of RTO should account for current ACK

2017-11-17 Thread Soheil Hassas Yeganeh

On Fri, Nov 17, 2017 at 9:06 PM, Neal Cardwell  wrote:
>
> Fix the TLP scheduling logic so that when scheduling a TLP probe, we
> ensure that the estimated time at which an RTO would fire accounts for
> the fact that ACKs indicating forward progress should push back RTO
> times.
>
> After the following fix:
>
> df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
>
> we had an unintentional behavior change in the following kind of
> scenario: suppose the RTT variance has been very low recently. Then
> suppose we send out a flight of N packets and our RTT is 100ms:
>
> t=0: send a flight of N packets
> t=100ms: receive an ACK for N-1 packets
>
> The response before df92c8394e6e that was:
>   -> schedule a TLP for now + RTO_interval
>
> The response after df92c8394e6e is:
>   -> schedule a TLP for t=0 + RTO_interval
>
> Since RTO_interval = srtt + RTT_variance, this means that we have
> scheduled a TLP timer at a point in the future that only accounts for
> RTT_variance. If the RTT_variance term is small, this means that the
> timer fires soon.
>
> Before df92c8394e6e this would not happen, because in that code, when
> we receive an ACK for a prefix of flight, we did:
>
> 1) Near the top of tcp_ack(), switch from TLP timer to RTO
>at write_queue_head->paket_tx_time + RTO_interval:
> if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
>tcp_rearm_rto(sk);
>
> 2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
> if (flag & FLAG_ACKED) {
>tcp_rearm_rto(sk);
>
> 3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
>to TLP at now + RTO_interval:
> if (icsk->icsk_pending == ICSK_TIME_RETRANS)
>tcp_schedule_loss_probe(sk);
>
> In df92c8394e6e we removed that 3-phase dance, and instead directly
> set the TLP timer once: we set the TLP timer in cases like this to
> write_queue_head->packet_tx_time + RTO_interval. So if the RTT
> variance is small, then this means that this is setting the TLP timer
> to fire quite soon. This means if the ACK for the tail of the flight
> takes longer than an RTT to arrive (often due to delayed ACKs), then
> the TLP timer fires too quickly.
>
> Fixes: df92c8394e6e ("tcp: fix xmit timer to only be reset if data 
> ACKed/SACKed")
> Signed-off-by: Neal Cardwell 
> Signed-off-by: Yuchung Cheng 
> Signed-off-by: Eric Dumazet 
Acked-by: Soheil Hassas Yeganeh 

Nice fix. Thank you, Neal!

Re: [PATCH 2/2] kbuild: remove all dummy assignments to obj-

2017-11-17 Thread Masahiro Yamada

2017-11-08 1:31 GMT+09:00 Masahiro Yamada :
> Now kbuild core scripts create empty built-in.o where necessary.
> Remove "obj- := dummy.o" tricks.
>
> Signed-off-by: Masahiro Yamada 
> ---
>

Applied to linux-kbuild/kbuild.

-- 
Best Regards
Masahiro Yamada

[PATCH net] net: ena: fix race condition between device reset and link up setup

2017-11-17 Thread netanel

From: Netanel Belgazal 

In rare cases, ena driver would reset and re-start the device,
for example, in case of misbehaving application that causes
transmit timeout

The first step in the reset procedure is to stop the Tx traffic by
calling ena_carrier_off().

After the driver have just started the device reset procedure, device
happens to send an asynchronous notification (via AENQ) to the driver
than there was a link change (to link-up state).
This link change is mapped to a call to netif_carrier_on() which
re-activates the Tx queues, violating the assumption of no tx traffic
until device reset is completed, as the reset task might still be in
the process of queues initialization, leading to an access to
uninitialized memory.
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 11 +--
 drivers/net/ethernet/amazon/ena/ena_netdev.h |  3 ++-
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 5417e4da64ca..988d0383b4e7 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2579,6 +2579,7 @@ static int ena_restore_device(struct ena_adapter *adapter)
bool wd_state;
int rc;
 
+   set_bit(ENA_FLAG_ONGOING_RESET, &adapter->flags);
rc = ena_device_init(ena_dev, adapter->pdev, &get_feat_ctx, &wd_state);
if (rc) {
dev_err(&pdev->dev, "Can not initialize device\n");
@@ -2592,6 +2593,11 @@ static int ena_restore_device(struct ena_adapter 
*adapter)
goto err_device_destroy;
}
 
+   clear_bit(ENA_FLAG_ONGOING_RESET, &adapter->flags);
+   /* Make sure we don't have a race with AENQ Links state handler */
+   if (test_bit(ENA_FLAG_LINK_UP, &adapter->flags))
+   netif_carrier_on(adapter->netdev);
+
rc = ena_enable_msix_and_set_admin_interrupts(adapter,
  adapter->num_queues);
if (rc) {
@@ -2618,7 +2624,7 @@ static int ena_restore_device(struct ena_adapter *adapter)
ena_com_admin_destroy(ena_dev);
 err:
clear_bit(ENA_FLAG_DEVICE_RUNNING, &adapter->flags);
-
+   clear_bit(ENA_FLAG_ONGOING_RESET, &adapter->flags);
dev_err(&pdev->dev,
"Reset attempt failed. Can not reset the device\n");
 
@@ -3495,7 +3501,8 @@ static void ena_update_on_link_change(void *adapter_data,
if (status) {
netdev_dbg(adapter->netdev, "%s\n", __func__);
set_bit(ENA_FLAG_LINK_UP, &adapter->flags);
-   netif_carrier_on(adapter->netdev);
+   if (!test_bit(ENA_FLAG_ONGOING_RESET, &adapter->flags))
+   netif_carrier_on(adapter->netdev);
} else {
clear_bit(ENA_FLAG_LINK_UP, &adapter->flags);
netif_carrier_off(adapter->netdev);
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h 
b/drivers/net/ethernet/amazon/ena/ena_netdev.h
index ed8bd0a579c4..3bbc003871de 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
@@ -272,7 +272,8 @@ enum ena_flags_t {
ENA_FLAG_DEV_UP,
ENA_FLAG_LINK_UP,
ENA_FLAG_MSIX_ENABLED,
-   ENA_FLAG_TRIGGER_RESET
+   ENA_FLAG_TRIGGER_RESET,
+   ENA_FLAG_ONGOING_RESET
 };
 
 /* adapter specific private data structure */
-- 
2.7.3.AMZN

Re: Fwd: FW: [PATCH 18/31] nds32: Library functions

2017-11-17 Thread Al Viro

On Tue, Nov 14, 2017 at 12:47:04PM +0800, Vincent Chen wrote:

> Thanks
> So, I should keep the area that we've copied into instead of zeroing
> the area even if unpredicted exception is happened. Right?

Yes.  Here's what's required: if raw_copy_{from,to}_user(from, to, size)
returns n, we want
* 0 <= n <= size
* no bytes outside of to[0 .. size - n - 1] modified
* all bytes in that range replaced with corresponding bytes of range
from[0 .. size - n - 1]
* non-zero return values should happen only when some loads (in case
of raw_copy_from_user()) or stores (in case of raw_copy_to_user()) had failed.
If everything could have been read and written, we must copy everything.
* return value should be equal to size only if no load or no store
had been possible.  In all other cases you need to copy at least something.
You don't have to squeeze all bytes that can be copied (you can, of course,
but it's not required).
* you should not assume that failing load guarantees that subsequent
loads further into the same page will keep failing; normally they will, but
relying upon that is asking for trouble.  Several architectures had bugs
of that sort, with varying amounts of nastiness happening when e.g. write(2)
raced with mprotect(2) from another thread...

For almost any architecture these should be more or less parallel to memcpy();
the only exception I know of is the situation when cross-address-space copy
has timing very different from that for normal load+store.  s390 is that
way - there's considerable overhead of setting such copying, and you really
want it done in bigger chunks than would be optimal for memcpy().  uml is
similar.  However, generally it's memcpy tweaked to deal with exceptions.

[PATCH v2 11/13] nubus: Rename struct nubus_dev

2017-11-17 Thread Finn Thain

It is misleading to call a functional resource a "device". In adopting
the Linux Driver Model, struct nubus_board will embed a struct device.
This will compound the problem because drivers will bind with boards,
not with functional resources.

Rename struct nubus_dev as struct nubus_rsrc. "Functional resource" is
the vendor's terminology so this helps to avoid confusion.

Cc: Bartlomiej Zolnierkiewicz 
Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
---
 drivers/net/ethernet/8390/mac8390.c |  26 
 drivers/net/ethernet/natsemi/macsonic.c |  22 +++
 drivers/nubus/nubus.c   | 105 
 drivers/nubus/proc.c|  15 ++---
 drivers/video/fbdev/macfb.c |   2 +-
 include/linux/nubus.h   |  30 +
 6 files changed, 98 insertions(+), 102 deletions(-)

diff --git a/drivers/net/ethernet/8390/mac8390.c 
b/drivers/net/ethernet/8390/mac8390.c
index 9497f18eaba0..929ff6419621 100644
--- a/drivers/net/ethernet/8390/mac8390.c
+++ b/drivers/net/ethernet/8390/mac8390.c
@@ -123,7 +123,8 @@ enum mac8390_access {
 };
 
 extern int mac8390_memtest(struct net_device *dev);
-static int mac8390_initdev(struct net_device *dev, struct nubus_dev *ndev,
+static int mac8390_initdev(struct net_device *dev,
+  struct nubus_rsrc *ndev,
   enum mac8390_type type);
 
 static int mac8390_open(struct net_device *dev);
@@ -169,11 +170,11 @@ static void word_memcpy_tocard(unsigned long tp, const 
void *fp, int count);
 static void word_memcpy_fromcard(void *tp, unsigned long fp, int count);
 static u32 mac8390_msg_enable;
 
-static enum mac8390_type __init mac8390_ident(struct nubus_dev *dev)
+static enum mac8390_type __init mac8390_ident(struct nubus_rsrc *fres)
 {
-   switch (dev->dr_sw) {
+   switch (fres->dr_sw) {
case NUBUS_DRSW_3COM:
-   switch (dev->dr_hw) {
+   switch (fres->dr_hw) {
case NUBUS_DRHW_APPLE_SONIC_NB:
case NUBUS_DRHW_APPLE_SONIC_LC:
case NUBUS_DRHW_SONNET:
@@ -184,7 +185,7 @@ static enum mac8390_type __init mac8390_ident(struct 
nubus_dev *dev)
break;
 
case NUBUS_DRSW_APPLE:
-   switch (dev->dr_hw) {
+   switch (fres->dr_hw) {
case NUBUS_DRHW_ASANTE_LC:
return MAC8390_NONE;
case NUBUS_DRHW_CABLETRON:
@@ -201,7 +202,7 @@ static enum mac8390_type __init mac8390_ident(struct 
nubus_dev *dev)
case NUBUS_DRSW_TECHWORKS:
case NUBUS_DRSW_DAYNA2:
case NUBUS_DRSW_DAYNA_LC:
-   if (dev->dr_hw == NUBUS_DRHW_CABLETRON)
+   if (fres->dr_hw == NUBUS_DRHW_CABLETRON)
return MAC8390_CABLETRON;
else
return MAC8390_APPLE;
@@ -212,7 +213,7 @@ static enum mac8390_type __init mac8390_ident(struct 
nubus_dev *dev)
break;
 
case NUBUS_DRSW_KINETICS:
-   switch (dev->dr_hw) {
+   switch (fres->dr_hw) {
case NUBUS_DRHW_INTERLAN:
return MAC8390_INTERLAN;
default:
@@ -225,8 +226,8 @@ static enum mac8390_type __init mac8390_ident(struct 
nubus_dev *dev)
 * These correspond to Dayna Sonic cards
 * which use the macsonic driver
 */
-   if (dev->dr_hw == NUBUS_DRHW_SMC9194 ||
-   dev->dr_hw == NUBUS_DRHW_INTERLAN)
+   if (fres->dr_hw == NUBUS_DRHW_SMC9194 ||
+   fres->dr_hw == NUBUS_DRHW_INTERLAN)
return MAC8390_NONE;
else
return MAC8390_DAYNA;
@@ -289,7 +290,8 @@ static int __init mac8390_memsize(unsigned long membase)
return i * 0x1000;
 }
 
-static bool __init mac8390_init(struct net_device *dev, struct nubus_dev *ndev,
+static bool __init mac8390_init(struct net_device *dev,
+   struct nubus_rsrc *ndev,
enum mac8390_type cardtype)
 {
struct nubus_dir dir;
@@ -394,7 +396,7 @@ static bool __init mac8390_init(struct net_device *dev, 
struct nubus_dev *ndev,
 struct net_device * __init mac8390_probe(int unit)
 {
struct net_device *dev;
-   struct nubus_dev *ndev = NULL;
+   struct nubus_rsrc *ndev = NULL;
int err = -ENODEV;
struct ei_device *ei_local;
 
@@ -489,7 +491,7 @@ static const struct net_device_ops mac8390_netdev_ops = {
 };
 
 static int __init mac8390_initdev(struct net_device *dev,
- struct nubus_dev *ndev,
+ struct nubus_rsrc *ndev,
  enum mac8390_type type)
 {
static u32 fwrd4_offsets[16] = {
diff --git a/drivers/net/ethernet/natsemi/macsonic.c 
b/drivers/net/ethernet/natsemi/macsonic.c
index a42433fb6949..14

[PATCH v2 12/13] nubus: Add expansion_type values for various Mac models

2017-11-17 Thread Finn Thain

Add an expansion slot attribute to allow drivers to properly handle
cards like Comm Slot cards and PDS cards without declaration ROMs.

Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
---
 arch/m68k/include/asm/macintosh.h   |   9 ++-
 arch/m68k/mac/config.c  | 110 +---
 drivers/net/ethernet/natsemi/macsonic.c |   8 +--
 3 files changed, 54 insertions(+), 73 deletions(-)

diff --git a/arch/m68k/include/asm/macintosh.h 
b/arch/m68k/include/asm/macintosh.h
index f42c27400dbc..9b840c03ebb7 100644
--- a/arch/m68k/include/asm/macintosh.h
+++ b/arch/m68k/include/asm/macintosh.h
@@ -33,7 +33,7 @@ struct mac_model
char ide_type;
char scc_type;
char ether_type;
-   char nubus_type;
+   char expansion_type;
char floppy_type;
 };
 
@@ -73,8 +73,11 @@ struct mac_model
 #define MAC_ETHER_SONIC1
 #define MAC_ETHER_MACE 2
 
-#define MAC_NO_NUBUS   0
-#define MAC_NUBUS  1
+#define MAC_EXP_NONE   0
+#define MAC_EXP_PDS1 /* Accepts only a PDS card */
+#define MAC_EXP_NUBUS  2 /* Accepts only NuBus card(s) */
+#define MAC_EXP_PDS_NUBUS  3 /* Accepts PDS card and/or NuBus card(s) */
+#define MAC_EXP_PDS_COMM   4 /* Accepts PDS card or Comm Slot card */
 
 #define MAC_FLOPPY_IWM 0
 #define MAC_FLOPPY_SWIM_ADDR1  1
diff --git a/arch/m68k/mac/config.c b/arch/m68k/mac/config.c
index 16cd5cea5207..d3d435248a24 100644
--- a/arch/m68k/mac/config.c
+++ b/arch/m68k/mac/config.c
@@ -212,7 +212,7 @@ static struct mac_model mac_data_table[] = {
.via_type   = MAC_VIA_II,
.scsi_type  = MAC_SCSI_OLD,
.scc_type   = MAC_SCC_II,
-   .nubus_type = MAC_NUBUS,
+   .expansion_type = MAC_EXP_NUBUS,
.floppy_type= MAC_FLOPPY_IWM,
},
 
@@ -227,7 +227,7 @@ static struct mac_model mac_data_table[] = {
.via_type   = MAC_VIA_II,
.scsi_type  = MAC_SCSI_OLD,
.scc_type   = MAC_SCC_II,
-   .nubus_type = MAC_NUBUS,
+   .expansion_type = MAC_EXP_NUBUS,
.floppy_type= MAC_FLOPPY_IWM,
}, {
.ident  = MAC_MODEL_IIX,
@@ -236,7 +236,7 @@ static struct mac_model mac_data_table[] = {
.via_type   = MAC_VIA_II,
.scsi_type  = MAC_SCSI_OLD,
.scc_type   = MAC_SCC_II,
-   .nubus_type = MAC_NUBUS,
+   .expansion_type = MAC_EXP_NUBUS,
.floppy_type= MAC_FLOPPY_SWIM_ADDR2,
}, {
.ident  = MAC_MODEL_IICX,
@@ -245,7 +245,7 @@ static struct mac_model mac_data_table[] = {
.via_type   = MAC_VIA_II,
.scsi_type  = MAC_SCSI_OLD,
.scc_type   = MAC_SCC_II,
-   .nubus_type = MAC_NUBUS,
+   .expansion_type = MAC_EXP_NUBUS,
.floppy_type= MAC_FLOPPY_SWIM_ADDR2,
}, {
.ident  = MAC_MODEL_SE30,
@@ -254,7 +254,7 @@ static struct mac_model mac_data_table[] = {
.via_type   = MAC_VIA_II,
.scsi_type  = MAC_SCSI_OLD,
.scc_type   = MAC_SCC_II,
-   .nubus_type = MAC_NUBUS,
+   .expansion_type = MAC_EXP_PDS,
.floppy_type= MAC_FLOPPY_SWIM_ADDR2,
},
 
@@ -272,7 +272,7 @@ static struct mac_model mac_data_table[] = {
.via_type   = MAC_VIA_IICI,
.scsi_type  = MAC_SCSI_OLD,
.scc_type   = MAC_SCC_II,
-   .nubus_type = MAC_NUBUS,
+   .expansion_type = MAC_EXP_NUBUS,
.floppy_type= MAC_FLOPPY_SWIM_ADDR2,
}, {
.ident  = MAC_MODEL_IIFX,
@@ -281,7 +281,7 @@ static struct mac_model mac_data_table[] = {
.via_type   = MAC_VIA_IICI,
.scsi_type  = MAC_SCSI_IIFX,
.scc_type   = MAC_SCC_IOP,
-   .nubus_type = MAC_NUBUS,
+   .expansion_type = MAC_EXP_PDS_NUBUS,
.floppy_type= MAC_FLOPPY_SWIM_IOP,
}, {
.ident  = MAC_MODEL_IISI,
@@ -290,7 +290,7 @@ static struct mac_model mac_data_table[] = {
.via_type   = MAC_VIA_IICI,
.scsi_type  = MAC_SCSI_OLD,
.scc_type   = MAC_SCC_II,
-   .nubus_type = MAC_NUBUS,
+   .expansion_type = MAC_EXP_PDS_NUBUS,
.floppy_type= MAC_FLOPPY_SWIM_ADDR2,
}, {
.ident  = MAC_MODEL_IIVI,
@@ -299,7 +299,7 @@ static struct mac_model mac_data_table[] = {
.via_type   = MAC_VIA_IICI,
.scsi_type  = MAC_SCSI_LC,
.scc_type   = MAC_

[PATCH net] tcp: when scheduling TLP, time of RTO should account for current ACK

2017-11-17 Thread Neal Cardwell

Fix the TLP scheduling logic so that when scheduling a TLP probe, we
ensure that the estimated time at which an RTO would fire accounts for
the fact that ACKs indicating forward progress should push back RTO
times.

After the following fix:

df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")

we had an unintentional behavior change in the following kind of
scenario: suppose the RTT variance has been very low recently. Then
suppose we send out a flight of N packets and our RTT is 100ms:

t=0: send a flight of N packets
t=100ms: receive an ACK for N-1 packets

The response before df92c8394e6e that was:
  -> schedule a TLP for now + RTO_interval

The response after df92c8394e6e is:
  -> schedule a TLP for t=0 + RTO_interval

Since RTO_interval = srtt + RTT_variance, this means that we have
scheduled a TLP timer at a point in the future that only accounts for
RTT_variance. If the RTT_variance term is small, this means that the
timer fires soon.

Before df92c8394e6e this would not happen, because in that code, when
we receive an ACK for a prefix of flight, we did:

1) Near the top of tcp_ack(), switch from TLP timer to RTO
   at write_queue_head->paket_tx_time + RTO_interval:
if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
   tcp_rearm_rto(sk);

2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
if (flag & FLAG_ACKED) {
   tcp_rearm_rto(sk);

3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
   to TLP at now + RTO_interval:
if (icsk->icsk_pending == ICSK_TIME_RETRANS)
   tcp_schedule_loss_probe(sk);

In df92c8394e6e we removed that 3-phase dance, and instead directly
set the TLP timer once: we set the TLP timer in cases like this to
write_queue_head->packet_tx_time + RTO_interval. So if the RTT
variance is small, then this means that this is setting the TLP timer
to fire quite soon. This means if the ACK for the tail of the flight
takes longer than an RTT to arrive (often due to delayed ACKs), then
the TLP timer fires too quickly.

Fixes: df92c8394e6e ("tcp: fix xmit timer to only be reset if data 
ACKed/SACKed")
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
---
 include/net/tcp.h | 2 +-
 net/ipv4/tcp_input.c  | 2 +-
 net/ipv4/tcp_output.c | 8 +---
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 85ea578195d4..4e09398009c1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -539,7 +539,7 @@ void tcp_push_one(struct sock *, unsigned int mss_now);
 void tcp_send_ack(struct sock *sk);
 void tcp_send_delayed_ack(struct sock *sk);
 void tcp_send_loss_probe(struct sock *sk);
-bool tcp_schedule_loss_probe(struct sock *sk);
+bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto);
 void tcp_skb_collapse_tstamp(struct sk_buff *skb,
 const struct sk_buff *next_skb);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index dabbf1d392fb..f31de422b37f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2964,7 +2964,7 @@ void tcp_rearm_rto(struct sock *sk)
 /* Try to schedule a loss probe; if that doesn't work, then schedule an RTO. */
 static void tcp_set_xmit_timer(struct sock *sk)
 {
-   if (!tcp_schedule_loss_probe(sk))
+   if (!tcp_schedule_loss_probe(sk, true))
tcp_rearm_rto(sk);
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 540b7d92cc70..a4d214c7b506 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2391,7 +2391,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
 
/* Send one loss probe per tail loss episode. */
if (push_one != 2)
-   tcp_schedule_loss_probe(sk);
+   tcp_schedule_loss_probe(sk, false);
is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd);
tcp_cwnd_validate(sk, is_cwnd_limited);
return false;
@@ -2399,7 +2399,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
return !tp->packets_out && !tcp_write_queue_empty(sk);
 }
 
-bool tcp_schedule_loss_probe(struct sock *sk)
+bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
@@ -2440,7 +2440,9 @@ bool tcp_schedule_loss_probe(struct sock *sk)
}
 
/* If the RTO formula yields an earlier time, then use that time. */
-   rto_delta_us = tcp_rto_delta_us(sk);  /* How far in future is RTO? */
+   rto_delta_us = advancing_rto ?
+   jiffies_to_usecs(inet_csk(sk)->icsk_rto) :
+   tcp_rto_delta_us(sk);  /* How far in future is RTO? */
if (rto_delta_us > 0)
timeout = min_t(u32, tim

[GIT] Networking

2017-11-17 Thread David Miller


1) Revert regression inducing change to the IPSEC template resolver,
   from Steffen Klassert.

2) Peeloffs can cause the wrong sk to be waken up in SCTP, fix from
   Xin Long.

3) Min packet MTU size is wrong in cpsw driver, from Grygorii Strashko.

4) Fix build failure in netfilter ctnetlink, from Arnd Bergmann.

5) ISDN hisax driver checks pnp_irq() for errors incorrectly, from
   Arvind Yadav.

6) Fix fealnx driver build failure on MIPS, from Huacai Chen.

7) Fix into leak in SCTP, the scope_id of socket addresses is not
   always filled in.  From Eric W. Biederman.

8) MTU inheritance between physical function and representor fix
   in nfp driver, from Dirk van der Merwe.

9) Fix memory leak in rsi driver, from Colin Ian King.

10) Fix expiration and generation ID handling of cached ipv4
redirect routes, from Xin Long.

Please pull, thanks a lot!

The following changes since commit 6363b3f3ac5be096d08c8c504128befa0c033529:

  Merge tag 'ipmi-for-4.15' of git://github.com/cminyard/linux-ipmi (2017-11-15 
15:12:28 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to 461ee7f3286dd50be4726606819c4228bc485a17:

  net: usb: hso.c: remove unneeded DRIVER_LICENSE #define (2017-11-18 10:37:00 
+0900)


Ahmed Abdelsalam (1):
  ipv6: sr: update the struct ipv6_sr_hdr

Arnd Bergmann (1):
  netfilter: add ifdef around ctnetlink_proto_size

Arvind Yadav (12):
  isdn: hisax: Fix pnp_irq's error checking for setup_asuscom
  isdn: hisax: Fix pnp_irq's error checking for avm_pnp_setup
  isdn: hisax: Fix pnp_irq's error checking for setup_diva_isapnp
  isdn: hisax: Fix pnp_irq's error checking for setup_elsa_isapnp
  isdn: hisax: Fix pnp_irq's error checking for setup_hfcsx
  isdn: hisax: Fix pnp_irq's error checking for setup_hfcs
  isdn: hisax: Handle return value of pnp_irq and pnp_port_start
  isdn: hisax: Fix pnp_irq's error checking for setup_isurf
  isdn: hisax: Fix pnp_irq's error checking for setup_ix1micro
  isdn: hisax: Fix pnp_irq's error checking for setup_niccy
  isdn: hisax: Fix pnp_irq's error checking for setup_sedlbauer_isapnp
  isdn: hisax: Fix pnp_irq's error checking for setup_teles3

Colin Ian King (2):
  qed: use kzalloc instead of kmalloc and memset
  rsi: fix memory leak on buf and usb_reg_buf

David S. Miller (3):
  Merge branch 'isdn-hisax-Fix-pnp_irq-error-checking'
  Merge branch 'master' of git://git.kernel.org/.../klassert/ipsec
  Merge branch 'nfp-flower-fixes-and-typo-in-ethtool-stats-name'

Desnes Augusto Nunes do Rosario (1):
  ibmvnic: fix dma_mapping_error call

Dirk van der Merwe (1):
  nfp: inherit the max_mtu from the PF netdev

Eric W. Biederman (1):
  net/sctp: Always set scope_id in sctp_inet6_skb_msgname

Girish Moodalbail (1):
  ipvlan: NULL pointer dereference panic in ipvlan_port_destroy

Greg Kroah-Hartman (1):
  net: usb: hso.c: remove unneeded DRIVER_LICENSE #define

Grygorii Strashko (1):
  net: ethernet: ti: cpsw: fix min eth packet size

Herbert Xu (1):
  xfrm: Copy policy family in clone_policy

Huacai Chen (1):
  fealnx: Fix building error on MIPS

Joel Stanley (1):
  virto_net: remove empty file 'virtio_net.'

John Hurley (2):
  nfp: register flower reprs for egress dev offload
  nfp: remove false positive offloads in flower vxlan

Jon Maloy (1):
  tipc: enforce valid ratio between skb truesize and contents

Michal Kubecek (1):
  genetlink: fix genlmsg_nlhdr()

Pieter Jansen van Vuuren (2):
  nfp: fix flower offload metadata flag usage
  nfp: fix vlan receive MAC statistics typo

Steffen Klassert (1):
  Revert "xfrm: Fix stack-out-of-bounds read in xfrm_state_find."

Tim Hansen (1):
  net/netlabel: Add list_next_rcu() in rcu_dereference().

Vitaly Kuznetsov (1):
  hv_netvsc: preserve hw_features on mtu/channels/ringparam changes

Xin Long (6):
  sctp: do not free asoc when it is already dead in sctp_sendmsg
  sctp: use the right sk after waking up from wait_buf sleep
  sctp: check stream reset info len before making reconf chunk
  sctp: set frag_point in sctp_setsockopt_maxseg correctly
  route: update fnhe_expires for redirect when the fnhe exists
  route: also update fnhe_genid when updating a route cache

 drivers/isdn/hisax/asuscom.c |   2 +-
 drivers/isdn/hisax/avm_pci.c |   2 +-
 drivers/isdn/hisax/diva.c|   2 +-
 drivers/isdn/hisax/elsa.c|   2 +-
 drivers/isdn/hisax/hfc_sx.c  |   2 +-
 drivers/isdn/hisax/hfcscard.c|   2 +-
 drivers/isdn/hisax/hisax_fcpcipnp.c  |   2 +
 drivers/isdn/hisax/isurf.c   |   2 +-
 drivers/isdn/hisax/ix1_micro.c

Re: [PATCH 7/8] net: ovs: remove unused hardirq.h

2017-11-17 Thread Yang Shi

It looks the email address of Pravin in MAINTAINERS file is obsolete, 
sent to the right address.


Yang


On 11/17/17 3:02 PM, Yang Shi wrote:

Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by openvswitch at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Pravin Shelar 
Cc: "David S. Miller" 
Cc: d...@openvswitch.org
---
  net/openvswitch/vport-internal_dev.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/net/openvswitch/vport-internal_dev.c 
b/net/openvswitch/vport-internal_dev.c
index 04a3128..2f47c65 100644
--- a/net/openvswitch/vport-internal_dev.c
+++ b/net/openvswitch/vport-internal_dev.c
@@ -16,7 +16,6 @@
   * 02110-1301, USA
   */
  
-#include 

  #include 
  #include 
  #include

Re: iproute2: make ip route list to search by metric too

2017-11-17 Thread Alexander Zubkov

Hello again,

Things turned out to be not so hard. Please take a look at the attached patch.
I'm only not sure if RTA_PRIORITY is enough. Because the print_route
function prints "metric" also for some situations with RTA_METRICS,
which I haven't managed to understand.

On Fri, Nov 17, 2017 at 1:40 AM, Alexander Zubkov  wrote:
> Hello all,
>
> Currently routes in the Linux routing table have these "key" fields:
> prefix, tos, table, metric (as I know). I.e. we cannot have two
> different routes with the same set of this fields. And "ip route list"
> command can be provided with all but one of those fields. We cannot
> pass metric to it and this is inconvenient. I ask if this behaviour
> can be changed by someone. We can even use "secondary" fields, for
> example type, dev or via, but not metric unfortunately.
> Sorry, I can not provide patches. I have written code long time ago. I
> tried to trace it, but as I see it parses arguments and fills some
> structures. And then my tries to understand failed.
> I opened the bug: https://bugzilla.kernel.org/show_bug.cgi?id=197897,
> but I was pointed out that this mailing list is a better place for
> this question.
>
> --
> Alexander Zubkov
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -126,6 +126,8 @@ static struct
 	int oif, oifmask;
 	int mark, markmask;
 	int realm, realmmask;
+	int have_metric;
+	__u32 metric;
 	inet_prefix rprefsrc;
 	inet_prefix rvia;
 	inet_prefix rdst;
@@ -288,6 +290,14 @@ static int filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len)
 		if ((mark ^ filter.mark) & filter.markmask)
 			return 0;
 	}
+	if (filter.have_metric) {
+		__u32 metric = 0;
+
+		if (tb[RTA_PRIORITY])
+			metric = rta_getattr_u32(tb[RTA_PRIORITY]);
+		if (filter.metric != metric)
+			return 0;
+	}
 	if (filter.flushb &&
 	r->rtm_family == AF_INET6 &&
 	r->rtm_dst_len == 0 &&
@@ -1518,6 +1528,16 @@ static int iproute_list_flush_or_save(int argc, char **argv, int action)
 			if (get_unsigned(&mark, *argv, 0))
 invarg("invalid mark value", *argv);
 			filter.markmask = -1;
+		} else if (matches(*argv, "metric") == 0 ||
+		   matches(*argv, "priority") == 0 ||
+		   strcmp(*argv, "preference") == 0) {
+			__u32 metric;
+
+			NEXT_ARG();
+			if (get_u32(&metric, *argv, 0))
+invarg("\"metric\" value is invalid\n", *argv);
+			filter.metric = metric;
+			filter.have_metric = 1;
 		} else if (strcmp(*argv, "via") == 0) {
 			int family;

Re: [PATCH net] net: accept UFO datagrams from tuntap and packet

2017-11-17 Thread David Miller

From: Willem de Bruijn 
Date: Fri, 17 Nov 2017 17:59:13 -0500

> Tuntap and similar devices can inject GSO packets. Accept type
> VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.
> 
> Processes are expected to use feature negotiation such as TUNSETOFFLOAD
> to detect supported offload types and refrain from injecting other
> packets. This process breaks down with live migration: guest kernels
> do not renegotiate flags, so destination hosts need to expose all
> features that the source host does.
> 
> Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677.
> This patch introduces nearly(*) no new code to simplify verification.
> It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
> insertion and software UFO segmentation.

This looks good, one minor nit:

> @@ -2369,6 +2369,10 @@ static int set_offload(struct tun_struct *tun, 
> unsigned long arg)
>   features |= NETIF_F_TSO6;
>   arg &= ~(TUN_F_TSO4|TUN_F_TSO6);
>   }
> +
> + if (arg & TUN_F_UFO) {
> + arg &= ~TUN_F_UFO;
> + }

This can be just simply "arg &= ~TUN_F_UFO;"?  If anything the curly braces
should be removed for a single statement basic block.

Thanks for working so hard on fixing this.

Re: [PATCH] net: usb: hso.c: remove unneeded DRIVER_LICENSE #define

2017-11-17 Thread David Miller

From: Greg Kroah-Hartman 
Date: Fri, 17 Nov 2017 15:19:39 +0100

> There is no need to #define the license of the driver, just put it in
> the MODULE_LICENSE() line directly as a text string.
> 
> This allows tools that check that the module license matches the source
> code license to work properly, as there is no need to unwind the
> unneeded dereference.
> 
> Cc: "David S. Miller" 
> Cc: Andreas Kemnade 
> Cc: Johan Hovold 
> Reported-by: Philippe Ombredanne 
> Signed-off-by: Greg Kroah-Hartman 

Applied.

Re: [PATCH net v2 1/1] ipvlan: NULL pointer dereference panic in ipvlan_port_destroy

2017-11-17 Thread David Miller

From: Girish Moodalbail 
Date: Thu, 16 Nov 2017 23:16:17 -0800

> When call to register_netdevice() (called from ipvlan_link_new()) fails,
> we call ipvlan_uninit() (through ndo_uninit()) to destroy the ipvlan
> port. After returning unsuccessfully from register_netdevice() we go
> ahead and call ipvlan_port_destroy() again which causes NULL pointer
> dereference panic. Fix the issue by making ipvlan_init() and
> ipvlan_uninit() call symmetric.
> 
> The ipvlan port will now be created inside ipvlan_init() and will be
> destroyed in ipvlan_uninit().
> 
> Fixes: 2ad7bf363841 (ipvlan: Initial check-in of the IPVLAN driver)
> Signed-off-by: Girish Moodalbail 

Applied.

Re: [PATCH] [net] ibmvnic: fix dma_mapping_error call

2017-11-17 Thread David Miller

From: Desnes Augusto Nunes do Rosario 
Date: Fri, 17 Nov 2017 09:09:04 -0200

> This patch fixes the dma_mapping_error call to use the correct dma_addr
> which is inside the ibmvnic_vpd struct. Moreover, it fixes an uninitialized
> warning regarding a local dma_addr variable which is not used anymore.
> 
> Fixes: 4e6759be28e4 ("ibmvnic: Feature implementation of VPD for the ibmvnic 
> driver")
> Reported-by: Stephen Rothwell 
> Signed-off-by: Desnes A. Nunes do Rosario 

Applied.

Re: [PATCH] net/netlabel: Add list_next_rcu() in rcu_dereference().

2017-11-17 Thread David Miller

From: Tim Hansen 
Date: Thu, 16 Nov 2017 12:03:34 -0500

> Add list_next_rcu() for fetching next list in rcu_deference safely.
> 
> Found with sparse in linux-next tree on tag next-20171116.
> 
> Signed-off-by: Tim Hansen 

Applied.

Re: [PATCH net] route: update fnhe_expires for redirect when the fnhe exists

2017-11-17 Thread David Miller

From: Xin Long 
Date: Fri, 17 Nov 2017 14:27:06 +0800

> Now when creating fnhe for redirect, it sets fnhe_expires for this
> new route cache. But when updating the exist one, it doesn't do it.
> It will cause this fnhe never to be expired.
> 
> Paolo already noticed it before, in Jianlin's test case, it became
> even worse:
> 
> When ip route flush cache, the old fnhe is not to be removed, but
> only clean it's members. When redirect comes again, this fnhe will
> be found and updated, but never be expired due to fnhe_expires not
> being set.
> 
> So fix it by simply updating fnhe_expires even it's for redirect.
> 
> Fixes: aee06da6726d ("ipv4: use seqlock for nh_exceptions")
> Reported-by: Jianlin Shi 
> Acked-by: Hannes Frederic Sowa 
> Signed-off-by: Xin Long 

Applied.

Re: [PATCH net] route: also update fnhe_genid when updating a route cache

2017-11-17 Thread David Miller

From: Xin Long 
Date: Fri, 17 Nov 2017 14:27:18 +0800

> Now when ip route flush cache and it turn out all fnhe_genid != genid.
> If a redirect/pmtu icmp packet comes and the old fnhe is found and all
> it's members but fnhe_genid will be updated.
> 
> Then next time when it looks up route and tries to rebind this fnhe to
> the new dst, the fnhe will be flushed due to fnhe_genid != genid. It
> causes this redirect/pmtu icmp packet acutally not to be applied.
> 
> This patch is to also reset fnhe_genid when updating a route cache.
> 
> Fixes: 5aad1de5ea2c ("ipv4: use separate genid for next hop exceptions")
> Acked-by: Hannes Frederic Sowa 
> Signed-off-by: Xin Long 

Applied.

Re: [PATCH net] sctp: set frag_point in sctp_setsockopt_maxseg correctly

2017-11-17 Thread David Miller

From: Xin Long 
Date: Fri, 17 Nov 2017 14:11:11 +0800

> Now in sctp_setsockopt_maxseg user_frag or frag_point can be set with
> val >= 8 and val <= SCTP_MAX_CHUNK_LEN. But both checks are incorrect.
> 
> val >= 8 means frag_point can even be less than SCTP_DEFAULT_MINSEGMENT.
> Then in sctp_datamsg_from_user(), when it's value is greater than cookie
> echo len and trying to bundle with cookie echo chunk, the first_len will
> overflow.
> 
> The worse case is when it's value is equal as cookie echo len, first_len
> becomes 0, it will go into a dead loop for fragment later on. In Hangbin
> syzkaller testing env, oom was even triggered due to consecutive memory
> allocation in that loop.
> 
> Besides, SCTP_MAX_CHUNK_LEN is the max size of the whole chunk, it should
> deduct the data header for frag_point or user_frag check.
> 
> This patch does a proper check with SCTP_DEFAULT_MINSEGMENT subtracting
> the sctphdr and datahdr, SCTP_MAX_CHUNK_LEN subtracting datahdr when
> setting frag_point via sockopt. It also improves sctp_setsockopt_maxseg
> codes.
> 
> Suggested-by: Marcelo Ricardo Leitner 
> Reported-by: Hangbin Liu 
> Signed-off-by: Xin Long 

Applied.

Re: [PATCH] qed: fix unnecessary call to memset cocci warnings

2017-11-17 Thread David Miller

From: Vasyl Gomonovych 
Date: Thu, 16 Nov 2017 23:04:08 +0100

> Use kzalloc rather than kmalloc followed by memset with 0
> 
> drivers/net/ethernet/qlogic/qed/qed_dcbx.c:1280:13-20: WARNING:
> kzalloc should be used for dcbx_info, instead of kmalloc/memset
> Generated by: scripts/coccinelle/api/alloc/kzalloc-simple.cocci
> 
> Signed-off-by: Vasyl Gomonovych 

This patch doesn't even apply.

Re: JOIN_ANYCAST breakage w. "net: ipv6: put host and anycast routes on device with address"

2017-11-17 Thread David Ahern

On 11/14/17 10:36 AM, Florian Westphal wrote:
> Hi David
> 
> This test program no longer works with 4.14
> (recvfrom: Resource temporarily unavailable)
> 
> after reverting commit
> 4832c30d5458387ff2533ff66fbde26ad8bb5a2d
> (net: ipv6: put host and anycast routes on device with address)
> 
> it will work again ("OK").
> 
> Could you please have a look at this?
> 

This restores the previous behavior:

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 05eb7bc36156..1c29d9bcedc3 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1019,7 +1019,7 @@ static struct net_device
*ip6_rt_get_dev_rcu(struct rt6_info *rt)
 {
struct net_device *dev = rt->dst.dev;

-   if (rt->rt6i_flags & RTF_LOCAL) {
+   if (rt->rt6i_flags & (RTF_LOCAL | RTF_ANYCAST)) {
/* for copies of local routes, dst->dev needs to be the
 * device if it is a master device, the master device if
 * device is enslaved, and the loopback as the default

Re: [ftrace-bpf 1/5] add BPF_PROG_TYPE_FTRACE to bpf

2017-11-17 Thread Alexei Starovoitov

On Mon, Nov 13, 2017 at 12:06:17AM -0800, peng yu wrote:
> > 1. anything bpf related has to go via net-next tree.
> I found there is a net-next git repo:
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
> I will use this repo for the further bpf-ftrace patch set.
> 
> > 2.
> > this obviously breaks ABI. New types can only be added to the end.
> Sure, I will add the new type at the end.
> 
> > 3.
> > this won't even compile, since ftrace_regs is only added in the patch 4.
> It could compile, as the ftrace_regs related code is inside the
> "#ifdef FTRACE_BPF_FILTER" macro, if this macro is not defined, no
> ftrace_regs related code would be compiled.
> 
> 
> > Since bpf program will see ftrace_regs as an input it becomes
> > abi, so has to be defined in uapi/linux/bpf_ftrace.h or similar.
> > We need to think through how to make it generic across archs
> > instead of defining ftrace_regs for each arch.
> I'm not sure whether I'm fully understand your meaning. Like kprobe,
> the ftrace-bpf need to get a function's parameters and check them. So
> it won't be abi stable, and it should depend on architecture
> implement. I can create a header file like uapi/linux/bpf_ftrace.h,
> but I noticed that kprobe doesn't have such a header file, if I'm
> wrong, please let me know. And about make it generic across archs, I
> know kprobe use pt_regs as parameter, the pt_regs is defined on each
> arch, so I can't see how bpf-ftrace can get a generic interface across
> archs if it need to check function's parameters. If I misunderstand
> anything, please let me know.

all of ftrace are called at function entry and calling convention
is fixed per architecture, so we can make a generic and stable
struct bpf_ftrace_args {
  __u64 arg1, arg2, .. arg5;
};
save_mcount_regs doesn't care what order the regs are stored
so the same stack space can be used to keep bpf_ftrace_args
and used in restore_mcount_regs.
I'd also make it depend on DYNAMIC_FTRACE_WITH_REGS to avoid
dealing with obscure corner cases.

> 
> > 4.
> > the patch 2/3 takes an approach of passing FD integer value in text form
> > to the kernel. That approach was discussed years ago and rejected.
> > It has to use binary interface like perf_event + ioctl.
> > See RFC patches where we're extending perf_event_open syscall to
> > support binary access to kprobe/uprobe.
> > imo binary interface to ftrace is pre-requisite to ftrace+bpf work.
> > We've had too many issues with text based kprobe api to repeat
> > the same mistake here.
> I notice the kprobe-bpf prog is set through the PERF_EVENT_IOC_SET_BPF
> ioctl, I may try to see whether I can reuse this interface, or if it
> is not suitable, I will try to define a new binary interface.
> 
> > 5.
> > patch 4 hacks save_mcount_regs asm to pass ctx pointer in %rcx
> > whereas it's only used in ftrace_graph_caller which doesn't seem right.
> > It points out to another issue that such ftrace+bpf integration
> > is only done for ftrace_graph_caller without extensibility in mind.
> > If we do ftrace+bpf I'd rather see generic framework that applies
> > to all of ftrace instead of single feature of it.
> It is a hard problem. The ftrace framework has lots of tracers,
> function tracer and function graph tracer use the 'gcc -pg' directly,
> other tracers use tracepoint, I should spend more time to find a
> suitable solution.

since all of ftrace goes through the same function entry point
it should be possible to have one generic bpf filter interface
suitable for all tracers that ftrace supports.

[PATCH 5/8] crypto: remove unused hardirq.h

2017-11-17 Thread Yang Shi

Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by crypto at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Herbert Xu 
Cc: "David S. Miller" 
Cc: linux-cry...@vger.kernel.org
---
 crypto/ablk_helper.c | 1 -
 crypto/blkcipher.c   | 1 -
 crypto/mcryptd.c | 1 -
 3 files changed, 3 deletions(-)

diff --git a/crypto/ablk_helper.c b/crypto/ablk_helper.c
index 1441f07..ee52660 100644
--- a/crypto/ablk_helper.c
+++ b/crypto/ablk_helper.c
@@ -28,7 +28,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/crypto/blkcipher.c b/crypto/blkcipher.c
index 6c43a0a..01c0d4a 100644
--- a/crypto/blkcipher.c
+++ b/crypto/blkcipher.c
@@ -18,7 +18,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/crypto/mcryptd.c b/crypto/mcryptd.c
index 4e64726..9fa362c 100644
--- a/crypto/mcryptd.c
+++ b/crypto/mcryptd.c
@@ -26,7 +26,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #define MCRYPTD_MAX_CPU_QLEN 100
 #define MCRYPTD_BATCH 9
-- 
1.8.3.1

[PATCH 2/8] fs: pstore: remove unused hardirq.h

2017-11-17 Thread Yang Shi

Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by pstore at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Kees Cook 
Cc: Anton Vorontsov 
Cc: Colin Cross 
Cc: Tony Luck 
---
 fs/pstore/platform.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
index 2b21d18..25dcef4 100644
--- a/fs/pstore/platform.c
+++ b/fs/pstore/platform.c
@@ -41,7 +41,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
-- 
1.8.3.1

[PATCH 3/8] fs: btrfs: remove unused hardirq.h

2017-11-17 Thread Yang Shi

Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by btrfs at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
---
 fs/btrfs/extent_map.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 2e348fb..cced7f1 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -2,7 +2,6 @@
 #include 
 #include 
 #include 
-#include 
 #include "ctree.h"
 #include "extent_map.h"
 #include "compression.h"
-- 
1.8.3.1

[PATCH 7/8] net: ovs: remove unused hardirq.h

2017-11-17 Thread Yang Shi

Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by openvswitch at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Pravin Shelar 
Cc: "David S. Miller" 
Cc: d...@openvswitch.org
---
 net/openvswitch/vport-internal_dev.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/openvswitch/vport-internal_dev.c 
b/net/openvswitch/vport-internal_dev.c
index 04a3128..2f47c65 100644
--- a/net/openvswitch/vport-internal_dev.c
+++ b/net/openvswitch/vport-internal_dev.c
@@ -16,7 +16,6 @@
  * 02110-1301, USA
  */
 
-#include 
 #include 
 #include 
 #include 
-- 
1.8.3.1

[PATCH 4/8] vfs: remove unused hardirq.h

2017-11-17 Thread Yang Shi

Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by vfs at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Alexander Viro 
---
 fs/dcache.c | 1 -
 fs/file_table.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f901413..9340e8c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -32,7 +32,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/fs/file_table.c b/fs/file_table.c
index 61517f5..dab099e 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -23,7 +23,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
1.8.3.1

[PATCH 8/8] net: tipc: remove unused hardirq.h

2017-11-17 Thread Yang Shi

Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by TIPC at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Jon Maloy 
Cc: Ying Xue 
Cc: "David S. Miller" 
---
 net/tipc/core.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/tipc/core.h b/net/tipc/core.h
index 5cc5398..099e072 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -49,7 +49,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
1.8.3.1

[PATCH 6/8] net: caif: remove unused hardirq.h

2017-11-17 Thread Yang Shi

Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by caif at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Dmitry Tarnyagin 
Cc: "David S. Miller" 
---
 net/caif/cfpkt_skbuff.c | 1 -
 net/caif/chnl_net.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/net/caif/cfpkt_skbuff.c b/net/caif/cfpkt_skbuff.c
index 71b6ab2..38c2b7a 100644
--- a/net/caif/cfpkt_skbuff.c
+++ b/net/caif/cfpkt_skbuff.c
@@ -8,7 +8,6 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 
diff --git a/net/caif/chnl_net.c b/net/caif/chnl_net.c
index 922ac1d..53ecda1 100644
--- a/net/caif/chnl_net.c
+++ b/net/caif/chnl_net.c
@@ -8,7 +8,6 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ":%s(): " fmt, __func__
 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
1.8.3.1

[PATCH 1/8] mm: kmemleak: remove unused hardirq.h

2017-11-17 Thread Yang Shi

Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by kmemleak at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
---
 mm/kmemleak.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 7780cd8..25b977f 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -91,7 +91,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
1.8.3.1

[iproute2 PATCH] man: tc-flower: add explanation for hw_tc option

2017-11-17 Thread Amritha Nambiar

Add details explaining the hw_tc option.

Signed-off-by: Amritha Nambiar 
---
 man/man8/tc-flower.8 |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
index be46f02..fd9098e 100644
--- a/man/man8/tc-flower.8
+++ b/man/man8/tc-flower.8
@@ -10,7 +10,10 @@ flower \- flow based traffic control filter
 .B action
 .IR ACTION_SPEC " ] [ "
 .B classid
-.IR CLASSID " ]"
+.IR CLASSID " ] [ "
+.B hw_tc
+.IR TCID " ]"
+
 
 .ti -8
 .IR MATCH_LIST " := [ " MATCH_LIST " ] " MATCH
@@ -77,6 +80,10 @@ is in the form
 .BR X : Y ", while " X " and " Y
 are interpreted as numbers in hexadecimal format.
 .TP
+.BI hw_tc " TCID"
+Specify a hardware traffic class to pass matching packets on to. TCID is in the
+range 0 through 15.
+.TP
 .BI indev " ifname"
 Match on incoming interface name. Obviously this makes sense only for forwarded
 flows.

[iproute2 PATCH] man: tc-mqprio: add documentation for new offload options

2017-11-17 Thread Amritha Nambiar

This patch adds documentation for additional offload modes and
associated parameters in tc-mqprio.

Signed-off-by: Amritha Nambiar 
---
 man/man8/tc-mqprio.8 |   60 +-
 1 file changed, 49 insertions(+), 11 deletions(-)

diff --git a/man/man8/tc-mqprio.8 b/man/man8/tc-mqprio.8
index 0e1d305..a1bedd3 100644
--- a/man/man8/tc-mqprio.8
+++ b/man/man8/tc-mqprio.8
@@ -16,7 +16,17 @@ P0 P1 P2...
 count1@offset1 count2@offset2 ...
 .B ] [ hw
 1|0
-.B ]
+.B ] [ mode
+dcb|channel]
+.B ] [ shaper
+dcb|
+.B [ bw_rlimit
+.B min_rate
+min_rate1 min_rate2 ...
+.B max_rate
+max_rate1 max_rate2 ...
+.B ]]
+
 
 .SH DESCRIPTION
 The MQPRIO qdisc is a simple queuing discipline that allows mapping
@@ -36,14 +46,16 @@ and
 By default these parameters are configured by the hardware
 driver to match the hardware QOS structures.
 
-Enabled hardware can provide hardware QOS with the ability to steer
-traffic flows to designated traffic classes provided by this qdisc.
-Configuring the hardware based QOS mechanism is outside the scope of
-this qdisc. Tools such as
-.B lldpad
-and
-.B ethtool
-exist to provide this functionality. Also further qdiscs may be added
+.B Channel
+mode supports full offload of the mqprio options, the traffic classes, the 
queue
+configurations and QOS attributes to the hardware. Enabled hardware can provide
+hardware QOS with the ability to steer traffic flows to designated traffic
+classes provided by this qdisc. Hardware based QOS is configured using the
+.B shaper
+parameter.
+.B bw_rlimit
+with minimum and maximum bandwidth rates can be used for setting
+transmission rates on each traffic class. Also further qdiscs may be added
 to the classes of MQPRIO to create more complex configurations.
 
 .SH ALGORITHM
@@ -104,9 +116,35 @@ contiguous range of queues.
 hw
 Set to
 .B 1
-to use hardware QOS defaults. Set to
+to support hardware offload. Set to
 .B 0
-to override hardware defaults with user specified values.
+to configure user specified values in software only.
+
+.TP
+mode
+Set to
+.B channel
+for full use of the mqprio options. Use
+.B dcb
+to offload only TC values and use hardware QOS defaults. Supported with 'hw'
+set to 1 only.
+
+.TP
+shaper
+Use
+.B bw_rlimit
+to set bandwidth rate limits for a traffic class. Use
+.B dcb
+for hardware QOS defaults. Supported with 'hw' set to 1 only.
+
+.TP
+min_rate
+Minimum value of bandwidth rate limit for a traffic class.
+
+.TP
+max_rate
+Maximum value of bandwidth rate limit for a traffic class.
+
 
 .SH AUTHORS
 John Fastabend,

Re: regression: UFO removal breaks kvm live migration

2017-11-17 Thread Willem de Bruijn

On Fri, Nov 17, 2017 at 9:48 AM, Willem de Bruijn
 wrote:
>>> Okay, I will send a patch to reinstate UFO for this use case (only). There
>>> is some related work in tap_handle_frame and packet_direct_xmit to
>>> segment directly in the device. I will be traveling the next few days, so
>>> it won't be in time for 4.14 (but can go in stable later, of course).
>>
>> I'm finishing up and running some tests. The majority of the patch is a
>> straightforward partial revert of the patchset, so while fairly large for a
>> patch to net (~150 lines, esp. in udp[46]_ufo_fragment), that is all
>> thoroughly tested code. Notably absent are the protocol layer and
>> hardware support (NETIF_F_UFO) portions.
>>
>> The only open issue is whether to rely on existing skb_gso_segment
>> processing in the transmit path from validate_xmit_skb or to add new
>> skb_gso_segment calls directly to tun_get_user, tap_get_user and
>> pf_packet. Tun has to loop around four different ways of injecting
>> packets into the device. Something like the below snippet.
>>
>> More conservative is to introduce no completely new code and rely on
>> validate_xmit_skb, but that means having to protect the entire stack
>> against skbs with SKB_GSO_UDP, so also bringing back some
>> checksum and fragment handling snippets in gre_gso_segment,
>> __skb_udp_tunnel_segment, act_csum and openvswitch.
>
> Come to think of it, as this patch does not bring back NETIF_F_UFO
> support to NETIF_F_GSO_SOFTWARE, the tunnel cases can be
> excluded.
>
> Then this is probably the simpler and more obviously correct approach.

Sent: http://patchwork.ozlabs.org/patch/839168/

[PATCH net] net: accept UFO datagrams from tuntap and packet

2017-11-17 Thread Willem de Bruijn

From: Willem de Bruijn 

Tuntap and similar devices can inject GSO packets. Accept type
VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.

Processes are expected to use feature negotiation such as TUNSETOFFLOAD
to detect supported offload types and refrain from injecting other
packets. This process breaks down with live migration: guest kernels
do not renegotiate flags, so destination hosts need to expose all
features that the source host does.

Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677.
This patch introduces nearly(*) no new code to simplify verification.
It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
insertion and software UFO segmentation.

It does not reinstate protocol stack support, hardware offload
(NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.

To support SKB_GSO_UDP reappearing in the stack, also reinstate
logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
by squashing in commit 939912216fa8 ("net: skb_needs_check() removes
CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee643f1
("net: avoid skb_warn_bad_offload false positives on UFO").

(*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
ipv6_proxy_select_ident is changed to return a __be32, which is
assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
at the end of the enum to minimize code churn.

Link: 
http://lkml.kernel.org/r/
Fixes: fb652fdfe837 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
Reported-by: Michal Kubecek 
Signed-off-by: Willem de Bruijn 
---
 drivers/net/tap.c   |  2 +-
 drivers/net/tun.c   |  4 ++
 include/linux/netdev_features.h |  4 +-
 include/linux/netdevice.h   |  1 +
 include/linux/skbuff.h  |  2 +
 include/linux/virtio_net.h  |  5 ++-
 include/net/ipv6.h  |  1 +
 net/core/dev.c  |  3 +-
 net/ipv4/af_inet.c  | 12 +-
 net/ipv4/udp_offload.c  | 49 ++--
 net/ipv6/output_core.c  | 31 +++
 net/ipv6/udp_offload.c  | 85 +++--
 net/openvswitch/datapath.c  | 14 +++
 net/openvswitch/flow.c  |  6 ++-
 net/sched/act_csum.c|  6 +++
 15 files changed, 211 insertions(+), 14 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index b13890953ebb..e9489b88407c 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1077,7 +1077,7 @@ static long tap_ioctl(struct file *file, unsigned int cmd,
case TUNSETOFFLOAD:
/* let the user check for future flags */
if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
-   TUN_F_TSO_ECN))
+   TUN_F_TSO_ECN | TUN_F_UFO))
return -EINVAL;
 
rtnl_lock();
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 6bb1e604aadd..a33385d8ac65 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2369,6 +2369,10 @@ static int set_offload(struct tun_struct *tun, unsigned 
long arg)
features |= NETIF_F_TSO6;
arg &= ~(TUN_F_TSO4|TUN_F_TSO6);
}
+
+   if (arg & TUN_F_UFO) {
+   arg &= ~TUN_F_UFO;
+   }
}
 
/* This gives the user a way to test for new features in future by
diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index dc8b4896b77b..b1b0ca7ccb2b 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -54,8 +54,9 @@ enum {
NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */
NETIF_F_GSO_SCTP_BIT,   /* ... SCTP fragmentation */
NETIF_F_GSO_ESP_BIT,/* ... ESP with TSO */
+   NETIF_F_GSO_UDP_BIT,/* ... UFO, deprecated except tuntap */
/**/NETIF_F_GSO_LAST =  /* last bit, see GSO_MASK */
-   NETIF_F_GSO_ESP_BIT,
+   NETIF_F_GSO_UDP_BIT,
 
NETIF_F_FCOE_CRC_BIT,   /* FCoE CRC32 */
NETIF_F_SCTP_CRC_BIT,   /* SCTP checksum offload */
@@ -132,6 +133,7 @@ enum {
 #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
 #define NETIF_F_GSO_SCTP   __NETIF_F(GSO_SCTP)
 #define NETIF_F_GSO_ESP__NETIF_F(GSO_ESP)
+#define NETIF_F_GSO_UDP__NETIF_F(GSO_UDP)
 #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
 #define NETIF_F_HW_VLAN_STAG_RX__NETIF_F(HW_VLAN_STAG_RX)
 #define NETIF_F_HW_VLAN_STAG_TX__NETIF_F(HW_VLAN_STAG_TX)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6b274bfe489f..ef789e1d679e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4140,6 +4140,7 @@ static inline bool net_gso_ok(netdev_features_t features, 
int gso_ty

Re: [LTP] [RFC] [PATCH] netns: Fix race in virtual interface bringup

2017-11-17 Thread Dan Rue

Alexey, Li, thank you for your suggestions.

On Fri, Nov 17, 2017 at 03:08:20PM +0300, Alexey Kodanev wrote:
> On 11/17/2017 09:09 AM, Li Wang wrote:
> > Hi Dan,
> >
> > On Fri, Nov 10, 2017 at 4:38 AM, Dan Rue  wrote:
> >> Symptoms (+ command, error):
> >> netns_comm_ip_ipv6_ioctl:
> >> + ip netns exec tst_net_ns1 ping6 -q -c2 -I veth1 fd00::2
> >> connect: Cannot assign requested address
> >>
> >> netns_comm_ip_ipv6_netlink:
> >> + ip netns exec tst_net_ns0 ping6 -q -c2 -I veth0 fd00::3
> >> connect: Cannot assign requested address
> >>
> >> netns_comm_ns_exec_ipv6_ioctl:
> >> + ns_exec 6689 net ping6 -q -c2 -I veth0 fd00::3
> >> connect: Cannot assign requested address
> >>
> >> netns_comm_ns_exec_ipv6_netlin:
> >> + ns_exec 6891 net ping6 -q -c2 -I veth0 fd00::3
> >> connect: Cannot assign requested address
> >>
> >> The error is coming from ping6, which is trying to get an IP address for
> >> veth0 (due to -I veth0), but cannot. Waiting for two seconds fixes the
> >> test in my testcases. 1 second is not long enough.
> >>
> >> dmesg shows the following during the test:
> >>
> >> [Nov 7 15:39] LTP: starting netns_comm_ip_ipv6_ioctl (netns_comm.sh ip 
> >> ipv6 ioctl)
> >> [  +0.302401] IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready
> >> [  +0.048059] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
> 
> It's quite strange that veth interface needs 2 seconds to become
> operational and it is up in less than 0.3s according to dmesg, but
> you said that it's not enough even 1 sec... Are you sure that IPv6
> address not in tentative state and dad process actually disabled?
> I'm asking because you don't have it disabled in the script:
> https://gist.github.com/danrue/7b76bbcbc23a6296030b7295650b69f3

Investigating further, the dmesg output is reporting on the status of
the link between veth0 and veth1, not the veth0 interface itself. That
is, the first dmesg message comes from "ip netns exec tst_net_ns0
ifconfig veth0 up" and the second comes from "ip netns exec tst_net_ns1
ifconfig veth1 up". This explains why we see .3s in dmesg but a 2 second
sleep being required. There is not actually anything in dmesg that is
helpful here.

Regarding dad (duplicate address detection), we have seen similar issues
on low power ARM64 boards and IPv4. Anyway, I tried disabling dad on the
interface and it did not make a difference.

> 
> >>
> >> Signed-off-by: Dan Rue 
> >> ---
> >>
> >> We've periodically hit this problem across many arm64 kernels and boards, 
> >> and
> >> it seems to be caused by "ping6" running before the virtual interface is
> >> actually ready. "sleep 2" works around the issue and proves that it is a 
> >> race
> >> condition, but I would prefer something faster and deterministic. Please
> >> suggest a better implementation.
> > Just FYI:
> >
> > I'm not good at network things, but one method I copied from ltp/numa
> > test is to split the '2s' into many smaller pieces of time.
> >
> > which something like:
> >
> > --- a/testcases/kernel/containers/netns/netns_helper.sh
> > +++ b/testcases/kernel/containers/netns/netns_helper.sh
> > @@ -240,6 +240,22 @@ netns_ip_setup()
> > tst_brkm TBROK "unable to add device veth1 to the
> > separate network namespace"
> >  }
> >
> > +wait_for_set_ip()
> > +{
> > +   local dev=$1
> > +   local retries=200
> > +
> > +   while [ $retries -gt 0 ]; do
> > +   dmesg -c | grep -q "IPv6: ADDRCONF(NETDEV_CHANGE):
> > $dev: link becomes ready"
> 
> 
> What about "grep -q up /sys/class/net/$dev/operstate && break"?

Since dmesg will not help, I explored /sys as proposed.

operstate shows "up", and ping6 still fails.
carrier shows "1" (up), and ping6 still fails.
dormant shows "0" (interface is not dormant), and ping6 still fails.
flags shows "0x1003" before and after a 2s sleep (they don't change)

So it seems there is nothing in dmesg, or /sys that can help here.

Dan

> 
> Thanks,
> Alexey
> 
> 
> > +   if [ $? -eq 0 ]; then
> > +   break
> > +   fi
> > +
> > +   retries=$((retries-1))
> > +   tst_sleep 10ms
> > +   done
> > +}
> > +
> >  ##
> >  # Enables virtual ethernet devices and assigns IP addresses for both
> >  # of them (IPv4/IPv6 variant is decided by netns_setup() function).
> > @@ -285,6 +301,9 @@ netns_set_ip()
> > tst_brkm TBROK "enabling veth1 device failed"
> > ;;
> > esac
> > +
> > +   wait_for_set_ip veth0
> > +   wait_for_set_ip veth1
> >  }
> >
> >  netns_ns_exec_cleanup()
> >
> >> Also, is it correct that "ifconfig veth0 up" returns before the interface 
> >> is
> >> actually ready?
> >>
> >> See also this isolated test script:
> >> https://gist.github.com/danrue/7b76bbcbc23a6296030b7295650b69f3
> >>
> >>  testcases/kernel/containers/netns/netns_helper.sh | 1 +
> >>  1 file changed, 1

Greetings From Mrs. Sarah Smith

2017-11-17 Thread Mrs. Sarah Smith

Greetings From Mrs. Sarah Smith,

With Due Respect and Humanity, I was compelled to write to you under a
humanitarian ground. My names are Mrs.Sarah Smith , am 52 years old From
Switzerland; I am married to Late Mr. Hazard  Smith; but  we Living Benin
Republic, We were married for 25 years without a child. He died after a
Cardiac Arteries Operation. And Recently, My Doctor told me that I would
not last for the  next 3 months due to my cancer problem (Breast cancer).
Before my husband died last year  there is this sum $2,800,000.00 United
State Dollars that he deposited with a bank in  Benin and presently the
fund is still with the Bank.  Having known my condition I decided to
donate this fund to individual that will utilize  this fund the way I am
going to instruct herein. I want somebody that will use this fund
according to the desire of my late husband to help less privileged people,
orphanages,  widows. I took this decision because I don't have any child
that will inherit this fund,  and I don't want in a way were this fund
will be used in wrong way. If you wish to help me actualize this vision,
please indicate your readiness immediately you received this proposal.
Remain blessed you as you listing to the voice of reasoning.

Your beloved sister Mrs. Sarah Smith,

Re: [PATCH iproute2/net-next v3]tc: B.W limits can now be specified in %.

2017-11-17 Thread Stephen Hemminger

On Sat, 18 Nov 2017 02:13:38 +0530
Nishanth Devarajan  wrote:

> diff --git a/tc/tc_util.h b/tc/tc_util.h
> index 583a21a..7b7420a 100644
> --- a/tc/tc_util.h
> +++ b/tc/tc_util.h
> @@ -24,14 +24,14 @@ struct qdisc_util {
>   struct  qdisc_util *next;
>   const char *id;
>   int (*parse_qopt)(struct qdisc_util *qu, int argc,
> -   char **argv, struct nlmsghdr *n);
> +   char **argv, struct nlmsghdr *n, char *dev);

One more nit...
Since parsing queue options should not modify the device name, that should
be const char *.

[PATCH iproute2] tc: cleanup qdisc arg parsing

2017-11-17 Thread Stephen Hemminger

The qdisc arg parsing has magic limit of 16 for class which is not required
by kernel. Also the limit of 16 for device name is really IFNAMSIZ.

Signed-off-by: Stephen Hemminger 
---
 tc/tc_qdisc.c | 21 +
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c
index fcb75f29128e..1066ae05a4b5 100644
--- a/tc/tc_qdisc.c
+++ b/tc/tc_qdisc.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -49,8 +50,7 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int 
argc, char **argv)
struct tc_sizespec  szopts;
__u16   *data;
} stab = {};
-   char  d[16] = {};
-   char  k[16] = {};
+   char  d[IFNAMSIZ] = {};
struct {
struct nlmsghdr n;
struct tcmsgt;
@@ -89,8 +89,8 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int 
argc, char **argv)
return -1;
}
req.t.tcm_parent = TC_H_CLSACT;
-   strncpy(k, "clsact", sizeof(k) - 1);
-   q = get_qdisc_kind(k);
+
+   q = get_qdisc_kind("clsact");
req.t.tcm_handle = TC_H_MAKE(TC_H_CLSACT, 0);
NEXT_ARG_FWD();
break;
@@ -100,8 +100,8 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int 
argc, char **argv)
return -1;
}
req.t.tcm_parent = TC_H_INGRESS;
-   strncpy(k, "ingress", sizeof(k) - 1);
-   q = get_qdisc_kind(k);
+
+   q = get_qdisc_kind("ingress");
req.t.tcm_handle = TC_H_MAKE(TC_H_INGRESS, 0);
NEXT_ARG_FWD();
break;
@@ -124,26 +124,23 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, 
int argc, char **argv)
} else if (matches(*argv, "help") == 0) {
usage();
} else {
-   strncpy(k, *argv, sizeof(k)-1);
-
-   q = get_qdisc_kind(k);
+   q = get_qdisc_kind(*argv);
argc--; argv++;
break;
}
argc--; argv++;
}
 
-   if (k[0])
-   addattr_l(&req.n, sizeof(req), TCA_KIND, k, strlen(k)+1);
if (est.ewma_log)
addattr_l(&req.n, sizeof(req), TCA_RATE, &est, sizeof(est));
 
if (q) {
+   addattr_l(&req.n, sizeof(req), TCA_KIND, q->id, strlen(q->id) + 
1);
if (q->parse_qopt) {
if (q->parse_qopt(q, argc, argv, &req.n))
return 1;
} else if (argc) {
-   fprintf(stderr, "qdisc '%s' does not support option 
parsing\n", k);
+   fprintf(stderr, "qdisc '%s' does not support option 
parsing\n", q->id);
return -1;
}
} else {
-- 
2.11.0

Re: [PATCH iproute2/net-next v3]tc: B.W limits can now be specified in %.

2017-11-17 Thread Stephen Hemminger

On Sat, 18 Nov 2017 02:13:38 +0530
Nishanth Devarajan  wrote:

> + result = strtoul(buf, &endp, 0);
> +
> + if (*endp || buf == endp) {
> + fprintf(stderr, "value \"%s\" in file %s is not a number\n",
> + buf, fname);
> + goto out;
> + }
> +
> + if (result == ULONG_MAX && errno == ERANGE) {
> + fprintf(stderr, "strtoul %s: %s", fname, strerror(errno));
> + goto out;
> + }

Since speed value of unknown is represented as "-1" I think you need to
change this API to take signed value (ie use strtol)

Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-17 Thread Kirill Tkhai

On 17.11.2017 21:52, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
> 
>> On 15.11.2017 19:31, Eric W. Biederman wrote:
>>> Kirill Tkhai  writes:
>>>
 On 15.11.2017 12:51, Kirill Tkhai wrote:
> On 15.11.2017 06:19, Eric W. Biederman wrote:
>> Kirill Tkhai  writes:
>>
>>> On 14.11.2017 21:39, Cong Wang wrote:
 On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai  
 wrote:
> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>
> get_user_ns(user_ns);
>
> -   rv = mutex_lock_killable(&net_mutex);
> +   rv = down_read_killable(&net_sem);
> if (rv < 0) {
> net_free(net);
> dec_net_namespaces(ucounts);
> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
> list_add_tail_rcu(&net->list, &net_namespace_list);
> rtnl_unlock();
> }
> -   mutex_unlock(&net_mutex);
> +   up_read(&net_sem);
> if (rv < 0) {
> dec_net_namespaces(ucounts);
> put_user_ns(user_ns);
> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
> list_replace_init(&cleanup_list, &net_kill_list);
> spin_unlock_irq(&cleanup_list_lock);
>
> -   mutex_lock(&net_mutex);
> +   down_read(&net_sem);
>
> /* Don't let anyone else find us. */
> rtnl_lock();
> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
> list_for_each_entry_reverse(ops, &pernet_list, list)
> ops_free_list(ops, &net_exit_list);
>
> -   mutex_unlock(&net_mutex);
> +   up_read(&net_sem);

 After your patch setup_net() could run concurrently with cleanup_net(),
 given that ops_exit_list() is called on error path of setup_net() too,
 it means ops->exit() now could run concurrently if it doesn't have its
 own lock. Not sure if this breaks any existing user.
>>>
>>> Yes, there will be possible concurrent ops->init() for a net namespace,
>>> and ops->exit() for another one. I hadn't found pernet operations, which
>>> have a problem with that. If they exist, they are hidden and not clear 
>>> seen.
>>> The pernet operations in general do not touch someone else's memory.
>>> If suddenly there is one, KASAN should show it after a while.
>>
>> Certainly the use of hash tables shared between multiple network
>> namespaces would count.  I don't rembmer how many of these we have but
>> there used to be quite a few.
>
> Could you please provide an example of hash tables, you mean?

 Ah, I see, it's dccp_hashinfo etc.
>>
>> JFI, I've checked dccp_hashinfo, and it seems to be safe.
>>
>>>
>>> The big one used to be the route cache.  With resizable hash tables
>>> things may be getting better in that regard.
>>
>> I've checked some fib-related things, and wasn't able to find that.
>> Excuse me, could you please clarify, if it's an assumption, or
>> there is exactly a problem hash table, you know? Could you please
>> point it me more exactly, if it's so.
> 
> Two things.
> 1) Hash tables are one case I know where we access data from multiple
>network namespaces.  As such it can not be asserted that is no
>possibility for problems.
> 
> 2) The responsible way to handle this is one patch for each set of
>methods explaining why those methods are safe to run in parallel.
> 
>That ensures there is opportunity for review and people are going
>slowly enough that they actually look at these issues.
> 
> The reason I want to see this broken up is that at 200ish sets of
> methods it is too much to review all at once.

Ok, it's possible to split the changes in 400 patches, but there is
a problem with three-state (no compile, module, built-in) drivers.
Git bisect won't work anyway. Please see the description of the problem
in cover message "[PATCH RFC 00/25] Replacing net_mutex with rw_semaphore"
I sent today.
 
> I completely agree that odds are that this can be made safe and that it
> is mostly likely already safe in practically every instance.My guess
> would be that if there are problems that need to be addressed they
> happen in one or two places and we need to find them.  If possible I
> don't want to find them after the code has shipped in a stable release.

Kirill

[PATCH iproute2/net-next v3]tc: B.W limits can now be specified in %.

2017-11-17 Thread Nishanth Devarajan

This patch adapts the tc command line interface to allow bandwidth limits
to be specified as a percentage of the interface's capacity.

Adding this functionality requires passing the specified device string to
each class/qdisc which changes the prototype for a couple of functions: the
.parse_qopt and .parse_copt interfaces. The device string is a required
parameter for tc-qdisc and tc-class, and when not specified, the kernel
returns ENODEV. In this patch, if the user tries to specify a bandwidth
percentage without naming the device, we return an error from userspace.

v2:
* Modified and moved int read_prop() from ip/iptuntap.c to lib/utils.c,
to make it accessible to tc. 

v3:
* Modified and moved int parse_percent() from tc/q_netem.c to ib/util.c for
use in tc.

* Changed couple variable names in int parse_percent_rate().

* Handled showing error message when device speed is unknown.

* Updated man page to warn users that when specifying rates in %, tc only
uses the current device speed and does not recalculate if it changes after.

During cases when properties (like device speed) are unknown, read_prop()
assumes that if the property file can be opened but not read, it means
that the property is unknown.

Signed-off by: Nishanth Devarajan

---
 include/utils.h |  2 ++
 ip/iptuntap.c   | 32 ---
 lib/utils.c | 68 +
 man/man8/tc.8   |  5 -
 tc/q_atm.c  |  2 +-
 tc/q_cbq.c  | 25 -
 tc/q_choke.c|  9 ++--
 tc/q_clsact.c   |  2 +-
 tc/q_codel.c|  2 +-
 tc/q_drr.c  |  4 ++--
 tc/q_dsmark.c   |  4 ++--
 tc/q_fifo.c |  2 +-
 tc/q_fq.c   | 16 +++---
 tc/q_fq_codel.c |  2 +-
 tc/q_gred.c |  9 ++--
 tc/q_hfsc.c | 45 +-
 tc/q_hhf.c  |  2 +-
 tc/q_htb.c  | 18 +++
 tc/q_ingress.c  |  2 +-
 tc/q_mqprio.c   |  2 +-
 tc/q_multiq.c   |  2 +-
 tc/q_netem.c| 23 ++-
 tc/q_pie.c  |  2 +-
 tc/q_prio.c |  2 +-
 tc/q_qfq.c  |  4 ++--
 tc/q_red.c  |  9 ++--
 tc/q_rr.c   |  2 +-
 tc/q_sfb.c  |  2 +-
 tc/q_sfq.c  |  2 +-
 tc/q_tbf.c  | 16 +++---
 tc/tc.c |  2 +-
 tc/tc_class.c   |  2 +-
 tc/tc_qdisc.c   |  2 +-
 tc/tc_util.c| 63 
 tc/tc_util.h|  7 --
 35 files changed, 283 insertions(+), 110 deletions(-)

diff --git a/include/utils.h b/include/utils.h
index 3d91c50..9377266 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -87,6 +87,8 @@ int get_prefix(inet_prefix *dst, char *arg, int family);
 int mask2bits(__u32 netmask);
 int get_addr_ila(__u64 *val, const char *arg);
 
+int read_prop(const char *dev, char *prop, long *value);
+int parse_percent(double *val, const char *str);
 int get_hex(char c);
 int get_integer(int *val, const char *arg, int base);
 int get_unsigned(unsigned *val, const char *arg, int base);
diff --git a/ip/iptuntap.c b/ip/iptuntap.c
index b46e452..09f2be2 100644
--- a/ip/iptuntap.c
+++ b/ip/iptuntap.c
@@ -223,38 +223,6 @@ static int do_del(int argc, char **argv)
return tap_del_ioctl(&ifr);
 }
 
-static int read_prop(char *dev, char *prop, long *value)
-{
-   char fname[IFNAMSIZ+25], buf[80], *endp;
-   ssize_t len;
-   int fd;
-   long result;
-
-   sprintf(fname, "/sys/class/net/%s/%s", dev, prop);
-   fd = open(fname, O_RDONLY);
-   if (fd < 0) {
-   if (strcmp(prop, "tun_flags"))
-   fprintf(stderr, "open %s: %s\n", fname,
-   strerror(errno));
-   return -1;
-   }
-   len = read(fd, buf, sizeof(buf)-1);
-   close(fd);
-   if (len < 0) {
-   fprintf(stderr, "read %s: %s", fname, strerror(errno));
-   return -1;
-   }
-
-   buf[len] = 0;
-   result = strtol(buf, &endp, 0);
-   if (*endp != '\n') {
-   fprintf(stderr, "Failed to parse %s\n", fname);
-   return -1;
-   }
-   *value = result;
-   return 0;
-}
-
 static void print_flags(long flags)
 {
if (flags & IFF_TUN)
diff --git a/lib/utils.c b/lib/utils.c
index 4f2fa28..9d5ba2a 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -39,6 +39,74 @@
 int resolve_hosts;
 int timestamp_short;
 
+int read_prop(const char *dev, char *prop, long *value)
+{
+   char fname[128], buf[80], *endp, *nl;
+   FILE *fp;
+   long result;
+   int ret;
+
+   ret = snprintf(fname, sizeof(fname), "/sys/class/net/%s/%s",
+   dev, prop);
+
+   if (ret <= 0 || ret >= sizeof(fname)) {
+   fprintf(stderr, "could not build pathname for property\n");
+   return -1;
+   }
+
+   fp = fopen(fname, "r");
+   if (fp == NULL) {
+   fprintf(stderr, "fopen %s: %s\n", fname, strerror(errno));
+   return -1;
+   }
+
+   if (!fgets(buf, size

Re: [pull request][net V2 0/5] Mellanox, mlx5 fixes 2017-11-08

2017-11-17 Thread Saeed Mahameed

On Sat, Nov 11, 2017 at 2:42 AM, David Miller  wrote:
> From: Saeed Mahameed 
> Date: Fri, 10 Nov 2017 15:50:15 +0900
>
>> The follwoing series includes some fixes for mlx5 core and etherent
>> driver.
>>
>> Sorry for the late submission but as you can see i have some very
>> critical fixes below that i would like them merged into this RC.
>>
>> Please pull and let me know if there is any problem.
>
> Pulled.
>
>> For -stable:
>> ('net/mlx5e: Set page to null in case dma mapping fails') kernels >= 4.13
>> ('net/mlx5: FPGA, return -EINVAL if size is zero') kernels >= 4.13
>> ('net/mlx5: Cancel health poll before sending panic teardown command') 
>> kernels >= 4.13
>
> That FPGA change doesn't appear in this pull request.
>

Sorry about that, I had to drop it as you see in "V1->V2" log, but
forgot to remove it from the -stable list.

[PATCH] usbnet: ipheth: fix potential null pointer dereference in ipheth_carrier_set

2017-11-17 Thread Gustavo A. R. Silva

_dev_ is being dereferenced before it is null checked, hence there
is a potential null pointer dereference.

Fix this by moving the pointer dereference after _dev_ has been null
checked.

Addresses-Coverity-ID: 1462020
Fixes: bb1b40c7cb86 ("usbnet: ipheth: prevent TX queue timeouts when device not 
ready")
Signed-off-by: Gustavo A. R. Silva 
---
 drivers/net/usb/ipheth.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/usb/ipheth.c b/drivers/net/usb/ipheth.c
index ca71f6c..7275761 100644
--- a/drivers/net/usb/ipheth.c
+++ b/drivers/net/usb/ipheth.c
@@ -291,12 +291,15 @@ static void ipheth_sndbulk_callback(struct urb *urb)
 
 static int ipheth_carrier_set(struct ipheth_device *dev)
 {
-   struct usb_device *udev = dev->udev;
+   struct usb_device *udev;
int retval;
+
if (!dev)
return 0;
if (!dev->confirmed_pairing)
return 0;
+
+   udev = dev->udev;
retval = usb_control_msg(udev,
usb_rcvctrlpipe(udev, IPHETH_CTRL_ENDP),
IPHETH_CMD_CARRIER_CHECK, /* request */
-- 
2.7.4

Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-17 Thread Kirill Tkhai

On 17.11.2017 21:54, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
> 
>> On 15.11.2017 19:29, Eric W. Biederman wrote:
>>> Kirill Tkhai  writes:
>>>
 On 15.11.2017 09:25, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
>
>> Curently mutex is used to protect pernet operations list. It makes
>> cleanup_net() to execute ->exit methods of the same operations set,
>> which was used on the time of ->init, even after net namespace is
>> unlinked from net_namespace_list.
>>
>> But the problem is it's need to synchronize_rcu() after net is removed
>> from net_namespace_list():
>>
>> Destroy net_ns:
>> cleanup_net()
>>   mutex_lock(&net_mutex)
>>   list_del_rcu(&net->list)
>>   synchronize_rcu()  <--- Sleep there 
>> for ages
>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>> ops_exit_list(ops, &net_exit_list)
>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>> ops_free_list(ops, &net_exit_list)
>>   mutex_unlock(&net_mutex)
>>
>> This primitive is not fast, especially on the systems with many 
>> processors
>> and/or when preemptible RCU is enabled in config. So, all the time, while
>> cleanup_net() is waiting for RCU grace period, creation of new net 
>> namespaces
>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>
>> Create net_ns:
>> copy_net_ns()
>>   mutex_lock_killable(&net_mutex)<--- Sleep there 
>> for ages
>>
>> The solution is to convert net_mutex to the rw_semaphore. Then,
>> pernet_operations::init/::exit methods, modifying the net-related data,
>> will require down_read() locking only, while down_write() will be used
>> for changing pernet_list.
>>
>> This gives signify performance increase, like you may see below. There
>> is measured sequential net namespace creation in a cycle, in single
>> thread, without other tasks (single user mode):
>>
>> 1)int main(int argc, char *argv[])
>> {
>> unsigned nr;
>> if (argc < 2) {
>> fprintf(stderr, "Provide nr iterations arg\n");
>> return 1;
>> }
>> nr = atoi(argv[1]);
>> while (nr-- > 0) {
>> if (unshare(CLONE_NEWNET)) {
>> perror("Can't unshare");
>> return 1;
>> }
>> }
>> return 0;
>> }
>>
>> Origin, 10 unshare():
>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>
>> Patched, 10 unshare():
>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>
>> 2)for i in {1..1}; do unshare -n bash -c exit; done
>>
>> Origin:
>> real 1m24,190s
>> user 0m6,225s
>> sys 0m15,132s
>>
>> Patched:
>> real 0m18,235s   (4.6 times faster)
>> user 0m4,544s
>> sys 0m13,796s
>>
>> This patch requires commit 76f8507f7a64 "locking/rwsem: Add 
>> down_read_killable()"
>> from Linus tree (not in net-next yet).
>
> Using a rwsem to protect the list of operations makes sense.
>
> That should allow removing the sing
>
> I am not wild about taking a the rwsem down_write in
> rtnl_link_unregister, and net_ns_barrier.  I think that works but it
> goes from being a mild hack to being a pretty bad hack and something
> else that can kill the parallelism you are seeking it add.
>
> There are about 204 instances of struct pernet_operations.  That is a
> lot of code to have carefully audited to ensure it can in parallel all
> at once.  The existence of the exit_batch method, net_ns_barrier,
> for_each_net and taking of net_mutex in rtnl_link_unregister all testify
> to the fact that there are data structures accessed by multiple network
> namespaces.
>
> My preference would be to:
>
> - Add the net_sem in addition to net_mutex with down_write only held in
>   register and unregister, and maybe net_ns_barrier and
>   rtnl_link_unregister.
>
> - Factor out struct pernet_ops out of struct pernet_operations.  With
>   struct pernet_ops not having the exit_batch method.  With pernet_ops
>   being embedded an anonymous member of the old struct pernet_operations.
>
> - Add [un]register_pernet_{sys,dev} functions that take a struct
>   pernet_ops, that don't take net_mutex.  Have them order the
>   pernet_list as:
>
>   pernet_sys
>   pernet_subsys
>   pernet_device
>   pernet_dev
>
>   With the chunk in the middle taking the net_mutex.

 I think this approach will work. Thanks for the suggestion. Some more
 thoughts to the plan below.

 The only difficult thing there will be to choose the right order
 to move ops from pernet_subsys to per

Re: [PATCH v10 5/8] ARM: dts: sunxi: Restore EMAC changes (boards)

2017-11-17 Thread Philipp Rossak


Hey,
Sorry for the bringing this up again.
Isn't there a: ethernet0 = &emac; for some boards missing?

Best,
Philipp

(Sorry for sending this to some persons more than once! My Thunderbird 
sent mails in html and didn't reach the mailing lists. I hope it works 
now :) )


On 31.10.2017 09:19, Corentin Labbe wrote:

The original dwmac-sun8i DT bindings have some issue on how to handle
integrated PHY and was reverted in last RC of 4.13.
But now we have a solution so we need to get back that was reverted.

This patch restore all boards DT about dwmac-sun8i
This reverts partially commit fe45174b72ae ("arm: dts: sunxi: Revert EMAC 
changes")

Signed-off-by: Corentin Labbe 
Acked-by: Florian Fainelli 
---
  arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts |  9 +
  arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts   | 19 +++
  arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts | 19 +++
  arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts |  7 +++
  arch/arm/boot/dts/sun8i-h3-orangepi-2.dts |  8 
  arch/arm/boot/dts/sun8i-h3-orangepi-one.dts   |  8 
  arch/arm/boot/dts/sun8i-h3-orangepi-pc-plus.dts   |  5 +
  arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts|  8 
  arch/arm/boot/dts/sun8i-h3-orangepi-plus.dts  | 22 ++
  arch/arm/boot/dts/sun8i-h3-orangepi-plus2e.dts| 16 
  10 files changed, 121 insertions(+)

diff --git a/arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts 
b/arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts
index b1502df7b509..6713d0f2b3f4 100644
--- a/arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts
+++ b/arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts
@@ -56,6 +56,8 @@
  
  	aliases {

serial0 = &uart0;
+   /* ethernet0 is the H3 emac, defined in sun8i-h3.dtsi */
+   ethernet0 = &emac;
ethernet1 = &xr819;
};
  
@@ -102,6 +104,13 @@

status = "okay";
  };
  
+&emac {

+   phy-handle = <&int_mii_phy>;
+   phy-mode = "mii";
+   allwinner,leds-active-low;
+   status = "okay";
+};
+
  &mmc0 {
pinctrl-names = "default";
pinctrl-0 = <&mmc0_pins_a>;
diff --git a/arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts 
b/arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts
index e1dba9ffa94b..f2292deaa590 100644
--- a/arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts
+++ b/arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts
@@ -52,6 +52,7 @@
compatible = "sinovoip,bpi-m2-plus", "allwinner,sun8i-h3";
  
  	aliases {

+   ethernet0 = &emac;
serial0 = &uart0;
serial1 = &uart1;
};
@@ -111,6 +112,24 @@
status = "okay";
  };
  
+&emac {

+   pinctrl-names = "default";
+   pinctrl-0 = <&emac_rgmii_pins>;
+   phy-supply = <®_gmac_3v3>;
+   phy-handle = <&ext_rgmii_phy>;
+   phy-mode = "rgmii";
+
+   allwinner,leds-active-low;
+   status = "okay";
+};
+
+&external_mdio {
+   ext_rgmii_phy: ethernet-phy@1 {
+   compatible = "ethernet-phy-ieee802.3-c22";
+   reg = <0>;
+   };
+};
+
  &ir {
pinctrl-names = "default";
pinctrl-0 = <&ir_pins_a>;
diff --git a/arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts 
b/arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts
index 73766d38ee6c..cfb96da3cfef 100644
--- a/arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts
+++ b/arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts
@@ -66,6 +66,25 @@
status = "okay";
  };
  
+&emac {

+   pinctrl-names = "default";
+   pinctrl-0 = <&emac_rgmii_pins>;
+   phy-supply = <®_gmac_3v3>;
+   phy-handle = <&ext_rgmii_phy>;
+   phy-mode = "rgmii";
+
+   allwinner,leds-active-low;
+
+   status = "okay";
+};
+
+&external_mdio {
+   ext_rgmii_phy: ethernet-phy@1 {
+   compatible = "ethernet-phy-ieee802.3-c22";
+   reg = <7>;
+   };
+};
+
  &ir {
pinctrl-names = "default";
pinctrl-0 = <&ir_pins_a>;
diff --git a/arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts 
b/arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts
index 8d2cc6e9a03f..78f6c24952dd 100644
--- a/arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts
+++ b/arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts
@@ -46,3 +46,10 @@
model = "FriendlyARM NanoPi NEO";
compatible = "friendlyarm,nanopi-neo", "allwinner,sun8i-h3";
  };
+
+&emac {
+   phy-handle = <&int_mii_phy>;
+   phy-mode = "mii";
+   allwinner,leds-active-low;
+   status = "okay";
+};
diff --git a/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts 
b/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts
index 1bf51802f5aa..b20be95b49d5 100644
--- a/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts
+++ b/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts
@@ -54,6 +54,7 @@
aliases {
serial0 = &uart0;
/* ethernet0 is the H3 emac, defined in sun8i-h3.dtsi */
+   ethernet0 = &emac;
ethernet1 = &rtl81

Re: [patch net-next RFC v2 00/11] Add support for resource abstraction

2017-11-17 Thread David Ahern

On 11/14/17 9:18 AM, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> Arkadi says:
> 
> Many of the ASIC's internal resources are limited and are shared between
> several hardware procedures. For example, unified hash-based memory can
> be used for many lookup purposes, like FDB and LPM. In many cases the user
> can provide a partitioning scheme for such a resource in order to perform
> fine tuning for his application. In many cases after setting the
> partitioning of the resource driver reload is needed. This patchset add
> support for hot reset of the driver.
> 
> Such an abstraction can be coupled with devlink's dpipe interface, which
> models the ASIC's pipeline as an graph of match/action tables. By modeling
> the hardware resource object, and by coupling it to several dpipe tables,
> further visibility can be achieved in order to debug ASIC-wide issues.
> 
> The proposed interface will provide the user the ability to understand the
> limitations of the hardware, and receive notification regarding its occupancy.
> Furthermore, monitoring the resource occupancy can be done in real-time and
> can be useful in many cases.
> 
> Userspace part prototype can be found at https://github.com/arkadis/iproute2/
> at resource_dev branch.
> 

now that my firmware problem is fixed, I installed a build with this
patch set. Trying to run devlink to split a port hangs:

$ devlink port split swp1 count 4


[  615.373359] INFO: task devlink:804 blocked for more than 120 seconds.
[  615.379934]   Tainted: GW   4.14.0+ #38
[  615.385238] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  615.393111] devlink D0   804771 0x0080
[  615.393115] Call Trace:
[  615.393126]  __schedule+0x1de/0x690
[  615.393130]  schedule+0x36/0x80
[  615.393139]  schedule_preempt_disabled+0xe/0x10
[  615.393146]  __mutex_lock.isra.4+0x211/0x530
[  615.393152]  __mutex_lock_slowpath+0x13/0x20
[  615.393155]  ? __mutex_lock_slowpath+0x13/0x20
[  615.393158]  mutex_lock+0x2f/0x40
[  615.393164]  devlink_port_unregister+0x29/0x60 [devlink]
[  615.393169]  mlxsw_core_port_fini+0x25/0x50 [mlxsw_core]
[  615.393179]  mlxsw_sp_port_remove+0xf0/0x100 [mlxsw_spectrum]
[  615.393186]  mlxsw_sp_port_split+0xdc/0x260 [mlxsw_spectrum]
[  615.393193]  ? _cond_resched+0x19/0x30
[  615.393200]  mlxsw_devlink_port_split+0x36/0x50 [mlxsw_core]
[  615.393206]  devlink_nl_cmd_port_split_doit+0x42/0x50 [devlink]
[  615.393212]  genl_family_rcv_msg+0x1c9/0x390
[  615.393217]  genl_rcv_msg+0x4c/0xa0
[  615.393220]  ? _cond_resched+0x19/0x30
[  615.393228]  ? genl_family_rcv_msg+0x390/0x390
[  615.393232]  netlink_rcv_skb+0xec/0x120
[  615.393235]  genl_rcv+0x28/0x40
[  615.393239]  netlink_unicast+0x170/0x230
[  615.393244]  netlink_sendmsg+0x28e/0x370
[  615.393251]  SYSC_sendto+0x10e/0x1b0
[  615.393258]  ? __audit_syscall_entry+0xc1/0x110
[  615.393261]  ? syscall_trace_enter+0x1c6/0x2d0
[  615.393264]  ? __do_page_fault+0x231/0x4b0
[  615.393268]  SyS_sendto+0xe/0x10
[  615.393272]  do_syscall_64+0x60/0x1f0
[  615.393277]  entry_SYSCALL64_slow_path+0x25/0x25
[  615.393280] RIP: 0033:0x7f4ef43c16f3
[  615.393284] RSP: 002b:7fffb907fbc8 EFLAGS: 0246 ORIG_RAX:
002c
[  615.393287] RAX: ffda RBX: 013660e0 RCX:
7f4ef43c16f3
[  615.393290] RDX: 0040 RSI: 01366110 RDI:
0003
[  615.393291] RBP:  R08: 7f4ef4686d80 R09:
000c
[  615.393292] R10:  R11: 0246 R12:

[  615.393296] R13: 0004 R14:  R15:

Re: [PATCH v1 net-next 0/7] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers

2017-11-17 Thread Andrew Lunn

> I really need to monitor the DSA discussion to better contribute to its 
> success.
> I just found out the DSA API set_addr was removed last month due to not
> everybody is using it.  It cited the Marvell switch was the only switch using 
> that
> API and found a new way to program the MAC address.  But looking at that
> driver I found it simply uses a randomized MAC address.
> 
> For big switch with many ports where the main function is forwarding that MAC
> address may not matter.  For small switch with 2 ports it acts more like an 
> Ethernet
> controller where the switch is mainly used for daisy chaining in a ring 
> network the MAC
> address can be used in feature like source address filtering.

Hi Tristram

The MAC address set by set_addr was only used for pause
frames. Nothing else. So a random address is fine.

The switch itself should not be sending any other frames.

Andrew

RE: [PATCH v1 net-next 0/7] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers

2017-11-17 Thread Tristram.Ha

> On Thu, Nov 16, 2017 at 06:41:24PM -0800, tristram...@microchip.com
> wrote:
> > From: Tristram Ha 
> >
> > This series of patches is to modify the original KSZ9477 DSA driver so
> > that other KSZ switch drivers can be added and use the common code.
> 
> Hi Tristram
> 
> http://vger.kernel.org/~davem/net-next.html
> 
> It is better to send an RFC patchset while netdev is closed and not
> send it to David. He will shout at you otherwise.

Noted.

I really need to monitor the DSA discussion to better contribute to its success.
I just found out the DSA API set_addr was removed last month due to not
everybody is using it.  It cited the Marvell switch was the only switch using 
that
API and found a new way to program the MAC address.  But looking at that
driver I found it simply uses a randomized MAC address.

For big switch with many ports where the main function is forwarding that MAC
address may not matter.  For small switch with 2 ports it acts more like an 
Ethernet
controller where the switch is mainly used for daisy chaining in a ring network 
the MAC
address can be used in feature like source address filtering.

Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-17 Thread Eric W. Biederman

Kirill Tkhai  writes:

> On 15.11.2017 19:29, Eric W. Biederman wrote:
>> Kirill Tkhai  writes:
>> 
>>> On 15.11.2017 09:25, Eric W. Biederman wrote:
 Kirill Tkhai  writes:

> Curently mutex is used to protect pernet operations list. It makes
> cleanup_net() to execute ->exit methods of the same operations set,
> which was used on the time of ->init, even after net namespace is
> unlinked from net_namespace_list.
>
> But the problem is it's need to synchronize_rcu() after net is removed
> from net_namespace_list():
>
> Destroy net_ns:
> cleanup_net()
>   mutex_lock(&net_mutex)
>   list_del_rcu(&net->list)
>   synchronize_rcu()  <--- Sleep there for 
> ages
>   list_for_each_entry_reverse(ops, &pernet_list, list)
> ops_exit_list(ops, &net_exit_list)
>   list_for_each_entry_reverse(ops, &pernet_list, list)
> ops_free_list(ops, &net_exit_list)
>   mutex_unlock(&net_mutex)
>
> This primitive is not fast, especially on the systems with many processors
> and/or when preemptible RCU is enabled in config. So, all the time, while
> cleanup_net() is waiting for RCU grace period, creation of new net 
> namespaces
> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>
> Create net_ns:
> copy_net_ns()
>   mutex_lock_killable(&net_mutex)<--- Sleep there for 
> ages
>
> The solution is to convert net_mutex to the rw_semaphore. Then,
> pernet_operations::init/::exit methods, modifying the net-related data,
> will require down_read() locking only, while down_write() will be used
> for changing pernet_list.
>
> This gives signify performance increase, like you may see below. There
> is measured sequential net namespace creation in a cycle, in single
> thread, without other tasks (single user mode):
>
> 1)int main(int argc, char *argv[])
> {
> unsigned nr;
> if (argc < 2) {
> fprintf(stderr, "Provide nr iterations arg\n");
> return 1;
> }
> nr = atoi(argv[1]);
> while (nr-- > 0) {
> if (unshare(CLONE_NEWNET)) {
> perror("Can't unshare");
> return 1;
> }
> }
> return 0;
> }
>
> Origin, 10 unshare():
> 0.03user 23.14system 1:39.85elapsed 23%CPU
>
> Patched, 10 unshare():
> 0.03user 67.49system 1:08.34elapsed 98%CPU
>
> 2)for i in {1..1}; do unshare -n bash -c exit; done
>
> Origin:
> real 1m24,190s
> user 0m6,225s
> sys 0m15,132s
>
> Patched:
> real 0m18,235s   (4.6 times faster)
> user 0m4,544s
> sys 0m13,796s
>
> This patch requires commit 76f8507f7a64 "locking/rwsem: Add 
> down_read_killable()"
> from Linus tree (not in net-next yet).

 Using a rwsem to protect the list of operations makes sense.

 That should allow removing the sing

 I am not wild about taking a the rwsem down_write in
 rtnl_link_unregister, and net_ns_barrier.  I think that works but it
 goes from being a mild hack to being a pretty bad hack and something
 else that can kill the parallelism you are seeking it add.

 There are about 204 instances of struct pernet_operations.  That is a
 lot of code to have carefully audited to ensure it can in parallel all
 at once.  The existence of the exit_batch method, net_ns_barrier,
 for_each_net and taking of net_mutex in rtnl_link_unregister all testify
 to the fact that there are data structures accessed by multiple network
 namespaces.

 My preference would be to:

 - Add the net_sem in addition to net_mutex with down_write only held in
   register and unregister, and maybe net_ns_barrier and
   rtnl_link_unregister.

 - Factor out struct pernet_ops out of struct pernet_operations.  With
   struct pernet_ops not having the exit_batch method.  With pernet_ops
   being embedded an anonymous member of the old struct pernet_operations.

 - Add [un]register_pernet_{sys,dev} functions that take a struct
   pernet_ops, that don't take net_mutex.  Have them order the
   pernet_list as:

   pernet_sys
   pernet_subsys
   pernet_device
   pernet_dev

   With the chunk in the middle taking the net_mutex.
>>>
>>> I think this approach will work. Thanks for the suggestion. Some more
>>> thoughts to the plan below.
>>>
>>> The only difficult thing there will be to choose the right order
>>> to move ops from pernet_subsys to pernet_sys and from pernet_device
>>> to pernet_dev one by one.
>>>
>>> This is rather easy in case of tristate drivers, as modules may be loaded
>>> at any time, and the only important o

Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-17 Thread Eric W. Biederman

Kirill Tkhai  writes:

> On 15.11.2017 19:31, Eric W. Biederman wrote:
>> Kirill Tkhai  writes:
>> 
>>> On 15.11.2017 12:51, Kirill Tkhai wrote:
 On 15.11.2017 06:19, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
>
>> On 14.11.2017 21:39, Cong Wang wrote:
>>> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai  
>>> wrote:
 @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,

 get_user_ns(user_ns);

 -   rv = mutex_lock_killable(&net_mutex);
 +   rv = down_read_killable(&net_sem);
 if (rv < 0) {
 net_free(net);
 dec_net_namespaces(ucounts);
 @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
 list_add_tail_rcu(&net->list, &net_namespace_list);
 rtnl_unlock();
 }
 -   mutex_unlock(&net_mutex);
 +   up_read(&net_sem);
 if (rv < 0) {
 dec_net_namespaces(ucounts);
 put_user_ns(user_ns);
 @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
 list_replace_init(&cleanup_list, &net_kill_list);
 spin_unlock_irq(&cleanup_list_lock);

 -   mutex_lock(&net_mutex);
 +   down_read(&net_sem);

 /* Don't let anyone else find us. */
 rtnl_lock();
 @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
 list_for_each_entry_reverse(ops, &pernet_list, list)
 ops_free_list(ops, &net_exit_list);

 -   mutex_unlock(&net_mutex);
 +   up_read(&net_sem);
>>>
>>> After your patch setup_net() could run concurrently with cleanup_net(),
>>> given that ops_exit_list() is called on error path of setup_net() too,
>>> it means ops->exit() now could run concurrently if it doesn't have its
>>> own lock. Not sure if this breaks any existing user.
>>
>> Yes, there will be possible concurrent ops->init() for a net namespace,
>> and ops->exit() for another one. I hadn't found pernet operations, which
>> have a problem with that. If they exist, they are hidden and not clear 
>> seen.
>> The pernet operations in general do not touch someone else's memory.
>> If suddenly there is one, KASAN should show it after a while.
>
> Certainly the use of hash tables shared between multiple network
> namespaces would count.  I don't rembmer how many of these we have but
> there used to be quite a few.

 Could you please provide an example of hash tables, you mean?
>>>
>>> Ah, I see, it's dccp_hashinfo etc.
>
> JFI, I've checked dccp_hashinfo, and it seems to be safe.
>
>> 
>> The big one used to be the route cache.  With resizable hash tables
>> things may be getting better in that regard.
>
> I've checked some fib-related things, and wasn't able to find that.
> Excuse me, could you please clarify, if it's an assumption, or
> there is exactly a problem hash table, you know? Could you please
> point it me more exactly, if it's so.

Two things.
1) Hash tables are one case I know where we access data from multiple
   network namespaces.  As such it can not be asserted that is no
   possibility for problems.

2) The responsible way to handle this is one patch for each set of
   methods explaining why those methods are safe to run in parallel.

   That ensures there is opportunity for review and people are going
   slowly enough that they actually look at these issues.

The reason I want to see this broken up is that at 200ish sets of
methods it is too much to review all at once.

I completely agree that odds are that this can be made safe and that it
is mostly likely already safe in practically every instance.My guess
would be that if there are problems that need to be addressed they
happen in one or two places and we need to find them.  If possible I
don't want to find them after the code has shipped in a stable release.

Eric

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-17 Thread Willy Tarreau

Hi Andrew,

On Fri, Nov 17, 2017 at 03:06:23PM +0100, Andrew Lunn wrote:
> > Usually it's better to apply LRU or random here in my opinion, as the
> > new entry is much more likely to be needed than older ones by definition.
> 
> Hi Willy
> 
> I think this depends on why you need to discard. If it is normal
> operation and the limits are simply too low, i would agree.
> 
> If however it is a DoS, throwing away the new entries makes sense,
> leaving the old ones which are more likely to be useful.
> 
> Most of the talk in this thread has been about limits for DoS
> prevention...

Sure but my point is that it can kick in on regular traffic and in
this case it can be catastrophic. That's only what bothers me. If
we have an unlimited default value with this algorithm I'm fine
because nobody will get caught by accident with a bridge suddenly
replicating high traffic on all ports because an unknown limit was
reached. That's the principle of least surprise.

I know that when fighting DoSes there's never any universally good
solutions and one has to make tradeoffs. I'm perfectly fine with this.

Cheers,
Willy

Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-17 Thread Kirill Tkhai

On 15.11.2017 19:31, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
> 
>> On 15.11.2017 12:51, Kirill Tkhai wrote:
>>> On 15.11.2017 06:19, Eric W. Biederman wrote:
 Kirill Tkhai  writes:

> On 14.11.2017 21:39, Cong Wang wrote:
>> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai  
>> wrote:
>>> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>
>>> get_user_ns(user_ns);
>>>
>>> -   rv = mutex_lock_killable(&net_mutex);
>>> +   rv = down_read_killable(&net_sem);
>>> if (rv < 0) {
>>> net_free(net);
>>> dec_net_namespaces(ucounts);
>>> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
>>> list_add_tail_rcu(&net->list, &net_namespace_list);
>>> rtnl_unlock();
>>> }
>>> -   mutex_unlock(&net_mutex);
>>> +   up_read(&net_sem);
>>> if (rv < 0) {
>>> dec_net_namespaces(ucounts);
>>> put_user_ns(user_ns);
>>> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
>>> list_replace_init(&cleanup_list, &net_kill_list);
>>> spin_unlock_irq(&cleanup_list_lock);
>>>
>>> -   mutex_lock(&net_mutex);
>>> +   down_read(&net_sem);
>>>
>>> /* Don't let anyone else find us. */
>>> rtnl_lock();
>>> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
>>> list_for_each_entry_reverse(ops, &pernet_list, list)
>>> ops_free_list(ops, &net_exit_list);
>>>
>>> -   mutex_unlock(&net_mutex);
>>> +   up_read(&net_sem);
>>
>> After your patch setup_net() could run concurrently with cleanup_net(),
>> given that ops_exit_list() is called on error path of setup_net() too,
>> it means ops->exit() now could run concurrently if it doesn't have its
>> own lock. Not sure if this breaks any existing user.
>
> Yes, there will be possible concurrent ops->init() for a net namespace,
> and ops->exit() for another one. I hadn't found pernet operations, which
> have a problem with that. If they exist, they are hidden and not clear 
> seen.
> The pernet operations in general do not touch someone else's memory.
> If suddenly there is one, KASAN should show it after a while.

 Certainly the use of hash tables shared between multiple network
 namespaces would count.  I don't rembmer how many of these we have but
 there used to be quite a few.
>>>
>>> Could you please provide an example of hash tables, you mean?
>>
>> Ah, I see, it's dccp_hashinfo etc.

JFI, I've checked dccp_hashinfo, and it seems to be safe.

> 
> The big one used to be the route cache.  With resizable hash tables
> things may be getting better in that regard.

I've checked some fib-related things, and wasn't able to find that.
Excuse me, could you please clarify, if it's an assumption, or
there is exactly a problem hash table, you know? Could you please
point it me more exactly, if it's so.

[PATCH RFC 05/25] net: Add primitives to update heads of pernet_list sublists

2017-11-17 Thread Kirill Tkhai

Currently we have first_device, and device and subsys
sublists. Next patches introduce one more sublist.
So, move the functionality, which will be repeating,
to the primitives.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |   19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a8ea580885d9..1d9712973695 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -939,6 +939,18 @@ static void __unregister_pernet_operations(struct 
pernet_operations *ops)
 
 static DEFINE_IDA(net_generic_ids);
 
+#define update_first_on_add(first, delim, added)   \
+   do {\
+   if (first == delim) \
+   first = added;  \
+   } while (0)
+
+#define update_first_on_del(first, to_delete)  \
+   do {\
+   if (first == to_delete) \
+   first = (to_delete)->next;  \
+   } while (0)
+
 static int register_pernet_operations(struct list_head *list,
  struct pernet_operations *ops)
 {
@@ -1045,8 +1057,8 @@ int register_pernet_device(struct pernet_operations *ops)
int error;
down_write(&net_sem);
error = register_pernet_operations(&pernet_list, ops);
-   if (!error && (first_device == &pernet_list))
-   first_device = &ops->list;
+   if (!error)
+   update_first_on_add(first_device, &pernet_list, &ops->list);
up_write(&net_sem);
return error;
 }
@@ -1064,8 +1076,7 @@ EXPORT_SYMBOL_GPL(register_pernet_device);
 void unregister_pernet_device(struct pernet_operations *ops)
 {
down_write(&net_sem);
-   if (&ops->list == first_device)
-   first_device = first_device->next;
+   update_first_on_del(first_device, &ops->list);
unregister_pernet_operations(ops);
up_write(&net_sem);
 }

[PATCH RFC 04/25] net: Move mutex_unlock() in cleanup_net() up

2017-11-17 Thread Kirill Tkhai

net_sem protects from pernet_list changing, while
ops_free_list() makes simple kfree(), and it can't
race with other pernet_operations callbacks.

So we may release net_mutex earlier then it was.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 2254b1639209..a8ea580885d9 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -489,11 +489,12 @@ static void cleanup_net(struct work_struct *work)
list_for_each_entry_reverse(ops, &pernet_list, list)
ops_exit_list(ops, &net_exit_list);
 
+   mutex_unlock(&net_mutex);
+
/* Free the net generic variables */
list_for_each_entry_reverse(ops, &pernet_list, list)
ops_free_list(ops, &net_exit_list);
 
-   mutex_unlock(&net_mutex);
up_read(&net_sem);
 
/* Ensure there are no outstanding rcu callbacks using this

[PATCH RFC 06/25] net: Add pernet sys and registration functions

2017-11-17 Thread Kirill Tkhai

This is a new sublist of pernet_list, which will live ahead
of already existing:

sys, subsys, device.

It's aimed for subsystems, which pernet_operations may execute
in parallel with any other's pernet_operations. In further,
step-by-step we will move all subsys there, adding necessary
small synchronization locks, where it's need. After all subsys
are moved to sys, we'll kill subsys list and we'll have
all current subsys not requiring net_mutex and to be able
to init and exit in parallel with others.

Then we'll add dev sublist ahead of device, and will repeat
the cycle.

Suggested-by: Eric W. Biederman 
Signed-off-by: Kirill Tkhai 
---
 include/net/net_namespace.h |2 +
 net/core/net_namespace.c|   75 ++-
 2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 10f99dafd5ac..2cde5f766ec6 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -324,6 +324,8 @@ struct pernet_operations {
  * device which caused kernel oops, and panics during network
  * namespace cleanup.   So please don't get this wrong.
  */
+int register_pernet_sys(struct pernet_operations *);
+void unregister_pernet_sys(struct pernet_operations *);
 int register_pernet_subsys(struct pernet_operations *);
 void unregister_pernet_subsys(struct pernet_operations *);
 int register_pernet_device(struct pernet_operations *);
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 1d9712973695..f4f4aaa5ce1f 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -24,10 +24,24 @@
 #include 
 
 /*
- * Our network namespace constructor/destructor lists
+ * Our network namespace constructor/destructor lists
+ * one by one linked in pernet_list. They are (in order
+ * of linking): sys, subsys, device.
+ *
+ * The methods from sys for a network namespace may be
+ * called in parallel with any method from any list
+ * for another net namespace.
+ *
+ * The methods from subsys and device can't be called
+ * in parallel with a method from subsys or device.
+ *
+ * When all subsys pernet_operations are moved to sys
+ * sublist, we'll kill subsys sublist, and create dev
+ * ahead of device sublist, and repeat the cycle.
  */
 
 static LIST_HEAD(pernet_list);
+static struct list_head *first_subsys = &pernet_list;
 static struct list_head *first_device = &pernet_list;
 DEFINE_MUTEX(net_mutex);
 
@@ -987,6 +1001,57 @@ static void unregister_pernet_operations(struct 
pernet_operations *ops)
ida_remove(&net_generic_ids, *ops->id);
 }
 
+/**
+ *  register_pernet_sys - register a network namespace system
+ * @ops:  pernet operations structure for the system
+ *
+ * Register a subsystem which has init and exit functions
+ * that are called when network namespaces are created and
+ * destroyed respectively.
+ *
+ * When registered all network namespace init functions are
+ * called for every existing network namespace.  Allowing kernel
+ * modules to have a race free view of the set of network namespaces.
+ *
+ * When a new network namespace is created all of the init
+ * methods are called in the order in which they were registered.
+ *
+ * When a network namespace is destroyed all of the exit methods
+ * are called in the reverse of the order with which they were
+ * registered.
+ */
+int register_pernet_sys(struct pernet_operations *ops)
+{
+   int error;
+   down_write(&net_sem);
+   if (first_subsys != first_device) {
+   panic("Pernet %ps registered out of order.\n"
+ "There is already %ps.\n", ops,
+ list_entry(first_subsys, struct pernet_operations, list));
+   }
+   error =  register_pernet_operations(first_subsys, ops);
+   up_write(&net_sem);
+   return error;
+}
+EXPORT_SYMBOL_GPL(register_pernet_sys);
+
+/**
+ *  unregister_pernet_sys - unregister a network namespace system
+ * @ops: pernet operations structure to manipulate
+ *
+ * Remove the pernet operations structure from the list to be
+ * used when network namespaces are created or destroyed.  In
+ * addition run the exit method for all existing network
+ * namespaces.
+ */
+void unregister_pernet_sys(struct pernet_operations *ops)
+{
+   down_write(&net_sem);
+   unregister_pernet_operations(ops);
+   up_write(&net_sem);
+}
+EXPORT_SYMBOL_GPL(unregister_pernet_sys);
+
 /**
  *  register_pernet_subsys - register a network namespace subsystem
  * @ops:  pernet operations structure for the subsystem
@@ -1011,6 +1076,8 @@ int register_pernet_subsys(struct pernet_operations *ops)
int error;
down_write(&net_sem);
error =  register_pernet_operations(first_device, ops);
+   if (!error)
+   update_first_on_add(first_subsys, first_device, &ops->list);
up_write(&net_sem);
return error;

[PATCH RFC 03/25] net: Introduce net_sem for protection of pernet_list

2017-11-17 Thread Kirill Tkhai

Curently mutex is used to protect pernet operations list. It makes
cleanup_net() to execute ->exit methods of the same operations set,
which was used on the time of ->init, even after net namespace is
unlinked from net_namespace_list.

But the problem is it's need to synchronize_rcu() after net is removed
from net_namespace_list():

Destroy net_ns:
cleanup_net()
  mutex_lock(&net_mutex)
  list_del_rcu(&net->list)
  synchronize_rcu()  <--- Sleep there for ages
  list_for_each_entry_reverse(ops, &pernet_list, list)
ops_exit_list(ops, &net_exit_list)
  list_for_each_entry_reverse(ops, &pernet_list, list)
ops_free_list(ops, &net_exit_list)
  mutex_unlock(&net_mutex)

This primitive is not fast, especially on the systems with many processors
and/or when preemptible RCU is enabled in config. So, all the time, while
cleanup_net() is waiting for RCU grace period, creation of new net namespaces
is not possible, the tasks, who makes it, are sleeping on the same mutex:

Create net_ns:
copy_net_ns()
  mutex_lock_killable(&net_mutex)<--- Sleep there for ages

I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop
with preemptible RCU enabled.

The solution is to convert net_mutex to the rw_semaphore and add small locks
to really small number of pernet_operations, what really need them. Then,
pernet_operations::init/::exit methods, modifying the net-related data,
will require down_read() locking only, while down_write() will be used
for changing pernet_list.

This gives signify performance increase, like you may see here:
https://www.spinics.net/lists/netdev/msg467095.html

It's 4.6 times performance increase on one-thread test.
Multi-thread tests increase may be close to 4.6 multiplied
to number of threads.

This patch starts replacing net_mutex to net_sem. It adds rw_semaphore,
describes the variables it protects, and makes to use where appropriate.
net_mutex is still present, and next patches will kick it out step-by-step.

Signed-off-by: Kirill Tkhai 
---
 include/linux/rtnetlink.h |1 +
 net/core/net_namespace.c  |   37 +
 net/core/rtnetlink.c  |4 ++--
 3 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 2032ce2eb20b..f640fc87fe1d 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -35,6 +35,7 @@ extern int rtnl_is_locked(void);
 
 extern wait_queue_head_t netdev_unregistering_wq;
 extern struct mutex net_mutex;
+extern struct rw_semaphore net_sem;
 
 #ifdef CONFIG_PROVE_LOCKING
 extern bool lockdep_rtnl_is_held(void);
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 2e512965bf42..2254b1639209 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -41,6 +41,11 @@ struct net init_net = {
 EXPORT_SYMBOL(init_net);
 
 static bool init_net_initialized;
+/*
+ * net_sem: protects: pernet_list, net_generic_ids,
+ * init_net_initialized and first_* pointers.
+ */
+DECLARE_RWSEM(net_sem);
 
 #define MIN_PERNET_OPS_ID  \
((sizeof(struct net_generic) + sizeof(void *) - 1) / sizeof(void *))
@@ -411,12 +416,16 @@ struct net *copy_net_ns(unsigned long flags,
net->ucounts = ucounts;
get_user_ns(user_ns);
 
-   rv = mutex_lock_killable(&net_mutex);
+   rv = down_read_killable(&net_sem);
if (rv < 0)
goto put_userns;
-
+   rv = mutex_lock_killable(&net_mutex);
+   if (rv < 0)
+   goto up_read;
rv = setup_net(net, user_ns);
mutex_unlock(&net_mutex);
+up_read:
+   up_read(&net_sem);
if (rv < 0) {
 put_userns:
put_user_ns(user_ns);
@@ -443,6 +452,7 @@ static void cleanup_net(struct work_struct *work)
list_replace_init(&cleanup_list, &net_kill_list);
spin_unlock_irq(&cleanup_list_lock);
 
+   down_read(&net_sem);
mutex_lock(&net_mutex);
 
/* Don't let anyone else find us. */
@@ -484,6 +494,7 @@ static void cleanup_net(struct work_struct *work)
ops_free_list(ops, &net_exit_list);
 
mutex_unlock(&net_mutex);
+   up_read(&net_sem);
 
/* Ensure there are no outstanding rcu callbacks using this
 * network namespace.
@@ -510,8 +521,10 @@ static void cleanup_net(struct work_struct *work)
  */
 void net_ns_barrier(void)
 {
+   down_write(&net_sem);
mutex_lock(&net_mutex);
mutex_unlock(&net_mutex);
+   up_write(&net_sem);
 }
 EXPORT_SYMBOL(net_ns_barrier);
 
@@ -838,12 +851,12 @@ static int __init net_ns_init(void)
 
rcu_assign_pointer(init_net.gen, ng);
 
-   mutex_lock(&net_mutex);
+   down_write(&net_sem);
if (setup_net(&init_net, &init_user_ns))
panic("Could not setup the initial network namespace");
 
init_net_initialized = true;
-   mutex_unlock(&net_mutex);
+   up_write(&net_sem);
 
register_pernet_

[PATCH RFC 11/25] net: Move netfilter_net_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai

Since net/socket.o is the first linked file in net/Makefile,
its core initcalls execute the first. netfilter_net_ops
is executed right after sysctl_pernet_ops.

Methods netfilter_net_init() and netfilter_net_exit()
initialize net::nf::hooks and change net-related proc
directory of net. Another pernet_operations do not
interested in forein net::nf::hooks or proc entries,
so it's safe to move netfilter_net_ops to pernet list.

Signed-off-by: Kirill Tkhai 
---
 net/netfilter/core.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 52cd2901a097..2bed28281b67 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -606,7 +606,7 @@ int __init netfilter_init(void)
 {
int ret;
 
-   ret = register_pernet_subsys(&netfilter_net_ops);
+   ret = register_pernet_sys(&netfilter_net_ops);
if (ret < 0)
goto err;

[PATCH RFC 10/25] net: Move sysctl_pernet_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai

This patch starts to convert pernet_subsys, registered
from core initcalls.

Since net/socket.o is the first linked file in net/Makefile,
its core initcalls execute the first. sysctl_pernet_ops is
the first pernet_subsys, registered from sock_init(), so
it goes ahead of others, registered via core_initcall().

Methods sysctl_net_init() and sysctl_net_exit() initialize
net::sysctls of a namespace.

pernet_operations::init()/exit() methods from the rest
of the list do not touch net::sysctls of strangers,
so it's safe to execute sysctl_pernet_ops's methods
in parallel with any other pernet_operations.

Signed-off-by: Kirill Tkhai 
---
 net/sysctl_net.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sysctl_net.c b/net/sysctl_net.c
index 9aed6fe1bf1a..1b91db88e54a 100644
--- a/net/sysctl_net.c
+++ b/net/sysctl_net.c
@@ -103,7 +103,7 @@ __init int net_sysctl_init(void)
net_header = register_sysctl("net", empty);
if (!net_header)
goto out;
-   ret = register_pernet_subsys(&sysctl_pernet_ops);
+   ret = register_pernet_sys(&sysctl_pernet_ops);
if (ret)
goto out1;
 out:

[PATCH RFC 16/25] net: Move rtnetlink_net_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai

rtnetlink_net_ops are added the same core initcall
as netlink_net_ops, so they has to be added right
after netlink_net_ops.

rtnetlink_net_init() and rtnetlink_net_exit()
create and destroy netlink socket. It looks like,
another pernet_operations are not interested in
foreiner net::rtnl, so rtnetlink_net_ops may be
safely moved to pernet_sys list.

Signed-off-by: Kirill Tkhai 
---
 net/core/rtnetlink.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index cb06d43c4230..d9cf13554e4d 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -4503,7 +4503,7 @@ void __init rtnetlink_init(void)
for (i = 0; i < ARRAY_SIZE(rtnl_msg_handlers_ref); i++)
refcount_set(&rtnl_msg_handlers_ref[i], 1);
 
-   if (register_pernet_subsys(&rtnetlink_net_ops))
+   if (register_pernet_sys(&rtnetlink_net_ops))
panic("rtnetlink_init: cannot initialize rtnetlink\n");
 
register_netdevice_notifier(&rtnetlink_dev_notifier);

[PATCH RFC 17/25] net: Move audit_net_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai

This patch starts to convert pernet_subsys, registered
from postcore initcalls.

These pernet_operations are in ./kernel directory, and
there are only one more postcore in ./lib. So, audit_net_ops
have to go the first.

audit_net_init() creates netlink socket, while audit_net_exit()
destroys it. The rest of the pernet_list are not interested
in the socket, so we move audit_net_ops to pernet_sys list.

Signed-off-by: Kirill Tkhai 
---
 kernel/audit.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/audit.c b/kernel/audit.c
index 227db99b0f19..bb4626d7e712 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1549,7 +1549,7 @@ static int __init audit_init(void)
 
pr_info("initializing netlink subsys (%s)\n",
audit_default ? "enabled" : "disabled");
-   register_pernet_subsys(&audit_net_ops);
+   register_pernet_sys(&audit_net_ops);
 
audit_initialized = AUDIT_INITIALIZED;

[PATCH RFC 18/25] net: Move uevent_net_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai

This postcore_initcall() created pernet_operations
are registered from ./lib directory, and they have
to go right after audit_net_ops.

uevent_net_init() and uevent_net_exit() create and
destroy netlink socket, and these actions serialized
in netlink code.

Parallel execution with other pernet_operations
makes the socket disappear earlier from uevent_sock_list
on ->exit. As userspace can't be interested in broadcast
messages of dying net, and, as I see, no one in kernel
listen them, we may safely move uevent_net_ops to pernet_sys
list.

Signed-off-by: Kirill Tkhai 
---
 lib/kobject_uevent.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index c3e84edc47c9..84c9d85477cc 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -647,7 +647,7 @@ static struct pernet_operations uevent_net_ops = {
 
 static int __init kobject_uevent_init(void)
 {
-   return register_pernet_subsys(&uevent_net_ops);
+   return register_pernet_sys(&uevent_net_ops);
 }

[PATCH RFC 14/25] net: Move net_defaults_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai

According to net/core/Makefile, net/core/net_namespace.o
core initcalls execute right after net/core/sock.o.

net_defaults_ops introduces only net_defaults_init_net method,
and it acts on net::core::sysctl_somaxconn, which
is not interested the rest of pernet_subsys and pernet_device
lists. Then, move it to pernet_sys.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 2e8295aa7003..7fc9d44c1817 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -371,7 +371,7 @@ static struct pernet_operations net_defaults_ops = {
 
 static __init int net_defaults_init(void)
 {
-   if (register_pernet_subsys(&net_defaults_ops))
+   if (register_pernet_sys(&net_defaults_ops))
panic("Cannot initialize net default settings");
 
return 0;

[PATCH RFC 24/25] net: Move wext_pernet_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai

These pernet_operations initialize and purge net::wext_nlevents
queue, and are not touched by foreign pernet_operations.

Signed-off-by: Kirill Tkhai 
---
 net/wireless/wext-core.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/wireless/wext-core.c b/net/wireless/wext-core.c
index 6cdb054484d6..2103c2a003ed 100644
--- a/net/wireless/wext-core.c
+++ b/net/wireless/wext-core.c
@@ -394,7 +394,7 @@ static struct pernet_operations wext_pernet_ops = {
 
 static int __init wireless_nlevent_init(void)
 {
-   int err = register_pernet_subsys(&wext_pernet_ops);
+   int err = register_pernet_sys(&wext_pernet_ops);
 
if (err)
return err;

[PATCH RFC 25/25] net: Move sysctl_core_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai

These pernet_operations register and destroy sysctl
directory, and it's not interested for foreign
pernet_operations.

Signed-off-by: Kirill Tkhai 
---
 net/core/sysctl_net_core.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cbc3dde4cfcc..0dab679b33fa 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -525,7 +525,7 @@ static __net_initdata struct pernet_operations 
sysctl_core_ops = {
 static __init int sysctl_core_init(void)
 {
register_net_sysctl(&init_net, "net/core", net_core_table);
-   return register_pernet_subsys(&sysctl_core_ops);
+   return register_pernet_sys(&sysctl_core_ops);
 }
 
 fs_initcall(sysctl_core_init);

[PATCH RFC 23/25] net: Move genl_pernet_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai

This pernet_operations create and destroy net::genl_sock.
Foreign pernet_operations don't touch it.

Signed-off-by: Kirill Tkhai 
---
 net/netlink/genetlink.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index d444daf1ac04..da7ab3dd5609 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -1045,7 +1045,7 @@ static int __init genl_init(void)
if (err < 0)
goto problem;
 
-   err = register_pernet_subsys(&genl_pernet_ops);
+   err = register_pernet_sys(&genl_pernet_ops);
if (err)
goto problem;

[PATCH RFC 20/25] net: Move pernet_subsys, registered via net_dev_init(), to pernet_sys list

2017-11-17 Thread Kirill Tkhai

net/core/dev.o is lined after net/core/sock.o.

There are:
1)dev_proc_ops and dev_mc_net_ops, which create and destroy
pernet proc file and not interested to another net namespaces;
2)netdev_net_ops, which creates pernet hash, which is not
touched by another pernet_operations.

So, move it to pernet_sys list.

Signed-off-by: Kirill Tkhai 
---
 net/core/dev.c|2 +-
 net/core/net-procfs.c |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 8ee29f4f5fa9..b90a503a9e1a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8787,7 +8787,7 @@ static int __init net_dev_init(void)
 
INIT_LIST_HEAD(&offload_base);
 
-   if (register_pernet_subsys(&netdev_net_ops))
+   if (register_pernet_sys(&netdev_net_ops))
goto out;
 
/*
diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c
index 615ccab55f38..46096219d574 100644
--- a/net/core/net-procfs.c
+++ b/net/core/net-procfs.c
@@ -413,8 +413,8 @@ static struct pernet_operations __net_initdata 
dev_mc_net_ops = {
 
 int __init dev_proc_init(void)
 {
-   int ret = register_pernet_subsys(&dev_proc_ops);
+   int ret = register_pernet_sys(&dev_proc_ops);
if (!ret)
-   return register_pernet_subsys(&dev_mc_net_ops);
+   return register_pernet_sys(&dev_mc_net_ops);
return ret;
 }

[PATCH RFC 22/25] net: Move subsys_initcall() registered pernet_operations from net/sched to pernet_sys list

2017-11-17 Thread Kirill Tkhai

psched_net_ops only creates and destroyes /proc entry,
and safe to be executed in parallel with any foreigh
pernet_operations.

tcf_action_net_ops initializes and destructs tcf_action_net::egdev_ht,
which is not touched by foreign pernet_operations.

So, move them to pernet_sys list.

Signed-off-by: Kirill Tkhai 
---
 net/sched/act_api.c |2 +-
 net/sched/sch_api.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 4d33a50a8a6d..f1de2146e6e0 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -1470,7 +1470,7 @@ static int __init tc_action_init(void)
 {
int err;
 
-   err = register_pernet_subsys(&tcf_action_net_ops);
+   err = register_pernet_sys(&tcf_action_net_ops);
if (err)
return err;
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index b6c4f536876b..68938ca4bbe1 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -2008,7 +2008,7 @@ static int __init pktsched_init(void)
 {
int err;
 
-   err = register_pernet_subsys(&psched_net_ops);
+   err = register_pernet_sys(&psched_net_ops);
if (err) {
pr_err("pktsched_init: "
   "cannot initialize per netns operations\n");

[PATCH RFC 21/25] net: Move fib_* pernet_operations, registered via subsys_initcall(), to pernet_sys list

2017-11-17 Thread Kirill Tkhai

Both of them create and initialize lists, which are not touched
by another foreing pernet_operations.

Signed-off-by: Kirill Tkhai 
---
 net/core/fib_notifier.c |2 +-
 net/core/fib_rules.c|2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/fib_notifier.c b/net/core/fib_notifier.c
index 0c048bdeb016..782a1475a32e 100644
--- a/net/core/fib_notifier.c
+++ b/net/core/fib_notifier.c
@@ -175,7 +175,7 @@ static struct pernet_operations fib_notifier_net_ops = {
 
 static int __init fib_notifier_init(void)
 {
-   return register_pernet_subsys(&fib_notifier_net_ops);
+   return register_pernet_sys(&fib_notifier_net_ops);
 }
 
 subsys_initcall(fib_notifier_init);
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 98e1066c3d55..b2706c18f0f3 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -1039,7 +1039,7 @@ static int __init fib_rules_init(void)
rtnl_register(PF_UNSPEC, RTM_DELRULE, fib_nl_delrule, NULL, 0);
rtnl_register(PF_UNSPEC, RTM_GETRULE, NULL, fib_nl_dumprule, 0);
 
-   err = register_pernet_subsys(&fib_rules_net_ops);
+   err = register_pernet_sys(&fib_rules_net_ops);
if (err < 0)
goto fail;

[PATCH RFC 19/25] net: Move proto_net_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai

This patch starts to convert pernet_subsys, registered
from subsys initcalls.

According to net/Makefile and net/core/Makefile, this
is the first exected subsys_initcall(), registering
pernet_subsys.

It seems to be executed in parallel with others,
as it's only creates/destoyes proc entry, which
nobody else is not interested in.

Signed-off-by: Kirill Tkhai 
---
 net/core/sock.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index be050b044699..ed12e115458b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3349,7 +3349,7 @@ static __net_initdata struct pernet_operations 
proto_net_ops = {
 
 static int __init proto_init(void)
 {
-   return register_pernet_subsys(&proto_net_ops);
+   return register_pernet_sys(&proto_net_ops);
 }
 
 subsys_initcall(proto_init);

[PATCH RFC 13/25] net: Move net_inuse_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai

net/core/sock.o is the first linked file in net/core/Makefile,
so its core initcall executes the first in the directory.

net_inuse_ops methods expose statistics in /proc.
No one from the rest of pernet_subsys or pernet_device lists
does not touch net::core::inuse. So, it's safe to move
net_inuse_ops to pernet_sys list.

Signed-off-by: Kirill Tkhai 
---
 net/core/sock.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 13719af7b4e3..be050b044699 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3081,7 +3081,7 @@ static struct pernet_operations net_inuse_ops = {
 
 static __init int net_inuse_init(void)
 {
-   if (register_pernet_subsys(&net_inuse_ops))
+   if (register_pernet_sys(&net_inuse_ops))
panic("Cannot initialize net inuse counters");
 
return 0;

[PATCH RFC 15/25] net: Move netlink_net_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai

According to net/core/Makefile, net/core/af_netlink.o
core initcalls execute right after net/core/net_namespace.o.

The methods of netlink_net_ops create and destroy "netlink"
file, which are not interested for foreigh pernet_operations.
So, netlink_net_ops may safely be moved to pernet_sys list.

Signed-off-by: Kirill Tkhai 
---
 net/netlink/af_netlink.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index b9e0ee4e22f5..a4f1f5222b79 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2735,7 +2735,7 @@ static int __init netlink_proto_init(void)
netlink_add_usersock_entry();
 
sock_register(&netlink_family_ops);
-   register_pernet_subsys(&netlink_net_ops);
+   register_pernet_sys(&netlink_net_ops);
/* The netlink device handler may be needed early. */
rtnetlink_init();
 out:

[PATCH RFC 09/25] net: Move net_ns_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai

This patch starts to convert pernet_subsys, registered
from pure initcalls.

Since net_ns_init() is the only pure initcall in net subsystem,
and there is no early initcalls; the pernet subsys, it registers,
is the first in pernet_operations list. So, we start with it.

net_ns_ops::net_ns_net_init/net_ns_net_init, methods use only
ida_simple_* functions, which are not need a synchronization.

So it's safe to execute them in parallel with any other
pernet_operations, and thus we convert net_ns_ops to pernet_sys type.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 7aec8c1afe50..2e8295aa7003 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -899,7 +899,7 @@ static int __init net_ns_init(void)
init_net_initialized = true;
up_write(&net_sem);
 
-   register_pernet_subsys(&net_ns_ops);
+   register_pernet_sys(&net_ns_ops);
 
rtnl_register(PF_UNSPEC, RTM_NEWNSID, rtnl_net_newid, NULL,
  RTNL_FLAG_DOIT_UNLOCKED);

[PATCH RFC 12/25] net: Move nf_log_net_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai

nf_log_net_ops are registered the same initcall
as netfilter_net_ops, so they has to be moved right
after netfilter_net_ops.

The ops would have had a problem in parallel execution
with others, if init_net had been possible to released.
But it's not, and the rest is safe for that. There is
memory allocation, which nobody else interested in,
and sysctl registration. So, we move it to pernet_sys
list.

Signed-off-by: Kirill Tkhai 
---
 net/netfilter/nf_log.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index 8bb152a7cca4..08868afad813 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -582,5 +582,5 @@ static struct pernet_operations nf_log_net_ops = {
 
 int __init netfilter_log_init(void)
 {
-   return register_pernet_subsys(&nf_log_net_ops);
+   return register_pernet_sys(&nf_log_net_ops);
 }

[PATCH RFC 08/25] net: Move proc_net_ns_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai

This patch starts to convert pernet_subsys, registered
from before initcalls.

Since proc_net_ns_ops is registered pernet_subsys,
made from:

start_kernel()->proc_root_init()->proc_net_init(),

and there is no a pernet_subsys, which is registered
earlier, we start from it.

proc_net_ns_ops::proc_net_ns_init()/proc_net_ns_exit()
register pernet net->proc_net and ->proc_net_stat, and
constructors and destructors of another pernet_operations
are not interested in foreign net's proc_net and proc_net_stat.
Proc filesystem privitives are synchronized on proc_subdir_lock.

So, it's safe to move proc_net_ns_ops to pernet_sys list
and execute its methods in parallel with another pernet
operations.

Signed-off-by: Kirill Tkhai 
---
 fs/proc/proc_net.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index a2bf369c923d..5eb52765eeab 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -243,5 +243,5 @@ int __init proc_net_init(void)
 {
proc_symlink("net", NULL, "self/net");
 
-   return register_pernet_subsys(&proc_net_ns_ops);
+   return register_pernet_sys(&proc_net_ns_ops);
 }

[PATCH RFC 07/25] net: Make sys sublist pernet_operations executed out of net_mutex

2017-11-17 Thread Kirill Tkhai

Move net_mutex to setup_net() and cleanup_net(), and
do not hold it, while sys sublist methods are executed.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |   44 +++-
 1 file changed, 35 insertions(+), 9 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index f4f4aaa5ce1f..7aec8c1afe50 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -84,11 +84,11 @@ static int net_assign_generic(struct net *net, unsigned int 
id, void *data)
 {
struct net_generic *ng, *old_ng;
 
-   BUG_ON(!mutex_is_locked(&net_mutex));
+   BUG_ON(!rwsem_is_locked(&net_sem));
BUG_ON(id < MIN_PERNET_OPS_ID);
 
old_ng = rcu_dereference_protected(net->gen,
-  lockdep_is_held(&net_mutex));
+  lockdep_is_held(&net_sem));
if (old_ng->s.len > id) {
old_ng->ptr[id] = data;
return 0;
@@ -300,6 +300,7 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
 {
/* Must be called with net_mutex held */
const struct pernet_operations *ops, *saved_ops;
+   bool locked = false;
int error = 0;
LIST_HEAD(net_exit_list);
 
@@ -311,14 +312,34 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
spin_lock_init(&net->nsid_lock);
 
list_for_each_entry(ops, &pernet_list, list) {
+   if (&ops->list == first_subsys) {
+   BUG_ON(locked);
+   error = mutex_lock_killable(&net_mutex);
+   if (error)
+   goto out_undo;
+   locked = true;
+   }
+
error = ops_init(ops, net);
if (error < 0)
goto out_undo;
}
+
+   if (!locked) {
+   /*
+* This may happen only on early boot, so we don't
+* care about possibility to interrupt the locking.
+*/
+   mutex_lock(&net_mutex);
+   locked = true;
+   }
+
rtnl_lock();
list_add_tail_rcu(&net->list, &net_namespace_list);
rtnl_unlock();
 out:
+   if (locked)
+   mutex_unlock(&net_mutex);
return error;
 
 out_undo:
@@ -433,12 +454,7 @@ struct net *copy_net_ns(unsigned long flags,
rv = down_read_killable(&net_sem);
if (rv < 0)
goto put_userns;
-   rv = mutex_lock_killable(&net_mutex);
-   if (rv < 0)
-   goto up_read;
rv = setup_net(net, user_ns);
-   mutex_unlock(&net_mutex);
-up_read:
up_read(&net_sem);
if (rv < 0) {
 put_userns:
@@ -460,6 +476,7 @@ static void cleanup_net(struct work_struct *work)
struct net *net, *tmp;
struct list_head net_kill_list;
LIST_HEAD(net_exit_list);
+   bool locked;
 
/* Atomically snapshot the list of namespaces to cleanup */
spin_lock_irq(&cleanup_list_lock);
@@ -468,6 +485,7 @@ static void cleanup_net(struct work_struct *work)
 
down_read(&net_sem);
mutex_lock(&net_mutex);
+   locked = true;
 
/* Don't let anyone else find us. */
rtnl_lock();
@@ -500,10 +518,18 @@ static void cleanup_net(struct work_struct *work)
synchronize_rcu();
 
/* Run all of the network namespace exit methods */
-   list_for_each_entry_reverse(ops, &pernet_list, list)
+   list_for_each_entry_reverse(ops, &pernet_list, list) {
ops_exit_list(ops, &net_exit_list);
 
-   mutex_unlock(&net_mutex);
+   if (&ops->list == first_subsys) {
+   BUG_ON(!locked);
+   mutex_unlock(&net_mutex);
+   locked = false;
+   }
+   }
+
+   if (locked)
+   mutex_unlock(&net_mutex);
 
/* Free the net generic variables */
list_for_each_entry_reverse(ops, &pernet_list, list)

[PATCH RFC 02/25] net: Cleanup copy_net_ns()

2017-11-17 Thread Kirill Tkhai

Line up destructors actions in the revers order
to constructors. Next patches will add more actions,
and this will be comfortable, if there is the such
order.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |   20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 7ecf71050ffa..2e512965bf42 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -404,27 +404,25 @@ struct net *copy_net_ns(unsigned long flags,
 
net = net_alloc();
if (!net) {
-   dec_net_namespaces(ucounts);
-   return ERR_PTR(-ENOMEM);
+   rv = -ENOMEM;
+   goto dec_ucounts;
}
-
+   refcount_set(&net->passive, 1);
+   net->ucounts = ucounts;
get_user_ns(user_ns);
 
rv = mutex_lock_killable(&net_mutex);
-   if (rv < 0) {
-   net_free(net);
-   dec_net_namespaces(ucounts);
-   put_user_ns(user_ns);
-   return ERR_PTR(rv);
-   }
+   if (rv < 0)
+   goto put_userns;
 
-   net->ucounts = ucounts;
rv = setup_net(net, user_ns);
mutex_unlock(&net_mutex);
if (rv < 0) {
-   dec_net_namespaces(ucounts);
+put_userns:
put_user_ns(user_ns);
net_drop_ns(net);
+dec_ucounts:
+   dec_net_namespaces(ucounts);
return ERR_PTR(rv);
}
return net;

[PATCH RFC 00/25] Replacing net_mutex with rw_semaphore

2017-11-17 Thread Kirill Tkhai

Hi,

this is continuation of discussion from here:

https://lkml.org/lkml/2017/11/14/298

The plan has changed a little bit, so I'd be happy to hear
people's comments, before I dived into all 400+ pernet subsys
and devices.

The patch set adds pernet sys list ahead of subsys and device,
and it's used for pernet_operations, which may be executed
in parallel with any other pernet_operations methods. Also,
some high-priority ops converted (up to registered using
postcore_initcall(), and some subsys_initcall()) in order
of appearance. The sequence in setup_net() is following:

1)execute all the callbacks from pernet_sys list
2)lock net_mutex
3)execute all the callbacks from pernet_subsys list
4)execute all the callbacks from pernet_device list
5)unlock net_mutex

There was not pernet_operations, requiring additional
synchronization, yet, but I've bumped in another problem.
The problem is that some drivers may be compiled as modules
and as kernel-image part. They register pernet_operations
from device_initcall() for example. This initcall executes
in different time comparing to in-kernel built-in only
drivers.

Imagine, we have three state driverA, and boolean driverB.
driverA registers pernet_subsys from subsys_initcall().
driverB registers pernet_subsys from fs_initcall().
So, here we have two cases:

driverA is module  driverA is built-in
   ---
register driverB ops   register driverA ops
register driverA ops   register driverB ops

So, the order is different. When converting driver one-by-one,
it's impossible to make the order true for all .config
states, because of the above. So, the bisect won't work.

And it seems, it's just the same as to convert pernet_operations
from all the files in file alphabetical order. What do you
think about this? (Note, the patches has no such a problem
at the moment, as there are all in-kernel early core drivers).

Maybe there are another comments on the code.
---

Kirill Tkhai (25):
  net: Assign net to net_namespace_list in setup_net()
  net: Cleanup copy_net_ns()
  net: Introduce net_sem for protection of pernet_list
  net: Move mutex_unlock() in cleanup_net() up
  net: Add primitives to update heads of pernet_list sublists
  net: Add pernet sys and registration functions
  net: Make sys sublist pernet_operations executed out of net_mutex
  net: Move proc_net_ns_ops to pernet_sys list
  net: Move net_ns_ops to pernet_sys list
  net: Move sysctl_pernet_ops to pernet_sys list
  net: Move netfilter_net_ops to pernet_sys list
  net: Move nf_log_net_ops to pernet_sys list
  net: Move net_inuse_ops to pernet_sys list
  net: Move net_defaults_ops to pernet_sys list
  net: Move netlink_net_ops to pernet_sys list
  net: Move rtnetlink_net_ops to pernet_sys list
  net: Move audit_net_ops to pernet_sys list
  net: Move uevent_net_ops to pernet_sys list
  net: Move proto_net_ops to pernet_sys list
  net: Move pernet_subsys, registered via net_dev_init(), to pernet_sys list
  net: Move fib_* pernet_operations, registered via subsys_initcall(), to 
pernet_sys list
  net: Move subsys_initcall() registered pernet_operations from net/sched 
to pernet_sys list
  net: Move genl_pernet_ops to pernet_sys list
  net: Move wext_pernet_ops to pernet_sys list
  net: Move sysctl_core_ops to pernet_sys list


 fs/proc/proc_net.c  |2 
 include/linux/rtnetlink.h   |1 
 include/net/net_namespace.h |2 
 kernel/audit.c  |2 
 lib/kobject_uevent.c|2 
 net/core/dev.c  |2 
 net/core/fib_notifier.c |2 
 net/core/fib_rules.c|2 
 net/core/net-procfs.c   |4 -
 net/core/net_namespace.c|  203 +--
 net/core/rtnetlink.c|6 +
 net/core/sock.c |4 -
 net/core/sysctl_net_core.c  |2 
 net/netfilter/core.c|2 
 net/netfilter/nf_log.c  |2 
 net/netlink/af_netlink.c|2 
 net/netlink/genetlink.c |2 
 net/sched/act_api.c |2 
 net/sched/sch_api.c |2 
 net/sysctl_net.c|2 
 net/wireless/wext-core.c|2 
 21 files changed, 183 insertions(+), 67 deletions(-)

--
Signed-off-by: Kirill Tkhai

[PATCH RFC 01/25] net: Assign net to net_namespace_list in setup_net()

2017-11-17 Thread Kirill Tkhai

This patch merges two repeating pieces of code in one,
and they will live in setup_net() now.

It acts as cleanup even despite init_net_initialized
assignment is reordered with the linking of net now.
This variable is need for proc_net_init() called from:

start_kernel()->proc_root_init()->proc_net_init(),

which can't race with net_ns_init(), called from
initcall.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |   13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index b797832565d3..7ecf71050ffa 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -296,6 +296,9 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
if (error < 0)
goto out_undo;
}
+   rtnl_lock();
+   list_add_tail_rcu(&net->list, &net_namespace_list);
+   rtnl_unlock();
 out:
return error;
 
@@ -417,11 +420,6 @@ struct net *copy_net_ns(unsigned long flags,
 
net->ucounts = ucounts;
rv = setup_net(net, user_ns);
-   if (rv == 0) {
-   rtnl_lock();
-   list_add_tail_rcu(&net->list, &net_namespace_list);
-   rtnl_unlock();
-   }
mutex_unlock(&net_mutex);
if (rv < 0) {
dec_net_namespaces(ucounts);
@@ -847,11 +845,6 @@ static int __init net_ns_init(void)
panic("Could not setup the initial network namespace");
 
init_net_initialized = true;
-
-   rtnl_lock();
-   list_add_tail_rcu(&init_net.list, &net_namespace_list);
-   rtnl_unlock();
-
mutex_unlock(&net_mutex);
 
register_pernet_subsys(&net_ns_ops);

[PATCH net-next 2/2] net-next: copy user configured flowlabel to reset packet

2017-11-17 Thread Shaohua Li

From: Shaohua Li 

Reset packet doesn't use user configured flowlabel, instead, it always
uses 0. This will cause inconsistency for flowlabel. tw sock already
records flowlabel info, so we can directly use it.

Cc: Martin KaFai Lau 
Cc: Eric Dumazet 
Cc: Florent Fourcot 
Cc: Cong Wang 
Cc: Tom Herbert 
Signed-off-by: Shaohua Li 
---
 net/ipv6/tcp_ipv6.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index a1a5802..9b678cd 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -901,6 +901,8 @@ static void tcp_v6_send_reset(const struct sock *sk, struct 
sk_buff *skb)
struct sock *sk1 = NULL;
 #endif
int oif = 0;
+   u8 tclass = 0;
+   __be32 flowlabel = 0;
 
if (th->rst)
return;
@@ -954,7 +956,21 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb)
trace_tcp_send_reset(sk, skb);
}
 
-   tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0);
+   if (sk) {
+   if (sk_fullsock(sk)) {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   tclass = np->tclass;
+   flowlabel = np->flow_label & IPV6_FLOWLABEL_MASK;
+   } else {
+   struct inet_timewait_sock *tw = inet_twsk(sk);
+
+   tclass = tw->tw_tclass;
+   flowlabel = cpu_to_be32(tw->tw_flowlabel);
+   }
+   }
+   tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1,
+   tclass, flowlabel);
 
 #ifdef CONFIG_TCP_MD5SIG
 out:
-- 
2.9.5

[PATCH net-next 0/2] net: fix flowlabel inconsistency in reset packet

2017-11-17 Thread Shaohua Li

From: Shaohua Li 

Hi,

Please see below tcpdump output:
21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options 
[mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 40) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 
43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 
7], length 0
21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 30
21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 56) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903437], length 24
21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903437], length 0
21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options 
[nop,nop,TS val 2500904438 ecr 2500903438], length 24
21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload 
length: 20) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0

The tcp reset packet has a different flowlabel, which causes our router
doesn't correctly close tcp connection. We are using flowlabel to do
load balance. Routers in the path maintain connection state. So if flow
label changes, the packet is routed through a different router. In this
case, the old router doesn't get the reset packet to close the tcp
connection.

The reason is the normal packet gets the skb->hash from sk->sk_txhash,
which is generated randomly. ip6_make_flowlabel then uses the hash to
create a flowlabel. The reset packet doesn't get assigned a hash, so the
flowlabel is calculated with flowi6.

The patches fix the issue.

Thanks,
Shaohua

Shaohua Li (2):
  net-next: use five-tuple hash for sk_txhash
  net-next: copy user configured flowlabel to reset packet

 include/net/sock.h| 18 --
 include/net/tcp.h |  2 +-
 net/ipv4/datagram.c   |  2 +-
 net/ipv4/syncookies.c |  4 +++-
 net/ipv4/tcp_input.c  |  1 -
 net/ipv4/tcp_ipv4.c   | 17 -
 net/ipv4/tcp_output.c |  1 -
 net/ipv6/datagram.c   |  4 +++-
 net/ipv6/syncookies.c |  3 ++-
 net/ipv6/tcp_ipv6.c   | 36 ++--
 10 files changed, 56 insertions(+), 32 deletions(-)

-- 
2.9.5

[PATCH net-next 1/2] net-next: use five-tuple hash for sk_txhash

2017-11-17 Thread Shaohua Li

From: Shaohua Li 

We are using sk_txhash to calculate flowlabel, but sk_txhash isn't
always available, for example, in inet_timewait_sock. This causes
problem for reset packet, which will have a different flowlabel. This
causes our router doesn't correctly close tcp connection. We are using
flowlabel to do load balance. Routers in the path maintain connection
state. So if flow label changes, the packet is routed through a
different router. In this case, the old router doesn't get the reset
packet to close the tcp connection.

Per Tom's suggestion, we switch back to five-tuple hash, so we can
reconstruct correct flowlabel for reset packet.

At most places, we already have the flowi info, so we directly use it
build sk_txhash. For synack, we do this after route search. At that
time, we have the flowi info ready, so don't need to create the flowi
info again.

I don't change sk_rethink_txhash() though, it still uses random hash,
which is the whole point to select a different path after a negative
routing advise.

Cc: Martin KaFai Lau 
Cc: Eric Dumazet 
Cc: Florent Fourcot 
Cc: Cong Wang 
Cc: Tom Herbert 
Signed-off-by: Shaohua Li 
---
 include/net/sock.h| 18 --
 include/net/tcp.h |  2 +-
 net/ipv4/datagram.c   |  2 +-
 net/ipv4/syncookies.c |  4 +++-
 net/ipv4/tcp_input.c  |  1 -
 net/ipv4/tcp_ipv4.c   | 17 -
 net/ipv4/tcp_output.c |  1 -
 net/ipv6/datagram.c   |  4 +++-
 net/ipv6/syncookies.c |  3 ++-
 net/ipv6/tcp_ipv6.c   | 18 +-
 10 files changed, 39 insertions(+), 31 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index f8715c5..85a6192 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1732,22 +1732,12 @@ static inline kuid_t sock_net_uid(const struct net 
*net, const struct sock *sk)
return sk ? sk->sk_uid : make_kuid(net->user_ns, 0);
 }
 
-static inline u32 net_tx_rndhash(void)
-{
-   u32 v = prandom_u32();
-
-   return v ?: 1;
-}
-
-static inline void sk_set_txhash(struct sock *sk)
-{
-   sk->sk_txhash = net_tx_rndhash();
-}
-
 static inline void sk_rethink_txhash(struct sock *sk)
 {
-   if (sk->sk_txhash)
-   sk_set_txhash(sk);
+   if (sk->sk_txhash) {
+   u32 v = prandom_u32();
+   sk->sk_txhash = v ?: 1;
+   }
 }
 
 static inline struct dst_entry *
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 85ea578..8d68fde 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1840,7 +1840,7 @@ struct tcp_request_sock_ops {
 __u16 *mss);
 #endif
struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl,
-  const struct request_sock *req);
+  struct request_sock *req);
u32 (*init_seq)(const struct sk_buff *skb);
u32 (*init_ts_off)(const struct net *net, const struct sk_buff *skb);
int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index f915abf..ed9ccb7 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -74,7 +74,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr 
*uaddr, int addr_len
inet->inet_daddr = fl4->daddr;
inet->inet_dport = usin->sin_port;
sk->sk_state = TCP_ESTABLISHED;
-   sk_set_txhash(sk);
+   sk->sk_txhash = get_hash_from_flowi4(fl4);
inet->inet_id = jiffies;
 
sk_dst_set(sk, &rt->dst);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index fda37f2..76f1cf6 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -335,7 +335,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
treq->rcv_isn   = ntohl(th->seq) - 1;
treq->snt_isn   = cookie;
treq->ts_off= 0;
-   treq->txhash= net_tx_rndhash();
req->mss= mss;
ireq->ir_num= ntohs(th->dest);
ireq->ir_rmt_port   = th->source;
@@ -376,6 +375,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
   opt->srr ? opt->faddr : ireq->ir_rmt_addr,
   ireq->ir_loc_addr, th->source, th->dest, sk->sk_uid);
security_req_classify_flow(req, flowi4_to_flowi(&fl4));
+
+   treq->txhash = get_hash_from_flowi4(&fl4);
+
rt = ip_route_output_key(sock_net(sk), &fl4);
if (IS_ERR(rt)) {
reqsk_free(req);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index dabbf1d..92b4a10 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6289,7 +6289,6 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
}
 
tcp_rsk(req)->snt_isn = isn;
-   tcp_rsk(req)->txhash = net_tx_rndhash();
tcp_openreq_init_rwin(req, sk, dst);
if (!want_cookie) {
tcp_reqsk_record_syn(sk, req, skb);
diff --git a/net/ipv4

Re: [PATCH][v4] uprobes/x86: emulate push insns for uprobe on x86

2017-11-17 Thread Yonghong Song




On 11/17/17 9:25 AM, Oleg Nesterov wrote:

On 11/15, Yonghong Song wrote:


v3 -> v4:
   . Revert most of v3 change as 32bit emulation is not really working
 on x86_64 platform as among other issues, function emulate_push_stack()
 needs to account for 32bit app on 64bit platform.
 A separate effort is ongoing to address this issue.


Reviewed-by: Oleg Nesterov 



Please test your patch with the fix below, in this particular case the
TIF_IA32 check should be fine. Although this is not what we really want,
we should probably use user_64bit_mode(regs) which checks ->cs. But this
needs more changes and doesn't solve other problems (get_unmapped_area)
so I still can't decide what should we do right now...


I tested the below change with my patch. On x86_64, both 64bit and 32bit 
program can be uprobe emulated properly. On x86_32, however, there is a 
compilation error like below:


In function ‘check_copy_size’,
inlined from ‘copy_to_user’ at 
/home/yhs/work/tip/include/linux/uaccess.h:154:6,
inlined from ‘emulate_push_stack.isra.9’ at 
/home/yhs/work/tip/arch/x86/kernel/uprobes.c:535:6:
/home/yhs/work/tip/include/linux/thread_info.h:139:4: error: call to 
‘__bad_copy_from’ declared with attribute error: copy source size is too 
small

__bad_copy_from();

Basically, test_thread_flag(TIF_IA32) returns 0 on x86_32 system.



Oleg.

--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -516,7 +516,7 @@ struct uprobe_xol_ops {
  
  static inline int sizeof_long(void)

  {
-   return in_ia32_syscall() ? 4 : 8;
+   return test_thread_flag(TIF_IA32) ? 4 : 8;
  }
  
  static int default_pre_xol_op(struct arch_uprobe *auprobe, struct pt_regs *regs)

[PATCH 3/4] bpf: add a bpf_override_function helper

2017-11-17 Thread Josef Bacik

From: Josef Bacik 

Error injection is sloppy and very ad-hoc.  BPF could fill this niche
perfectly with it's kprobe functionality.  We could make sure errors are
only triggered in specific call chains that we care about with very
specific situations.  Accomplish this with the bpf_override_funciton
helper.  This will modify the probe'd callers return value to the
specified value and set the PC to an override function that simply
returns, bypassing the originally probed function.  This gives us a nice
clean way to implement systematic error injection for all of our code
paths.

Acked-by: Alexei Starovoitov 
Signed-off-by: Josef Bacik 
---
 arch/Kconfig |  3 +++
 arch/x86/Kconfig |  1 +
 arch/x86/include/asm/kprobes.h   |  4 +++
 arch/x86/include/asm/ptrace.h|  5 
 arch/x86/kernel/kprobes/ftrace.c | 14 ++
 include/linux/filter.h   |  3 ++-
 include/linux/trace_events.h |  1 +
 include/uapi/linux/bpf.h |  7 -
 kernel/bpf/core.c|  3 +++
 kernel/bpf/verifier.c|  2 ++
 kernel/events/core.c |  7 +
 kernel/trace/Kconfig | 11 
 kernel/trace/bpf_trace.c | 38 +++
 kernel/trace/trace_kprobe.c  | 55 +++-
 kernel/trace/trace_probe.h   | 12 +
 15 files changed, 157 insertions(+), 9 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index d789a89cb32c..4fb618082259 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -195,6 +195,9 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
+config HAVE_KPROBE_OVERRIDE
+   bool
+
 config HAVE_NMI
bool
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 971feac13506..5126d2750dd0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -152,6 +152,7 @@ config X86
select HAVE_KERNEL_XZ
select HAVE_KPROBES
select HAVE_KPROBES_ON_FTRACE
+   select HAVE_KPROBE_OVERRIDE
select HAVE_KRETPROBES
select HAVE_KVM
select HAVE_LIVEPATCH   if X86_64
diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index 6cf65437b5e5..c6c3b1f4306a 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -67,6 +67,10 @@ extern const int kretprobe_blacklist_size;
 void arch_remove_kprobe(struct kprobe *p);
 asmlinkage void kretprobe_trampoline(void);
 
+#ifdef CONFIG_KPROBES_ON_FTRACE
+extern void arch_ftrace_kprobe_override_function(struct pt_regs *regs);
+#endif
+
 /* Architecture specific copy of original instruction*/
 struct arch_specific_insn {
/* copy of the original instruction */
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 91c04c8e67fa..f04e71800c2f 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -108,6 +108,11 @@ static inline unsigned long regs_return_value(struct 
pt_regs *regs)
return regs->ax;
 }
 
+static inline void regs_set_return_value(struct pt_regs *regs, unsigned long 
rc)
+{
+   regs->ax = rc;
+}
+
 /*
  * user_mode(regs) determines whether a register set came from user
  * mode.  On x86_32, this is true if V8086 mode was enabled OR if the
diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c
index 041f7b6dfa0f..3c455bf490cb 100644
--- a/arch/x86/kernel/kprobes/ftrace.c
+++ b/arch/x86/kernel/kprobes/ftrace.c
@@ -97,3 +97,17 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p)
p->ainsn.boostable = false;
return 0;
 }
+
+asmlinkage void override_func(void);
+asm(
+   ".type override_func, @function\n"
+   "override_func:\n"
+   "   ret\n"
+   ".size override_func, .-override_func\n"
+);
+
+void arch_ftrace_kprobe_override_function(struct pt_regs *regs)
+{
+   regs->ip = (unsigned long)&override_func;
+}
+NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index cdd78a7beaae..dfa44fd74bae 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -458,7 +458,8 @@ struct bpf_prog {
locked:1,   /* Program image locked? */
gpl_compatible:1, /* Is filter GPL compatible? 
*/
cb_access:1,/* Is control block accessed? */
-   dst_needed:1;   /* Do we need dst entry? */
+   dst_needed:1,   /* Do we need dst entry? */
+   kprobe_override:1; /* Do we override a kprobe? 
*/
kmemcheck_bitfield_end(meta);
enum bpf_prog_type  type;   /* Type of BPF program */
u32 len;/* Number of filter blocks */
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index fc6aeca945db..be8bd5a8efaa 100644
--- a/include/linux/trace_events.h
+++ b/include/linu

[PATCH 4/4] samples/bpf: add a test for bpf_override_return

2017-11-17 Thread Josef Bacik

From: Josef Bacik 

This adds a basic test for bpf_override_return to verify it works.  We
override the main function for mounting a btrfs fs so it'll return
-ENOMEM and then make sure that trying to mount a btrfs fs will fail.

Acked-by: Alexei Starovoitov 
Signed-off-by: Josef Bacik 
---
 samples/bpf/Makefile  |  4 
 samples/bpf/test_override_return.sh   | 15 +++
 samples/bpf/tracex7_kern.c| 16 
 samples/bpf/tracex7_user.c| 28 
 tools/include/uapi/linux/bpf.h|  7 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 ++-
 6 files changed, 71 insertions(+), 2 deletions(-)
 create mode 100755 samples/bpf/test_override_return.sh
 create mode 100644 samples/bpf/tracex7_kern.c
 create mode 100644 samples/bpf/tracex7_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index ea2b9e6135f3..83d06bc1f710 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -14,6 +14,7 @@ hostprogs-y += tracex3
 hostprogs-y += tracex4
 hostprogs-y += tracex5
 hostprogs-y += tracex6
+hostprogs-y += tracex7
 hostprogs-y += test_probe_write_user
 hostprogs-y += trace_output
 hostprogs-y += lathist
@@ -58,6 +59,7 @@ tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o
 tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o
 tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o
 tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
+tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o
 load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o
 test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
 trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o
@@ -100,6 +102,7 @@ always += tracex3_kern.o
 always += tracex4_kern.o
 always += tracex5_kern.o
 always += tracex6_kern.o
+always += tracex7_kern.o
 always += sock_flags_kern.o
 always += test_probe_write_user_kern.o
 always += trace_output_kern.o
@@ -153,6 +156,7 @@ HOSTLOADLIBES_tracex3 += -lelf
 HOSTLOADLIBES_tracex4 += -lelf -lrt
 HOSTLOADLIBES_tracex5 += -lelf
 HOSTLOADLIBES_tracex6 += -lelf
+HOSTLOADLIBES_tracex7 += -lelf
 HOSTLOADLIBES_test_cgrp2_sock2 += -lelf
 HOSTLOADLIBES_load_sock_ops += -lelf
 HOSTLOADLIBES_test_probe_write_user += -lelf
diff --git a/samples/bpf/test_override_return.sh 
b/samples/bpf/test_override_return.sh
new file mode 100755
index ..e68b9ee6814b
--- /dev/null
+++ b/samples/bpf/test_override_return.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+rm -f testfile.img
+dd if=/dev/zero of=testfile.img bs=1M seek=1000 count=1
+DEVICE=$(losetup --show -f testfile.img)
+mkfs.btrfs -f $DEVICE
+mkdir tmpmnt
+./tracex7 $DEVICE
+if [ $? -eq 0 ]
+then
+   echo "SUCCESS!"
+else
+   echo "FAILED!"
+fi
+losetup -d $DEVICE
diff --git a/samples/bpf/tracex7_kern.c b/samples/bpf/tracex7_kern.c
new file mode 100644
index ..1ab308a43e0f
--- /dev/null
+++ b/samples/bpf/tracex7_kern.c
@@ -0,0 +1,16 @@
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+SEC("kprobe/open_ctree")
+int bpf_prog1(struct pt_regs *ctx)
+{
+   unsigned long rc = -12;
+
+   bpf_override_return(ctx, rc);
+   return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex7_user.c b/samples/bpf/tracex7_user.c
new file mode 100644
index ..8a52ac492e8b
--- /dev/null
+++ b/samples/bpf/tracex7_user.c
@@ -0,0 +1,28 @@
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include "libbpf.h"
+#include "bpf_load.h"
+
+int main(int argc, char **argv)
+{
+   FILE *f;
+   char filename[256];
+   char command[256];
+   int ret;
+
+   snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+   if (load_bpf_file(filename)) {
+   printf("%s", bpf_log_buf);
+   return 1;
+   }
+
+   snprintf(command, 256, "mount %s tmpmnt/", argv[1]);
+   f = popen(command, "r");
+   ret = pclose(f);
+
+   return ret ? 0 : 1;
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4a4b6e78c977..3756dde69834 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -673,6 +673,10 @@ union bpf_attr {
  * @buf: buf to fill
  * @buf_size: size of the buf
  * Return : 0 on success or negative error code
+ *
+ * int bpf_override_return(pt_regs, rc)
+ * @pt_regs: pointer to struct pt_regs
+ * @rc: the return value to set
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -732,7 +736,8 @@ union bpf_attr {
FN(xdp_adjust_meta),\
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
-   FN(getsockopt),
+   FN(getsockopt), \
+   FN(override_return),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/to

[PATCH 1/4] add infrastructure for tagging functions as error injectable

2017-11-17 Thread Josef Bacik

From: Josef Bacik 

Using BPF we can override kprob'ed functions and return arbitrary
values.  Obviously this can be a bit unsafe, so make this feature opt-in
for functions.  Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
order to give BPF access to that function for error injection purposes.

Signed-off-by: Josef Bacik 
---
 arch/x86/include/asm/asm.h|   6 ++
 include/asm-generic/kprobes.h |   9 +++
 include/asm-generic/vmlinux.lds.h |  10 +++
 include/linux/kprobes.h   |   1 +
 include/linux/module.h|   5 ++
 kernel/kprobes.c  | 163 ++
 kernel/module.c   |   6 +-
 7 files changed, 199 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index b0dc91f4bedc..340f4cc43255 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -85,6 +85,12 @@
_ASM_PTR (entry);   \
.popsection
 
+# define _ASM_KPROBE_ERROR_INJECT(entry)   \
+   .pushsection "_kprobe_error_inject_list","aw" ; \
+   _ASM_ALIGN ;\
+   _ASM_PTR (entry);   \
+   .popseciton
+
 .macro ALIGN_DESTINATION
/* check for bad alignment of destination */
movl %edi,%ecx
diff --git a/include/asm-generic/kprobes.h b/include/asm-generic/kprobes.h
index 57af9f21d148..f96c4de5d7b0 100644
--- a/include/asm-generic/kprobes.h
+++ b/include/asm-generic/kprobes.h
@@ -22,4 +22,13 @@ static unsigned long __used  
\
 #endif
 #endif /* defined(__KERNEL__) && !defined(__ASSEMBLY__) */
 
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+#define BPF_ALLOW_ERROR_INJECTION(fname)   \
+static unsigned long __used\
+   __attribute__((__section__("_kprobe_error_inject_list")))   \
+   _eil_addr_##fname = (unsigned long)fname;
+#else
+#define BPF_ALLOW_ERROR_INJECTION(fname)
+#endif
+
 #endif /* _ASM_GENERIC_KPROBES_H */
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 8acfc1e099e1..85822804861e 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -136,6 +136,15 @@
 #define KPROBE_BLACKLIST()
 #endif
 
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+#define ERROR_INJECT_LIST(). = ALIGN(8);   
\
+   
VMLINUX_SYMBOL(__start_kprobe_error_inject_list) = .;   \
+   KEEP(*(_kprobe_error_inject_list))  
\
+   VMLINUX_SYMBOL(__stop_kprobe_error_inject_list) 
= .;
+#else
+#define ERROR_INJECT_LIST()
+#endif
+
 #ifdef CONFIG_EVENT_TRACING
 #define FTRACE_EVENTS(). = ALIGN(8);   
\
VMLINUX_SYMBOL(__start_ftrace_events) = .;  \
@@ -560,6 +569,7 @@
FTRACE_EVENTS() \
TRACE_SYSCALLS()\
KPROBE_BLACKLIST()  \
+   ERROR_INJECT_LIST() \
MEM_DISCARD(init.rodata)\
CLK_OF_TABLES() \
RESERVEDMEM_OF_TABLES() \
diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index bd2684700b74..4f501cb73aec 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -271,6 +271,7 @@ extern bool arch_kprobe_on_func_entry(unsigned long offset);
 extern bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, 
unsigned long offset);
 
 extern bool within_kprobe_blacklist(unsigned long addr);
+extern bool within_kprobe_error_injection_list(unsigned long addr);
 
 struct kprobe_insn_cache {
struct mutex mutex;
diff --git a/include/linux/module.h b/include/linux/module.h
index fe5aa3736707..7bb1a9b9a322 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -475,6 +475,11 @@ struct module {
ctor_fn_t *ctors;
unsigned int num_ctors;
 #endif
+
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+   unsigned int num_kprobe_ei_funcs;
+   unsigned long *kprobe_ei_funcs;
+#endif
 } cacheline_aligned __randomize_layout;
 #ifndef MODULE_ARCH_INIT
 #define MODULE_ARCH_INIT {}
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index a1606a4224e1..7afadf07b34e 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -83,6 +83,16 @@ static raw_spinlock_t *kretprobe_table_lock_ptr(unsigned 
long hash)
return &(kretprobe_table_locks[hash].lock);
 }
 
+/* List of symbols that can be overriden for error injection. */
+static LIST_HEAD(kprobe_error_injection_list);
+static DEFIN

[PATCH 2/4] btrfs: make open_ctree error injectable

2017-11-17 Thread Josef Bacik

From: Josef Bacik 

This allows us to do error injection with BPF for open_ctree.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index dfdab849037b..c6b4e1f07072 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "hash.h"
@@ -3283,6 +3284,7 @@ int open_ctree(struct super_block *sb,
goto fail_block_groups;
goto retry_root_backup;
 }
+BPF_ALLOW_ERROR_INJECTION(open_ctree);
 
 static void btrfs_end_buffer_write_sync(struct buffer_head *bh, int uptodate)
 {
-- 
2.7.5

[PATCH 0/4][v6] Add the ability to do BPF directed error injection

2017-11-17 Thread Josef Bacik

I've reworked this to be opt-in only as per Igno and Alexei.  Still needs to go
through Dave because of the bpf bits, but I need tracing guys to weigh in and
sign off on my approach please.

v5->v6:
- add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this
  feature.  This way only functions that opt-in will be allowed to be
  overridden.
- added a btrfs patch to allow error injection for open_ctree() so that the bpf
  sample actually works.

v4->v5:
- disallow kprobe_override programs from being put in the prog map array so we
  don't tail call into something we didn't check.  This allows us to make the
  normal path still fast without a bunch of percpu operations.

v3->v4:
- fix a build error found by kbuild test bot (I didn't wait long enough
  apparently.)
- Added a warning message as per Daniels suggestion.

v2->v3:
- added a ->kprobe_override flag to bpf_prog.
- added some sanity checks to disallow attaching bpf progs that have
  ->kprobe_override set that aren't for ftrace kprobes.
- added the trace_kprobe_ftrace helper to check if the trace_event_call is a
  ftrace kprobe.
- renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read this
  value in the kprobe path, and thus only write to it if we're overriding or
  clearing the override.

v1->v2:
- moved things around to make sure that bpf_override_return could really only be
  used for an ftrace kprobe.
- killed the special return values from trace_call_bpf.
- renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
  it was being called from an ftrace kprobe context.
- reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state.
- updated the test as per Alexei's review.

- Original message -

A lot of our error paths are not well tested because we have no good way of
injecting errors generically.  Some subystems (block, memory) have ways to
inject errors, but they are random so it's hard to get reproduceable results.

With BPF we can add determinism to our error injection.  We can use kprobes and
other things to verify we are injecting errors at the exact case we are trying
to test.  This patch gives us the tool to actual do the error injection part.
It is very simple, we just set the return value of the pt_regs we're given to
whatever we provide, and then override the PC with a dummy function that simply
returns.

Right now this only works on x86, but it would be simple enough to expand to
other architectures.  Thanks,

Josef

Re: Bisected 4.14 Regression: IPsec transport mode breakage

2017-11-17 Thread Kevin Locke

On Fri, 2017-11-17 at 11:03 +0100, Steffen Klassert wrote:
> On Wed, Nov 15, 2017 at 09:46:19AM -0700, Kevin Locke wrote:
>> I have bisected the issue to commit c9f3f813d462.  I have attached the
>> client ipsec.conf as well as the syslog during the connection attempt
>> for both c9f3f813d462 (bad) and cf3796675174 (good).
> 
> The offending commit is already reverted in the 'net' tree
> and will be available in mainline soon.

Great, thank you!  I tested davem/net#94802151894d and can confirm
that it works and fixes the issue for me.  Thanks again.

-- 
Cheers,  |  ke...@kevinlocke.name| XMPP: ke...@kevinlocke.name
Kevin|  https://kevinlocke.name  | IRC:   kevinoid on freenode

Re: [PATCH][v4] uprobes/x86: emulate push insns for uprobe on x86

2017-11-17 Thread Oleg Nesterov

On 11/15, Yonghong Song wrote:
>
> v3 -> v4:
>   . Revert most of v3 change as 32bit emulation is not really working
> on x86_64 platform as among other issues, function emulate_push_stack()
> needs to account for 32bit app on 64bit platform.
> A separate effort is ongoing to address this issue.

Reviewed-by: Oleg Nesterov 

Please test your patch with the fix below, in this particular case the
TIF_IA32 check should be fine. Although this is not what we really want,
we should probably use user_64bit_mode(regs) which checks ->cs. But this
needs more changes and doesn't solve other problems (get_unmapped_area)
so I still can't decide what should we do right now...

Oleg.

--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -516,7 +516,7 @@ struct uprobe_xol_ops {

 static inline int sizeof_long(void)
 {
-   return in_ia32_syscall() ? 4 : 8;
+   return test_thread_flag(TIF_IA32) ? 4 : 8;
 }

 static int default_pre_xol_op(struct arch_uprobe *auprobe, struct pt_regs 
*regs)

Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-17 Thread Kirill Tkhai

On 15.11.2017 19:29, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
> 
>> On 15.11.2017 09:25, Eric W. Biederman wrote:
>>> Kirill Tkhai  writes:
>>>
 Curently mutex is used to protect pernet operations list. It makes
 cleanup_net() to execute ->exit methods of the same operations set,
 which was used on the time of ->init, even after net namespace is
 unlinked from net_namespace_list.

 But the problem is it's need to synchronize_rcu() after net is removed
 from net_namespace_list():

 Destroy net_ns:
 cleanup_net()
   mutex_lock(&net_mutex)
   list_del_rcu(&net->list)
   synchronize_rcu()  <--- Sleep there for 
 ages
   list_for_each_entry_reverse(ops, &pernet_list, list)
 ops_exit_list(ops, &net_exit_list)
   list_for_each_entry_reverse(ops, &pernet_list, list)
 ops_free_list(ops, &net_exit_list)
   mutex_unlock(&net_mutex)

 This primitive is not fast, especially on the systems with many processors
 and/or when preemptible RCU is enabled in config. So, all the time, while
 cleanup_net() is waiting for RCU grace period, creation of new net 
 namespaces
 is not possible, the tasks, who makes it, are sleeping on the same mutex:

 Create net_ns:
 copy_net_ns()
   mutex_lock_killable(&net_mutex)<--- Sleep there for 
 ages

 The solution is to convert net_mutex to the rw_semaphore. Then,
 pernet_operations::init/::exit methods, modifying the net-related data,
 will require down_read() locking only, while down_write() will be used
 for changing pernet_list.

 This gives signify performance increase, like you may see below. There
 is measured sequential net namespace creation in a cycle, in single
 thread, without other tasks (single user mode):

 1)int main(int argc, char *argv[])
 {
 unsigned nr;
 if (argc < 2) {
 fprintf(stderr, "Provide nr iterations arg\n");
 return 1;
 }
 nr = atoi(argv[1]);
 while (nr-- > 0) {
 if (unshare(CLONE_NEWNET)) {
 perror("Can't unshare");
 return 1;
 }
 }
 return 0;
 }

 Origin, 10 unshare():
 0.03user 23.14system 1:39.85elapsed 23%CPU

 Patched, 10 unshare():
 0.03user 67.49system 1:08.34elapsed 98%CPU

 2)for i in {1..1}; do unshare -n bash -c exit; done

 Origin:
 real 1m24,190s
 user 0m6,225s
 sys 0m15,132s

 Patched:
 real 0m18,235s   (4.6 times faster)
 user 0m4,544s
 sys 0m13,796s

 This patch requires commit 76f8507f7a64 "locking/rwsem: Add 
 down_read_killable()"
 from Linus tree (not in net-next yet).
>>>
>>> Using a rwsem to protect the list of operations makes sense.
>>>
>>> That should allow removing the sing
>>>
>>> I am not wild about taking a the rwsem down_write in
>>> rtnl_link_unregister, and net_ns_barrier.  I think that works but it
>>> goes from being a mild hack to being a pretty bad hack and something
>>> else that can kill the parallelism you are seeking it add.
>>>
>>> There are about 204 instances of struct pernet_operations.  That is a
>>> lot of code to have carefully audited to ensure it can in parallel all
>>> at once.  The existence of the exit_batch method, net_ns_barrier,
>>> for_each_net and taking of net_mutex in rtnl_link_unregister all testify
>>> to the fact that there are data structures accessed by multiple network
>>> namespaces.
>>>
>>> My preference would be to:
>>>
>>> - Add the net_sem in addition to net_mutex with down_write only held in
>>>   register and unregister, and maybe net_ns_barrier and
>>>   rtnl_link_unregister.
>>>
>>> - Factor out struct pernet_ops out of struct pernet_operations.  With
>>>   struct pernet_ops not having the exit_batch method.  With pernet_ops
>>>   being embedded an anonymous member of the old struct pernet_operations.
>>>
>>> - Add [un]register_pernet_{sys,dev} functions that take a struct
>>>   pernet_ops, that don't take net_mutex.  Have them order the
>>>   pernet_list as:
>>>
>>>   pernet_sys
>>>   pernet_subsys
>>>   pernet_device
>>>   pernet_dev
>>>
>>>   With the chunk in the middle taking the net_mutex.
>>
>> I think this approach will work. Thanks for the suggestion. Some more
>> thoughts to the plan below.
>>
>> The only difficult thing there will be to choose the right order
>> to move ops from pernet_subsys to pernet_sys and from pernet_device
>> to pernet_dev one by one.
>>
>> This is rather easy in case of tristate drivers, as modules may be loaded
>> at any time, and the only important order is dependences between them.
>> So, it's possible to start from a module, who has no dependences,
>> and move it to pernet_sys, and then continue with

[PATCH 1/2] gre6: use log_ecn_error module parameter in ip6_tnl_rcv()

2017-11-17 Thread Alexey Kodanev

After commit 308edfdf1563 ("gre6: Cleanup GREv6 receive path, call
common GRE functions") it's not used anywhere in the module, but
previously was used in ip6gre_rcv().

Fixes: 308edfdf1563 ("gre6: Cleanup GREv6 receive path, call common GRE 
functions")
Signed-off-by: Alexey Kodanev 
---
 net/ipv6/ip6_gre.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 59c121b..5d6bee0 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -461,7 +461,7 @@ static int ip6gre_rcv(struct sk_buff *skb, const struct 
tnl_ptk_info *tpi)
  &ipv6h->saddr, &ipv6h->daddr, tpi->key,
  tpi->proto);
if (tunnel) {
-   ip6_tnl_rcv(tunnel, skb, tpi, NULL, false);
+   ip6_tnl_rcv(tunnel, skb, tpi, NULL, log_ecn_error);
 
return PACKET_RCVD;
}
-- 
1.8.3.1

[PATCH 2/2] ip6_tunnel: pass tun_dst arg from ip6_tnl_rcv() to __ip6_tnl_rcv()

2017-11-17 Thread Alexey Kodanev

Otherwise tun_dst argument is unused there. Currently, ip6_tnl_rcv()
invoked with tun_dst set to NULL, so there is no actual functional
changes introduced in this patch.

Fixes: 0d3c703a9d17 ("ipv6: Cleanup IPv6 tunnel receive path")
Signed-off-by: Alexey Kodanev 
---
 net/ipv6/ip6_tunnel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index a1c2444..bc050e8 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -869,7 +869,7 @@ int ip6_tnl_rcv(struct ip6_tnl *t, struct sk_buff *skb,
struct metadata_dst *tun_dst,
bool log_ecn_err)
 {
-   return __ip6_tnl_rcv(t, skb, tpi, NULL, ip6ip6_dscp_ecn_decapsulate,
+   return __ip6_tnl_rcv(t, skb, tpi, tun_dst, ip6ip6_dscp_ecn_decapsulate,
 log_ecn_err);
 }
 EXPORT_SYMBOL(ip6_tnl_rcv);
-- 
1.8.3.1

Re: [PATCH] qed: fix unnecessary call to memset cocci warnings

2017-11-17 Thread Andy Shevchenko

On Fri, Nov 17, 2017 at 12:04 AM, Vasyl Gomonovych  wrote:
> Use kzalloc rather than kmalloc followed by memset with 0
>
> drivers/net/ethernet/qlogic/qed/qed_dcbx.c:1280:13-20: WARNING:
> kzalloc should be used for dcbx_info, instead of kmalloc/memset
> Generated by: scripts/coccinelle/api/alloc/kzalloc-simple.cocci

While this looks okay per se now, it would be good if you put version
of the patch and add a changelog to it.

I think no need to resend this one, just for your information.

Reviewed-by: Andy Shevchenko 

> Signed-off-by: Vasyl Gomonovych 
> ---
>  drivers/net/ethernet/qlogic/qed/qed_dcbx.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c 
> b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
> index 8f6ccc0c39e5..cc9e0dfcee48 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
> @@ -1277,11 +1277,10 @@ static struct qed_dcbx_get *qed_dcbnl_get_dcbx(struct 
> qed_hwfn *hwfn,
>  {
> struct qed_dcbx_get *dcbx_info;
>
> -   dcbx_info = kmalloc(sizeof(*dcbx_info), GFP_ATOMIC);
> +   dcbx_info = kzalloc(sizeof(*dcbx_info), GFP_ATOMIC);
> if (!dcbx_info)
> return NULL;
>
> -   memset(dcbx_info, 0, sizeof(*dcbx_info));
> if (qed_dcbx_query_params(hwfn, dcbx_info, type)) {
> kfree(dcbx_info);
> return NULL;
> --
> 1.9.1
>



-- 
With Best Regards,
Andy Shevchenko

Re: [PATCH net] sctp: report SCTP_ERROR_INV_STRM as cpu endian

2017-11-17 Thread Marcelo Ricardo Leitner

On Fri, Nov 17, 2017 at 02:15:02PM +0800, Xin Long wrote:
> rfc6458 demands the send_error in SCTP_SEND_FAILED_EVENT should
> be in cpu endian, while SCTP_ERROR_INV_STRM is in big endian.
> 
> This issue is there since very beginning, Eric noticed it by
> running 'make C=2 M=net/sctp/'.
> 
> This patch is to convert it before reporting it.

Unfortunatelly we can't fix this as this will break UAPI. It will
break applications that are currently matching on the current value.

> 
> Reported-by: Eric Dumazet 
> Signed-off-by: Xin Long 
> ---
>  net/sctp/stream.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> index a11db21..f86ceee 100644
> --- a/net/sctp/stream.c
> +++ b/net/sctp/stream.c
> @@ -64,7 +64,7 @@ static void sctp_stream_outq_migrate(struct sctp_stream 
> *stream,
>*/
>  
>   /* Mark as failed send. */
> - sctp_chunk_fail(ch, SCTP_ERROR_INV_STRM);
> + sctp_chunk_fail(ch, be16_to_cpu(SCTP_ERROR_INV_STRM));
>   if (asoc->peer.prsctp_capable &&
>   SCTP_PR_PRIO_ENABLED(ch->sinfo.sinfo_flags))
>   asoc->sent_cnt_removable--;
> -- 
> 2.1.0
>

Product Enquiry

2017-11-17 Thread Carol Merck

Hello,

We recently visited your website and we are interested in your models, We will 
like to make an order from your list of products. However, we would like to see 
your company's latest catalogs with the; minimum order quantity, delivery 
time/FOB, payment terms etc. Official order placement will follow as soon as 
possible. 

Awaiting your prompt reply.


Thanks and best regards,

Carol Merck
Purchasing Manager

Re: [PATCH net] sctp: set frag_point in sctp_setsockopt_maxseg correctly

2017-11-17 Thread Marcelo Ricardo Leitner

On Fri, Nov 17, 2017 at 02:11:11PM +0800, Xin Long wrote:
> Now in sctp_setsockopt_maxseg user_frag or frag_point can be set with
> val >= 8 and val <= SCTP_MAX_CHUNK_LEN. But both checks are incorrect.
> 
> val >= 8 means frag_point can even be less than SCTP_DEFAULT_MINSEGMENT.
> Then in sctp_datamsg_from_user(), when it's value is greater than cookie
> echo len and trying to bundle with cookie echo chunk, the first_len will
> overflow.
> 
> The worse case is when it's value is equal as cookie echo len, first_len
> becomes 0, it will go into a dead loop for fragment later on. In Hangbin
> syzkaller testing env, oom was even triggered due to consecutive memory
> allocation in that loop.
> 
> Besides, SCTP_MAX_CHUNK_LEN is the max size of the whole chunk, it should
> deduct the data header for frag_point or user_frag check.
> 
> This patch does a proper check with SCTP_DEFAULT_MINSEGMENT subtracting
> the sctphdr and datahdr, SCTP_MAX_CHUNK_LEN subtracting datahdr when
> setting frag_point via sockopt. It also improves sctp_setsockopt_maxseg
> codes.
> 
> Suggested-by: Marcelo Ricardo Leitner 
> Reported-by: Hangbin Liu 
> Signed-off-by: Xin Long 

Acked-by: Marcelo Ricardo Leitner 

> ---
>  include/net/sctp/sctp.h |  3 ++-
>  net/sctp/socket.c   | 29 +++--
>  2 files changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
> index d7d8cba..749a428 100644
> --- a/include/net/sctp/sctp.h
> +++ b/include/net/sctp/sctp.h
> @@ -444,7 +444,8 @@ static inline int sctp_frag_point(const struct 
> sctp_association *asoc, int pmtu)
>   if (asoc->user_frag)
>   frag = min_t(int, frag, asoc->user_frag);
>  
> - frag = SCTP_TRUNC4(min_t(int, frag, SCTP_MAX_CHUNK_LEN));
> + frag = SCTP_TRUNC4(min_t(int, frag, SCTP_MAX_CHUNK_LEN -
> + sizeof(struct sctp_data_chunk)));
>  
>   return frag;
>  }
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 4c0a772..3204a9b 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -3140,9 +3140,9 @@ static int sctp_setsockopt_mappedv4(struct sock *sk, 
> char __user *optval, unsign
>   */
>  static int sctp_setsockopt_maxseg(struct sock *sk, char __user *optval, 
> unsigned int optlen)
>  {
> + struct sctp_sock *sp = sctp_sk(sk);
>   struct sctp_assoc_value params;
>   struct sctp_association *asoc;
> - struct sctp_sock *sp = sctp_sk(sk);
>   int val;
>  
>   if (optlen == sizeof(int)) {
> @@ -3158,26 +3158,35 @@ static int sctp_setsockopt_maxseg(struct sock *sk, 
> char __user *optval, unsigned
>   if (copy_from_user(¶ms, optval, optlen))
>   return -EFAULT;
>   val = params.assoc_value;
> - } else
> + } else {
>   return -EINVAL;
> + }
>  
> - if ((val != 0) && ((val < 8) || (val > SCTP_MAX_CHUNK_LEN)))
> - return -EINVAL;
> + if (val) {
> + int min_len, max_len;
>  
> - asoc = sctp_id2assoc(sk, params.assoc_id);
> - if (!asoc && params.assoc_id && sctp_style(sk, UDP))
> - return -EINVAL;
> + min_len = SCTP_DEFAULT_MINSEGMENT - sp->pf->af->net_header_len;
> + min_len -= sizeof(struct sctphdr) +
> +sizeof(struct sctp_data_chunk);
> +
> + max_len = SCTP_MAX_CHUNK_LEN - sizeof(struct sctp_data_chunk);
>  
> + if (val < min_len || val > max_len)
> + return -EINVAL;
> + }
> +
> + asoc = sctp_id2assoc(sk, params.assoc_id);
>   if (asoc) {
>   if (val == 0) {
> - val = asoc->pathmtu;
> - val -= sp->pf->af->net_header_len;
> + val = asoc->pathmtu - sp->pf->af->net_header_len;
>   val -= sizeof(struct sctphdr) +
> - sizeof(struct sctp_data_chunk);
> +sizeof(struct sctp_data_chunk);
>   }
>   asoc->user_frag = val;
>   asoc->frag_point = sctp_frag_point(asoc, asoc->pathmtu);
>   } else {
> + if (params.assoc_id && sctp_style(sk, UDP))
> + return -EINVAL;
>   sp->user_frag = val;
>   }
>  
> -- 
> 2.1.0
>

Re: [PATCH] net: usb: hso.c: remove unneeded DRIVER_LICENSE #define

2017-11-17 Thread Philippe Ombredanne

On Fri, Nov 17, 2017 at 3:19 PM, Greg Kroah-Hartman
 wrote:
> There is no need to #define the license of the driver, just put it in
> the MODULE_LICENSE() line directly as a text string.
>
> This allows tools that check that the module license matches the source
> code license to work properly, as there is no need to unwind the
> unneeded dereference.
>
> Cc: "David S. Miller" 
> Cc: Andreas Kemnade 
> Cc: Johan Hovold 
> Reported-by: Philippe Ombredanne 
> Signed-off-by: Greg Kroah-Hartman 


Reviewed-by: Philippe Ombredanne 
-- 
Cordially
Philippe Ombredanne

Re: regression: UFO removal breaks kvm live migration

2017-11-17 Thread Willem de Bruijn

>> Okay, I will send a patch to reinstate UFO for this use case (only). There
>> is some related work in tap_handle_frame and packet_direct_xmit to
>> segment directly in the device. I will be traveling the next few days, so
>> it won't be in time for 4.14 (but can go in stable later, of course).
>
> I'm finishing up and running some tests. The majority of the patch is a
> straightforward partial revert of the patchset, so while fairly large for a
> patch to net (~150 lines, esp. in udp[46]_ufo_fragment), that is all
> thoroughly tested code. Notably absent are the protocol layer and
> hardware support (NETIF_F_UFO) portions.
>
> The only open issue is whether to rely on existing skb_gso_segment
> processing in the transmit path from validate_xmit_skb or to add new
> skb_gso_segment calls directly to tun_get_user, tap_get_user and
> pf_packet. Tun has to loop around four different ways of injecting
> packets into the device. Something like the below snippet.
>
> More conservative is to introduce no completely new code and rely on
> validate_xmit_skb, but that means having to protect the entire stack
> against skbs with SKB_GSO_UDP, so also bringing back some
> checksum and fragment handling snippets in gre_gso_segment,
> __skb_udp_tunnel_segment, act_csum and openvswitch.

Come to think of it, as this patch does not bring back NETIF_F_UFO
support to NETIF_F_GSO_SOFTWARE, the tunnel cases can be
excluded.

Then this is probably the simpler and more obviously correct approach.

Re: [PATCH] sfp: Add support for DWDM SFP modules

2017-11-17 Thread David Miller

From: Russell King - ARM Linux 
Date: Fri, 17 Nov 2017 09:52:10 +

> I already have a stack of patches for phy, phylink and sfp that I
> need to send, including documentation patches which Florian has
> already found very useful and helpful.  I had assumed that net-next
> was already closed, being almost a week into the merge window.

Yes it is.

Thanks for the info, I'll mark this 'deferred' in patchwork.  Please
have this respun and posted once net-next is openned back up and
the various issues have been sorted out.

Thank you.

Re: regression: UFO removal breaks kvm live migration

2017-11-17 Thread Willem de Bruijn

On Fri, Nov 10, 2017 at 12:32 AM, Willem de Bruijn
 wrote:
> On Wed, Nov 8, 2017 at 9:53 PM, Jason Wang  wrote:
>>
>>
>> On 2017年11月08日 20:32, David Miller wrote:
>>>
>>> From: Jason Wang 
>>> Date: Wed, 8 Nov 2017 17:25:48 +0900
>>>
 On 2017年11月08日 17:08, Willem de Bruijn wrote:
>
> That won't help in the short term. I'm still reading up to see if
> there are
> any other options besides reimplement or advertise-but-drop, such as
> an implicit trigger that would make the guest renegotiate. It's
> unlikely, but
> worth a look..

 Yes, this looks hard. And even if we can manage to do this, it looks
 an overkill since it will impact all guest after migration.
>>>
>>> Like Willem I would much prefer "advertise-but-drop" if it works.
>>
>>
>> This makes migration work but all guest UFO traffic will stall.
>>
>>>
>>> In the long term feature renegotiation triggers are a must.
>>>
>>> There is no way for us to remove features otherwise.
>>
>>
>> We can remove if we don't break userspace(guest).
>>
>>> In my opinion
>>> this will even make migrations more powerful.
>>
>>
>> But this does not help for guest running old version of kernel which still
>> think UFO work.
>
> Indeed, if we have to support live migration of arbitrary old guests
> without any expectations on hypervisor version either, features can
> simply never be reverted, even if a negotiation interface exists.
>
> At least for upcoming features and devices, guest code should not
> have this expectation, but from the start allow renegation such as
> CTRL_GUEST_OFFLOADS [1] based on a host trigger. But for
> tuntap TUNSETOFFLOAD it seems that ship has sailed.
>
> Okay, I will send a patch to reinstate UFO for this use case (only). There
> is some related work in tap_handle_frame and packet_direct_xmit to
> segment directly in the device. I will be traveling the next few days, so
> it won't be in time for 4.14 (but can go in stable later, of course).

I'm finishing up and running some tests. The majority of the patch is a
straightforward partial revert of the patchset, so while fairly large for a
patch to net (~150 lines, esp. in udp[46]_ufo_fragment), that is all
thoroughly tested code. Notably absent are the protocol layer and
hardware support (NETIF_F_UFO) portions.

The only open issue is whether to rely on existing skb_gso_segment
processing in the transmit path from validate_xmit_skb or to add new
skb_gso_segment calls directly to tun_get_user, tap_get_user and
pf_packet. Tun has to loop around four different ways of injecting
packets into the device. Something like the below snippet.

More conservative is to introduce no completely new code and rely on
validate_xmit_skb, but that means having to protect the entire stack
against skbs with SKB_GSO_UDP, so also bringing back some
checksum and fragment handling snippets in gre_gso_segment,
__skb_udp_tunnel_segment, act_csum and openvswitch.

A third option is to send the conservative approach to net, then
in net-next follow up with a patch to plug the SKB_GSO_UDP
directly in the devices and revert the tunnel/act/openvswitch stanzas
I'm leaning towards that approach.

@@ -1380,7 +1380,7 @@ static ssize_t tun_get_user(struct tun_struct
*tun, struct tun_file *tfile,
int noblock, bool more)
 {
struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) };
-   struct sk_buff *skb;
+   struct sk_buff *skb, *segs = NULL;
size_t total_len = iov_iter_count(from);
size_t len = total_len, align = tun->align, linear;
struct virtio_net_hdr gso = { 0 };
@@ -1552,12 +1552,33 @@ static ssize_t tun_get_user(struct tun_struct
*tun, struct tun_file *tfile,
}

rxhash = __skb_get_hash_symmetric(skb);
+
+   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP) {
+   skb_push(skb, ETH_HLEN);
+   segs = __skb_gso_segment(skb, netif_skb_features(skb), false);
+
+   if (IS_ERR(segs)) {
+   kfree_skb(skb);
+   return PTR_ERR(segs);
+   }
+
+   if (segs) {
+   consume_skb(skb);
+   skb = segs;
+   }
+again:
+   skb_pull(skb, ETH_HLEN);
+   segs = skb->next;
+   skb->next = NULL;
+   }
+
 #ifndef CONFIG_4KSTACKS
-tun_rx_batched(tun, tfile, skb, more);
+   tun_rx_batched(tun, tfile, skb, more || segs);
 #else
netif_rx_ni(skb);
 #endif

+   if (segs) {
+   skb = segs;
+   goto again;
+   }

1 2 >

1 - 100 of 121 matches

Mail list logo