Re: [PATCH 2/4] btrfs: make open_ctree error injectable
* Josef Bacik wrote: > From: Josef Bacik > > This allows us to do error injection with BPF for open_ctree. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/disk-io.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index dfdab849037b..c6b4e1f07072 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -31,6 +31,7 @@ > #include > #include > #include > +#include > #include "ctree.h" > #include "disk-io.h" > #include "hash.h" > @@ -3283,6 +3284,7 @@ int open_ctree(struct super_block *sb, > goto fail_block_groups; > goto retry_root_backup; > } > +BPF_ALLOW_ERROR_INJECTION(open_ctree); Ok, this looks a lot better - except the random header inclusion dependency: if a facility is in the BPF_*() namespace then it should include and not a random asm/* header... With that detail fixed: Acked-by: Ingo Molnar for the whole series. Thanks, Ingo
Re: [PATCH net] tcp: when scheduling TLP, time of RTO should account for current ACK
On Fri, Nov 17, 2017 at 9:06 PM, Neal Cardwell wrote: > > Fix the TLP scheduling logic so that when scheduling a TLP probe, we > ensure that the estimated time at which an RTO would fire accounts for > the fact that ACKs indicating forward progress should push back RTO > times. > > After the following fix: > > df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed") > > we had an unintentional behavior change in the following kind of > scenario: suppose the RTT variance has been very low recently. Then > suppose we send out a flight of N packets and our RTT is 100ms: > > t=0: send a flight of N packets > t=100ms: receive an ACK for N-1 packets > > The response before df92c8394e6e that was: > -> schedule a TLP for now + RTO_interval > > The response after df92c8394e6e is: > -> schedule a TLP for t=0 + RTO_interval > > Since RTO_interval = srtt + RTT_variance, this means that we have > scheduled a TLP timer at a point in the future that only accounts for > RTT_variance. If the RTT_variance term is small, this means that the > timer fires soon. > > Before df92c8394e6e this would not happen, because in that code, when > we receive an ACK for a prefix of flight, we did: > > 1) Near the top of tcp_ack(), switch from TLP timer to RTO >at write_queue_head->paket_tx_time + RTO_interval: > if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) >tcp_rearm_rto(sk); > > 2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval: > if (flag & FLAG_ACKED) { >tcp_rearm_rto(sk); > > 3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO >to TLP at now + RTO_interval: > if (icsk->icsk_pending == ICSK_TIME_RETRANS) >tcp_schedule_loss_probe(sk); > > In df92c8394e6e we removed that 3-phase dance, and instead directly > set the TLP timer once: we set the TLP timer in cases like this to > write_queue_head->packet_tx_time + RTO_interval. So if the RTT > variance is small, then this means that this is setting the TLP timer > to fire quite soon. This means if the ACK for the tail of the flight > takes longer than an RTT to arrive (often due to delayed ACKs), then > the TLP timer fires too quickly. > > Fixes: df92c8394e6e ("tcp: fix xmit timer to only be reset if data > ACKed/SACKed") > Signed-off-by: Neal Cardwell > Signed-off-by: Yuchung Cheng > Signed-off-by: Eric Dumazet Acked-by: Soheil Hassas Yeganeh Nice fix. Thank you, Neal!
Re: [PATCH 2/2] kbuild: remove all dummy assignments to obj-
2017-11-08 1:31 GMT+09:00 Masahiro Yamada : > Now kbuild core scripts create empty built-in.o where necessary. > Remove "obj- := dummy.o" tricks. > > Signed-off-by: Masahiro Yamada > --- > Applied to linux-kbuild/kbuild. -- Best Regards Masahiro Yamada
[PATCH net] net: ena: fix race condition between device reset and link up setup
From: Netanel Belgazal In rare cases, ena driver would reset and re-start the device, for example, in case of misbehaving application that causes transmit timeout The first step in the reset procedure is to stop the Tx traffic by calling ena_carrier_off(). After the driver have just started the device reset procedure, device happens to send an asynchronous notification (via AENQ) to the driver than there was a link change (to link-up state). This link change is mapped to a call to netif_carrier_on() which re-activates the Tx queues, violating the assumption of no tx traffic until device reset is completed, as the reset task might still be in the process of queues initialization, leading to an access to uninitialized memory. --- drivers/net/ethernet/amazon/ena/ena_netdev.c | 11 +-- drivers/net/ethernet/amazon/ena/ena_netdev.h | 3 ++- 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index 5417e4da64ca..988d0383b4e7 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -2579,6 +2579,7 @@ static int ena_restore_device(struct ena_adapter *adapter) bool wd_state; int rc; + set_bit(ENA_FLAG_ONGOING_RESET, &adapter->flags); rc = ena_device_init(ena_dev, adapter->pdev, &get_feat_ctx, &wd_state); if (rc) { dev_err(&pdev->dev, "Can not initialize device\n"); @@ -2592,6 +2593,11 @@ static int ena_restore_device(struct ena_adapter *adapter) goto err_device_destroy; } + clear_bit(ENA_FLAG_ONGOING_RESET, &adapter->flags); + /* Make sure we don't have a race with AENQ Links state handler */ + if (test_bit(ENA_FLAG_LINK_UP, &adapter->flags)) + netif_carrier_on(adapter->netdev); + rc = ena_enable_msix_and_set_admin_interrupts(adapter, adapter->num_queues); if (rc) { @@ -2618,7 +2624,7 @@ static int ena_restore_device(struct ena_adapter *adapter) ena_com_admin_destroy(ena_dev); err: clear_bit(ENA_FLAG_DEVICE_RUNNING, &adapter->flags); - + clear_bit(ENA_FLAG_ONGOING_RESET, &adapter->flags); dev_err(&pdev->dev, "Reset attempt failed. Can not reset the device\n"); @@ -3495,7 +3501,8 @@ static void ena_update_on_link_change(void *adapter_data, if (status) { netdev_dbg(adapter->netdev, "%s\n", __func__); set_bit(ENA_FLAG_LINK_UP, &adapter->flags); - netif_carrier_on(adapter->netdev); + if (!test_bit(ENA_FLAG_ONGOING_RESET, &adapter->flags)) + netif_carrier_on(adapter->netdev); } else { clear_bit(ENA_FLAG_LINK_UP, &adapter->flags); netif_carrier_off(adapter->netdev); diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h b/drivers/net/ethernet/amazon/ena/ena_netdev.h index ed8bd0a579c4..3bbc003871de 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.h +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h @@ -272,7 +272,8 @@ enum ena_flags_t { ENA_FLAG_DEV_UP, ENA_FLAG_LINK_UP, ENA_FLAG_MSIX_ENABLED, - ENA_FLAG_TRIGGER_RESET + ENA_FLAG_TRIGGER_RESET, + ENA_FLAG_ONGOING_RESET }; /* adapter specific private data structure */ -- 2.7.3.AMZN
Re: Fwd: FW: [PATCH 18/31] nds32: Library functions
On Tue, Nov 14, 2017 at 12:47:04PM +0800, Vincent Chen wrote: > Thanks > So, I should keep the area that we've copied into instead of zeroing > the area even if unpredicted exception is happened. Right? Yes. Here's what's required: if raw_copy_{from,to}_user(from, to, size) returns n, we want * 0 <= n <= size * no bytes outside of to[0 .. size - n - 1] modified * all bytes in that range replaced with corresponding bytes of range from[0 .. size - n - 1] * non-zero return values should happen only when some loads (in case of raw_copy_from_user()) or stores (in case of raw_copy_to_user()) had failed. If everything could have been read and written, we must copy everything. * return value should be equal to size only if no load or no store had been possible. In all other cases you need to copy at least something. You don't have to squeeze all bytes that can be copied (you can, of course, but it's not required). * you should not assume that failing load guarantees that subsequent loads further into the same page will keep failing; normally they will, but relying upon that is asking for trouble. Several architectures had bugs of that sort, with varying amounts of nastiness happening when e.g. write(2) raced with mprotect(2) from another thread... For almost any architecture these should be more or less parallel to memcpy(); the only exception I know of is the situation when cross-address-space copy has timing very different from that for normal load+store. s390 is that way - there's considerable overhead of setting such copying, and you really want it done in bigger chunks than would be optimal for memcpy(). uml is similar. However, generally it's memcpy tweaked to deal with exceptions.
[PATCH v2 11/13] nubus: Rename struct nubus_dev
It is misleading to call a functional resource a "device". In adopting the Linux Driver Model, struct nubus_board will embed a struct device. This will compound the problem because drivers will bind with boards, not with functional resources. Rename struct nubus_dev as struct nubus_rsrc. "Functional resource" is the vendor's terminology so this helps to avoid confusion. Cc: Bartlomiej Zolnierkiewicz Tested-by: Stan Johnson Signed-off-by: Finn Thain --- drivers/net/ethernet/8390/mac8390.c | 26 drivers/net/ethernet/natsemi/macsonic.c | 22 +++ drivers/nubus/nubus.c | 105 drivers/nubus/proc.c| 15 ++--- drivers/video/fbdev/macfb.c | 2 +- include/linux/nubus.h | 30 + 6 files changed, 98 insertions(+), 102 deletions(-) diff --git a/drivers/net/ethernet/8390/mac8390.c b/drivers/net/ethernet/8390/mac8390.c index 9497f18eaba0..929ff6419621 100644 --- a/drivers/net/ethernet/8390/mac8390.c +++ b/drivers/net/ethernet/8390/mac8390.c @@ -123,7 +123,8 @@ enum mac8390_access { }; extern int mac8390_memtest(struct net_device *dev); -static int mac8390_initdev(struct net_device *dev, struct nubus_dev *ndev, +static int mac8390_initdev(struct net_device *dev, + struct nubus_rsrc *ndev, enum mac8390_type type); static int mac8390_open(struct net_device *dev); @@ -169,11 +170,11 @@ static void word_memcpy_tocard(unsigned long tp, const void *fp, int count); static void word_memcpy_fromcard(void *tp, unsigned long fp, int count); static u32 mac8390_msg_enable; -static enum mac8390_type __init mac8390_ident(struct nubus_dev *dev) +static enum mac8390_type __init mac8390_ident(struct nubus_rsrc *fres) { - switch (dev->dr_sw) { + switch (fres->dr_sw) { case NUBUS_DRSW_3COM: - switch (dev->dr_hw) { + switch (fres->dr_hw) { case NUBUS_DRHW_APPLE_SONIC_NB: case NUBUS_DRHW_APPLE_SONIC_LC: case NUBUS_DRHW_SONNET: @@ -184,7 +185,7 @@ static enum mac8390_type __init mac8390_ident(struct nubus_dev *dev) break; case NUBUS_DRSW_APPLE: - switch (dev->dr_hw) { + switch (fres->dr_hw) { case NUBUS_DRHW_ASANTE_LC: return MAC8390_NONE; case NUBUS_DRHW_CABLETRON: @@ -201,7 +202,7 @@ static enum mac8390_type __init mac8390_ident(struct nubus_dev *dev) case NUBUS_DRSW_TECHWORKS: case NUBUS_DRSW_DAYNA2: case NUBUS_DRSW_DAYNA_LC: - if (dev->dr_hw == NUBUS_DRHW_CABLETRON) + if (fres->dr_hw == NUBUS_DRHW_CABLETRON) return MAC8390_CABLETRON; else return MAC8390_APPLE; @@ -212,7 +213,7 @@ static enum mac8390_type __init mac8390_ident(struct nubus_dev *dev) break; case NUBUS_DRSW_KINETICS: - switch (dev->dr_hw) { + switch (fres->dr_hw) { case NUBUS_DRHW_INTERLAN: return MAC8390_INTERLAN; default: @@ -225,8 +226,8 @@ static enum mac8390_type __init mac8390_ident(struct nubus_dev *dev) * These correspond to Dayna Sonic cards * which use the macsonic driver */ - if (dev->dr_hw == NUBUS_DRHW_SMC9194 || - dev->dr_hw == NUBUS_DRHW_INTERLAN) + if (fres->dr_hw == NUBUS_DRHW_SMC9194 || + fres->dr_hw == NUBUS_DRHW_INTERLAN) return MAC8390_NONE; else return MAC8390_DAYNA; @@ -289,7 +290,8 @@ static int __init mac8390_memsize(unsigned long membase) return i * 0x1000; } -static bool __init mac8390_init(struct net_device *dev, struct nubus_dev *ndev, +static bool __init mac8390_init(struct net_device *dev, + struct nubus_rsrc *ndev, enum mac8390_type cardtype) { struct nubus_dir dir; @@ -394,7 +396,7 @@ static bool __init mac8390_init(struct net_device *dev, struct nubus_dev *ndev, struct net_device * __init mac8390_probe(int unit) { struct net_device *dev; - struct nubus_dev *ndev = NULL; + struct nubus_rsrc *ndev = NULL; int err = -ENODEV; struct ei_device *ei_local; @@ -489,7 +491,7 @@ static const struct net_device_ops mac8390_netdev_ops = { }; static int __init mac8390_initdev(struct net_device *dev, - struct nubus_dev *ndev, + struct nubus_rsrc *ndev, enum mac8390_type type) { static u32 fwrd4_offsets[16] = { diff --git a/drivers/net/ethernet/natsemi/macsonic.c b/drivers/net/ethernet/natsemi/macsonic.c index a42433fb6949..14
[PATCH v2 12/13] nubus: Add expansion_type values for various Mac models
Add an expansion slot attribute to allow drivers to properly handle cards like Comm Slot cards and PDS cards without declaration ROMs. Tested-by: Stan Johnson Signed-off-by: Finn Thain --- arch/m68k/include/asm/macintosh.h | 9 ++- arch/m68k/mac/config.c | 110 +--- drivers/net/ethernet/natsemi/macsonic.c | 8 +-- 3 files changed, 54 insertions(+), 73 deletions(-) diff --git a/arch/m68k/include/asm/macintosh.h b/arch/m68k/include/asm/macintosh.h index f42c27400dbc..9b840c03ebb7 100644 --- a/arch/m68k/include/asm/macintosh.h +++ b/arch/m68k/include/asm/macintosh.h @@ -33,7 +33,7 @@ struct mac_model char ide_type; char scc_type; char ether_type; - char nubus_type; + char expansion_type; char floppy_type; }; @@ -73,8 +73,11 @@ struct mac_model #define MAC_ETHER_SONIC1 #define MAC_ETHER_MACE 2 -#define MAC_NO_NUBUS 0 -#define MAC_NUBUS 1 +#define MAC_EXP_NONE 0 +#define MAC_EXP_PDS1 /* Accepts only a PDS card */ +#define MAC_EXP_NUBUS 2 /* Accepts only NuBus card(s) */ +#define MAC_EXP_PDS_NUBUS 3 /* Accepts PDS card and/or NuBus card(s) */ +#define MAC_EXP_PDS_COMM 4 /* Accepts PDS card or Comm Slot card */ #define MAC_FLOPPY_IWM 0 #define MAC_FLOPPY_SWIM_ADDR1 1 diff --git a/arch/m68k/mac/config.c b/arch/m68k/mac/config.c index 16cd5cea5207..d3d435248a24 100644 --- a/arch/m68k/mac/config.c +++ b/arch/m68k/mac/config.c @@ -212,7 +212,7 @@ static struct mac_model mac_data_table[] = { .via_type = MAC_VIA_II, .scsi_type = MAC_SCSI_OLD, .scc_type = MAC_SCC_II, - .nubus_type = MAC_NUBUS, + .expansion_type = MAC_EXP_NUBUS, .floppy_type= MAC_FLOPPY_IWM, }, @@ -227,7 +227,7 @@ static struct mac_model mac_data_table[] = { .via_type = MAC_VIA_II, .scsi_type = MAC_SCSI_OLD, .scc_type = MAC_SCC_II, - .nubus_type = MAC_NUBUS, + .expansion_type = MAC_EXP_NUBUS, .floppy_type= MAC_FLOPPY_IWM, }, { .ident = MAC_MODEL_IIX, @@ -236,7 +236,7 @@ static struct mac_model mac_data_table[] = { .via_type = MAC_VIA_II, .scsi_type = MAC_SCSI_OLD, .scc_type = MAC_SCC_II, - .nubus_type = MAC_NUBUS, + .expansion_type = MAC_EXP_NUBUS, .floppy_type= MAC_FLOPPY_SWIM_ADDR2, }, { .ident = MAC_MODEL_IICX, @@ -245,7 +245,7 @@ static struct mac_model mac_data_table[] = { .via_type = MAC_VIA_II, .scsi_type = MAC_SCSI_OLD, .scc_type = MAC_SCC_II, - .nubus_type = MAC_NUBUS, + .expansion_type = MAC_EXP_NUBUS, .floppy_type= MAC_FLOPPY_SWIM_ADDR2, }, { .ident = MAC_MODEL_SE30, @@ -254,7 +254,7 @@ static struct mac_model mac_data_table[] = { .via_type = MAC_VIA_II, .scsi_type = MAC_SCSI_OLD, .scc_type = MAC_SCC_II, - .nubus_type = MAC_NUBUS, + .expansion_type = MAC_EXP_PDS, .floppy_type= MAC_FLOPPY_SWIM_ADDR2, }, @@ -272,7 +272,7 @@ static struct mac_model mac_data_table[] = { .via_type = MAC_VIA_IICI, .scsi_type = MAC_SCSI_OLD, .scc_type = MAC_SCC_II, - .nubus_type = MAC_NUBUS, + .expansion_type = MAC_EXP_NUBUS, .floppy_type= MAC_FLOPPY_SWIM_ADDR2, }, { .ident = MAC_MODEL_IIFX, @@ -281,7 +281,7 @@ static struct mac_model mac_data_table[] = { .via_type = MAC_VIA_IICI, .scsi_type = MAC_SCSI_IIFX, .scc_type = MAC_SCC_IOP, - .nubus_type = MAC_NUBUS, + .expansion_type = MAC_EXP_PDS_NUBUS, .floppy_type= MAC_FLOPPY_SWIM_IOP, }, { .ident = MAC_MODEL_IISI, @@ -290,7 +290,7 @@ static struct mac_model mac_data_table[] = { .via_type = MAC_VIA_IICI, .scsi_type = MAC_SCSI_OLD, .scc_type = MAC_SCC_II, - .nubus_type = MAC_NUBUS, + .expansion_type = MAC_EXP_PDS_NUBUS, .floppy_type= MAC_FLOPPY_SWIM_ADDR2, }, { .ident = MAC_MODEL_IIVI, @@ -299,7 +299,7 @@ static struct mac_model mac_data_table[] = { .via_type = MAC_VIA_IICI, .scsi_type = MAC_SCSI_LC, .scc_type = MAC_
[PATCH net] tcp: when scheduling TLP, time of RTO should account for current ACK
Fix the TLP scheduling logic so that when scheduling a TLP probe, we ensure that the estimated time at which an RTO would fire accounts for the fact that ACKs indicating forward progress should push back RTO times. After the following fix: df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed") we had an unintentional behavior change in the following kind of scenario: suppose the RTT variance has been very low recently. Then suppose we send out a flight of N packets and our RTT is 100ms: t=0: send a flight of N packets t=100ms: receive an ACK for N-1 packets The response before df92c8394e6e that was: -> schedule a TLP for now + RTO_interval The response after df92c8394e6e is: -> schedule a TLP for t=0 + RTO_interval Since RTO_interval = srtt + RTT_variance, this means that we have scheduled a TLP timer at a point in the future that only accounts for RTT_variance. If the RTT_variance term is small, this means that the timer fires soon. Before df92c8394e6e this would not happen, because in that code, when we receive an ACK for a prefix of flight, we did: 1) Near the top of tcp_ack(), switch from TLP timer to RTO at write_queue_head->paket_tx_time + RTO_interval: if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) tcp_rearm_rto(sk); 2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval: if (flag & FLAG_ACKED) { tcp_rearm_rto(sk); 3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO to TLP at now + RTO_interval: if (icsk->icsk_pending == ICSK_TIME_RETRANS) tcp_schedule_loss_probe(sk); In df92c8394e6e we removed that 3-phase dance, and instead directly set the TLP timer once: we set the TLP timer in cases like this to write_queue_head->packet_tx_time + RTO_interval. So if the RTT variance is small, then this means that this is setting the TLP timer to fire quite soon. This means if the ACK for the tail of the flight takes longer than an RTT to arrive (often due to delayed ACKs), then the TLP timer fires too quickly. Fixes: df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed") Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Eric Dumazet --- include/net/tcp.h | 2 +- net/ipv4/tcp_input.c | 2 +- net/ipv4/tcp_output.c | 8 +--- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 85ea578195d4..4e09398009c1 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -539,7 +539,7 @@ void tcp_push_one(struct sock *, unsigned int mss_now); void tcp_send_ack(struct sock *sk); void tcp_send_delayed_ack(struct sock *sk); void tcp_send_loss_probe(struct sock *sk); -bool tcp_schedule_loss_probe(struct sock *sk); +bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto); void tcp_skb_collapse_tstamp(struct sk_buff *skb, const struct sk_buff *next_skb); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index dabbf1d392fb..f31de422b37f 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2964,7 +2964,7 @@ void tcp_rearm_rto(struct sock *sk) /* Try to schedule a loss probe; if that doesn't work, then schedule an RTO. */ static void tcp_set_xmit_timer(struct sock *sk) { - if (!tcp_schedule_loss_probe(sk)) + if (!tcp_schedule_loss_probe(sk, true)) tcp_rearm_rto(sk); } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 540b7d92cc70..a4d214c7b506 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2391,7 +2391,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, /* Send one loss probe per tail loss episode. */ if (push_one != 2) - tcp_schedule_loss_probe(sk); + tcp_schedule_loss_probe(sk, false); is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd); tcp_cwnd_validate(sk, is_cwnd_limited); return false; @@ -2399,7 +2399,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, return !tp->packets_out && !tcp_write_queue_empty(sk); } -bool tcp_schedule_loss_probe(struct sock *sk) +bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto) { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); @@ -2440,7 +2440,9 @@ bool tcp_schedule_loss_probe(struct sock *sk) } /* If the RTO formula yields an earlier time, then use that time. */ - rto_delta_us = tcp_rto_delta_us(sk); /* How far in future is RTO? */ + rto_delta_us = advancing_rto ? + jiffies_to_usecs(inet_csk(sk)->icsk_rto) : + tcp_rto_delta_us(sk); /* How far in future is RTO? */ if (rto_delta_us > 0) timeout = min_t(u32, tim
[GIT] Networking
1) Revert regression inducing change to the IPSEC template resolver, from Steffen Klassert. 2) Peeloffs can cause the wrong sk to be waken up in SCTP, fix from Xin Long. 3) Min packet MTU size is wrong in cpsw driver, from Grygorii Strashko. 4) Fix build failure in netfilter ctnetlink, from Arnd Bergmann. 5) ISDN hisax driver checks pnp_irq() for errors incorrectly, from Arvind Yadav. 6) Fix fealnx driver build failure on MIPS, from Huacai Chen. 7) Fix into leak in SCTP, the scope_id of socket addresses is not always filled in. From Eric W. Biederman. 8) MTU inheritance between physical function and representor fix in nfp driver, from Dirk van der Merwe. 9) Fix memory leak in rsi driver, from Colin Ian King. 10) Fix expiration and generation ID handling of cached ipv4 redirect routes, from Xin Long. Please pull, thanks a lot! The following changes since commit 6363b3f3ac5be096d08c8c504128befa0c033529: Merge tag 'ipmi-for-4.15' of git://github.com/cminyard/linux-ipmi (2017-11-15 15:12:28 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git for you to fetch changes up to 461ee7f3286dd50be4726606819c4228bc485a17: net: usb: hso.c: remove unneeded DRIVER_LICENSE #define (2017-11-18 10:37:00 +0900) Ahmed Abdelsalam (1): ipv6: sr: update the struct ipv6_sr_hdr Arnd Bergmann (1): netfilter: add ifdef around ctnetlink_proto_size Arvind Yadav (12): isdn: hisax: Fix pnp_irq's error checking for setup_asuscom isdn: hisax: Fix pnp_irq's error checking for avm_pnp_setup isdn: hisax: Fix pnp_irq's error checking for setup_diva_isapnp isdn: hisax: Fix pnp_irq's error checking for setup_elsa_isapnp isdn: hisax: Fix pnp_irq's error checking for setup_hfcsx isdn: hisax: Fix pnp_irq's error checking for setup_hfcs isdn: hisax: Handle return value of pnp_irq and pnp_port_start isdn: hisax: Fix pnp_irq's error checking for setup_isurf isdn: hisax: Fix pnp_irq's error checking for setup_ix1micro isdn: hisax: Fix pnp_irq's error checking for setup_niccy isdn: hisax: Fix pnp_irq's error checking for setup_sedlbauer_isapnp isdn: hisax: Fix pnp_irq's error checking for setup_teles3 Colin Ian King (2): qed: use kzalloc instead of kmalloc and memset rsi: fix memory leak on buf and usb_reg_buf David S. Miller (3): Merge branch 'isdn-hisax-Fix-pnp_irq-error-checking' Merge branch 'master' of git://git.kernel.org/.../klassert/ipsec Merge branch 'nfp-flower-fixes-and-typo-in-ethtool-stats-name' Desnes Augusto Nunes do Rosario (1): ibmvnic: fix dma_mapping_error call Dirk van der Merwe (1): nfp: inherit the max_mtu from the PF netdev Eric W. Biederman (1): net/sctp: Always set scope_id in sctp_inet6_skb_msgname Girish Moodalbail (1): ipvlan: NULL pointer dereference panic in ipvlan_port_destroy Greg Kroah-Hartman (1): net: usb: hso.c: remove unneeded DRIVER_LICENSE #define Grygorii Strashko (1): net: ethernet: ti: cpsw: fix min eth packet size Herbert Xu (1): xfrm: Copy policy family in clone_policy Huacai Chen (1): fealnx: Fix building error on MIPS Joel Stanley (1): virto_net: remove empty file 'virtio_net.' John Hurley (2): nfp: register flower reprs for egress dev offload nfp: remove false positive offloads in flower vxlan Jon Maloy (1): tipc: enforce valid ratio between skb truesize and contents Michal Kubecek (1): genetlink: fix genlmsg_nlhdr() Pieter Jansen van Vuuren (2): nfp: fix flower offload metadata flag usage nfp: fix vlan receive MAC statistics typo Steffen Klassert (1): Revert "xfrm: Fix stack-out-of-bounds read in xfrm_state_find." Tim Hansen (1): net/netlabel: Add list_next_rcu() in rcu_dereference(). Vitaly Kuznetsov (1): hv_netvsc: preserve hw_features on mtu/channels/ringparam changes Xin Long (6): sctp: do not free asoc when it is already dead in sctp_sendmsg sctp: use the right sk after waking up from wait_buf sleep sctp: check stream reset info len before making reconf chunk sctp: set frag_point in sctp_setsockopt_maxseg correctly route: update fnhe_expires for redirect when the fnhe exists route: also update fnhe_genid when updating a route cache drivers/isdn/hisax/asuscom.c | 2 +- drivers/isdn/hisax/avm_pci.c | 2 +- drivers/isdn/hisax/diva.c| 2 +- drivers/isdn/hisax/elsa.c| 2 +- drivers/isdn/hisax/hfc_sx.c | 2 +- drivers/isdn/hisax/hfcscard.c| 2 +- drivers/isdn/hisax/hisax_fcpcipnp.c | 2 + drivers/isdn/hisax/isurf.c | 2 +- drivers/isdn/hisax/ix1_micro.c
Re: [PATCH 7/8] net: ovs: remove unused hardirq.h
It looks the email address of Pravin in MAINTAINERS file is obsolete, sent to the right address. Yang On 11/17/17 3:02 PM, Yang Shi wrote: Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by openvswitch at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Pravin Shelar Cc: "David S. Miller" Cc: d...@openvswitch.org --- net/openvswitch/vport-internal_dev.c | 1 - 1 file changed, 1 deletion(-) diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c index 04a3128..2f47c65 100644 --- a/net/openvswitch/vport-internal_dev.c +++ b/net/openvswitch/vport-internal_dev.c @@ -16,7 +16,6 @@ * 02110-1301, USA */ -#include #include #include #include
Re: iproute2: make ip route list to search by metric too
Hello again, Things turned out to be not so hard. Please take a look at the attached patch. I'm only not sure if RTA_PRIORITY is enough. Because the print_route function prints "metric" also for some situations with RTA_METRICS, which I haven't managed to understand. On Fri, Nov 17, 2017 at 1:40 AM, Alexander Zubkov wrote: > Hello all, > > Currently routes in the Linux routing table have these "key" fields: > prefix, tos, table, metric (as I know). I.e. we cannot have two > different routes with the same set of this fields. And "ip route list" > command can be provided with all but one of those fields. We cannot > pass metric to it and this is inconvenient. I ask if this behaviour > can be changed by someone. We can even use "secondary" fields, for > example type, dev or via, but not metric unfortunately. > Sorry, I can not provide patches. I have written code long time ago. I > tried to trace it, but as I see it parses arguments and fills some > structures. And then my tries to understand failed. > I opened the bug: https://bugzilla.kernel.org/show_bug.cgi?id=197897, > but I was pointed out that this mailing list is a better place for > this question. > > -- > Alexander Zubkov --- a/ip/iproute.c +++ b/ip/iproute.c @@ -126,6 +126,8 @@ static struct int oif, oifmask; int mark, markmask; int realm, realmmask; + int have_metric; + __u32 metric; inet_prefix rprefsrc; inet_prefix rvia; inet_prefix rdst; @@ -288,6 +290,14 @@ static int filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len) if ((mark ^ filter.mark) & filter.markmask) return 0; } + if (filter.have_metric) { + __u32 metric = 0; + + if (tb[RTA_PRIORITY]) + metric = rta_getattr_u32(tb[RTA_PRIORITY]); + if (filter.metric != metric) + return 0; + } if (filter.flushb && r->rtm_family == AF_INET6 && r->rtm_dst_len == 0 && @@ -1518,6 +1528,16 @@ static int iproute_list_flush_or_save(int argc, char **argv, int action) if (get_unsigned(&mark, *argv, 0)) invarg("invalid mark value", *argv); filter.markmask = -1; + } else if (matches(*argv, "metric") == 0 || + matches(*argv, "priority") == 0 || + strcmp(*argv, "preference") == 0) { + __u32 metric; + + NEXT_ARG(); + if (get_u32(&metric, *argv, 0)) +invarg("\"metric\" value is invalid\n", *argv); + filter.metric = metric; + filter.have_metric = 1; } else if (strcmp(*argv, "via") == 0) { int family;
Re: [PATCH net] net: accept UFO datagrams from tuntap and packet
From: Willem de Bruijn Date: Fri, 17 Nov 2017 17:59:13 -0500 > Tuntap and similar devices can inject GSO packets. Accept type > VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively. > > Processes are expected to use feature negotiation such as TUNSETOFFLOAD > to detect supported offload types and refrain from injecting other > packets. This process breaks down with live migration: guest kernels > do not renegotiate flags, so destination hosts need to expose all > features that the source host does. > > Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677. > This patch introduces nearly(*) no new code to simplify verification. > It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP > insertion and software UFO segmentation. This looks good, one minor nit: > @@ -2369,6 +2369,10 @@ static int set_offload(struct tun_struct *tun, > unsigned long arg) > features |= NETIF_F_TSO6; > arg &= ~(TUN_F_TSO4|TUN_F_TSO6); > } > + > + if (arg & TUN_F_UFO) { > + arg &= ~TUN_F_UFO; > + } This can be just simply "arg &= ~TUN_F_UFO;"? If anything the curly braces should be removed for a single statement basic block. Thanks for working so hard on fixing this.
Re: [PATCH] net: usb: hso.c: remove unneeded DRIVER_LICENSE #define
From: Greg Kroah-Hartman Date: Fri, 17 Nov 2017 15:19:39 +0100 > There is no need to #define the license of the driver, just put it in > the MODULE_LICENSE() line directly as a text string. > > This allows tools that check that the module license matches the source > code license to work properly, as there is no need to unwind the > unneeded dereference. > > Cc: "David S. Miller" > Cc: Andreas Kemnade > Cc: Johan Hovold > Reported-by: Philippe Ombredanne > Signed-off-by: Greg Kroah-Hartman Applied.
Re: [PATCH net v2 1/1] ipvlan: NULL pointer dereference panic in ipvlan_port_destroy
From: Girish Moodalbail Date: Thu, 16 Nov 2017 23:16:17 -0800 > When call to register_netdevice() (called from ipvlan_link_new()) fails, > we call ipvlan_uninit() (through ndo_uninit()) to destroy the ipvlan > port. After returning unsuccessfully from register_netdevice() we go > ahead and call ipvlan_port_destroy() again which causes NULL pointer > dereference panic. Fix the issue by making ipvlan_init() and > ipvlan_uninit() call symmetric. > > The ipvlan port will now be created inside ipvlan_init() and will be > destroyed in ipvlan_uninit(). > > Fixes: 2ad7bf363841 (ipvlan: Initial check-in of the IPVLAN driver) > Signed-off-by: Girish Moodalbail Applied.
Re: [PATCH] [net] ibmvnic: fix dma_mapping_error call
From: Desnes Augusto Nunes do Rosario Date: Fri, 17 Nov 2017 09:09:04 -0200 > This patch fixes the dma_mapping_error call to use the correct dma_addr > which is inside the ibmvnic_vpd struct. Moreover, it fixes an uninitialized > warning regarding a local dma_addr variable which is not used anymore. > > Fixes: 4e6759be28e4 ("ibmvnic: Feature implementation of VPD for the ibmvnic > driver") > Reported-by: Stephen Rothwell > Signed-off-by: Desnes A. Nunes do Rosario Applied.
Re: [PATCH] net/netlabel: Add list_next_rcu() in rcu_dereference().
From: Tim Hansen Date: Thu, 16 Nov 2017 12:03:34 -0500 > Add list_next_rcu() for fetching next list in rcu_deference safely. > > Found with sparse in linux-next tree on tag next-20171116. > > Signed-off-by: Tim Hansen Applied.
Re: [PATCH net] route: update fnhe_expires for redirect when the fnhe exists
From: Xin Long Date: Fri, 17 Nov 2017 14:27:06 +0800 > Now when creating fnhe for redirect, it sets fnhe_expires for this > new route cache. But when updating the exist one, it doesn't do it. > It will cause this fnhe never to be expired. > > Paolo already noticed it before, in Jianlin's test case, it became > even worse: > > When ip route flush cache, the old fnhe is not to be removed, but > only clean it's members. When redirect comes again, this fnhe will > be found and updated, but never be expired due to fnhe_expires not > being set. > > So fix it by simply updating fnhe_expires even it's for redirect. > > Fixes: aee06da6726d ("ipv4: use seqlock for nh_exceptions") > Reported-by: Jianlin Shi > Acked-by: Hannes Frederic Sowa > Signed-off-by: Xin Long Applied.
Re: [PATCH net] route: also update fnhe_genid when updating a route cache
From: Xin Long Date: Fri, 17 Nov 2017 14:27:18 +0800 > Now when ip route flush cache and it turn out all fnhe_genid != genid. > If a redirect/pmtu icmp packet comes and the old fnhe is found and all > it's members but fnhe_genid will be updated. > > Then next time when it looks up route and tries to rebind this fnhe to > the new dst, the fnhe will be flushed due to fnhe_genid != genid. It > causes this redirect/pmtu icmp packet acutally not to be applied. > > This patch is to also reset fnhe_genid when updating a route cache. > > Fixes: 5aad1de5ea2c ("ipv4: use separate genid for next hop exceptions") > Acked-by: Hannes Frederic Sowa > Signed-off-by: Xin Long Applied.
Re: [PATCH net] sctp: set frag_point in sctp_setsockopt_maxseg correctly
From: Xin Long Date: Fri, 17 Nov 2017 14:11:11 +0800 > Now in sctp_setsockopt_maxseg user_frag or frag_point can be set with > val >= 8 and val <= SCTP_MAX_CHUNK_LEN. But both checks are incorrect. > > val >= 8 means frag_point can even be less than SCTP_DEFAULT_MINSEGMENT. > Then in sctp_datamsg_from_user(), when it's value is greater than cookie > echo len and trying to bundle with cookie echo chunk, the first_len will > overflow. > > The worse case is when it's value is equal as cookie echo len, first_len > becomes 0, it will go into a dead loop for fragment later on. In Hangbin > syzkaller testing env, oom was even triggered due to consecutive memory > allocation in that loop. > > Besides, SCTP_MAX_CHUNK_LEN is the max size of the whole chunk, it should > deduct the data header for frag_point or user_frag check. > > This patch does a proper check with SCTP_DEFAULT_MINSEGMENT subtracting > the sctphdr and datahdr, SCTP_MAX_CHUNK_LEN subtracting datahdr when > setting frag_point via sockopt. It also improves sctp_setsockopt_maxseg > codes. > > Suggested-by: Marcelo Ricardo Leitner > Reported-by: Hangbin Liu > Signed-off-by: Xin Long Applied.
Re: [PATCH] qed: fix unnecessary call to memset cocci warnings
From: Vasyl Gomonovych Date: Thu, 16 Nov 2017 23:04:08 +0100 > Use kzalloc rather than kmalloc followed by memset with 0 > > drivers/net/ethernet/qlogic/qed/qed_dcbx.c:1280:13-20: WARNING: > kzalloc should be used for dcbx_info, instead of kmalloc/memset > Generated by: scripts/coccinelle/api/alloc/kzalloc-simple.cocci > > Signed-off-by: Vasyl Gomonovych This patch doesn't even apply.
Re: JOIN_ANYCAST breakage w. "net: ipv6: put host and anycast routes on device with address"
On 11/14/17 10:36 AM, Florian Westphal wrote: > Hi David > > This test program no longer works with 4.14 > (recvfrom: Resource temporarily unavailable) > > after reverting commit > 4832c30d5458387ff2533ff66fbde26ad8bb5a2d > (net: ipv6: put host and anycast routes on device with address) > > it will work again ("OK"). > > Could you please have a look at this? > This restores the previous behavior: diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 05eb7bc36156..1c29d9bcedc3 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1019,7 +1019,7 @@ static struct net_device *ip6_rt_get_dev_rcu(struct rt6_info *rt) { struct net_device *dev = rt->dst.dev; - if (rt->rt6i_flags & RTF_LOCAL) { + if (rt->rt6i_flags & (RTF_LOCAL | RTF_ANYCAST)) { /* for copies of local routes, dst->dev needs to be the * device if it is a master device, the master device if * device is enslaved, and the loopback as the default
Re: [ftrace-bpf 1/5] add BPF_PROG_TYPE_FTRACE to bpf
On Mon, Nov 13, 2017 at 12:06:17AM -0800, peng yu wrote: > > 1. anything bpf related has to go via net-next tree. > I found there is a net-next git repo: > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git > I will use this repo for the further bpf-ftrace patch set. > > > 2. > > this obviously breaks ABI. New types can only be added to the end. > Sure, I will add the new type at the end. > > > 3. > > this won't even compile, since ftrace_regs is only added in the patch 4. > It could compile, as the ftrace_regs related code is inside the > "#ifdef FTRACE_BPF_FILTER" macro, if this macro is not defined, no > ftrace_regs related code would be compiled. > > > > Since bpf program will see ftrace_regs as an input it becomes > > abi, so has to be defined in uapi/linux/bpf_ftrace.h or similar. > > We need to think through how to make it generic across archs > > instead of defining ftrace_regs for each arch. > I'm not sure whether I'm fully understand your meaning. Like kprobe, > the ftrace-bpf need to get a function's parameters and check them. So > it won't be abi stable, and it should depend on architecture > implement. I can create a header file like uapi/linux/bpf_ftrace.h, > but I noticed that kprobe doesn't have such a header file, if I'm > wrong, please let me know. And about make it generic across archs, I > know kprobe use pt_regs as parameter, the pt_regs is defined on each > arch, so I can't see how bpf-ftrace can get a generic interface across > archs if it need to check function's parameters. If I misunderstand > anything, please let me know. all of ftrace are called at function entry and calling convention is fixed per architecture, so we can make a generic and stable struct bpf_ftrace_args { __u64 arg1, arg2, .. arg5; }; save_mcount_regs doesn't care what order the regs are stored so the same stack space can be used to keep bpf_ftrace_args and used in restore_mcount_regs. I'd also make it depend on DYNAMIC_FTRACE_WITH_REGS to avoid dealing with obscure corner cases. > > > 4. > > the patch 2/3 takes an approach of passing FD integer value in text form > > to the kernel. That approach was discussed years ago and rejected. > > It has to use binary interface like perf_event + ioctl. > > See RFC patches where we're extending perf_event_open syscall to > > support binary access to kprobe/uprobe. > > imo binary interface to ftrace is pre-requisite to ftrace+bpf work. > > We've had too many issues with text based kprobe api to repeat > > the same mistake here. > I notice the kprobe-bpf prog is set through the PERF_EVENT_IOC_SET_BPF > ioctl, I may try to see whether I can reuse this interface, or if it > is not suitable, I will try to define a new binary interface. > > > 5. > > patch 4 hacks save_mcount_regs asm to pass ctx pointer in %rcx > > whereas it's only used in ftrace_graph_caller which doesn't seem right. > > It points out to another issue that such ftrace+bpf integration > > is only done for ftrace_graph_caller without extensibility in mind. > > If we do ftrace+bpf I'd rather see generic framework that applies > > to all of ftrace instead of single feature of it. > It is a hard problem. The ftrace framework has lots of tracers, > function tracer and function graph tracer use the 'gcc -pg' directly, > other tracers use tracepoint, I should spend more time to find a > suitable solution. since all of ftrace goes through the same function entry point it should be possible to have one generic bpf filter interface suitable for all tracers that ftrace supports.
[PATCH 5/8] crypto: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by crypto at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Herbert Xu Cc: "David S. Miller" Cc: linux-cry...@vger.kernel.org --- crypto/ablk_helper.c | 1 - crypto/blkcipher.c | 1 - crypto/mcryptd.c | 1 - 3 files changed, 3 deletions(-) diff --git a/crypto/ablk_helper.c b/crypto/ablk_helper.c index 1441f07..ee52660 100644 --- a/crypto/ablk_helper.c +++ b/crypto/ablk_helper.c @@ -28,7 +28,6 @@ #include #include #include -#include #include #include #include diff --git a/crypto/blkcipher.c b/crypto/blkcipher.c index 6c43a0a..01c0d4a 100644 --- a/crypto/blkcipher.c +++ b/crypto/blkcipher.c @@ -18,7 +18,6 @@ #include #include #include -#include #include #include #include diff --git a/crypto/mcryptd.c b/crypto/mcryptd.c index 4e64726..9fa362c 100644 --- a/crypto/mcryptd.c +++ b/crypto/mcryptd.c @@ -26,7 +26,6 @@ #include #include #include -#include #define MCRYPTD_MAX_CPU_QLEN 100 #define MCRYPTD_BATCH 9 -- 1.8.3.1
[PATCH 2/8] fs: pstore: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by pstore at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Kees Cook Cc: Anton Vorontsov Cc: Colin Cross Cc: Tony Luck --- fs/pstore/platform.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c index 2b21d18..25dcef4 100644 --- a/fs/pstore/platform.c +++ b/fs/pstore/platform.c @@ -41,7 +41,6 @@ #include #include #include -#include #include #include -- 1.8.3.1
[PATCH 3/8] fs: btrfs: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by btrfs at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Chris Mason Cc: Josef Bacik Cc: David Sterba Cc: linux-bt...@vger.kernel.org --- fs/btrfs/extent_map.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 2e348fb..cced7f1 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -2,7 +2,6 @@ #include #include #include -#include #include "ctree.h" #include "extent_map.h" #include "compression.h" -- 1.8.3.1
[PATCH 7/8] net: ovs: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by openvswitch at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Pravin Shelar Cc: "David S. Miller" Cc: d...@openvswitch.org --- net/openvswitch/vport-internal_dev.c | 1 - 1 file changed, 1 deletion(-) diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c index 04a3128..2f47c65 100644 --- a/net/openvswitch/vport-internal_dev.c +++ b/net/openvswitch/vport-internal_dev.c @@ -16,7 +16,6 @@ * 02110-1301, USA */ -#include #include #include #include -- 1.8.3.1
[PATCH 4/8] vfs: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by vfs at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Alexander Viro --- fs/dcache.c | 1 - fs/file_table.c | 1 - 2 files changed, 2 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index f901413..9340e8c 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -32,7 +32,6 @@ #include #include #include -#include #include #include #include diff --git a/fs/file_table.c b/fs/file_table.c index 61517f5..dab099e 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -23,7 +23,6 @@ #include #include #include -#include #include #include #include -- 1.8.3.1
[PATCH 8/8] net: tipc: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by TIPC at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Jon Maloy Cc: Ying Xue Cc: "David S. Miller" --- net/tipc/core.h | 1 - 1 file changed, 1 deletion(-) diff --git a/net/tipc/core.h b/net/tipc/core.h index 5cc5398..099e072 100644 --- a/net/tipc/core.h +++ b/net/tipc/core.h @@ -49,7 +49,6 @@ #include #include #include -#include #include #include #include -- 1.8.3.1
[PATCH 6/8] net: caif: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by caif at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Dmitry Tarnyagin Cc: "David S. Miller" --- net/caif/cfpkt_skbuff.c | 1 - net/caif/chnl_net.c | 1 - 2 files changed, 2 deletions(-) diff --git a/net/caif/cfpkt_skbuff.c b/net/caif/cfpkt_skbuff.c index 71b6ab2..38c2b7a 100644 --- a/net/caif/cfpkt_skbuff.c +++ b/net/caif/cfpkt_skbuff.c @@ -8,7 +8,6 @@ #include #include -#include #include #include diff --git a/net/caif/chnl_net.c b/net/caif/chnl_net.c index 922ac1d..53ecda1 100644 --- a/net/caif/chnl_net.c +++ b/net/caif/chnl_net.c @@ -8,7 +8,6 @@ #define pr_fmt(fmt) KBUILD_MODNAME ":%s(): " fmt, __func__ #include -#include #include #include #include -- 1.8.3.1
[PATCH 1/8] mm: kmemleak: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by kmemleak at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Michal Hocko Cc: Andrew Morton Cc: Matthew Wilcox --- mm/kmemleak.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 7780cd8..25b977f 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -91,7 +91,6 @@ #include #include #include -#include #include #include #include -- 1.8.3.1
[iproute2 PATCH] man: tc-flower: add explanation for hw_tc option
Add details explaining the hw_tc option. Signed-off-by: Amritha Nambiar --- man/man8/tc-flower.8 |9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8 index be46f02..fd9098e 100644 --- a/man/man8/tc-flower.8 +++ b/man/man8/tc-flower.8 @@ -10,7 +10,10 @@ flower \- flow based traffic control filter .B action .IR ACTION_SPEC " ] [ " .B classid -.IR CLASSID " ]" +.IR CLASSID " ] [ " +.B hw_tc +.IR TCID " ]" + .ti -8 .IR MATCH_LIST " := [ " MATCH_LIST " ] " MATCH @@ -77,6 +80,10 @@ is in the form .BR X : Y ", while " X " and " Y are interpreted as numbers in hexadecimal format. .TP +.BI hw_tc " TCID" +Specify a hardware traffic class to pass matching packets on to. TCID is in the +range 0 through 15. +.TP .BI indev " ifname" Match on incoming interface name. Obviously this makes sense only for forwarded flows.
[iproute2 PATCH] man: tc-mqprio: add documentation for new offload options
This patch adds documentation for additional offload modes and associated parameters in tc-mqprio. Signed-off-by: Amritha Nambiar --- man/man8/tc-mqprio.8 | 60 +- 1 file changed, 49 insertions(+), 11 deletions(-) diff --git a/man/man8/tc-mqprio.8 b/man/man8/tc-mqprio.8 index 0e1d305..a1bedd3 100644 --- a/man/man8/tc-mqprio.8 +++ b/man/man8/tc-mqprio.8 @@ -16,7 +16,17 @@ P0 P1 P2... count1@offset1 count2@offset2 ... .B ] [ hw 1|0 -.B ] +.B ] [ mode +dcb|channel] +.B ] [ shaper +dcb| +.B [ bw_rlimit +.B min_rate +min_rate1 min_rate2 ... +.B max_rate +max_rate1 max_rate2 ... +.B ]] + .SH DESCRIPTION The MQPRIO qdisc is a simple queuing discipline that allows mapping @@ -36,14 +46,16 @@ and By default these parameters are configured by the hardware driver to match the hardware QOS structures. -Enabled hardware can provide hardware QOS with the ability to steer -traffic flows to designated traffic classes provided by this qdisc. -Configuring the hardware based QOS mechanism is outside the scope of -this qdisc. Tools such as -.B lldpad -and -.B ethtool -exist to provide this functionality. Also further qdiscs may be added +.B Channel +mode supports full offload of the mqprio options, the traffic classes, the queue +configurations and QOS attributes to the hardware. Enabled hardware can provide +hardware QOS with the ability to steer traffic flows to designated traffic +classes provided by this qdisc. Hardware based QOS is configured using the +.B shaper +parameter. +.B bw_rlimit +with minimum and maximum bandwidth rates can be used for setting +transmission rates on each traffic class. Also further qdiscs may be added to the classes of MQPRIO to create more complex configurations. .SH ALGORITHM @@ -104,9 +116,35 @@ contiguous range of queues. hw Set to .B 1 -to use hardware QOS defaults. Set to +to support hardware offload. Set to .B 0 -to override hardware defaults with user specified values. +to configure user specified values in software only. + +.TP +mode +Set to +.B channel +for full use of the mqprio options. Use +.B dcb +to offload only TC values and use hardware QOS defaults. Supported with 'hw' +set to 1 only. + +.TP +shaper +Use +.B bw_rlimit +to set bandwidth rate limits for a traffic class. Use +.B dcb +for hardware QOS defaults. Supported with 'hw' set to 1 only. + +.TP +min_rate +Minimum value of bandwidth rate limit for a traffic class. + +.TP +max_rate +Maximum value of bandwidth rate limit for a traffic class. + .SH AUTHORS John Fastabend,
Re: regression: UFO removal breaks kvm live migration
On Fri, Nov 17, 2017 at 9:48 AM, Willem de Bruijn wrote: >>> Okay, I will send a patch to reinstate UFO for this use case (only). There >>> is some related work in tap_handle_frame and packet_direct_xmit to >>> segment directly in the device. I will be traveling the next few days, so >>> it won't be in time for 4.14 (but can go in stable later, of course). >> >> I'm finishing up and running some tests. The majority of the patch is a >> straightforward partial revert of the patchset, so while fairly large for a >> patch to net (~150 lines, esp. in udp[46]_ufo_fragment), that is all >> thoroughly tested code. Notably absent are the protocol layer and >> hardware support (NETIF_F_UFO) portions. >> >> The only open issue is whether to rely on existing skb_gso_segment >> processing in the transmit path from validate_xmit_skb or to add new >> skb_gso_segment calls directly to tun_get_user, tap_get_user and >> pf_packet. Tun has to loop around four different ways of injecting >> packets into the device. Something like the below snippet. >> >> More conservative is to introduce no completely new code and rely on >> validate_xmit_skb, but that means having to protect the entire stack >> against skbs with SKB_GSO_UDP, so also bringing back some >> checksum and fragment handling snippets in gre_gso_segment, >> __skb_udp_tunnel_segment, act_csum and openvswitch. > > Come to think of it, as this patch does not bring back NETIF_F_UFO > support to NETIF_F_GSO_SOFTWARE, the tunnel cases can be > excluded. > > Then this is probably the simpler and more obviously correct approach. Sent: http://patchwork.ozlabs.org/patch/839168/
[PATCH net] net: accept UFO datagrams from tuntap and packet
From: Willem de Bruijn Tuntap and similar devices can inject GSO packets. Accept type VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively. Processes are expected to use feature negotiation such as TUNSETOFFLOAD to detect supported offload types and refrain from injecting other packets. This process breaks down with live migration: guest kernels do not renegotiate flags, so destination hosts need to expose all features that the source host does. Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677. This patch introduces nearly(*) no new code to simplify verification. It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP insertion and software UFO segmentation. It does not reinstate protocol stack support, hardware offload (NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception of VIRTIO_NET_HDR_GSO_UDP packets in tuntap. To support SKB_GSO_UDP reappearing in the stack, also reinstate logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD by squashing in commit 939912216fa8 ("net: skb_needs_check() removes CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee643f1 ("net: avoid skb_warn_bad_offload false positives on UFO"). (*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id, ipv6_proxy_select_ident is changed to return a __be32, which is assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted at the end of the enum to minimize code churn. Link: http://lkml.kernel.org/r/ Fixes: fb652fdfe837 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.") Reported-by: Michal Kubecek Signed-off-by: Willem de Bruijn --- drivers/net/tap.c | 2 +- drivers/net/tun.c | 4 ++ include/linux/netdev_features.h | 4 +- include/linux/netdevice.h | 1 + include/linux/skbuff.h | 2 + include/linux/virtio_net.h | 5 ++- include/net/ipv6.h | 1 + net/core/dev.c | 3 +- net/ipv4/af_inet.c | 12 +- net/ipv4/udp_offload.c | 49 ++-- net/ipv6/output_core.c | 31 +++ net/ipv6/udp_offload.c | 85 +++-- net/openvswitch/datapath.c | 14 +++ net/openvswitch/flow.c | 6 ++- net/sched/act_csum.c| 6 +++ 15 files changed, 211 insertions(+), 14 deletions(-) diff --git a/drivers/net/tap.c b/drivers/net/tap.c index b13890953ebb..e9489b88407c 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -1077,7 +1077,7 @@ static long tap_ioctl(struct file *file, unsigned int cmd, case TUNSETOFFLOAD: /* let the user check for future flags */ if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 | - TUN_F_TSO_ECN)) + TUN_F_TSO_ECN | TUN_F_UFO)) return -EINVAL; rtnl_lock(); diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 6bb1e604aadd..a33385d8ac65 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -2369,6 +2369,10 @@ static int set_offload(struct tun_struct *tun, unsigned long arg) features |= NETIF_F_TSO6; arg &= ~(TUN_F_TSO4|TUN_F_TSO6); } + + if (arg & TUN_F_UFO) { + arg &= ~TUN_F_UFO; + } } /* This gives the user a way to test for new features in future by diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h index dc8b4896b77b..b1b0ca7ccb2b 100644 --- a/include/linux/netdev_features.h +++ b/include/linux/netdev_features.h @@ -54,8 +54,9 @@ enum { NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */ NETIF_F_GSO_SCTP_BIT, /* ... SCTP fragmentation */ NETIF_F_GSO_ESP_BIT,/* ... ESP with TSO */ + NETIF_F_GSO_UDP_BIT,/* ... UFO, deprecated except tuntap */ /**/NETIF_F_GSO_LAST = /* last bit, see GSO_MASK */ - NETIF_F_GSO_ESP_BIT, + NETIF_F_GSO_UDP_BIT, NETIF_F_FCOE_CRC_BIT, /* FCoE CRC32 */ NETIF_F_SCTP_CRC_BIT, /* SCTP checksum offload */ @@ -132,6 +133,7 @@ enum { #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM) #define NETIF_F_GSO_SCTP __NETIF_F(GSO_SCTP) #define NETIF_F_GSO_ESP__NETIF_F(GSO_ESP) +#define NETIF_F_GSO_UDP__NETIF_F(GSO_UDP) #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER) #define NETIF_F_HW_VLAN_STAG_RX__NETIF_F(HW_VLAN_STAG_RX) #define NETIF_F_HW_VLAN_STAG_TX__NETIF_F(HW_VLAN_STAG_TX) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 6b274bfe489f..ef789e1d679e 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -4140,6 +4140,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_ty
Re: [LTP] [RFC] [PATCH] netns: Fix race in virtual interface bringup
Alexey, Li, thank you for your suggestions. On Fri, Nov 17, 2017 at 03:08:20PM +0300, Alexey Kodanev wrote: > On 11/17/2017 09:09 AM, Li Wang wrote: > > Hi Dan, > > > > On Fri, Nov 10, 2017 at 4:38 AM, Dan Rue wrote: > >> Symptoms (+ command, error): > >> netns_comm_ip_ipv6_ioctl: > >> + ip netns exec tst_net_ns1 ping6 -q -c2 -I veth1 fd00::2 > >> connect: Cannot assign requested address > >> > >> netns_comm_ip_ipv6_netlink: > >> + ip netns exec tst_net_ns0 ping6 -q -c2 -I veth0 fd00::3 > >> connect: Cannot assign requested address > >> > >> netns_comm_ns_exec_ipv6_ioctl: > >> + ns_exec 6689 net ping6 -q -c2 -I veth0 fd00::3 > >> connect: Cannot assign requested address > >> > >> netns_comm_ns_exec_ipv6_netlin: > >> + ns_exec 6891 net ping6 -q -c2 -I veth0 fd00::3 > >> connect: Cannot assign requested address > >> > >> The error is coming from ping6, which is trying to get an IP address for > >> veth0 (due to -I veth0), but cannot. Waiting for two seconds fixes the > >> test in my testcases. 1 second is not long enough. > >> > >> dmesg shows the following during the test: > >> > >> [Nov 7 15:39] LTP: starting netns_comm_ip_ipv6_ioctl (netns_comm.sh ip > >> ipv6 ioctl) > >> [ +0.302401] IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready > >> [ +0.048059] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready > > It's quite strange that veth interface needs 2 seconds to become > operational and it is up in less than 0.3s according to dmesg, but > you said that it's not enough even 1 sec... Are you sure that IPv6 > address not in tentative state and dad process actually disabled? > I'm asking because you don't have it disabled in the script: > https://gist.github.com/danrue/7b76bbcbc23a6296030b7295650b69f3 Investigating further, the dmesg output is reporting on the status of the link between veth0 and veth1, not the veth0 interface itself. That is, the first dmesg message comes from "ip netns exec tst_net_ns0 ifconfig veth0 up" and the second comes from "ip netns exec tst_net_ns1 ifconfig veth1 up". This explains why we see .3s in dmesg but a 2 second sleep being required. There is not actually anything in dmesg that is helpful here. Regarding dad (duplicate address detection), we have seen similar issues on low power ARM64 boards and IPv4. Anyway, I tried disabling dad on the interface and it did not make a difference. > > >> > >> Signed-off-by: Dan Rue > >> --- > >> > >> We've periodically hit this problem across many arm64 kernels and boards, > >> and > >> it seems to be caused by "ping6" running before the virtual interface is > >> actually ready. "sleep 2" works around the issue and proves that it is a > >> race > >> condition, but I would prefer something faster and deterministic. Please > >> suggest a better implementation. > > Just FYI: > > > > I'm not good at network things, but one method I copied from ltp/numa > > test is to split the '2s' into many smaller pieces of time. > > > > which something like: > > > > --- a/testcases/kernel/containers/netns/netns_helper.sh > > +++ b/testcases/kernel/containers/netns/netns_helper.sh > > @@ -240,6 +240,22 @@ netns_ip_setup() > > tst_brkm TBROK "unable to add device veth1 to the > > separate network namespace" > > } > > > > +wait_for_set_ip() > > +{ > > + local dev=$1 > > + local retries=200 > > + > > + while [ $retries -gt 0 ]; do > > + dmesg -c | grep -q "IPv6: ADDRCONF(NETDEV_CHANGE): > > $dev: link becomes ready" > > > What about "grep -q up /sys/class/net/$dev/operstate && break"? Since dmesg will not help, I explored /sys as proposed. operstate shows "up", and ping6 still fails. carrier shows "1" (up), and ping6 still fails. dormant shows "0" (interface is not dormant), and ping6 still fails. flags shows "0x1003" before and after a 2s sleep (they don't change) So it seems there is nothing in dmesg, or /sys that can help here. Dan > > Thanks, > Alexey > > > > + if [ $? -eq 0 ]; then > > + break > > + fi > > + > > + retries=$((retries-1)) > > + tst_sleep 10ms > > + done > > +} > > + > > ## > > # Enables virtual ethernet devices and assigns IP addresses for both > > # of them (IPv4/IPv6 variant is decided by netns_setup() function). > > @@ -285,6 +301,9 @@ netns_set_ip() > > tst_brkm TBROK "enabling veth1 device failed" > > ;; > > esac > > + > > + wait_for_set_ip veth0 > > + wait_for_set_ip veth1 > > } > > > > netns_ns_exec_cleanup() > > > >> Also, is it correct that "ifconfig veth0 up" returns before the interface > >> is > >> actually ready? > >> > >> See also this isolated test script: > >> https://gist.github.com/danrue/7b76bbcbc23a6296030b7295650b69f3 > >> > >> testcases/kernel/containers/netns/netns_helper.sh | 1 + > >> 1 file changed, 1
Greetings From Mrs. Sarah Smith
Greetings From Mrs. Sarah Smith, With Due Respect and Humanity, I was compelled to write to you under a humanitarian ground. My names are Mrs.Sarah Smith , am 52 years old From Switzerland; I am married to Late Mr. Hazard Smith; but we Living Benin Republic, We were married for 25 years without a child. He died after a Cardiac Arteries Operation. And Recently, My Doctor told me that I would not last for the next 3 months due to my cancer problem (Breast cancer). Before my husband died last year there is this sum $2,800,000.00 United State Dollars that he deposited with a bank in Benin and presently the fund is still with the Bank. Having known my condition I decided to donate this fund to individual that will utilize this fund the way I am going to instruct herein. I want somebody that will use this fund according to the desire of my late husband to help less privileged people, orphanages, widows. I took this decision because I don't have any child that will inherit this fund, and I don't want in a way were this fund will be used in wrong way. If you wish to help me actualize this vision, please indicate your readiness immediately you received this proposal. Remain blessed you as you listing to the voice of reasoning. Your beloved sister Mrs. Sarah Smith,
Re: [PATCH iproute2/net-next v3]tc: B.W limits can now be specified in %.
On Sat, 18 Nov 2017 02:13:38 +0530 Nishanth Devarajan wrote: > diff --git a/tc/tc_util.h b/tc/tc_util.h > index 583a21a..7b7420a 100644 > --- a/tc/tc_util.h > +++ b/tc/tc_util.h > @@ -24,14 +24,14 @@ struct qdisc_util { > struct qdisc_util *next; > const char *id; > int (*parse_qopt)(struct qdisc_util *qu, int argc, > - char **argv, struct nlmsghdr *n); > + char **argv, struct nlmsghdr *n, char *dev); One more nit... Since parsing queue options should not modify the device name, that should be const char *.
[PATCH iproute2] tc: cleanup qdisc arg parsing
The qdisc arg parsing has magic limit of 16 for class which is not required by kernel. Also the limit of 16 for device name is really IFNAMSIZ. Signed-off-by: Stephen Hemminger --- tc/tc_qdisc.c | 21 + 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c index fcb75f29128e..1066ae05a4b5 100644 --- a/tc/tc_qdisc.c +++ b/tc/tc_qdisc.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -49,8 +50,7 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int argc, char **argv) struct tc_sizespec szopts; __u16 *data; } stab = {}; - char d[16] = {}; - char k[16] = {}; + char d[IFNAMSIZ] = {}; struct { struct nlmsghdr n; struct tcmsgt; @@ -89,8 +89,8 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int argc, char **argv) return -1; } req.t.tcm_parent = TC_H_CLSACT; - strncpy(k, "clsact", sizeof(k) - 1); - q = get_qdisc_kind(k); + + q = get_qdisc_kind("clsact"); req.t.tcm_handle = TC_H_MAKE(TC_H_CLSACT, 0); NEXT_ARG_FWD(); break; @@ -100,8 +100,8 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int argc, char **argv) return -1; } req.t.tcm_parent = TC_H_INGRESS; - strncpy(k, "ingress", sizeof(k) - 1); - q = get_qdisc_kind(k); + + q = get_qdisc_kind("ingress"); req.t.tcm_handle = TC_H_MAKE(TC_H_INGRESS, 0); NEXT_ARG_FWD(); break; @@ -124,26 +124,23 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int argc, char **argv) } else if (matches(*argv, "help") == 0) { usage(); } else { - strncpy(k, *argv, sizeof(k)-1); - - q = get_qdisc_kind(k); + q = get_qdisc_kind(*argv); argc--; argv++; break; } argc--; argv++; } - if (k[0]) - addattr_l(&req.n, sizeof(req), TCA_KIND, k, strlen(k)+1); if (est.ewma_log) addattr_l(&req.n, sizeof(req), TCA_RATE, &est, sizeof(est)); if (q) { + addattr_l(&req.n, sizeof(req), TCA_KIND, q->id, strlen(q->id) + 1); if (q->parse_qopt) { if (q->parse_qopt(q, argc, argv, &req.n)) return 1; } else if (argc) { - fprintf(stderr, "qdisc '%s' does not support option parsing\n", k); + fprintf(stderr, "qdisc '%s' does not support option parsing\n", q->id); return -1; } } else { -- 2.11.0
Re: [PATCH iproute2/net-next v3]tc: B.W limits can now be specified in %.
On Sat, 18 Nov 2017 02:13:38 +0530 Nishanth Devarajan wrote: > + result = strtoul(buf, &endp, 0); > + > + if (*endp || buf == endp) { > + fprintf(stderr, "value \"%s\" in file %s is not a number\n", > + buf, fname); > + goto out; > + } > + > + if (result == ULONG_MAX && errno == ERANGE) { > + fprintf(stderr, "strtoul %s: %s", fname, strerror(errno)); > + goto out; > + } Since speed value of unknown is represented as "-1" I think you need to change this API to take signed value (ie use strtol)
Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit
On 17.11.2017 21:52, Eric W. Biederman wrote: > Kirill Tkhai writes: > >> On 15.11.2017 19:31, Eric W. Biederman wrote: >>> Kirill Tkhai writes: >>> On 15.11.2017 12:51, Kirill Tkhai wrote: > On 15.11.2017 06:19, Eric W. Biederman wrote: >> Kirill Tkhai writes: >> >>> On 14.11.2017 21:39, Cong Wang wrote: On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai wrote: > @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags, > > get_user_ns(user_ns); > > - rv = mutex_lock_killable(&net_mutex); > + rv = down_read_killable(&net_sem); > if (rv < 0) { > net_free(net); > dec_net_namespaces(ucounts); > @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags, > list_add_tail_rcu(&net->list, &net_namespace_list); > rtnl_unlock(); > } > - mutex_unlock(&net_mutex); > + up_read(&net_sem); > if (rv < 0) { > dec_net_namespaces(ucounts); > put_user_ns(user_ns); > @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work) > list_replace_init(&cleanup_list, &net_kill_list); > spin_unlock_irq(&cleanup_list_lock); > > - mutex_lock(&net_mutex); > + down_read(&net_sem); > > /* Don't let anyone else find us. */ > rtnl_lock(); > @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work) > list_for_each_entry_reverse(ops, &pernet_list, list) > ops_free_list(ops, &net_exit_list); > > - mutex_unlock(&net_mutex); > + up_read(&net_sem); After your patch setup_net() could run concurrently with cleanup_net(), given that ops_exit_list() is called on error path of setup_net() too, it means ops->exit() now could run concurrently if it doesn't have its own lock. Not sure if this breaks any existing user. >>> >>> Yes, there will be possible concurrent ops->init() for a net namespace, >>> and ops->exit() for another one. I hadn't found pernet operations, which >>> have a problem with that. If they exist, they are hidden and not clear >>> seen. >>> The pernet operations in general do not touch someone else's memory. >>> If suddenly there is one, KASAN should show it after a while. >> >> Certainly the use of hash tables shared between multiple network >> namespaces would count. I don't rembmer how many of these we have but >> there used to be quite a few. > > Could you please provide an example of hash tables, you mean? Ah, I see, it's dccp_hashinfo etc. >> >> JFI, I've checked dccp_hashinfo, and it seems to be safe. >> >>> >>> The big one used to be the route cache. With resizable hash tables >>> things may be getting better in that regard. >> >> I've checked some fib-related things, and wasn't able to find that. >> Excuse me, could you please clarify, if it's an assumption, or >> there is exactly a problem hash table, you know? Could you please >> point it me more exactly, if it's so. > > Two things. > 1) Hash tables are one case I know where we access data from multiple >network namespaces. As such it can not be asserted that is no >possibility for problems. > > 2) The responsible way to handle this is one patch for each set of >methods explaining why those methods are safe to run in parallel. > >That ensures there is opportunity for review and people are going >slowly enough that they actually look at these issues. > > The reason I want to see this broken up is that at 200ish sets of > methods it is too much to review all at once. Ok, it's possible to split the changes in 400 patches, but there is a problem with three-state (no compile, module, built-in) drivers. Git bisect won't work anyway. Please see the description of the problem in cover message "[PATCH RFC 00/25] Replacing net_mutex with rw_semaphore" I sent today. > I completely agree that odds are that this can be made safe and that it > is mostly likely already safe in practically every instance.My guess > would be that if there are problems that need to be addressed they > happen in one or two places and we need to find them. If possible I > don't want to find them after the code has shipped in a stable release. Kirill
[PATCH iproute2/net-next v3]tc: B.W limits can now be specified in %.
This patch adapts the tc command line interface to allow bandwidth limits to be specified as a percentage of the interface's capacity. Adding this functionality requires passing the specified device string to each class/qdisc which changes the prototype for a couple of functions: the .parse_qopt and .parse_copt interfaces. The device string is a required parameter for tc-qdisc and tc-class, and when not specified, the kernel returns ENODEV. In this patch, if the user tries to specify a bandwidth percentage without naming the device, we return an error from userspace. v2: * Modified and moved int read_prop() from ip/iptuntap.c to lib/utils.c, to make it accessible to tc. v3: * Modified and moved int parse_percent() from tc/q_netem.c to ib/util.c for use in tc. * Changed couple variable names in int parse_percent_rate(). * Handled showing error message when device speed is unknown. * Updated man page to warn users that when specifying rates in %, tc only uses the current device speed and does not recalculate if it changes after. During cases when properties (like device speed) are unknown, read_prop() assumes that if the property file can be opened but not read, it means that the property is unknown. Signed-off by: Nishanth Devarajan --- include/utils.h | 2 ++ ip/iptuntap.c | 32 --- lib/utils.c | 68 + man/man8/tc.8 | 5 - tc/q_atm.c | 2 +- tc/q_cbq.c | 25 - tc/q_choke.c| 9 ++-- tc/q_clsact.c | 2 +- tc/q_codel.c| 2 +- tc/q_drr.c | 4 ++-- tc/q_dsmark.c | 4 ++-- tc/q_fifo.c | 2 +- tc/q_fq.c | 16 +++--- tc/q_fq_codel.c | 2 +- tc/q_gred.c | 9 ++-- tc/q_hfsc.c | 45 +- tc/q_hhf.c | 2 +- tc/q_htb.c | 18 +++ tc/q_ingress.c | 2 +- tc/q_mqprio.c | 2 +- tc/q_multiq.c | 2 +- tc/q_netem.c| 23 ++- tc/q_pie.c | 2 +- tc/q_prio.c | 2 +- tc/q_qfq.c | 4 ++-- tc/q_red.c | 9 ++-- tc/q_rr.c | 2 +- tc/q_sfb.c | 2 +- tc/q_sfq.c | 2 +- tc/q_tbf.c | 16 +++--- tc/tc.c | 2 +- tc/tc_class.c | 2 +- tc/tc_qdisc.c | 2 +- tc/tc_util.c| 63 tc/tc_util.h| 7 -- 35 files changed, 283 insertions(+), 110 deletions(-) diff --git a/include/utils.h b/include/utils.h index 3d91c50..9377266 100644 --- a/include/utils.h +++ b/include/utils.h @@ -87,6 +87,8 @@ int get_prefix(inet_prefix *dst, char *arg, int family); int mask2bits(__u32 netmask); int get_addr_ila(__u64 *val, const char *arg); +int read_prop(const char *dev, char *prop, long *value); +int parse_percent(double *val, const char *str); int get_hex(char c); int get_integer(int *val, const char *arg, int base); int get_unsigned(unsigned *val, const char *arg, int base); diff --git a/ip/iptuntap.c b/ip/iptuntap.c index b46e452..09f2be2 100644 --- a/ip/iptuntap.c +++ b/ip/iptuntap.c @@ -223,38 +223,6 @@ static int do_del(int argc, char **argv) return tap_del_ioctl(&ifr); } -static int read_prop(char *dev, char *prop, long *value) -{ - char fname[IFNAMSIZ+25], buf[80], *endp; - ssize_t len; - int fd; - long result; - - sprintf(fname, "/sys/class/net/%s/%s", dev, prop); - fd = open(fname, O_RDONLY); - if (fd < 0) { - if (strcmp(prop, "tun_flags")) - fprintf(stderr, "open %s: %s\n", fname, - strerror(errno)); - return -1; - } - len = read(fd, buf, sizeof(buf)-1); - close(fd); - if (len < 0) { - fprintf(stderr, "read %s: %s", fname, strerror(errno)); - return -1; - } - - buf[len] = 0; - result = strtol(buf, &endp, 0); - if (*endp != '\n') { - fprintf(stderr, "Failed to parse %s\n", fname); - return -1; - } - *value = result; - return 0; -} - static void print_flags(long flags) { if (flags & IFF_TUN) diff --git a/lib/utils.c b/lib/utils.c index 4f2fa28..9d5ba2a 100644 --- a/lib/utils.c +++ b/lib/utils.c @@ -39,6 +39,74 @@ int resolve_hosts; int timestamp_short; +int read_prop(const char *dev, char *prop, long *value) +{ + char fname[128], buf[80], *endp, *nl; + FILE *fp; + long result; + int ret; + + ret = snprintf(fname, sizeof(fname), "/sys/class/net/%s/%s", + dev, prop); + + if (ret <= 0 || ret >= sizeof(fname)) { + fprintf(stderr, "could not build pathname for property\n"); + return -1; + } + + fp = fopen(fname, "r"); + if (fp == NULL) { + fprintf(stderr, "fopen %s: %s\n", fname, strerror(errno)); + return -1; + } + + if (!fgets(buf, size
Re: [pull request][net V2 0/5] Mellanox, mlx5 fixes 2017-11-08
On Sat, Nov 11, 2017 at 2:42 AM, David Miller wrote: > From: Saeed Mahameed > Date: Fri, 10 Nov 2017 15:50:15 +0900 > >> The follwoing series includes some fixes for mlx5 core and etherent >> driver. >> >> Sorry for the late submission but as you can see i have some very >> critical fixes below that i would like them merged into this RC. >> >> Please pull and let me know if there is any problem. > > Pulled. > >> For -stable: >> ('net/mlx5e: Set page to null in case dma mapping fails') kernels >= 4.13 >> ('net/mlx5: FPGA, return -EINVAL if size is zero') kernels >= 4.13 >> ('net/mlx5: Cancel health poll before sending panic teardown command') >> kernels >= 4.13 > > That FPGA change doesn't appear in this pull request. > Sorry about that, I had to drop it as you see in "V1->V2" log, but forgot to remove it from the -stable list.
[PATCH] usbnet: ipheth: fix potential null pointer dereference in ipheth_carrier_set
_dev_ is being dereferenced before it is null checked, hence there is a potential null pointer dereference. Fix this by moving the pointer dereference after _dev_ has been null checked. Addresses-Coverity-ID: 1462020 Fixes: bb1b40c7cb86 ("usbnet: ipheth: prevent TX queue timeouts when device not ready") Signed-off-by: Gustavo A. R. Silva --- drivers/net/usb/ipheth.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/usb/ipheth.c b/drivers/net/usb/ipheth.c index ca71f6c..7275761 100644 --- a/drivers/net/usb/ipheth.c +++ b/drivers/net/usb/ipheth.c @@ -291,12 +291,15 @@ static void ipheth_sndbulk_callback(struct urb *urb) static int ipheth_carrier_set(struct ipheth_device *dev) { - struct usb_device *udev = dev->udev; + struct usb_device *udev; int retval; + if (!dev) return 0; if (!dev->confirmed_pairing) return 0; + + udev = dev->udev; retval = usb_control_msg(udev, usb_rcvctrlpipe(udev, IPHETH_CTRL_ENDP), IPHETH_CMD_CARRIER_CHECK, /* request */ -- 2.7.4
Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit
On 17.11.2017 21:54, Eric W. Biederman wrote: > Kirill Tkhai writes: > >> On 15.11.2017 19:29, Eric W. Biederman wrote: >>> Kirill Tkhai writes: >>> On 15.11.2017 09:25, Eric W. Biederman wrote: > Kirill Tkhai writes: > >> Curently mutex is used to protect pernet operations list. It makes >> cleanup_net() to execute ->exit methods of the same operations set, >> which was used on the time of ->init, even after net namespace is >> unlinked from net_namespace_list. >> >> But the problem is it's need to synchronize_rcu() after net is removed >> from net_namespace_list(): >> >> Destroy net_ns: >> cleanup_net() >> mutex_lock(&net_mutex) >> list_del_rcu(&net->list) >> synchronize_rcu() <--- Sleep there >> for ages >> list_for_each_entry_reverse(ops, &pernet_list, list) >> ops_exit_list(ops, &net_exit_list) >> list_for_each_entry_reverse(ops, &pernet_list, list) >> ops_free_list(ops, &net_exit_list) >> mutex_unlock(&net_mutex) >> >> This primitive is not fast, especially on the systems with many >> processors >> and/or when preemptible RCU is enabled in config. So, all the time, while >> cleanup_net() is waiting for RCU grace period, creation of new net >> namespaces >> is not possible, the tasks, who makes it, are sleeping on the same mutex: >> >> Create net_ns: >> copy_net_ns() >> mutex_lock_killable(&net_mutex)<--- Sleep there >> for ages >> >> The solution is to convert net_mutex to the rw_semaphore. Then, >> pernet_operations::init/::exit methods, modifying the net-related data, >> will require down_read() locking only, while down_write() will be used >> for changing pernet_list. >> >> This gives signify performance increase, like you may see below. There >> is measured sequential net namespace creation in a cycle, in single >> thread, without other tasks (single user mode): >> >> 1)int main(int argc, char *argv[]) >> { >> unsigned nr; >> if (argc < 2) { >> fprintf(stderr, "Provide nr iterations arg\n"); >> return 1; >> } >> nr = atoi(argv[1]); >> while (nr-- > 0) { >> if (unshare(CLONE_NEWNET)) { >> perror("Can't unshare"); >> return 1; >> } >> } >> return 0; >> } >> >> Origin, 10 unshare(): >> 0.03user 23.14system 1:39.85elapsed 23%CPU >> >> Patched, 10 unshare(): >> 0.03user 67.49system 1:08.34elapsed 98%CPU >> >> 2)for i in {1..1}; do unshare -n bash -c exit; done >> >> Origin: >> real 1m24,190s >> user 0m6,225s >> sys 0m15,132s >> >> Patched: >> real 0m18,235s (4.6 times faster) >> user 0m4,544s >> sys 0m13,796s >> >> This patch requires commit 76f8507f7a64 "locking/rwsem: Add >> down_read_killable()" >> from Linus tree (not in net-next yet). > > Using a rwsem to protect the list of operations makes sense. > > That should allow removing the sing > > I am not wild about taking a the rwsem down_write in > rtnl_link_unregister, and net_ns_barrier. I think that works but it > goes from being a mild hack to being a pretty bad hack and something > else that can kill the parallelism you are seeking it add. > > There are about 204 instances of struct pernet_operations. That is a > lot of code to have carefully audited to ensure it can in parallel all > at once. The existence of the exit_batch method, net_ns_barrier, > for_each_net and taking of net_mutex in rtnl_link_unregister all testify > to the fact that there are data structures accessed by multiple network > namespaces. > > My preference would be to: > > - Add the net_sem in addition to net_mutex with down_write only held in > register and unregister, and maybe net_ns_barrier and > rtnl_link_unregister. > > - Factor out struct pernet_ops out of struct pernet_operations. With > struct pernet_ops not having the exit_batch method. With pernet_ops > being embedded an anonymous member of the old struct pernet_operations. > > - Add [un]register_pernet_{sys,dev} functions that take a struct > pernet_ops, that don't take net_mutex. Have them order the > pernet_list as: > > pernet_sys > pernet_subsys > pernet_device > pernet_dev > > With the chunk in the middle taking the net_mutex. I think this approach will work. Thanks for the suggestion. Some more thoughts to the plan below. The only difficult thing there will be to choose the right order to move ops from pernet_subsys to per
Re: [PATCH v10 5/8] ARM: dts: sunxi: Restore EMAC changes (boards)
Hey, Sorry for the bringing this up again. Isn't there a: ethernet0 = &emac; for some boards missing? Best, Philipp (Sorry for sending this to some persons more than once! My Thunderbird sent mails in html and didn't reach the mailing lists. I hope it works now :) ) On 31.10.2017 09:19, Corentin Labbe wrote: The original dwmac-sun8i DT bindings have some issue on how to handle integrated PHY and was reverted in last RC of 4.13. But now we have a solution so we need to get back that was reverted. This patch restore all boards DT about dwmac-sun8i This reverts partially commit fe45174b72ae ("arm: dts: sunxi: Revert EMAC changes") Signed-off-by: Corentin Labbe Acked-by: Florian Fainelli --- arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts | 9 + arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts | 19 +++ arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts | 19 +++ arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts | 7 +++ arch/arm/boot/dts/sun8i-h3-orangepi-2.dts | 8 arch/arm/boot/dts/sun8i-h3-orangepi-one.dts | 8 arch/arm/boot/dts/sun8i-h3-orangepi-pc-plus.dts | 5 + arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts| 8 arch/arm/boot/dts/sun8i-h3-orangepi-plus.dts | 22 ++ arch/arm/boot/dts/sun8i-h3-orangepi-plus2e.dts| 16 10 files changed, 121 insertions(+) diff --git a/arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts b/arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts index b1502df7b509..6713d0f2b3f4 100644 --- a/arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts +++ b/arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts @@ -56,6 +56,8 @@ aliases { serial0 = &uart0; + /* ethernet0 is the H3 emac, defined in sun8i-h3.dtsi */ + ethernet0 = &emac; ethernet1 = &xr819; }; @@ -102,6 +104,13 @@ status = "okay"; }; +&emac { + phy-handle = <&int_mii_phy>; + phy-mode = "mii"; + allwinner,leds-active-low; + status = "okay"; +}; + &mmc0 { pinctrl-names = "default"; pinctrl-0 = <&mmc0_pins_a>; diff --git a/arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts b/arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts index e1dba9ffa94b..f2292deaa590 100644 --- a/arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts +++ b/arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts @@ -52,6 +52,7 @@ compatible = "sinovoip,bpi-m2-plus", "allwinner,sun8i-h3"; aliases { + ethernet0 = &emac; serial0 = &uart0; serial1 = &uart1; }; @@ -111,6 +112,24 @@ status = "okay"; }; +&emac { + pinctrl-names = "default"; + pinctrl-0 = <&emac_rgmii_pins>; + phy-supply = <®_gmac_3v3>; + phy-handle = <&ext_rgmii_phy>; + phy-mode = "rgmii"; + + allwinner,leds-active-low; + status = "okay"; +}; + +&external_mdio { + ext_rgmii_phy: ethernet-phy@1 { + compatible = "ethernet-phy-ieee802.3-c22"; + reg = <0>; + }; +}; + &ir { pinctrl-names = "default"; pinctrl-0 = <&ir_pins_a>; diff --git a/arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts b/arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts index 73766d38ee6c..cfb96da3cfef 100644 --- a/arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts +++ b/arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts @@ -66,6 +66,25 @@ status = "okay"; }; +&emac { + pinctrl-names = "default"; + pinctrl-0 = <&emac_rgmii_pins>; + phy-supply = <®_gmac_3v3>; + phy-handle = <&ext_rgmii_phy>; + phy-mode = "rgmii"; + + allwinner,leds-active-low; + + status = "okay"; +}; + +&external_mdio { + ext_rgmii_phy: ethernet-phy@1 { + compatible = "ethernet-phy-ieee802.3-c22"; + reg = <7>; + }; +}; + &ir { pinctrl-names = "default"; pinctrl-0 = <&ir_pins_a>; diff --git a/arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts b/arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts index 8d2cc6e9a03f..78f6c24952dd 100644 --- a/arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts +++ b/arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts @@ -46,3 +46,10 @@ model = "FriendlyARM NanoPi NEO"; compatible = "friendlyarm,nanopi-neo", "allwinner,sun8i-h3"; }; + +&emac { + phy-handle = <&int_mii_phy>; + phy-mode = "mii"; + allwinner,leds-active-low; + status = "okay"; +}; diff --git a/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts b/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts index 1bf51802f5aa..b20be95b49d5 100644 --- a/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts +++ b/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts @@ -54,6 +54,7 @@ aliases { serial0 = &uart0; /* ethernet0 is the H3 emac, defined in sun8i-h3.dtsi */ + ethernet0 = &emac; ethernet1 = &rtl81
Re: [patch net-next RFC v2 00/11] Add support for resource abstraction
On 11/14/17 9:18 AM, Jiri Pirko wrote: > From: Jiri Pirko > > Arkadi says: > > Many of the ASIC's internal resources are limited and are shared between > several hardware procedures. For example, unified hash-based memory can > be used for many lookup purposes, like FDB and LPM. In many cases the user > can provide a partitioning scheme for such a resource in order to perform > fine tuning for his application. In many cases after setting the > partitioning of the resource driver reload is needed. This patchset add > support for hot reset of the driver. > > Such an abstraction can be coupled with devlink's dpipe interface, which > models the ASIC's pipeline as an graph of match/action tables. By modeling > the hardware resource object, and by coupling it to several dpipe tables, > further visibility can be achieved in order to debug ASIC-wide issues. > > The proposed interface will provide the user the ability to understand the > limitations of the hardware, and receive notification regarding its occupancy. > Furthermore, monitoring the resource occupancy can be done in real-time and > can be useful in many cases. > > Userspace part prototype can be found at https://github.com/arkadis/iproute2/ > at resource_dev branch. > now that my firmware problem is fixed, I installed a build with this patch set. Trying to run devlink to split a port hangs: $ devlink port split swp1 count 4 [ 615.373359] INFO: task devlink:804 blocked for more than 120 seconds. [ 615.379934] Tainted: GW 4.14.0+ #38 [ 615.385238] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 615.393111] devlink D0 804771 0x0080 [ 615.393115] Call Trace: [ 615.393126] __schedule+0x1de/0x690 [ 615.393130] schedule+0x36/0x80 [ 615.393139] schedule_preempt_disabled+0xe/0x10 [ 615.393146] __mutex_lock.isra.4+0x211/0x530 [ 615.393152] __mutex_lock_slowpath+0x13/0x20 [ 615.393155] ? __mutex_lock_slowpath+0x13/0x20 [ 615.393158] mutex_lock+0x2f/0x40 [ 615.393164] devlink_port_unregister+0x29/0x60 [devlink] [ 615.393169] mlxsw_core_port_fini+0x25/0x50 [mlxsw_core] [ 615.393179] mlxsw_sp_port_remove+0xf0/0x100 [mlxsw_spectrum] [ 615.393186] mlxsw_sp_port_split+0xdc/0x260 [mlxsw_spectrum] [ 615.393193] ? _cond_resched+0x19/0x30 [ 615.393200] mlxsw_devlink_port_split+0x36/0x50 [mlxsw_core] [ 615.393206] devlink_nl_cmd_port_split_doit+0x42/0x50 [devlink] [ 615.393212] genl_family_rcv_msg+0x1c9/0x390 [ 615.393217] genl_rcv_msg+0x4c/0xa0 [ 615.393220] ? _cond_resched+0x19/0x30 [ 615.393228] ? genl_family_rcv_msg+0x390/0x390 [ 615.393232] netlink_rcv_skb+0xec/0x120 [ 615.393235] genl_rcv+0x28/0x40 [ 615.393239] netlink_unicast+0x170/0x230 [ 615.393244] netlink_sendmsg+0x28e/0x370 [ 615.393251] SYSC_sendto+0x10e/0x1b0 [ 615.393258] ? __audit_syscall_entry+0xc1/0x110 [ 615.393261] ? syscall_trace_enter+0x1c6/0x2d0 [ 615.393264] ? __do_page_fault+0x231/0x4b0 [ 615.393268] SyS_sendto+0xe/0x10 [ 615.393272] do_syscall_64+0x60/0x1f0 [ 615.393277] entry_SYSCALL64_slow_path+0x25/0x25 [ 615.393280] RIP: 0033:0x7f4ef43c16f3 [ 615.393284] RSP: 002b:7fffb907fbc8 EFLAGS: 0246 ORIG_RAX: 002c [ 615.393287] RAX: ffda RBX: 013660e0 RCX: 7f4ef43c16f3 [ 615.393290] RDX: 0040 RSI: 01366110 RDI: 0003 [ 615.393291] RBP: R08: 7f4ef4686d80 R09: 000c [ 615.393292] R10: R11: 0246 R12: [ 615.393296] R13: 0004 R14: R15:
Re: [PATCH v1 net-next 0/7] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers
> I really need to monitor the DSA discussion to better contribute to its > success. > I just found out the DSA API set_addr was removed last month due to not > everybody is using it. It cited the Marvell switch was the only switch using > that > API and found a new way to program the MAC address. But looking at that > driver I found it simply uses a randomized MAC address. > > For big switch with many ports where the main function is forwarding that MAC > address may not matter. For small switch with 2 ports it acts more like an > Ethernet > controller where the switch is mainly used for daisy chaining in a ring > network the MAC > address can be used in feature like source address filtering. Hi Tristram The MAC address set by set_addr was only used for pause frames. Nothing else. So a random address is fine. The switch itself should not be sending any other frames. Andrew
RE: [PATCH v1 net-next 0/7] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers
> On Thu, Nov 16, 2017 at 06:41:24PM -0800, tristram...@microchip.com > wrote: > > From: Tristram Ha > > > > This series of patches is to modify the original KSZ9477 DSA driver so > > that other KSZ switch drivers can be added and use the common code. > > Hi Tristram > > http://vger.kernel.org/~davem/net-next.html > > It is better to send an RFC patchset while netdev is closed and not > send it to David. He will shout at you otherwise. Noted. I really need to monitor the DSA discussion to better contribute to its success. I just found out the DSA API set_addr was removed last month due to not everybody is using it. It cited the Marvell switch was the only switch using that API and found a new way to program the MAC address. But looking at that driver I found it simply uses a randomized MAC address. For big switch with many ports where the main function is forwarding that MAC address may not matter. For small switch with 2 ports it acts more like an Ethernet controller where the switch is mainly used for daisy chaining in a ring network the MAC address can be used in feature like source address filtering.
Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit
Kirill Tkhai writes: > On 15.11.2017 19:29, Eric W. Biederman wrote: >> Kirill Tkhai writes: >> >>> On 15.11.2017 09:25, Eric W. Biederman wrote: Kirill Tkhai writes: > Curently mutex is used to protect pernet operations list. It makes > cleanup_net() to execute ->exit methods of the same operations set, > which was used on the time of ->init, even after net namespace is > unlinked from net_namespace_list. > > But the problem is it's need to synchronize_rcu() after net is removed > from net_namespace_list(): > > Destroy net_ns: > cleanup_net() > mutex_lock(&net_mutex) > list_del_rcu(&net->list) > synchronize_rcu() <--- Sleep there for > ages > list_for_each_entry_reverse(ops, &pernet_list, list) > ops_exit_list(ops, &net_exit_list) > list_for_each_entry_reverse(ops, &pernet_list, list) > ops_free_list(ops, &net_exit_list) > mutex_unlock(&net_mutex) > > This primitive is not fast, especially on the systems with many processors > and/or when preemptible RCU is enabled in config. So, all the time, while > cleanup_net() is waiting for RCU grace period, creation of new net > namespaces > is not possible, the tasks, who makes it, are sleeping on the same mutex: > > Create net_ns: > copy_net_ns() > mutex_lock_killable(&net_mutex)<--- Sleep there for > ages > > The solution is to convert net_mutex to the rw_semaphore. Then, > pernet_operations::init/::exit methods, modifying the net-related data, > will require down_read() locking only, while down_write() will be used > for changing pernet_list. > > This gives signify performance increase, like you may see below. There > is measured sequential net namespace creation in a cycle, in single > thread, without other tasks (single user mode): > > 1)int main(int argc, char *argv[]) > { > unsigned nr; > if (argc < 2) { > fprintf(stderr, "Provide nr iterations arg\n"); > return 1; > } > nr = atoi(argv[1]); > while (nr-- > 0) { > if (unshare(CLONE_NEWNET)) { > perror("Can't unshare"); > return 1; > } > } > return 0; > } > > Origin, 10 unshare(): > 0.03user 23.14system 1:39.85elapsed 23%CPU > > Patched, 10 unshare(): > 0.03user 67.49system 1:08.34elapsed 98%CPU > > 2)for i in {1..1}; do unshare -n bash -c exit; done > > Origin: > real 1m24,190s > user 0m6,225s > sys 0m15,132s > > Patched: > real 0m18,235s (4.6 times faster) > user 0m4,544s > sys 0m13,796s > > This patch requires commit 76f8507f7a64 "locking/rwsem: Add > down_read_killable()" > from Linus tree (not in net-next yet). Using a rwsem to protect the list of operations makes sense. That should allow removing the sing I am not wild about taking a the rwsem down_write in rtnl_link_unregister, and net_ns_barrier. I think that works but it goes from being a mild hack to being a pretty bad hack and something else that can kill the parallelism you are seeking it add. There are about 204 instances of struct pernet_operations. That is a lot of code to have carefully audited to ensure it can in parallel all at once. The existence of the exit_batch method, net_ns_barrier, for_each_net and taking of net_mutex in rtnl_link_unregister all testify to the fact that there are data structures accessed by multiple network namespaces. My preference would be to: - Add the net_sem in addition to net_mutex with down_write only held in register and unregister, and maybe net_ns_barrier and rtnl_link_unregister. - Factor out struct pernet_ops out of struct pernet_operations. With struct pernet_ops not having the exit_batch method. With pernet_ops being embedded an anonymous member of the old struct pernet_operations. - Add [un]register_pernet_{sys,dev} functions that take a struct pernet_ops, that don't take net_mutex. Have them order the pernet_list as: pernet_sys pernet_subsys pernet_device pernet_dev With the chunk in the middle taking the net_mutex. >>> >>> I think this approach will work. Thanks for the suggestion. Some more >>> thoughts to the plan below. >>> >>> The only difficult thing there will be to choose the right order >>> to move ops from pernet_subsys to pernet_sys and from pernet_device >>> to pernet_dev one by one. >>> >>> This is rather easy in case of tristate drivers, as modules may be loaded >>> at any time, and the only important o
Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit
Kirill Tkhai writes: > On 15.11.2017 19:31, Eric W. Biederman wrote: >> Kirill Tkhai writes: >> >>> On 15.11.2017 12:51, Kirill Tkhai wrote: On 15.11.2017 06:19, Eric W. Biederman wrote: > Kirill Tkhai writes: > >> On 14.11.2017 21:39, Cong Wang wrote: >>> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai >>> wrote: @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags, get_user_ns(user_ns); - rv = mutex_lock_killable(&net_mutex); + rv = down_read_killable(&net_sem); if (rv < 0) { net_free(net); dec_net_namespaces(ucounts); @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags, list_add_tail_rcu(&net->list, &net_namespace_list); rtnl_unlock(); } - mutex_unlock(&net_mutex); + up_read(&net_sem); if (rv < 0) { dec_net_namespaces(ucounts); put_user_ns(user_ns); @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work) list_replace_init(&cleanup_list, &net_kill_list); spin_unlock_irq(&cleanup_list_lock); - mutex_lock(&net_mutex); + down_read(&net_sem); /* Don't let anyone else find us. */ rtnl_lock(); @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work) list_for_each_entry_reverse(ops, &pernet_list, list) ops_free_list(ops, &net_exit_list); - mutex_unlock(&net_mutex); + up_read(&net_sem); >>> >>> After your patch setup_net() could run concurrently with cleanup_net(), >>> given that ops_exit_list() is called on error path of setup_net() too, >>> it means ops->exit() now could run concurrently if it doesn't have its >>> own lock. Not sure if this breaks any existing user. >> >> Yes, there will be possible concurrent ops->init() for a net namespace, >> and ops->exit() for another one. I hadn't found pernet operations, which >> have a problem with that. If they exist, they are hidden and not clear >> seen. >> The pernet operations in general do not touch someone else's memory. >> If suddenly there is one, KASAN should show it after a while. > > Certainly the use of hash tables shared between multiple network > namespaces would count. I don't rembmer how many of these we have but > there used to be quite a few. Could you please provide an example of hash tables, you mean? >>> >>> Ah, I see, it's dccp_hashinfo etc. > > JFI, I've checked dccp_hashinfo, and it seems to be safe. > >> >> The big one used to be the route cache. With resizable hash tables >> things may be getting better in that regard. > > I've checked some fib-related things, and wasn't able to find that. > Excuse me, could you please clarify, if it's an assumption, or > there is exactly a problem hash table, you know? Could you please > point it me more exactly, if it's so. Two things. 1) Hash tables are one case I know where we access data from multiple network namespaces. As such it can not be asserted that is no possibility for problems. 2) The responsible way to handle this is one patch for each set of methods explaining why those methods are safe to run in parallel. That ensures there is opportunity for review and people are going slowly enough that they actually look at these issues. The reason I want to see this broken up is that at 200ish sets of methods it is too much to review all at once. I completely agree that odds are that this can be made safe and that it is mostly likely already safe in practically every instance.My guess would be that if there are problems that need to be addressed they happen in one or two places and we need to find them. If possible I don't want to find them after the code has shipped in a stable release. Eric
Re: [PATCH] net: bridge: add max_fdb_count
Hi Andrew, On Fri, Nov 17, 2017 at 03:06:23PM +0100, Andrew Lunn wrote: > > Usually it's better to apply LRU or random here in my opinion, as the > > new entry is much more likely to be needed than older ones by definition. > > Hi Willy > > I think this depends on why you need to discard. If it is normal > operation and the limits are simply too low, i would agree. > > If however it is a DoS, throwing away the new entries makes sense, > leaving the old ones which are more likely to be useful. > > Most of the talk in this thread has been about limits for DoS > prevention... Sure but my point is that it can kick in on regular traffic and in this case it can be catastrophic. That's only what bothers me. If we have an unlimited default value with this algorithm I'm fine because nobody will get caught by accident with a bridge suddenly replicating high traffic on all ports because an unknown limit was reached. That's the principle of least surprise. I know that when fighting DoSes there's never any universally good solutions and one has to make tradeoffs. I'm perfectly fine with this. Cheers, Willy
Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit
On 15.11.2017 19:31, Eric W. Biederman wrote: > Kirill Tkhai writes: > >> On 15.11.2017 12:51, Kirill Tkhai wrote: >>> On 15.11.2017 06:19, Eric W. Biederman wrote: Kirill Tkhai writes: > On 14.11.2017 21:39, Cong Wang wrote: >> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai >> wrote: >>> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags, >>> >>> get_user_ns(user_ns); >>> >>> - rv = mutex_lock_killable(&net_mutex); >>> + rv = down_read_killable(&net_sem); >>> if (rv < 0) { >>> net_free(net); >>> dec_net_namespaces(ucounts); >>> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags, >>> list_add_tail_rcu(&net->list, &net_namespace_list); >>> rtnl_unlock(); >>> } >>> - mutex_unlock(&net_mutex); >>> + up_read(&net_sem); >>> if (rv < 0) { >>> dec_net_namespaces(ucounts); >>> put_user_ns(user_ns); >>> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work) >>> list_replace_init(&cleanup_list, &net_kill_list); >>> spin_unlock_irq(&cleanup_list_lock); >>> >>> - mutex_lock(&net_mutex); >>> + down_read(&net_sem); >>> >>> /* Don't let anyone else find us. */ >>> rtnl_lock(); >>> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work) >>> list_for_each_entry_reverse(ops, &pernet_list, list) >>> ops_free_list(ops, &net_exit_list); >>> >>> - mutex_unlock(&net_mutex); >>> + up_read(&net_sem); >> >> After your patch setup_net() could run concurrently with cleanup_net(), >> given that ops_exit_list() is called on error path of setup_net() too, >> it means ops->exit() now could run concurrently if it doesn't have its >> own lock. Not sure if this breaks any existing user. > > Yes, there will be possible concurrent ops->init() for a net namespace, > and ops->exit() for another one. I hadn't found pernet operations, which > have a problem with that. If they exist, they are hidden and not clear > seen. > The pernet operations in general do not touch someone else's memory. > If suddenly there is one, KASAN should show it after a while. Certainly the use of hash tables shared between multiple network namespaces would count. I don't rembmer how many of these we have but there used to be quite a few. >>> >>> Could you please provide an example of hash tables, you mean? >> >> Ah, I see, it's dccp_hashinfo etc. JFI, I've checked dccp_hashinfo, and it seems to be safe. > > The big one used to be the route cache. With resizable hash tables > things may be getting better in that regard. I've checked some fib-related things, and wasn't able to find that. Excuse me, could you please clarify, if it's an assumption, or there is exactly a problem hash table, you know? Could you please point it me more exactly, if it's so.
[PATCH RFC 05/25] net: Add primitives to update heads of pernet_list sublists
Currently we have first_device, and device and subsys sublists. Next patches introduce one more sublist. So, move the functionality, which will be repeating, to the primitives. Signed-off-by: Kirill Tkhai --- net/core/net_namespace.c | 19 +++ 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index a8ea580885d9..1d9712973695 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -939,6 +939,18 @@ static void __unregister_pernet_operations(struct pernet_operations *ops) static DEFINE_IDA(net_generic_ids); +#define update_first_on_add(first, delim, added) \ + do {\ + if (first == delim) \ + first = added; \ + } while (0) + +#define update_first_on_del(first, to_delete) \ + do {\ + if (first == to_delete) \ + first = (to_delete)->next; \ + } while (0) + static int register_pernet_operations(struct list_head *list, struct pernet_operations *ops) { @@ -1045,8 +1057,8 @@ int register_pernet_device(struct pernet_operations *ops) int error; down_write(&net_sem); error = register_pernet_operations(&pernet_list, ops); - if (!error && (first_device == &pernet_list)) - first_device = &ops->list; + if (!error) + update_first_on_add(first_device, &pernet_list, &ops->list); up_write(&net_sem); return error; } @@ -1064,8 +1076,7 @@ EXPORT_SYMBOL_GPL(register_pernet_device); void unregister_pernet_device(struct pernet_operations *ops) { down_write(&net_sem); - if (&ops->list == first_device) - first_device = first_device->next; + update_first_on_del(first_device, &ops->list); unregister_pernet_operations(ops); up_write(&net_sem); }
[PATCH RFC 04/25] net: Move mutex_unlock() in cleanup_net() up
net_sem protects from pernet_list changing, while ops_free_list() makes simple kfree(), and it can't race with other pernet_operations callbacks. So we may release net_mutex earlier then it was. Signed-off-by: Kirill Tkhai --- net/core/net_namespace.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 2254b1639209..a8ea580885d9 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -489,11 +489,12 @@ static void cleanup_net(struct work_struct *work) list_for_each_entry_reverse(ops, &pernet_list, list) ops_exit_list(ops, &net_exit_list); + mutex_unlock(&net_mutex); + /* Free the net generic variables */ list_for_each_entry_reverse(ops, &pernet_list, list) ops_free_list(ops, &net_exit_list); - mutex_unlock(&net_mutex); up_read(&net_sem); /* Ensure there are no outstanding rcu callbacks using this
[PATCH RFC 06/25] net: Add pernet sys and registration functions
This is a new sublist of pernet_list, which will live ahead of already existing: sys, subsys, device. It's aimed for subsystems, which pernet_operations may execute in parallel with any other's pernet_operations. In further, step-by-step we will move all subsys there, adding necessary small synchronization locks, where it's need. After all subsys are moved to sys, we'll kill subsys list and we'll have all current subsys not requiring net_mutex and to be able to init and exit in parallel with others. Then we'll add dev sublist ahead of device, and will repeat the cycle. Suggested-by: Eric W. Biederman Signed-off-by: Kirill Tkhai --- include/net/net_namespace.h |2 + net/core/net_namespace.c| 75 ++- 2 files changed, 75 insertions(+), 2 deletions(-) diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index 10f99dafd5ac..2cde5f766ec6 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -324,6 +324,8 @@ struct pernet_operations { * device which caused kernel oops, and panics during network * namespace cleanup. So please don't get this wrong. */ +int register_pernet_sys(struct pernet_operations *); +void unregister_pernet_sys(struct pernet_operations *); int register_pernet_subsys(struct pernet_operations *); void unregister_pernet_subsys(struct pernet_operations *); int register_pernet_device(struct pernet_operations *); diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 1d9712973695..f4f4aaa5ce1f 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -24,10 +24,24 @@ #include /* - * Our network namespace constructor/destructor lists + * Our network namespace constructor/destructor lists + * one by one linked in pernet_list. They are (in order + * of linking): sys, subsys, device. + * + * The methods from sys for a network namespace may be + * called in parallel with any method from any list + * for another net namespace. + * + * The methods from subsys and device can't be called + * in parallel with a method from subsys or device. + * + * When all subsys pernet_operations are moved to sys + * sublist, we'll kill subsys sublist, and create dev + * ahead of device sublist, and repeat the cycle. */ static LIST_HEAD(pernet_list); +static struct list_head *first_subsys = &pernet_list; static struct list_head *first_device = &pernet_list; DEFINE_MUTEX(net_mutex); @@ -987,6 +1001,57 @@ static void unregister_pernet_operations(struct pernet_operations *ops) ida_remove(&net_generic_ids, *ops->id); } +/** + * register_pernet_sys - register a network namespace system + * @ops: pernet operations structure for the system + * + * Register a subsystem which has init and exit functions + * that are called when network namespaces are created and + * destroyed respectively. + * + * When registered all network namespace init functions are + * called for every existing network namespace. Allowing kernel + * modules to have a race free view of the set of network namespaces. + * + * When a new network namespace is created all of the init + * methods are called in the order in which they were registered. + * + * When a network namespace is destroyed all of the exit methods + * are called in the reverse of the order with which they were + * registered. + */ +int register_pernet_sys(struct pernet_operations *ops) +{ + int error; + down_write(&net_sem); + if (first_subsys != first_device) { + panic("Pernet %ps registered out of order.\n" + "There is already %ps.\n", ops, + list_entry(first_subsys, struct pernet_operations, list)); + } + error = register_pernet_operations(first_subsys, ops); + up_write(&net_sem); + return error; +} +EXPORT_SYMBOL_GPL(register_pernet_sys); + +/** + * unregister_pernet_sys - unregister a network namespace system + * @ops: pernet operations structure to manipulate + * + * Remove the pernet operations structure from the list to be + * used when network namespaces are created or destroyed. In + * addition run the exit method for all existing network + * namespaces. + */ +void unregister_pernet_sys(struct pernet_operations *ops) +{ + down_write(&net_sem); + unregister_pernet_operations(ops); + up_write(&net_sem); +} +EXPORT_SYMBOL_GPL(unregister_pernet_sys); + /** * register_pernet_subsys - register a network namespace subsystem * @ops: pernet operations structure for the subsystem @@ -1011,6 +1076,8 @@ int register_pernet_subsys(struct pernet_operations *ops) int error; down_write(&net_sem); error = register_pernet_operations(first_device, ops); + if (!error) + update_first_on_add(first_subsys, first_device, &ops->list); up_write(&net_sem); return error;
[PATCH RFC 03/25] net: Introduce net_sem for protection of pernet_list
Curently mutex is used to protect pernet operations list. It makes cleanup_net() to execute ->exit methods of the same operations set, which was used on the time of ->init, even after net namespace is unlinked from net_namespace_list. But the problem is it's need to synchronize_rcu() after net is removed from net_namespace_list(): Destroy net_ns: cleanup_net() mutex_lock(&net_mutex) list_del_rcu(&net->list) synchronize_rcu() <--- Sleep there for ages list_for_each_entry_reverse(ops, &pernet_list, list) ops_exit_list(ops, &net_exit_list) list_for_each_entry_reverse(ops, &pernet_list, list) ops_free_list(ops, &net_exit_list) mutex_unlock(&net_mutex) This primitive is not fast, especially on the systems with many processors and/or when preemptible RCU is enabled in config. So, all the time, while cleanup_net() is waiting for RCU grace period, creation of new net namespaces is not possible, the tasks, who makes it, are sleeping on the same mutex: Create net_ns: copy_net_ns() mutex_lock_killable(&net_mutex)<--- Sleep there for ages I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop with preemptible RCU enabled. The solution is to convert net_mutex to the rw_semaphore and add small locks to really small number of pernet_operations, what really need them. Then, pernet_operations::init/::exit methods, modifying the net-related data, will require down_read() locking only, while down_write() will be used for changing pernet_list. This gives signify performance increase, like you may see here: https://www.spinics.net/lists/netdev/msg467095.html It's 4.6 times performance increase on one-thread test. Multi-thread tests increase may be close to 4.6 multiplied to number of threads. This patch starts replacing net_mutex to net_sem. It adds rw_semaphore, describes the variables it protects, and makes to use where appropriate. net_mutex is still present, and next patches will kick it out step-by-step. Signed-off-by: Kirill Tkhai --- include/linux/rtnetlink.h |1 + net/core/net_namespace.c | 37 + net/core/rtnetlink.c |4 ++-- 3 files changed, 28 insertions(+), 14 deletions(-) diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index 2032ce2eb20b..f640fc87fe1d 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -35,6 +35,7 @@ extern int rtnl_is_locked(void); extern wait_queue_head_t netdev_unregistering_wq; extern struct mutex net_mutex; +extern struct rw_semaphore net_sem; #ifdef CONFIG_PROVE_LOCKING extern bool lockdep_rtnl_is_held(void); diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 2e512965bf42..2254b1639209 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -41,6 +41,11 @@ struct net init_net = { EXPORT_SYMBOL(init_net); static bool init_net_initialized; +/* + * net_sem: protects: pernet_list, net_generic_ids, + * init_net_initialized and first_* pointers. + */ +DECLARE_RWSEM(net_sem); #define MIN_PERNET_OPS_ID \ ((sizeof(struct net_generic) + sizeof(void *) - 1) / sizeof(void *)) @@ -411,12 +416,16 @@ struct net *copy_net_ns(unsigned long flags, net->ucounts = ucounts; get_user_ns(user_ns); - rv = mutex_lock_killable(&net_mutex); + rv = down_read_killable(&net_sem); if (rv < 0) goto put_userns; - + rv = mutex_lock_killable(&net_mutex); + if (rv < 0) + goto up_read; rv = setup_net(net, user_ns); mutex_unlock(&net_mutex); +up_read: + up_read(&net_sem); if (rv < 0) { put_userns: put_user_ns(user_ns); @@ -443,6 +452,7 @@ static void cleanup_net(struct work_struct *work) list_replace_init(&cleanup_list, &net_kill_list); spin_unlock_irq(&cleanup_list_lock); + down_read(&net_sem); mutex_lock(&net_mutex); /* Don't let anyone else find us. */ @@ -484,6 +494,7 @@ static void cleanup_net(struct work_struct *work) ops_free_list(ops, &net_exit_list); mutex_unlock(&net_mutex); + up_read(&net_sem); /* Ensure there are no outstanding rcu callbacks using this * network namespace. @@ -510,8 +521,10 @@ static void cleanup_net(struct work_struct *work) */ void net_ns_barrier(void) { + down_write(&net_sem); mutex_lock(&net_mutex); mutex_unlock(&net_mutex); + up_write(&net_sem); } EXPORT_SYMBOL(net_ns_barrier); @@ -838,12 +851,12 @@ static int __init net_ns_init(void) rcu_assign_pointer(init_net.gen, ng); - mutex_lock(&net_mutex); + down_write(&net_sem); if (setup_net(&init_net, &init_user_ns)) panic("Could not setup the initial network namespace"); init_net_initialized = true; - mutex_unlock(&net_mutex); + up_write(&net_sem); register_pernet_
[PATCH RFC 11/25] net: Move netfilter_net_ops to pernet_sys list
Since net/socket.o is the first linked file in net/Makefile, its core initcalls execute the first. netfilter_net_ops is executed right after sysctl_pernet_ops. Methods netfilter_net_init() and netfilter_net_exit() initialize net::nf::hooks and change net-related proc directory of net. Another pernet_operations do not interested in forein net::nf::hooks or proc entries, so it's safe to move netfilter_net_ops to pernet list. Signed-off-by: Kirill Tkhai --- net/netfilter/core.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/core.c b/net/netfilter/core.c index 52cd2901a097..2bed28281b67 100644 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -606,7 +606,7 @@ int __init netfilter_init(void) { int ret; - ret = register_pernet_subsys(&netfilter_net_ops); + ret = register_pernet_sys(&netfilter_net_ops); if (ret < 0) goto err;
[PATCH RFC 10/25] net: Move sysctl_pernet_ops to pernet_sys list
This patch starts to convert pernet_subsys, registered from core initcalls. Since net/socket.o is the first linked file in net/Makefile, its core initcalls execute the first. sysctl_pernet_ops is the first pernet_subsys, registered from sock_init(), so it goes ahead of others, registered via core_initcall(). Methods sysctl_net_init() and sysctl_net_exit() initialize net::sysctls of a namespace. pernet_operations::init()/exit() methods from the rest of the list do not touch net::sysctls of strangers, so it's safe to execute sysctl_pernet_ops's methods in parallel with any other pernet_operations. Signed-off-by: Kirill Tkhai --- net/sysctl_net.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sysctl_net.c b/net/sysctl_net.c index 9aed6fe1bf1a..1b91db88e54a 100644 --- a/net/sysctl_net.c +++ b/net/sysctl_net.c @@ -103,7 +103,7 @@ __init int net_sysctl_init(void) net_header = register_sysctl("net", empty); if (!net_header) goto out; - ret = register_pernet_subsys(&sysctl_pernet_ops); + ret = register_pernet_sys(&sysctl_pernet_ops); if (ret) goto out1; out:
[PATCH RFC 16/25] net: Move rtnetlink_net_ops to pernet_sys list
rtnetlink_net_ops are added the same core initcall as netlink_net_ops, so they has to be added right after netlink_net_ops. rtnetlink_net_init() and rtnetlink_net_exit() create and destroy netlink socket. It looks like, another pernet_operations are not interested in foreiner net::rtnl, so rtnetlink_net_ops may be safely moved to pernet_sys list. Signed-off-by: Kirill Tkhai --- net/core/rtnetlink.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index cb06d43c4230..d9cf13554e4d 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -4503,7 +4503,7 @@ void __init rtnetlink_init(void) for (i = 0; i < ARRAY_SIZE(rtnl_msg_handlers_ref); i++) refcount_set(&rtnl_msg_handlers_ref[i], 1); - if (register_pernet_subsys(&rtnetlink_net_ops)) + if (register_pernet_sys(&rtnetlink_net_ops)) panic("rtnetlink_init: cannot initialize rtnetlink\n"); register_netdevice_notifier(&rtnetlink_dev_notifier);
[PATCH RFC 17/25] net: Move audit_net_ops to pernet_sys list
This patch starts to convert pernet_subsys, registered from postcore initcalls. These pernet_operations are in ./kernel directory, and there are only one more postcore in ./lib. So, audit_net_ops have to go the first. audit_net_init() creates netlink socket, while audit_net_exit() destroys it. The rest of the pernet_list are not interested in the socket, so we move audit_net_ops to pernet_sys list. Signed-off-by: Kirill Tkhai --- kernel/audit.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/audit.c b/kernel/audit.c index 227db99b0f19..bb4626d7e712 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -1549,7 +1549,7 @@ static int __init audit_init(void) pr_info("initializing netlink subsys (%s)\n", audit_default ? "enabled" : "disabled"); - register_pernet_subsys(&audit_net_ops); + register_pernet_sys(&audit_net_ops); audit_initialized = AUDIT_INITIALIZED;
[PATCH RFC 18/25] net: Move uevent_net_ops to pernet_sys list
This postcore_initcall() created pernet_operations are registered from ./lib directory, and they have to go right after audit_net_ops. uevent_net_init() and uevent_net_exit() create and destroy netlink socket, and these actions serialized in netlink code. Parallel execution with other pernet_operations makes the socket disappear earlier from uevent_sock_list on ->exit. As userspace can't be interested in broadcast messages of dying net, and, as I see, no one in kernel listen them, we may safely move uevent_net_ops to pernet_sys list. Signed-off-by: Kirill Tkhai --- lib/kobject_uevent.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c index c3e84edc47c9..84c9d85477cc 100644 --- a/lib/kobject_uevent.c +++ b/lib/kobject_uevent.c @@ -647,7 +647,7 @@ static struct pernet_operations uevent_net_ops = { static int __init kobject_uevent_init(void) { - return register_pernet_subsys(&uevent_net_ops); + return register_pernet_sys(&uevent_net_ops); }
[PATCH RFC 14/25] net: Move net_defaults_ops to pernet_sys list
According to net/core/Makefile, net/core/net_namespace.o core initcalls execute right after net/core/sock.o. net_defaults_ops introduces only net_defaults_init_net method, and it acts on net::core::sysctl_somaxconn, which is not interested the rest of pernet_subsys and pernet_device lists. Then, move it to pernet_sys. Signed-off-by: Kirill Tkhai --- net/core/net_namespace.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 2e8295aa7003..7fc9d44c1817 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -371,7 +371,7 @@ static struct pernet_operations net_defaults_ops = { static __init int net_defaults_init(void) { - if (register_pernet_subsys(&net_defaults_ops)) + if (register_pernet_sys(&net_defaults_ops)) panic("Cannot initialize net default settings"); return 0;
[PATCH RFC 24/25] net: Move wext_pernet_ops to pernet_sys list
These pernet_operations initialize and purge net::wext_nlevents queue, and are not touched by foreign pernet_operations. Signed-off-by: Kirill Tkhai --- net/wireless/wext-core.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/wireless/wext-core.c b/net/wireless/wext-core.c index 6cdb054484d6..2103c2a003ed 100644 --- a/net/wireless/wext-core.c +++ b/net/wireless/wext-core.c @@ -394,7 +394,7 @@ static struct pernet_operations wext_pernet_ops = { static int __init wireless_nlevent_init(void) { - int err = register_pernet_subsys(&wext_pernet_ops); + int err = register_pernet_sys(&wext_pernet_ops); if (err) return err;
[PATCH RFC 25/25] net: Move sysctl_core_ops to pernet_sys list
These pernet_operations register and destroy sysctl directory, and it's not interested for foreign pernet_operations. Signed-off-by: Kirill Tkhai --- net/core/sysctl_net_core.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index cbc3dde4cfcc..0dab679b33fa 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -525,7 +525,7 @@ static __net_initdata struct pernet_operations sysctl_core_ops = { static __init int sysctl_core_init(void) { register_net_sysctl(&init_net, "net/core", net_core_table); - return register_pernet_subsys(&sysctl_core_ops); + return register_pernet_sys(&sysctl_core_ops); } fs_initcall(sysctl_core_init);
[PATCH RFC 23/25] net: Move genl_pernet_ops to pernet_sys list
This pernet_operations create and destroy net::genl_sock. Foreign pernet_operations don't touch it. Signed-off-by: Kirill Tkhai --- net/netlink/genetlink.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c index d444daf1ac04..da7ab3dd5609 100644 --- a/net/netlink/genetlink.c +++ b/net/netlink/genetlink.c @@ -1045,7 +1045,7 @@ static int __init genl_init(void) if (err < 0) goto problem; - err = register_pernet_subsys(&genl_pernet_ops); + err = register_pernet_sys(&genl_pernet_ops); if (err) goto problem;
[PATCH RFC 20/25] net: Move pernet_subsys, registered via net_dev_init(), to pernet_sys list
net/core/dev.o is lined after net/core/sock.o. There are: 1)dev_proc_ops and dev_mc_net_ops, which create and destroy pernet proc file and not interested to another net namespaces; 2)netdev_net_ops, which creates pernet hash, which is not touched by another pernet_operations. So, move it to pernet_sys list. Signed-off-by: Kirill Tkhai --- net/core/dev.c|2 +- net/core/net-procfs.c |4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 8ee29f4f5fa9..b90a503a9e1a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -8787,7 +8787,7 @@ static int __init net_dev_init(void) INIT_LIST_HEAD(&offload_base); - if (register_pernet_subsys(&netdev_net_ops)) + if (register_pernet_sys(&netdev_net_ops)) goto out; /* diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c index 615ccab55f38..46096219d574 100644 --- a/net/core/net-procfs.c +++ b/net/core/net-procfs.c @@ -413,8 +413,8 @@ static struct pernet_operations __net_initdata dev_mc_net_ops = { int __init dev_proc_init(void) { - int ret = register_pernet_subsys(&dev_proc_ops); + int ret = register_pernet_sys(&dev_proc_ops); if (!ret) - return register_pernet_subsys(&dev_mc_net_ops); + return register_pernet_sys(&dev_mc_net_ops); return ret; }
[PATCH RFC 22/25] net: Move subsys_initcall() registered pernet_operations from net/sched to pernet_sys list
psched_net_ops only creates and destroyes /proc entry, and safe to be executed in parallel with any foreigh pernet_operations. tcf_action_net_ops initializes and destructs tcf_action_net::egdev_ht, which is not touched by foreign pernet_operations. So, move them to pernet_sys list. Signed-off-by: Kirill Tkhai --- net/sched/act_api.c |2 +- net/sched/sch_api.c |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 4d33a50a8a6d..f1de2146e6e0 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -1470,7 +1470,7 @@ static int __init tc_action_init(void) { int err; - err = register_pernet_subsys(&tcf_action_net_ops); + err = register_pernet_sys(&tcf_action_net_ops); if (err) return err; diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index b6c4f536876b..68938ca4bbe1 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -2008,7 +2008,7 @@ static int __init pktsched_init(void) { int err; - err = register_pernet_subsys(&psched_net_ops); + err = register_pernet_sys(&psched_net_ops); if (err) { pr_err("pktsched_init: " "cannot initialize per netns operations\n");
[PATCH RFC 21/25] net: Move fib_* pernet_operations, registered via subsys_initcall(), to pernet_sys list
Both of them create and initialize lists, which are not touched by another foreing pernet_operations. Signed-off-by: Kirill Tkhai --- net/core/fib_notifier.c |2 +- net/core/fib_rules.c|2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/core/fib_notifier.c b/net/core/fib_notifier.c index 0c048bdeb016..782a1475a32e 100644 --- a/net/core/fib_notifier.c +++ b/net/core/fib_notifier.c @@ -175,7 +175,7 @@ static struct pernet_operations fib_notifier_net_ops = { static int __init fib_notifier_init(void) { - return register_pernet_subsys(&fib_notifier_net_ops); + return register_pernet_sys(&fib_notifier_net_ops); } subsys_initcall(fib_notifier_init); diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 98e1066c3d55..b2706c18f0f3 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -1039,7 +1039,7 @@ static int __init fib_rules_init(void) rtnl_register(PF_UNSPEC, RTM_DELRULE, fib_nl_delrule, NULL, 0); rtnl_register(PF_UNSPEC, RTM_GETRULE, NULL, fib_nl_dumprule, 0); - err = register_pernet_subsys(&fib_rules_net_ops); + err = register_pernet_sys(&fib_rules_net_ops); if (err < 0) goto fail;
[PATCH RFC 19/25] net: Move proto_net_ops to pernet_sys list
This patch starts to convert pernet_subsys, registered from subsys initcalls. According to net/Makefile and net/core/Makefile, this is the first exected subsys_initcall(), registering pernet_subsys. It seems to be executed in parallel with others, as it's only creates/destoyes proc entry, which nobody else is not interested in. Signed-off-by: Kirill Tkhai --- net/core/sock.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/sock.c b/net/core/sock.c index be050b044699..ed12e115458b 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -3349,7 +3349,7 @@ static __net_initdata struct pernet_operations proto_net_ops = { static int __init proto_init(void) { - return register_pernet_subsys(&proto_net_ops); + return register_pernet_sys(&proto_net_ops); } subsys_initcall(proto_init);
[PATCH RFC 13/25] net: Move net_inuse_ops to pernet_sys list
net/core/sock.o is the first linked file in net/core/Makefile, so its core initcall executes the first in the directory. net_inuse_ops methods expose statistics in /proc. No one from the rest of pernet_subsys or pernet_device lists does not touch net::core::inuse. So, it's safe to move net_inuse_ops to pernet_sys list. Signed-off-by: Kirill Tkhai --- net/core/sock.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/sock.c b/net/core/sock.c index 13719af7b4e3..be050b044699 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -3081,7 +3081,7 @@ static struct pernet_operations net_inuse_ops = { static __init int net_inuse_init(void) { - if (register_pernet_subsys(&net_inuse_ops)) + if (register_pernet_sys(&net_inuse_ops)) panic("Cannot initialize net inuse counters"); return 0;
[PATCH RFC 15/25] net: Move netlink_net_ops to pernet_sys list
According to net/core/Makefile, net/core/af_netlink.o core initcalls execute right after net/core/net_namespace.o. The methods of netlink_net_ops create and destroy "netlink" file, which are not interested for foreigh pernet_operations. So, netlink_net_ops may safely be moved to pernet_sys list. Signed-off-by: Kirill Tkhai --- net/netlink/af_netlink.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index b9e0ee4e22f5..a4f1f5222b79 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2735,7 +2735,7 @@ static int __init netlink_proto_init(void) netlink_add_usersock_entry(); sock_register(&netlink_family_ops); - register_pernet_subsys(&netlink_net_ops); + register_pernet_sys(&netlink_net_ops); /* The netlink device handler may be needed early. */ rtnetlink_init(); out:
[PATCH RFC 09/25] net: Move net_ns_ops to pernet_sys list
This patch starts to convert pernet_subsys, registered from pure initcalls. Since net_ns_init() is the only pure initcall in net subsystem, and there is no early initcalls; the pernet subsys, it registers, is the first in pernet_operations list. So, we start with it. net_ns_ops::net_ns_net_init/net_ns_net_init, methods use only ida_simple_* functions, which are not need a synchronization. So it's safe to execute them in parallel with any other pernet_operations, and thus we convert net_ns_ops to pernet_sys type. Signed-off-by: Kirill Tkhai --- net/core/net_namespace.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 7aec8c1afe50..2e8295aa7003 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -899,7 +899,7 @@ static int __init net_ns_init(void) init_net_initialized = true; up_write(&net_sem); - register_pernet_subsys(&net_ns_ops); + register_pernet_sys(&net_ns_ops); rtnl_register(PF_UNSPEC, RTM_NEWNSID, rtnl_net_newid, NULL, RTNL_FLAG_DOIT_UNLOCKED);
[PATCH RFC 12/25] net: Move nf_log_net_ops to pernet_sys list
nf_log_net_ops are registered the same initcall as netfilter_net_ops, so they has to be moved right after netfilter_net_ops. The ops would have had a problem in parallel execution with others, if init_net had been possible to released. But it's not, and the rest is safe for that. There is memory allocation, which nobody else interested in, and sysctl registration. So, we move it to pernet_sys list. Signed-off-by: Kirill Tkhai --- net/netfilter/nf_log.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c index 8bb152a7cca4..08868afad813 100644 --- a/net/netfilter/nf_log.c +++ b/net/netfilter/nf_log.c @@ -582,5 +582,5 @@ static struct pernet_operations nf_log_net_ops = { int __init netfilter_log_init(void) { - return register_pernet_subsys(&nf_log_net_ops); + return register_pernet_sys(&nf_log_net_ops); }
[PATCH RFC 08/25] net: Move proc_net_ns_ops to pernet_sys list
This patch starts to convert pernet_subsys, registered from before initcalls. Since proc_net_ns_ops is registered pernet_subsys, made from: start_kernel()->proc_root_init()->proc_net_init(), and there is no a pernet_subsys, which is registered earlier, we start from it. proc_net_ns_ops::proc_net_ns_init()/proc_net_ns_exit() register pernet net->proc_net and ->proc_net_stat, and constructors and destructors of another pernet_operations are not interested in foreign net's proc_net and proc_net_stat. Proc filesystem privitives are synchronized on proc_subdir_lock. So, it's safe to move proc_net_ns_ops to pernet_sys list and execute its methods in parallel with another pernet operations. Signed-off-by: Kirill Tkhai --- fs/proc/proc_net.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c index a2bf369c923d..5eb52765eeab 100644 --- a/fs/proc/proc_net.c +++ b/fs/proc/proc_net.c @@ -243,5 +243,5 @@ int __init proc_net_init(void) { proc_symlink("net", NULL, "self/net"); - return register_pernet_subsys(&proc_net_ns_ops); + return register_pernet_sys(&proc_net_ns_ops); }
[PATCH RFC 07/25] net: Make sys sublist pernet_operations executed out of net_mutex
Move net_mutex to setup_net() and cleanup_net(), and do not hold it, while sys sublist methods are executed. Signed-off-by: Kirill Tkhai --- net/core/net_namespace.c | 44 +++- 1 file changed, 35 insertions(+), 9 deletions(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index f4f4aaa5ce1f..7aec8c1afe50 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -84,11 +84,11 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data) { struct net_generic *ng, *old_ng; - BUG_ON(!mutex_is_locked(&net_mutex)); + BUG_ON(!rwsem_is_locked(&net_sem)); BUG_ON(id < MIN_PERNET_OPS_ID); old_ng = rcu_dereference_protected(net->gen, - lockdep_is_held(&net_mutex)); + lockdep_is_held(&net_sem)); if (old_ng->s.len > id) { old_ng->ptr[id] = data; return 0; @@ -300,6 +300,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns) { /* Must be called with net_mutex held */ const struct pernet_operations *ops, *saved_ops; + bool locked = false; int error = 0; LIST_HEAD(net_exit_list); @@ -311,14 +312,34 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns) spin_lock_init(&net->nsid_lock); list_for_each_entry(ops, &pernet_list, list) { + if (&ops->list == first_subsys) { + BUG_ON(locked); + error = mutex_lock_killable(&net_mutex); + if (error) + goto out_undo; + locked = true; + } + error = ops_init(ops, net); if (error < 0) goto out_undo; } + + if (!locked) { + /* +* This may happen only on early boot, so we don't +* care about possibility to interrupt the locking. +*/ + mutex_lock(&net_mutex); + locked = true; + } + rtnl_lock(); list_add_tail_rcu(&net->list, &net_namespace_list); rtnl_unlock(); out: + if (locked) + mutex_unlock(&net_mutex); return error; out_undo: @@ -433,12 +454,7 @@ struct net *copy_net_ns(unsigned long flags, rv = down_read_killable(&net_sem); if (rv < 0) goto put_userns; - rv = mutex_lock_killable(&net_mutex); - if (rv < 0) - goto up_read; rv = setup_net(net, user_ns); - mutex_unlock(&net_mutex); -up_read: up_read(&net_sem); if (rv < 0) { put_userns: @@ -460,6 +476,7 @@ static void cleanup_net(struct work_struct *work) struct net *net, *tmp; struct list_head net_kill_list; LIST_HEAD(net_exit_list); + bool locked; /* Atomically snapshot the list of namespaces to cleanup */ spin_lock_irq(&cleanup_list_lock); @@ -468,6 +485,7 @@ static void cleanup_net(struct work_struct *work) down_read(&net_sem); mutex_lock(&net_mutex); + locked = true; /* Don't let anyone else find us. */ rtnl_lock(); @@ -500,10 +518,18 @@ static void cleanup_net(struct work_struct *work) synchronize_rcu(); /* Run all of the network namespace exit methods */ - list_for_each_entry_reverse(ops, &pernet_list, list) + list_for_each_entry_reverse(ops, &pernet_list, list) { ops_exit_list(ops, &net_exit_list); - mutex_unlock(&net_mutex); + if (&ops->list == first_subsys) { + BUG_ON(!locked); + mutex_unlock(&net_mutex); + locked = false; + } + } + + if (locked) + mutex_unlock(&net_mutex); /* Free the net generic variables */ list_for_each_entry_reverse(ops, &pernet_list, list)
[PATCH RFC 02/25] net: Cleanup copy_net_ns()
Line up destructors actions in the revers order to constructors. Next patches will add more actions, and this will be comfortable, if there is the such order. Signed-off-by: Kirill Tkhai --- net/core/net_namespace.c | 20 +--- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 7ecf71050ffa..2e512965bf42 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -404,27 +404,25 @@ struct net *copy_net_ns(unsigned long flags, net = net_alloc(); if (!net) { - dec_net_namespaces(ucounts); - return ERR_PTR(-ENOMEM); + rv = -ENOMEM; + goto dec_ucounts; } - + refcount_set(&net->passive, 1); + net->ucounts = ucounts; get_user_ns(user_ns); rv = mutex_lock_killable(&net_mutex); - if (rv < 0) { - net_free(net); - dec_net_namespaces(ucounts); - put_user_ns(user_ns); - return ERR_PTR(rv); - } + if (rv < 0) + goto put_userns; - net->ucounts = ucounts; rv = setup_net(net, user_ns); mutex_unlock(&net_mutex); if (rv < 0) { - dec_net_namespaces(ucounts); +put_userns: put_user_ns(user_ns); net_drop_ns(net); +dec_ucounts: + dec_net_namespaces(ucounts); return ERR_PTR(rv); } return net;
[PATCH RFC 00/25] Replacing net_mutex with rw_semaphore
Hi, this is continuation of discussion from here: https://lkml.org/lkml/2017/11/14/298 The plan has changed a little bit, so I'd be happy to hear people's comments, before I dived into all 400+ pernet subsys and devices. The patch set adds pernet sys list ahead of subsys and device, and it's used for pernet_operations, which may be executed in parallel with any other pernet_operations methods. Also, some high-priority ops converted (up to registered using postcore_initcall(), and some subsys_initcall()) in order of appearance. The sequence in setup_net() is following: 1)execute all the callbacks from pernet_sys list 2)lock net_mutex 3)execute all the callbacks from pernet_subsys list 4)execute all the callbacks from pernet_device list 5)unlock net_mutex There was not pernet_operations, requiring additional synchronization, yet, but I've bumped in another problem. The problem is that some drivers may be compiled as modules and as kernel-image part. They register pernet_operations from device_initcall() for example. This initcall executes in different time comparing to in-kernel built-in only drivers. Imagine, we have three state driverA, and boolean driverB. driverA registers pernet_subsys from subsys_initcall(). driverB registers pernet_subsys from fs_initcall(). So, here we have two cases: driverA is module driverA is built-in --- register driverB ops register driverA ops register driverA ops register driverB ops So, the order is different. When converting driver one-by-one, it's impossible to make the order true for all .config states, because of the above. So, the bisect won't work. And it seems, it's just the same as to convert pernet_operations from all the files in file alphabetical order. What do you think about this? (Note, the patches has no such a problem at the moment, as there are all in-kernel early core drivers). Maybe there are another comments on the code. --- Kirill Tkhai (25): net: Assign net to net_namespace_list in setup_net() net: Cleanup copy_net_ns() net: Introduce net_sem for protection of pernet_list net: Move mutex_unlock() in cleanup_net() up net: Add primitives to update heads of pernet_list sublists net: Add pernet sys and registration functions net: Make sys sublist pernet_operations executed out of net_mutex net: Move proc_net_ns_ops to pernet_sys list net: Move net_ns_ops to pernet_sys list net: Move sysctl_pernet_ops to pernet_sys list net: Move netfilter_net_ops to pernet_sys list net: Move nf_log_net_ops to pernet_sys list net: Move net_inuse_ops to pernet_sys list net: Move net_defaults_ops to pernet_sys list net: Move netlink_net_ops to pernet_sys list net: Move rtnetlink_net_ops to pernet_sys list net: Move audit_net_ops to pernet_sys list net: Move uevent_net_ops to pernet_sys list net: Move proto_net_ops to pernet_sys list net: Move pernet_subsys, registered via net_dev_init(), to pernet_sys list net: Move fib_* pernet_operations, registered via subsys_initcall(), to pernet_sys list net: Move subsys_initcall() registered pernet_operations from net/sched to pernet_sys list net: Move genl_pernet_ops to pernet_sys list net: Move wext_pernet_ops to pernet_sys list net: Move sysctl_core_ops to pernet_sys list fs/proc/proc_net.c |2 include/linux/rtnetlink.h |1 include/net/net_namespace.h |2 kernel/audit.c |2 lib/kobject_uevent.c|2 net/core/dev.c |2 net/core/fib_notifier.c |2 net/core/fib_rules.c|2 net/core/net-procfs.c |4 - net/core/net_namespace.c| 203 +-- net/core/rtnetlink.c|6 + net/core/sock.c |4 - net/core/sysctl_net_core.c |2 net/netfilter/core.c|2 net/netfilter/nf_log.c |2 net/netlink/af_netlink.c|2 net/netlink/genetlink.c |2 net/sched/act_api.c |2 net/sched/sch_api.c |2 net/sysctl_net.c|2 net/wireless/wext-core.c|2 21 files changed, 183 insertions(+), 67 deletions(-) -- Signed-off-by: Kirill Tkhai
[PATCH RFC 01/25] net: Assign net to net_namespace_list in setup_net()
This patch merges two repeating pieces of code in one, and they will live in setup_net() now. It acts as cleanup even despite init_net_initialized assignment is reordered with the linking of net now. This variable is need for proc_net_init() called from: start_kernel()->proc_root_init()->proc_net_init(), which can't race with net_ns_init(), called from initcall. Signed-off-by: Kirill Tkhai --- net/core/net_namespace.c | 13 +++-- 1 file changed, 3 insertions(+), 10 deletions(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index b797832565d3..7ecf71050ffa 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -296,6 +296,9 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns) if (error < 0) goto out_undo; } + rtnl_lock(); + list_add_tail_rcu(&net->list, &net_namespace_list); + rtnl_unlock(); out: return error; @@ -417,11 +420,6 @@ struct net *copy_net_ns(unsigned long flags, net->ucounts = ucounts; rv = setup_net(net, user_ns); - if (rv == 0) { - rtnl_lock(); - list_add_tail_rcu(&net->list, &net_namespace_list); - rtnl_unlock(); - } mutex_unlock(&net_mutex); if (rv < 0) { dec_net_namespaces(ucounts); @@ -847,11 +845,6 @@ static int __init net_ns_init(void) panic("Could not setup the initial network namespace"); init_net_initialized = true; - - rtnl_lock(); - list_add_tail_rcu(&init_net.list, &net_namespace_list); - rtnl_unlock(); - mutex_unlock(&net_mutex); register_pernet_subsys(&net_ns_ops);
[PATCH net-next 2/2] net-next: copy user configured flowlabel to reset packet
From: Shaohua Li Reset packet doesn't use user configured flowlabel, instead, it always uses 0. This will cause inconsistency for flowlabel. tw sock already records flowlabel info, so we can directly use it. Cc: Martin KaFai Lau Cc: Eric Dumazet Cc: Florent Fourcot Cc: Cong Wang Cc: Tom Herbert Signed-off-by: Shaohua Li --- net/ipv6/tcp_ipv6.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index a1a5802..9b678cd 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -901,6 +901,8 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb) struct sock *sk1 = NULL; #endif int oif = 0; + u8 tclass = 0; + __be32 flowlabel = 0; if (th->rst) return; @@ -954,7 +956,21 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb) trace_tcp_send_reset(sk, skb); } - tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0); + if (sk) { + if (sk_fullsock(sk)) { + struct ipv6_pinfo *np = inet6_sk(sk); + + tclass = np->tclass; + flowlabel = np->flow_label & IPV6_FLOWLABEL_MASK; + } else { + struct inet_timewait_sock *tw = inet_twsk(sk); + + tclass = tw->tw_tclass; + flowlabel = cpu_to_be32(tw->tw_flowlabel); + } + } + tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1, + tclass, flowlabel); #ifdef CONFIG_TCP_MD5SIG out: -- 2.9.5
[PATCH net-next 0/2] net: fix flowlabel inconsistency in reset packet
From: Shaohua Li Hi, Please see below tcpdump output: 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0 The tcp reset packet has a different flowlabel, which causes our router doesn't correctly close tcp connection. We are using flowlabel to do load balance. Routers in the path maintain connection state. So if flow label changes, the packet is routed through a different router. In this case, the old router doesn't get the reset packet to close the tcp connection. The reason is the normal packet gets the skb->hash from sk->sk_txhash, which is generated randomly. ip6_make_flowlabel then uses the hash to create a flowlabel. The reset packet doesn't get assigned a hash, so the flowlabel is calculated with flowi6. The patches fix the issue. Thanks, Shaohua Shaohua Li (2): net-next: use five-tuple hash for sk_txhash net-next: copy user configured flowlabel to reset packet include/net/sock.h| 18 -- include/net/tcp.h | 2 +- net/ipv4/datagram.c | 2 +- net/ipv4/syncookies.c | 4 +++- net/ipv4/tcp_input.c | 1 - net/ipv4/tcp_ipv4.c | 17 - net/ipv4/tcp_output.c | 1 - net/ipv6/datagram.c | 4 +++- net/ipv6/syncookies.c | 3 ++- net/ipv6/tcp_ipv6.c | 36 ++-- 10 files changed, 56 insertions(+), 32 deletions(-) -- 2.9.5
[PATCH net-next 1/2] net-next: use five-tuple hash for sk_txhash
From: Shaohua Li We are using sk_txhash to calculate flowlabel, but sk_txhash isn't always available, for example, in inet_timewait_sock. This causes problem for reset packet, which will have a different flowlabel. This causes our router doesn't correctly close tcp connection. We are using flowlabel to do load balance. Routers in the path maintain connection state. So if flow label changes, the packet is routed through a different router. In this case, the old router doesn't get the reset packet to close the tcp connection. Per Tom's suggestion, we switch back to five-tuple hash, so we can reconstruct correct flowlabel for reset packet. At most places, we already have the flowi info, so we directly use it build sk_txhash. For synack, we do this after route search. At that time, we have the flowi info ready, so don't need to create the flowi info again. I don't change sk_rethink_txhash() though, it still uses random hash, which is the whole point to select a different path after a negative routing advise. Cc: Martin KaFai Lau Cc: Eric Dumazet Cc: Florent Fourcot Cc: Cong Wang Cc: Tom Herbert Signed-off-by: Shaohua Li --- include/net/sock.h| 18 -- include/net/tcp.h | 2 +- net/ipv4/datagram.c | 2 +- net/ipv4/syncookies.c | 4 +++- net/ipv4/tcp_input.c | 1 - net/ipv4/tcp_ipv4.c | 17 - net/ipv4/tcp_output.c | 1 - net/ipv6/datagram.c | 4 +++- net/ipv6/syncookies.c | 3 ++- net/ipv6/tcp_ipv6.c | 18 +- 10 files changed, 39 insertions(+), 31 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index f8715c5..85a6192 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1732,22 +1732,12 @@ static inline kuid_t sock_net_uid(const struct net *net, const struct sock *sk) return sk ? sk->sk_uid : make_kuid(net->user_ns, 0); } -static inline u32 net_tx_rndhash(void) -{ - u32 v = prandom_u32(); - - return v ?: 1; -} - -static inline void sk_set_txhash(struct sock *sk) -{ - sk->sk_txhash = net_tx_rndhash(); -} - static inline void sk_rethink_txhash(struct sock *sk) { - if (sk->sk_txhash) - sk_set_txhash(sk); + if (sk->sk_txhash) { + u32 v = prandom_u32(); + sk->sk_txhash = v ?: 1; + } } static inline struct dst_entry * diff --git a/include/net/tcp.h b/include/net/tcp.h index 85ea578..8d68fde 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1840,7 +1840,7 @@ struct tcp_request_sock_ops { __u16 *mss); #endif struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl, - const struct request_sock *req); + struct request_sock *req); u32 (*init_seq)(const struct sk_buff *skb); u32 (*init_ts_off)(const struct net *net, const struct sk_buff *skb); int (*send_synack)(const struct sock *sk, struct dst_entry *dst, diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c index f915abf..ed9ccb7 100644 --- a/net/ipv4/datagram.c +++ b/net/ipv4/datagram.c @@ -74,7 +74,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len inet->inet_daddr = fl4->daddr; inet->inet_dport = usin->sin_port; sk->sk_state = TCP_ESTABLISHED; - sk_set_txhash(sk); + sk->sk_txhash = get_hash_from_flowi4(fl4); inet->inet_id = jiffies; sk_dst_set(sk, &rt->dst); diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c index fda37f2..76f1cf6 100644 --- a/net/ipv4/syncookies.c +++ b/net/ipv4/syncookies.c @@ -335,7 +335,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb) treq->rcv_isn = ntohl(th->seq) - 1; treq->snt_isn = cookie; treq->ts_off= 0; - treq->txhash= net_tx_rndhash(); req->mss= mss; ireq->ir_num= ntohs(th->dest); ireq->ir_rmt_port = th->source; @@ -376,6 +375,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb) opt->srr ? opt->faddr : ireq->ir_rmt_addr, ireq->ir_loc_addr, th->source, th->dest, sk->sk_uid); security_req_classify_flow(req, flowi4_to_flowi(&fl4)); + + treq->txhash = get_hash_from_flowi4(&fl4); + rt = ip_route_output_key(sock_net(sk), &fl4); if (IS_ERR(rt)) { reqsk_free(req); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index dabbf1d..92b4a10 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -6289,7 +6289,6 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, } tcp_rsk(req)->snt_isn = isn; - tcp_rsk(req)->txhash = net_tx_rndhash(); tcp_openreq_init_rwin(req, sk, dst); if (!want_cookie) { tcp_reqsk_record_syn(sk, req, skb); diff --git a/net/ipv4
Re: [PATCH][v4] uprobes/x86: emulate push insns for uprobe on x86
On 11/17/17 9:25 AM, Oleg Nesterov wrote: On 11/15, Yonghong Song wrote: v3 -> v4: . Revert most of v3 change as 32bit emulation is not really working on x86_64 platform as among other issues, function emulate_push_stack() needs to account for 32bit app on 64bit platform. A separate effort is ongoing to address this issue. Reviewed-by: Oleg Nesterov Please test your patch with the fix below, in this particular case the TIF_IA32 check should be fine. Although this is not what we really want, we should probably use user_64bit_mode(regs) which checks ->cs. But this needs more changes and doesn't solve other problems (get_unmapped_area) so I still can't decide what should we do right now... I tested the below change with my patch. On x86_64, both 64bit and 32bit program can be uprobe emulated properly. On x86_32, however, there is a compilation error like below: In function ‘check_copy_size’, inlined from ‘copy_to_user’ at /home/yhs/work/tip/include/linux/uaccess.h:154:6, inlined from ‘emulate_push_stack.isra.9’ at /home/yhs/work/tip/arch/x86/kernel/uprobes.c:535:6: /home/yhs/work/tip/include/linux/thread_info.h:139:4: error: call to ‘__bad_copy_from’ declared with attribute error: copy source size is too small __bad_copy_from(); Basically, test_thread_flag(TIF_IA32) returns 0 on x86_32 system. Oleg. --- a/arch/x86/kernel/uprobes.c +++ b/arch/x86/kernel/uprobes.c @@ -516,7 +516,7 @@ struct uprobe_xol_ops { static inline int sizeof_long(void) { - return in_ia32_syscall() ? 4 : 8; + return test_thread_flag(TIF_IA32) ? 4 : 8; } static int default_pre_xol_op(struct arch_uprobe *auprobe, struct pt_regs *regs)
[PATCH 3/4] bpf: add a bpf_override_function helper
From: Josef Bacik Error injection is sloppy and very ad-hoc. BPF could fill this niche perfectly with it's kprobe functionality. We could make sure errors are only triggered in specific call chains that we care about with very specific situations. Accomplish this with the bpf_override_funciton helper. This will modify the probe'd callers return value to the specified value and set the PC to an override function that simply returns, bypassing the originally probed function. This gives us a nice clean way to implement systematic error injection for all of our code paths. Acked-by: Alexei Starovoitov Signed-off-by: Josef Bacik --- arch/Kconfig | 3 +++ arch/x86/Kconfig | 1 + arch/x86/include/asm/kprobes.h | 4 +++ arch/x86/include/asm/ptrace.h| 5 arch/x86/kernel/kprobes/ftrace.c | 14 ++ include/linux/filter.h | 3 ++- include/linux/trace_events.h | 1 + include/uapi/linux/bpf.h | 7 - kernel/bpf/core.c| 3 +++ kernel/bpf/verifier.c| 2 ++ kernel/events/core.c | 7 + kernel/trace/Kconfig | 11 kernel/trace/bpf_trace.c | 38 +++ kernel/trace/trace_kprobe.c | 55 +++- kernel/trace/trace_probe.h | 12 + 15 files changed, 157 insertions(+), 9 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index d789a89cb32c..4fb618082259 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -195,6 +195,9 @@ config HAVE_OPTPROBES config HAVE_KPROBES_ON_FTRACE bool +config HAVE_KPROBE_OVERRIDE + bool + config HAVE_NMI bool diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 971feac13506..5126d2750dd0 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -152,6 +152,7 @@ config X86 select HAVE_KERNEL_XZ select HAVE_KPROBES select HAVE_KPROBES_ON_FTRACE + select HAVE_KPROBE_OVERRIDE select HAVE_KRETPROBES select HAVE_KVM select HAVE_LIVEPATCH if X86_64 diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h index 6cf65437b5e5..c6c3b1f4306a 100644 --- a/arch/x86/include/asm/kprobes.h +++ b/arch/x86/include/asm/kprobes.h @@ -67,6 +67,10 @@ extern const int kretprobe_blacklist_size; void arch_remove_kprobe(struct kprobe *p); asmlinkage void kretprobe_trampoline(void); +#ifdef CONFIG_KPROBES_ON_FTRACE +extern void arch_ftrace_kprobe_override_function(struct pt_regs *regs); +#endif + /* Architecture specific copy of original instruction*/ struct arch_specific_insn { /* copy of the original instruction */ diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 91c04c8e67fa..f04e71800c2f 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -108,6 +108,11 @@ static inline unsigned long regs_return_value(struct pt_regs *regs) return regs->ax; } +static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc) +{ + regs->ax = rc; +} + /* * user_mode(regs) determines whether a register set came from user * mode. On x86_32, this is true if V8086 mode was enabled OR if the diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c index 041f7b6dfa0f..3c455bf490cb 100644 --- a/arch/x86/kernel/kprobes/ftrace.c +++ b/arch/x86/kernel/kprobes/ftrace.c @@ -97,3 +97,17 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p) p->ainsn.boostable = false; return 0; } + +asmlinkage void override_func(void); +asm( + ".type override_func, @function\n" + "override_func:\n" + " ret\n" + ".size override_func, .-override_func\n" +); + +void arch_ftrace_kprobe_override_function(struct pt_regs *regs) +{ + regs->ip = (unsigned long)&override_func; +} +NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function); diff --git a/include/linux/filter.h b/include/linux/filter.h index cdd78a7beaae..dfa44fd74bae 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -458,7 +458,8 @@ struct bpf_prog { locked:1, /* Program image locked? */ gpl_compatible:1, /* Is filter GPL compatible? */ cb_access:1,/* Is control block accessed? */ - dst_needed:1; /* Do we need dst entry? */ + dst_needed:1, /* Do we need dst entry? */ + kprobe_override:1; /* Do we override a kprobe? */ kmemcheck_bitfield_end(meta); enum bpf_prog_type type; /* Type of BPF program */ u32 len;/* Number of filter blocks */ diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index fc6aeca945db..be8bd5a8efaa 100644 --- a/include/linux/trace_events.h +++ b/include/linu
[PATCH 4/4] samples/bpf: add a test for bpf_override_return
From: Josef Bacik This adds a basic test for bpf_override_return to verify it works. We override the main function for mounting a btrfs fs so it'll return -ENOMEM and then make sure that trying to mount a btrfs fs will fail. Acked-by: Alexei Starovoitov Signed-off-by: Josef Bacik --- samples/bpf/Makefile | 4 samples/bpf/test_override_return.sh | 15 +++ samples/bpf/tracex7_kern.c| 16 samples/bpf/tracex7_user.c| 28 tools/include/uapi/linux/bpf.h| 7 ++- tools/testing/selftests/bpf/bpf_helpers.h | 3 ++- 6 files changed, 71 insertions(+), 2 deletions(-) create mode 100755 samples/bpf/test_override_return.sh create mode 100644 samples/bpf/tracex7_kern.c create mode 100644 samples/bpf/tracex7_user.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index ea2b9e6135f3..83d06bc1f710 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -14,6 +14,7 @@ hostprogs-y += tracex3 hostprogs-y += tracex4 hostprogs-y += tracex5 hostprogs-y += tracex6 +hostprogs-y += tracex7 hostprogs-y += test_probe_write_user hostprogs-y += trace_output hostprogs-y += lathist @@ -58,6 +59,7 @@ tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o +tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o @@ -100,6 +102,7 @@ always += tracex3_kern.o always += tracex4_kern.o always += tracex5_kern.o always += tracex6_kern.o +always += tracex7_kern.o always += sock_flags_kern.o always += test_probe_write_user_kern.o always += trace_output_kern.o @@ -153,6 +156,7 @@ HOSTLOADLIBES_tracex3 += -lelf HOSTLOADLIBES_tracex4 += -lelf -lrt HOSTLOADLIBES_tracex5 += -lelf HOSTLOADLIBES_tracex6 += -lelf +HOSTLOADLIBES_tracex7 += -lelf HOSTLOADLIBES_test_cgrp2_sock2 += -lelf HOSTLOADLIBES_load_sock_ops += -lelf HOSTLOADLIBES_test_probe_write_user += -lelf diff --git a/samples/bpf/test_override_return.sh b/samples/bpf/test_override_return.sh new file mode 100755 index ..e68b9ee6814b --- /dev/null +++ b/samples/bpf/test_override_return.sh @@ -0,0 +1,15 @@ +#!/bin/bash + +rm -f testfile.img +dd if=/dev/zero of=testfile.img bs=1M seek=1000 count=1 +DEVICE=$(losetup --show -f testfile.img) +mkfs.btrfs -f $DEVICE +mkdir tmpmnt +./tracex7 $DEVICE +if [ $? -eq 0 ] +then + echo "SUCCESS!" +else + echo "FAILED!" +fi +losetup -d $DEVICE diff --git a/samples/bpf/tracex7_kern.c b/samples/bpf/tracex7_kern.c new file mode 100644 index ..1ab308a43e0f --- /dev/null +++ b/samples/bpf/tracex7_kern.c @@ -0,0 +1,16 @@ +#include +#include +#include +#include "bpf_helpers.h" + +SEC("kprobe/open_ctree") +int bpf_prog1(struct pt_regs *ctx) +{ + unsigned long rc = -12; + + bpf_override_return(ctx, rc); + return 0; +} + +char _license[] SEC("license") = "GPL"; +u32 _version SEC("version") = LINUX_VERSION_CODE; diff --git a/samples/bpf/tracex7_user.c b/samples/bpf/tracex7_user.c new file mode 100644 index ..8a52ac492e8b --- /dev/null +++ b/samples/bpf/tracex7_user.c @@ -0,0 +1,28 @@ +#define _GNU_SOURCE + +#include +#include +#include +#include "libbpf.h" +#include "bpf_load.h" + +int main(int argc, char **argv) +{ + FILE *f; + char filename[256]; + char command[256]; + int ret; + + snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + + if (load_bpf_file(filename)) { + printf("%s", bpf_log_buf); + return 1; + } + + snprintf(command, 256, "mount %s tmpmnt/", argv[1]); + f = popen(command, "r"); + ret = pclose(f); + + return ret ? 0 : 1; +} diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 4a4b6e78c977..3756dde69834 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -673,6 +673,10 @@ union bpf_attr { * @buf: buf to fill * @buf_size: size of the buf * Return : 0 on success or negative error code + * + * int bpf_override_return(pt_regs, rc) + * @pt_regs: pointer to struct pt_regs + * @rc: the return value to set */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -732,7 +736,8 @@ union bpf_attr { FN(xdp_adjust_meta),\ FN(perf_event_read_value), \ FN(perf_prog_read_value), \ - FN(getsockopt), + FN(getsockopt), \ + FN(override_return), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/to
[PATCH 1/4] add infrastructure for tagging functions as error injectable
From: Josef Bacik Using BPF we can override kprob'ed functions and return arbitrary values. Obviously this can be a bit unsafe, so make this feature opt-in for functions. Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in order to give BPF access to that function for error injection purposes. Signed-off-by: Josef Bacik --- arch/x86/include/asm/asm.h| 6 ++ include/asm-generic/kprobes.h | 9 +++ include/asm-generic/vmlinux.lds.h | 10 +++ include/linux/kprobes.h | 1 + include/linux/module.h| 5 ++ kernel/kprobes.c | 163 ++ kernel/module.c | 6 +- 7 files changed, 199 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h index b0dc91f4bedc..340f4cc43255 100644 --- a/arch/x86/include/asm/asm.h +++ b/arch/x86/include/asm/asm.h @@ -85,6 +85,12 @@ _ASM_PTR (entry); \ .popsection +# define _ASM_KPROBE_ERROR_INJECT(entry) \ + .pushsection "_kprobe_error_inject_list","aw" ; \ + _ASM_ALIGN ;\ + _ASM_PTR (entry); \ + .popseciton + .macro ALIGN_DESTINATION /* check for bad alignment of destination */ movl %edi,%ecx diff --git a/include/asm-generic/kprobes.h b/include/asm-generic/kprobes.h index 57af9f21d148..f96c4de5d7b0 100644 --- a/include/asm-generic/kprobes.h +++ b/include/asm-generic/kprobes.h @@ -22,4 +22,13 @@ static unsigned long __used \ #endif #endif /* defined(__KERNEL__) && !defined(__ASSEMBLY__) */ +#ifdef CONFIG_BPF_KPROBE_OVERRIDE +#define BPF_ALLOW_ERROR_INJECTION(fname) \ +static unsigned long __used\ + __attribute__((__section__("_kprobe_error_inject_list"))) \ + _eil_addr_##fname = (unsigned long)fname; +#else +#define BPF_ALLOW_ERROR_INJECTION(fname) +#endif + #endif /* _ASM_GENERIC_KPROBES_H */ diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 8acfc1e099e1..85822804861e 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -136,6 +136,15 @@ #define KPROBE_BLACKLIST() #endif +#ifdef CONFIG_BPF_KPROBE_OVERRIDE +#define ERROR_INJECT_LIST(). = ALIGN(8); \ + VMLINUX_SYMBOL(__start_kprobe_error_inject_list) = .; \ + KEEP(*(_kprobe_error_inject_list)) \ + VMLINUX_SYMBOL(__stop_kprobe_error_inject_list) = .; +#else +#define ERROR_INJECT_LIST() +#endif + #ifdef CONFIG_EVENT_TRACING #define FTRACE_EVENTS(). = ALIGN(8); \ VMLINUX_SYMBOL(__start_ftrace_events) = .; \ @@ -560,6 +569,7 @@ FTRACE_EVENTS() \ TRACE_SYSCALLS()\ KPROBE_BLACKLIST() \ + ERROR_INJECT_LIST() \ MEM_DISCARD(init.rodata)\ CLK_OF_TABLES() \ RESERVEDMEM_OF_TABLES() \ diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h index bd2684700b74..4f501cb73aec 100644 --- a/include/linux/kprobes.h +++ b/include/linux/kprobes.h @@ -271,6 +271,7 @@ extern bool arch_kprobe_on_func_entry(unsigned long offset); extern bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long offset); extern bool within_kprobe_blacklist(unsigned long addr); +extern bool within_kprobe_error_injection_list(unsigned long addr); struct kprobe_insn_cache { struct mutex mutex; diff --git a/include/linux/module.h b/include/linux/module.h index fe5aa3736707..7bb1a9b9a322 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -475,6 +475,11 @@ struct module { ctor_fn_t *ctors; unsigned int num_ctors; #endif + +#ifdef CONFIG_BPF_KPROBE_OVERRIDE + unsigned int num_kprobe_ei_funcs; + unsigned long *kprobe_ei_funcs; +#endif } cacheline_aligned __randomize_layout; #ifndef MODULE_ARCH_INIT #define MODULE_ARCH_INIT {} diff --git a/kernel/kprobes.c b/kernel/kprobes.c index a1606a4224e1..7afadf07b34e 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -83,6 +83,16 @@ static raw_spinlock_t *kretprobe_table_lock_ptr(unsigned long hash) return &(kretprobe_table_locks[hash].lock); } +/* List of symbols that can be overriden for error injection. */ +static LIST_HEAD(kprobe_error_injection_list); +static DEFIN
[PATCH 2/4] btrfs: make open_ctree error injectable
From: Josef Bacik This allows us to do error injection with BPF for open_ctree. Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index dfdab849037b..c6b4e1f07072 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -31,6 +31,7 @@ #include #include #include +#include #include "ctree.h" #include "disk-io.h" #include "hash.h" @@ -3283,6 +3284,7 @@ int open_ctree(struct super_block *sb, goto fail_block_groups; goto retry_root_backup; } +BPF_ALLOW_ERROR_INJECTION(open_ctree); static void btrfs_end_buffer_write_sync(struct buffer_head *bh, int uptodate) { -- 2.7.5
[PATCH 0/4][v6] Add the ability to do BPF directed error injection
I've reworked this to be opt-in only as per Igno and Alexei. Still needs to go through Dave because of the bpf bits, but I need tracing guys to weigh in and sign off on my approach please. v5->v6: - add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this feature. This way only functions that opt-in will be allowed to be overridden. - added a btrfs patch to allow error injection for open_ctree() so that the bpf sample actually works. v4->v5: - disallow kprobe_override programs from being put in the prog map array so we don't tail call into something we didn't check. This allows us to make the normal path still fast without a bunch of percpu operations. v3->v4: - fix a build error found by kbuild test bot (I didn't wait long enough apparently.) - Added a warning message as per Daniels suggestion. v2->v3: - added a ->kprobe_override flag to bpf_prog. - added some sanity checks to disallow attaching bpf progs that have ->kprobe_override set that aren't for ftrace kprobes. - added the trace_kprobe_ftrace helper to check if the trace_event_call is a ftrace kprobe. - renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read this value in the kprobe path, and thus only write to it if we're overriding or clearing the override. v1->v2: - moved things around to make sure that bpf_override_return could really only be used for an ftrace kprobe. - killed the special return values from trace_call_bpf. - renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if it was being called from an ftrace kprobe context. - reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state. - updated the test as per Alexei's review. - Original message - A lot of our error paths are not well tested because we have no good way of injecting errors generically. Some subystems (block, memory) have ways to inject errors, but they are random so it's hard to get reproduceable results. With BPF we can add determinism to our error injection. We can use kprobes and other things to verify we are injecting errors at the exact case we are trying to test. This patch gives us the tool to actual do the error injection part. It is very simple, we just set the return value of the pt_regs we're given to whatever we provide, and then override the PC with a dummy function that simply returns. Right now this only works on x86, but it would be simple enough to expand to other architectures. Thanks, Josef
Re: Bisected 4.14 Regression: IPsec transport mode breakage
On Fri, 2017-11-17 at 11:03 +0100, Steffen Klassert wrote: > On Wed, Nov 15, 2017 at 09:46:19AM -0700, Kevin Locke wrote: >> I have bisected the issue to commit c9f3f813d462. I have attached the >> client ipsec.conf as well as the syslog during the connection attempt >> for both c9f3f813d462 (bad) and cf3796675174 (good). > > The offending commit is already reverted in the 'net' tree > and will be available in mainline soon. Great, thank you! I tested davem/net#94802151894d and can confirm that it works and fixes the issue for me. Thanks again. -- Cheers, | ke...@kevinlocke.name| XMPP: ke...@kevinlocke.name Kevin| https://kevinlocke.name | IRC: kevinoid on freenode
Re: [PATCH][v4] uprobes/x86: emulate push insns for uprobe on x86
On 11/15, Yonghong Song wrote: > > v3 -> v4: > . Revert most of v3 change as 32bit emulation is not really working > on x86_64 platform as among other issues, function emulate_push_stack() > needs to account for 32bit app on 64bit platform. > A separate effort is ongoing to address this issue. Reviewed-by: Oleg Nesterov Please test your patch with the fix below, in this particular case the TIF_IA32 check should be fine. Although this is not what we really want, we should probably use user_64bit_mode(regs) which checks ->cs. But this needs more changes and doesn't solve other problems (get_unmapped_area) so I still can't decide what should we do right now... Oleg. --- a/arch/x86/kernel/uprobes.c +++ b/arch/x86/kernel/uprobes.c @@ -516,7 +516,7 @@ struct uprobe_xol_ops { static inline int sizeof_long(void) { - return in_ia32_syscall() ? 4 : 8; + return test_thread_flag(TIF_IA32) ? 4 : 8; } static int default_pre_xol_op(struct arch_uprobe *auprobe, struct pt_regs *regs)
Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit
On 15.11.2017 19:29, Eric W. Biederman wrote: > Kirill Tkhai writes: > >> On 15.11.2017 09:25, Eric W. Biederman wrote: >>> Kirill Tkhai writes: >>> Curently mutex is used to protect pernet operations list. It makes cleanup_net() to execute ->exit methods of the same operations set, which was used on the time of ->init, even after net namespace is unlinked from net_namespace_list. But the problem is it's need to synchronize_rcu() after net is removed from net_namespace_list(): Destroy net_ns: cleanup_net() mutex_lock(&net_mutex) list_del_rcu(&net->list) synchronize_rcu() <--- Sleep there for ages list_for_each_entry_reverse(ops, &pernet_list, list) ops_exit_list(ops, &net_exit_list) list_for_each_entry_reverse(ops, &pernet_list, list) ops_free_list(ops, &net_exit_list) mutex_unlock(&net_mutex) This primitive is not fast, especially on the systems with many processors and/or when preemptible RCU is enabled in config. So, all the time, while cleanup_net() is waiting for RCU grace period, creation of new net namespaces is not possible, the tasks, who makes it, are sleeping on the same mutex: Create net_ns: copy_net_ns() mutex_lock_killable(&net_mutex)<--- Sleep there for ages The solution is to convert net_mutex to the rw_semaphore. Then, pernet_operations::init/::exit methods, modifying the net-related data, will require down_read() locking only, while down_write() will be used for changing pernet_list. This gives signify performance increase, like you may see below. There is measured sequential net namespace creation in a cycle, in single thread, without other tasks (single user mode): 1)int main(int argc, char *argv[]) { unsigned nr; if (argc < 2) { fprintf(stderr, "Provide nr iterations arg\n"); return 1; } nr = atoi(argv[1]); while (nr-- > 0) { if (unshare(CLONE_NEWNET)) { perror("Can't unshare"); return 1; } } return 0; } Origin, 10 unshare(): 0.03user 23.14system 1:39.85elapsed 23%CPU Patched, 10 unshare(): 0.03user 67.49system 1:08.34elapsed 98%CPU 2)for i in {1..1}; do unshare -n bash -c exit; done Origin: real 1m24,190s user 0m6,225s sys 0m15,132s Patched: real 0m18,235s (4.6 times faster) user 0m4,544s sys 0m13,796s This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()" from Linus tree (not in net-next yet). >>> >>> Using a rwsem to protect the list of operations makes sense. >>> >>> That should allow removing the sing >>> >>> I am not wild about taking a the rwsem down_write in >>> rtnl_link_unregister, and net_ns_barrier. I think that works but it >>> goes from being a mild hack to being a pretty bad hack and something >>> else that can kill the parallelism you are seeking it add. >>> >>> There are about 204 instances of struct pernet_operations. That is a >>> lot of code to have carefully audited to ensure it can in parallel all >>> at once. The existence of the exit_batch method, net_ns_barrier, >>> for_each_net and taking of net_mutex in rtnl_link_unregister all testify >>> to the fact that there are data structures accessed by multiple network >>> namespaces. >>> >>> My preference would be to: >>> >>> - Add the net_sem in addition to net_mutex with down_write only held in >>> register and unregister, and maybe net_ns_barrier and >>> rtnl_link_unregister. >>> >>> - Factor out struct pernet_ops out of struct pernet_operations. With >>> struct pernet_ops not having the exit_batch method. With pernet_ops >>> being embedded an anonymous member of the old struct pernet_operations. >>> >>> - Add [un]register_pernet_{sys,dev} functions that take a struct >>> pernet_ops, that don't take net_mutex. Have them order the >>> pernet_list as: >>> >>> pernet_sys >>> pernet_subsys >>> pernet_device >>> pernet_dev >>> >>> With the chunk in the middle taking the net_mutex. >> >> I think this approach will work. Thanks for the suggestion. Some more >> thoughts to the plan below. >> >> The only difficult thing there will be to choose the right order >> to move ops from pernet_subsys to pernet_sys and from pernet_device >> to pernet_dev one by one. >> >> This is rather easy in case of tristate drivers, as modules may be loaded >> at any time, and the only important order is dependences between them. >> So, it's possible to start from a module, who has no dependences, >> and move it to pernet_sys, and then continue with
[PATCH 1/2] gre6: use log_ecn_error module parameter in ip6_tnl_rcv()
After commit 308edfdf1563 ("gre6: Cleanup GREv6 receive path, call common GRE functions") it's not used anywhere in the module, but previously was used in ip6gre_rcv(). Fixes: 308edfdf1563 ("gre6: Cleanup GREv6 receive path, call common GRE functions") Signed-off-by: Alexey Kodanev --- net/ipv6/ip6_gre.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c index 59c121b..5d6bee0 100644 --- a/net/ipv6/ip6_gre.c +++ b/net/ipv6/ip6_gre.c @@ -461,7 +461,7 @@ static int ip6gre_rcv(struct sk_buff *skb, const struct tnl_ptk_info *tpi) &ipv6h->saddr, &ipv6h->daddr, tpi->key, tpi->proto); if (tunnel) { - ip6_tnl_rcv(tunnel, skb, tpi, NULL, false); + ip6_tnl_rcv(tunnel, skb, tpi, NULL, log_ecn_error); return PACKET_RCVD; } -- 1.8.3.1
[PATCH 2/2] ip6_tunnel: pass tun_dst arg from ip6_tnl_rcv() to __ip6_tnl_rcv()
Otherwise tun_dst argument is unused there. Currently, ip6_tnl_rcv() invoked with tun_dst set to NULL, so there is no actual functional changes introduced in this patch. Fixes: 0d3c703a9d17 ("ipv6: Cleanup IPv6 tunnel receive path") Signed-off-by: Alexey Kodanev --- net/ipv6/ip6_tunnel.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index a1c2444..bc050e8 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -869,7 +869,7 @@ int ip6_tnl_rcv(struct ip6_tnl *t, struct sk_buff *skb, struct metadata_dst *tun_dst, bool log_ecn_err) { - return __ip6_tnl_rcv(t, skb, tpi, NULL, ip6ip6_dscp_ecn_decapsulate, + return __ip6_tnl_rcv(t, skb, tpi, tun_dst, ip6ip6_dscp_ecn_decapsulate, log_ecn_err); } EXPORT_SYMBOL(ip6_tnl_rcv); -- 1.8.3.1
Re: [PATCH] qed: fix unnecessary call to memset cocci warnings
On Fri, Nov 17, 2017 at 12:04 AM, Vasyl Gomonovych wrote: > Use kzalloc rather than kmalloc followed by memset with 0 > > drivers/net/ethernet/qlogic/qed/qed_dcbx.c:1280:13-20: WARNING: > kzalloc should be used for dcbx_info, instead of kmalloc/memset > Generated by: scripts/coccinelle/api/alloc/kzalloc-simple.cocci While this looks okay per se now, it would be good if you put version of the patch and add a changelog to it. I think no need to resend this one, just for your information. Reviewed-by: Andy Shevchenko > Signed-off-by: Vasyl Gomonovych > --- > drivers/net/ethernet/qlogic/qed/qed_dcbx.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c > b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c > index 8f6ccc0c39e5..cc9e0dfcee48 100644 > --- a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c > +++ b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c > @@ -1277,11 +1277,10 @@ static struct qed_dcbx_get *qed_dcbnl_get_dcbx(struct > qed_hwfn *hwfn, > { > struct qed_dcbx_get *dcbx_info; > > - dcbx_info = kmalloc(sizeof(*dcbx_info), GFP_ATOMIC); > + dcbx_info = kzalloc(sizeof(*dcbx_info), GFP_ATOMIC); > if (!dcbx_info) > return NULL; > > - memset(dcbx_info, 0, sizeof(*dcbx_info)); > if (qed_dcbx_query_params(hwfn, dcbx_info, type)) { > kfree(dcbx_info); > return NULL; > -- > 1.9.1 > -- With Best Regards, Andy Shevchenko
Re: [PATCH net] sctp: report SCTP_ERROR_INV_STRM as cpu endian
On Fri, Nov 17, 2017 at 02:15:02PM +0800, Xin Long wrote: > rfc6458 demands the send_error in SCTP_SEND_FAILED_EVENT should > be in cpu endian, while SCTP_ERROR_INV_STRM is in big endian. > > This issue is there since very beginning, Eric noticed it by > running 'make C=2 M=net/sctp/'. > > This patch is to convert it before reporting it. Unfortunatelly we can't fix this as this will break UAPI. It will break applications that are currently matching on the current value. > > Reported-by: Eric Dumazet > Signed-off-by: Xin Long > --- > net/sctp/stream.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/sctp/stream.c b/net/sctp/stream.c > index a11db21..f86ceee 100644 > --- a/net/sctp/stream.c > +++ b/net/sctp/stream.c > @@ -64,7 +64,7 @@ static void sctp_stream_outq_migrate(struct sctp_stream > *stream, >*/ > > /* Mark as failed send. */ > - sctp_chunk_fail(ch, SCTP_ERROR_INV_STRM); > + sctp_chunk_fail(ch, be16_to_cpu(SCTP_ERROR_INV_STRM)); > if (asoc->peer.prsctp_capable && > SCTP_PR_PRIO_ENABLED(ch->sinfo.sinfo_flags)) > asoc->sent_cnt_removable--; > -- > 2.1.0 >
Product Enquiry
Hello, We recently visited your website and we are interested in your models, We will like to make an order from your list of products. However, we would like to see your company's latest catalogs with the; minimum order quantity, delivery time/FOB, payment terms etc. Official order placement will follow as soon as possible. Awaiting your prompt reply. Thanks and best regards, Carol Merck Purchasing Manager
Re: [PATCH net] sctp: set frag_point in sctp_setsockopt_maxseg correctly
On Fri, Nov 17, 2017 at 02:11:11PM +0800, Xin Long wrote: > Now in sctp_setsockopt_maxseg user_frag or frag_point can be set with > val >= 8 and val <= SCTP_MAX_CHUNK_LEN. But both checks are incorrect. > > val >= 8 means frag_point can even be less than SCTP_DEFAULT_MINSEGMENT. > Then in sctp_datamsg_from_user(), when it's value is greater than cookie > echo len and trying to bundle with cookie echo chunk, the first_len will > overflow. > > The worse case is when it's value is equal as cookie echo len, first_len > becomes 0, it will go into a dead loop for fragment later on. In Hangbin > syzkaller testing env, oom was even triggered due to consecutive memory > allocation in that loop. > > Besides, SCTP_MAX_CHUNK_LEN is the max size of the whole chunk, it should > deduct the data header for frag_point or user_frag check. > > This patch does a proper check with SCTP_DEFAULT_MINSEGMENT subtracting > the sctphdr and datahdr, SCTP_MAX_CHUNK_LEN subtracting datahdr when > setting frag_point via sockopt. It also improves sctp_setsockopt_maxseg > codes. > > Suggested-by: Marcelo Ricardo Leitner > Reported-by: Hangbin Liu > Signed-off-by: Xin Long Acked-by: Marcelo Ricardo Leitner > --- > include/net/sctp/sctp.h | 3 ++- > net/sctp/socket.c | 29 +++-- > 2 files changed, 21 insertions(+), 11 deletions(-) > > diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h > index d7d8cba..749a428 100644 > --- a/include/net/sctp/sctp.h > +++ b/include/net/sctp/sctp.h > @@ -444,7 +444,8 @@ static inline int sctp_frag_point(const struct > sctp_association *asoc, int pmtu) > if (asoc->user_frag) > frag = min_t(int, frag, asoc->user_frag); > > - frag = SCTP_TRUNC4(min_t(int, frag, SCTP_MAX_CHUNK_LEN)); > + frag = SCTP_TRUNC4(min_t(int, frag, SCTP_MAX_CHUNK_LEN - > + sizeof(struct sctp_data_chunk))); > > return frag; > } > diff --git a/net/sctp/socket.c b/net/sctp/socket.c > index 4c0a772..3204a9b 100644 > --- a/net/sctp/socket.c > +++ b/net/sctp/socket.c > @@ -3140,9 +3140,9 @@ static int sctp_setsockopt_mappedv4(struct sock *sk, > char __user *optval, unsign > */ > static int sctp_setsockopt_maxseg(struct sock *sk, char __user *optval, > unsigned int optlen) > { > + struct sctp_sock *sp = sctp_sk(sk); > struct sctp_assoc_value params; > struct sctp_association *asoc; > - struct sctp_sock *sp = sctp_sk(sk); > int val; > > if (optlen == sizeof(int)) { > @@ -3158,26 +3158,35 @@ static int sctp_setsockopt_maxseg(struct sock *sk, > char __user *optval, unsigned > if (copy_from_user(¶ms, optval, optlen)) > return -EFAULT; > val = params.assoc_value; > - } else > + } else { > return -EINVAL; > + } > > - if ((val != 0) && ((val < 8) || (val > SCTP_MAX_CHUNK_LEN))) > - return -EINVAL; > + if (val) { > + int min_len, max_len; > > - asoc = sctp_id2assoc(sk, params.assoc_id); > - if (!asoc && params.assoc_id && sctp_style(sk, UDP)) > - return -EINVAL; > + min_len = SCTP_DEFAULT_MINSEGMENT - sp->pf->af->net_header_len; > + min_len -= sizeof(struct sctphdr) + > +sizeof(struct sctp_data_chunk); > + > + max_len = SCTP_MAX_CHUNK_LEN - sizeof(struct sctp_data_chunk); > > + if (val < min_len || val > max_len) > + return -EINVAL; > + } > + > + asoc = sctp_id2assoc(sk, params.assoc_id); > if (asoc) { > if (val == 0) { > - val = asoc->pathmtu; > - val -= sp->pf->af->net_header_len; > + val = asoc->pathmtu - sp->pf->af->net_header_len; > val -= sizeof(struct sctphdr) + > - sizeof(struct sctp_data_chunk); > +sizeof(struct sctp_data_chunk); > } > asoc->user_frag = val; > asoc->frag_point = sctp_frag_point(asoc, asoc->pathmtu); > } else { > + if (params.assoc_id && sctp_style(sk, UDP)) > + return -EINVAL; > sp->user_frag = val; > } > > -- > 2.1.0 >
Re: [PATCH] net: usb: hso.c: remove unneeded DRIVER_LICENSE #define
On Fri, Nov 17, 2017 at 3:19 PM, Greg Kroah-Hartman wrote: > There is no need to #define the license of the driver, just put it in > the MODULE_LICENSE() line directly as a text string. > > This allows tools that check that the module license matches the source > code license to work properly, as there is no need to unwind the > unneeded dereference. > > Cc: "David S. Miller" > Cc: Andreas Kemnade > Cc: Johan Hovold > Reported-by: Philippe Ombredanne > Signed-off-by: Greg Kroah-Hartman Reviewed-by: Philippe Ombredanne -- Cordially Philippe Ombredanne
Re: regression: UFO removal breaks kvm live migration
>> Okay, I will send a patch to reinstate UFO for this use case (only). There >> is some related work in tap_handle_frame and packet_direct_xmit to >> segment directly in the device. I will be traveling the next few days, so >> it won't be in time for 4.14 (but can go in stable later, of course). > > I'm finishing up and running some tests. The majority of the patch is a > straightforward partial revert of the patchset, so while fairly large for a > patch to net (~150 lines, esp. in udp[46]_ufo_fragment), that is all > thoroughly tested code. Notably absent are the protocol layer and > hardware support (NETIF_F_UFO) portions. > > The only open issue is whether to rely on existing skb_gso_segment > processing in the transmit path from validate_xmit_skb or to add new > skb_gso_segment calls directly to tun_get_user, tap_get_user and > pf_packet. Tun has to loop around four different ways of injecting > packets into the device. Something like the below snippet. > > More conservative is to introduce no completely new code and rely on > validate_xmit_skb, but that means having to protect the entire stack > against skbs with SKB_GSO_UDP, so also bringing back some > checksum and fragment handling snippets in gre_gso_segment, > __skb_udp_tunnel_segment, act_csum and openvswitch. Come to think of it, as this patch does not bring back NETIF_F_UFO support to NETIF_F_GSO_SOFTWARE, the tunnel cases can be excluded. Then this is probably the simpler and more obviously correct approach.
Re: [PATCH] sfp: Add support for DWDM SFP modules
From: Russell King - ARM Linux Date: Fri, 17 Nov 2017 09:52:10 + > I already have a stack of patches for phy, phylink and sfp that I > need to send, including documentation patches which Florian has > already found very useful and helpful. I had assumed that net-next > was already closed, being almost a week into the merge window. Yes it is. Thanks for the info, I'll mark this 'deferred' in patchwork. Please have this respun and posted once net-next is openned back up and the various issues have been sorted out. Thank you.
Re: regression: UFO removal breaks kvm live migration
On Fri, Nov 10, 2017 at 12:32 AM, Willem de Bruijn wrote: > On Wed, Nov 8, 2017 at 9:53 PM, Jason Wang wrote: >> >> >> On 2017年11月08日 20:32, David Miller wrote: >>> >>> From: Jason Wang >>> Date: Wed, 8 Nov 2017 17:25:48 +0900 >>> On 2017年11月08日 17:08, Willem de Bruijn wrote: > > That won't help in the short term. I'm still reading up to see if > there are > any other options besides reimplement or advertise-but-drop, such as > an implicit trigger that would make the guest renegotiate. It's > unlikely, but > worth a look.. Yes, this looks hard. And even if we can manage to do this, it looks an overkill since it will impact all guest after migration. >>> >>> Like Willem I would much prefer "advertise-but-drop" if it works. >> >> >> This makes migration work but all guest UFO traffic will stall. >> >>> >>> In the long term feature renegotiation triggers are a must. >>> >>> There is no way for us to remove features otherwise. >> >> >> We can remove if we don't break userspace(guest). >> >>> In my opinion >>> this will even make migrations more powerful. >> >> >> But this does not help for guest running old version of kernel which still >> think UFO work. > > Indeed, if we have to support live migration of arbitrary old guests > without any expectations on hypervisor version either, features can > simply never be reverted, even if a negotiation interface exists. > > At least for upcoming features and devices, guest code should not > have this expectation, but from the start allow renegation such as > CTRL_GUEST_OFFLOADS [1] based on a host trigger. But for > tuntap TUNSETOFFLOAD it seems that ship has sailed. > > Okay, I will send a patch to reinstate UFO for this use case (only). There > is some related work in tap_handle_frame and packet_direct_xmit to > segment directly in the device. I will be traveling the next few days, so > it won't be in time for 4.14 (but can go in stable later, of course). I'm finishing up and running some tests. The majority of the patch is a straightforward partial revert of the patchset, so while fairly large for a patch to net (~150 lines, esp. in udp[46]_ufo_fragment), that is all thoroughly tested code. Notably absent are the protocol layer and hardware support (NETIF_F_UFO) portions. The only open issue is whether to rely on existing skb_gso_segment processing in the transmit path from validate_xmit_skb or to add new skb_gso_segment calls directly to tun_get_user, tap_get_user and pf_packet. Tun has to loop around four different ways of injecting packets into the device. Something like the below snippet. More conservative is to introduce no completely new code and rely on validate_xmit_skb, but that means having to protect the entire stack against skbs with SKB_GSO_UDP, so also bringing back some checksum and fragment handling snippets in gre_gso_segment, __skb_udp_tunnel_segment, act_csum and openvswitch. A third option is to send the conservative approach to net, then in net-next follow up with a patch to plug the SKB_GSO_UDP directly in the devices and revert the tunnel/act/openvswitch stanzas I'm leaning towards that approach. @@ -1380,7 +1380,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, int noblock, bool more) { struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) }; - struct sk_buff *skb; + struct sk_buff *skb, *segs = NULL; size_t total_len = iov_iter_count(from); size_t len = total_len, align = tun->align, linear; struct virtio_net_hdr gso = { 0 }; @@ -1552,12 +1552,33 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, } rxhash = __skb_get_hash_symmetric(skb); + + if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP) { + skb_push(skb, ETH_HLEN); + segs = __skb_gso_segment(skb, netif_skb_features(skb), false); + + if (IS_ERR(segs)) { + kfree_skb(skb); + return PTR_ERR(segs); + } + + if (segs) { + consume_skb(skb); + skb = segs; + } +again: + skb_pull(skb, ETH_HLEN); + segs = skb->next; + skb->next = NULL; + } + #ifndef CONFIG_4KSTACKS -tun_rx_batched(tun, tfile, skb, more); + tun_rx_batched(tun, tfile, skb, more || segs); #else netif_rx_ni(skb); #endif + if (segs) { + skb = segs; + goto again; + }