[PATCH v2] net: alx: use custom skb allocator
This patch follows Eric Dumazet's commit 7b70176421 for Atheros atl1c driver to fix one exactly same bug in alx driver, that the network link will be lost in 1-5 minutes after the device is up. My laptop Lenovo Y580 with Atheros AR8161 ethernet device hit the same problem with kernel 4.4, and it will be cured by Jarod Wilson's commit c406700c for alx driver which get merged in 4.5. But there are still some alx devices can't function well even with Jarod's patch, while this patch could make them work fine. More details on https://bugzilla.kernel.org/show_bug.cgi?id=70761 The debug shows the issue is very likely to be related with the RX DMA address, specifically 0x...f80, if RX buffer get 0x...f80 several times, their will be RX overflow error and device will stop working. For kernel 4.5.0 with Jarod's patch which works fine with my AR8161/Lennov Y580, if I made some change to the __netdev_alloc_skb --> __alloc_page_frag() to make the allocated buffer can get an address with 0x...f80, then the same error happens. If I make it to 0x...f40 or 0xfc0, everything will be still fine. So I tend to believe that the 0x..f80 address cause the silicon to behave abnormally. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=70761 Cc: Eric Dumazet Cc: Johannes Berg Cc: Jarod Wilson Signed-off-by: Feng Tang Tested-by: Ole Lukoie --- drivers/net/ethernet/atheros/alx/alx.h | 4 +++ drivers/net/ethernet/atheros/alx/main.c | 48 - 2 files changed, 51 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/atheros/alx/alx.h b/drivers/net/ethernet/atheros/alx/alx.h index 8fc93c5..d02c424 100644 --- a/drivers/net/ethernet/atheros/alx/alx.h +++ b/drivers/net/ethernet/atheros/alx/alx.h @@ -96,6 +96,10 @@ struct alx_priv { unsigned int rx_ringsz; unsigned int rxbuf_size; + struct page *rx_page; + unsigned int rx_page_offset; + unsigned int rx_frag_size; + struct napi_struct napi; struct alx_tx_queue txq; struct alx_rx_queue rxq; diff --git a/drivers/net/ethernet/atheros/alx/main.c b/drivers/net/ethernet/atheros/alx/main.c index 9fe8b5e..c98acdc 100644 --- a/drivers/net/ethernet/atheros/alx/main.c +++ b/drivers/net/ethernet/atheros/alx/main.c @@ -70,6 +70,35 @@ static void alx_free_txbuf(struct alx_priv *alx, int entry) } } +static struct sk_buff *alx_alloc_skb(struct alx_priv *alx, gfp_t gfp) +{ + struct sk_buff *skb; + struct page *page; + + if (alx->rx_frag_size > PAGE_SIZE) + return __netdev_alloc_skb(alx->dev, alx->rxbuf_size, gfp); + + page = alx->rx_page; + if (!page) { + alx->rx_page = page = alloc_page(gfp); + if (unlikely(!page)) + return NULL; + alx->rx_page_offset = 0; + } + + skb = build_skb(page_address(page) + alx->rx_page_offset, + alx->rx_frag_size); + if (likely(skb)) { + alx->rx_page_offset += alx->rx_frag_size; + if (alx->rx_page_offset >= PAGE_SIZE) + alx->rx_page = NULL; + else + get_page(page); + } + return skb; +} + + static int alx_refill_rx_ring(struct alx_priv *alx, gfp_t gfp) { struct alx_rx_queue *rxq = &alx->rxq; @@ -86,7 +115,7 @@ static int alx_refill_rx_ring(struct alx_priv *alx, gfp_t gfp) while (!cur_buf->skb && next != rxq->read_idx) { struct alx_rfd *rfd = &rxq->rfd[cur]; - skb = __netdev_alloc_skb(alx->dev, alx->rxbuf_size, gfp); + skb = alx_alloc_skb(alx, gfp); if (!skb) break; dma = dma_map_single(&alx->hw.pdev->dev, @@ -124,6 +153,7 @@ static int alx_refill_rx_ring(struct alx_priv *alx, gfp_t gfp) alx_write_mem16(&alx->hw, ALX_RFD_PIDX, cur); } + return count; } @@ -592,6 +622,11 @@ static void alx_free_rings(struct alx_priv *alx) kfree(alx->txq.bufs); kfree(alx->rxq.bufs); + if (alx->rx_page) { + put_page(alx->rx_page); + alx->rx_page = NULL; + } + dma_free_coherent(&alx->hw.pdev->dev, alx->descmem.size, alx->descmem.virt, @@ -646,6 +681,7 @@ static int alx_request_irq(struct alx_priv *alx) alx->dev->name, alx); if (!err) goto out; + /* fall back to legacy interrupt */ pci_disable_msi(alx->hw.pdev); } @@ -689,6 +725,7 @@ static int alx_init_sw(struct alx_priv *alx) struct pci_dev *pdev = alx->hw.pdev; struct alx_hw *hw = &alx->hw; int err; + unsigned int head_size; err = alx_identify_hw(alx); if (err) { @@ -704,7 +741,12 @@ static int alx_init_sw(struct alx
Re: [PATCH] net: alx: use custom skb allocator
Hi Jarod, On Fri, May 20, 2016 at 02:26:57PM -0400, Jarod Wilson wrote: > On Fri, May 20, 2016 at 03:56:23PM +0800, Feng Tang wrote: > > Hi, > > > > Please ignore this patch. > > > > I found the problem and made the patch with kernel 4.4 with Ubuntu 12.04 > > on Lenovo Y580. > > > > After submitting the patch, I tried to google the datasheet for > > Atheros 8161 dev, and found there was already a kernel bugzilla > > https://bugzilla.kernel.org/show_bug.cgi?id=70761 > > that had a fix from Jarod Wilson which get merged in v4.5 (commit > > c406700cd) > > > > Per my test, the new 4.6.0 kernel works fine with Jarod's patch. > > Might not hurt to get this some testing by people for whom my patch didn't > fix things. It seems to help for a fair number of people, but not > everyone, so perhaps this would be of use as well. Thanks for the suggestion, and you are right. Some users whose AR8161 can't be cured by your path report the device is working with my patch, though one reported the TX is slower while RX is not (which can't be reproduced on my Lenovo Y580). https://bugzilla.kernel.org/show_bug.cgi?id=70761 So I think this patch is better to be merged, and I will send out a v2 I did some more debug, it looks very likely to be related with the RX DMA address, specifically 0x...f80, if RX buffer get 0x...f80 several times, their will be RX overflow error and card will stop working. For kernel 4.5.0 which merged Jarod's patch and works fine with my AR8161/Lennov Y580, if I made some change to the __netdev_alloc_skb --> __alloc_page_frag() to make the allocated buffer can get an address with 0x...f80, then the same error happens. If I make it to 0x...f40 or 0xfc0, every thing is still fine. So I tend to believe that the 0x..f80 address cause the silicon to behave abnormally. Thanks, Feng > > Personally, my own alx-driven hardware didn't have a problem to begin > with, so I'm not able to do more than make sure things don't regress > w/already functioning hardware. > > > > On Fri, May 20, 2016 at 01:41:03PM +0800, Feng Tang wrote: > > > This patch follows Eric Dumazet's commit 7b70176421 to fix one > > > exactly same bug in alx driver, that the network link will > > > be lost in 1-5 minutes after the device is up. > > > > > > Following is a git log from Eric's 7b70176421: > > > > > > "We had reports ( https://bugzilla.kernel.org/show_bug.cgi?id=54021 ) > > > that using high order pages for skb allocations is problematic for atl1c > > > > > > We do not know exactly what the problem is, but we suspect that crossing > > > 4K pages is not well supported by this hardware. > > > > > > Use a custom allocator, using page allocator and 2K fragments for > > > optimal stack behavior. We might make this allocator generic > > > in future kernels." > > > > > > And my debug shows the same suspect, most of the errors happen > > > when there is a RX buffer address with 0x..f80, hopefully > > > this will get noticed and fixed from silicon side. > > > > > > My victim is a Lenovo Y580 Laptop with Atheros ALX AR8161 etherenet > > > device(PCI ID 1969:1091), with this patch the ethernet dev > > > works just fine > > > > > > Signed-off-by: Feng Tang
Re: [RFC 08/12] ipv6: introduce neighbour discovery ops
Hi, Alexander Aring wrote: > This patch introduces neighbour discovery ops callback structure. The > idea is to separate the handling for 6LoWPAN into the 6lowpan module. > > These callback offers 6lowpan different handling, such as 802.15.4 short > address handling or RFC6775 (Neighbor Discovery Optimization for IPv6 > over 6LoWPANs). > > Cc: David S. Miller > Cc: Alexey Kuznetsov > Cc: James Morris > Cc: Hideaki YOSHIFUJI > Cc: Patrick McHardy > Signed-off-by: Alexander Aring > --- > include/linux/netdevice.h | 5 ++ > include/net/ndisc.h | 176 > +- > net/ipv6/addrconf.c | 9 ++- > net/ipv6/ndisc.c | 119 +-- > net/ipv6/route.c | 14 ++-- > 5 files changed, 275 insertions(+), 48 deletions(-) > > @@ -205,6 +376,9 @@ void ndisc_send_redirect(struct sk_buff *skb, const > struct in6_addr *target); > int ndisc_mc_map(const struct in6_addr *addr, char *buf, struct net_device > *dev, >int dir); > > +void ndisc_neigh_update(const struct net_device *dev, struct neighbour > *neigh, > + const u8 *lladdr, u8 new, u32 flags, u8 icmp6_type, > + struct ndisc_options *ndopts); > I prefer ndisc_update(). > /* > * IGMP > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c > index 393cdbf..4506cac 100644 > --- a/net/ipv6/addrconf.c > +++ b/net/ipv6/addrconf.c > @@ -2531,7 +2531,7 @@ void addrconf_prefix_rcv(struct net_device *dev, u8 > *opt, int len, bool sllao) > > if (pinfo->autoconf && in6_dev->cnf.autoconf) { > struct in6_addr addr; > - bool tokenized = false; > + bool tokenized = false, dev_addr_generated = false; > > if (pinfo->prefix_len == 64) { > memcpy(&addr, &pinfo->prefix, 8); > @@ -2551,6 +2551,8 @@ void addrconf_prefix_rcv(struct net_device *dev, u8 > *opt, int len, bool sllao) > ipv6_inherit_eui64(addr.s6_addr + 8, > in6_dev)) { > in6_dev_put(in6_dev); > return; > + } else { > + dev_addr_generated = true; > } > goto ok; > } > @@ -2564,6 +2566,11 @@ ok: >addr_type, addr_flags, sllao, >tokenized, valid_lft, >prefered_lft); > + ndisc_ops_prefix_rcv_add_addr(net, dev, pinfo, in6_dev, &addr, > + addr_type, addr_flags, sllao, > + tokenized, valid_lft, > + prefered_lft, > + dev_addr_generated); > } > inet6_prefix_notify(RTM_NEWPREFIX, in6_dev, pinfo); > in6_dev_put(in6_dev); > diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c > index d794d64..99fd53c 100644 > --- a/net/ipv6/ndisc.c > +++ b/net/ipv6/ndisc.c > @@ -191,24 +191,28 @@ static struct nd_opt_hdr *ndisc_next_option(struct > nd_opt_hdr *cur, > return cur <= end && cur->nd_opt_type == type ? cur : NULL; > } > > -static inline int ndisc_is_useropt(struct nd_opt_hdr *opt) > +static inline int ndisc_is_useropt(const struct net_device *dev, > +struct nd_opt_hdr *opt) > { > return opt->nd_opt_type == ND_OPT_RDNSS || > - opt->nd_opt_type == ND_OPT_DNSSL; > + opt->nd_opt_type == ND_OPT_DNSSL || > + ndisc_ops_is_useropt(dev, opt->nd_opt_type); > } > > -static struct nd_opt_hdr *ndisc_next_useropt(struct nd_opt_hdr *cur, > +static struct nd_opt_hdr *ndisc_next_useropt(const struct net_device *dev, > + struct nd_opt_hdr *cur, >struct nd_opt_hdr *end) > { > if (!cur || !end || cur >= end) > return NULL; > do { > cur = ((void *)cur) + (cur->nd_opt_len << 3); > - } while (cur < end && !ndisc_is_useropt(cur)); > - return cur <= end && ndisc_is_useropt(cur) ? cur : NULL; > + } while (cur < end && !ndisc_is_useropt(dev, cur)); > + return cur <= end && ndisc_is_useropt(dev, cur) ? cur : NULL; > } > > -struct ndisc_options *ndisc_parse_options(u8 *opt, int opt_len, > +struct ndisc_options *ndisc_parse_options(const struct net_device *dev, > + u8 *opt, int opt_len, > struct ndisc_options *ndopts) > { > struct nd_opt_hdr *nd_opt = (struct nd_opt_hdr *)opt; > @@ -223,6 +227,8 @@ struct ndisc_options *ndisc_parse_options(u8 *opt, int > opt_len, > l = nd_opt->nd_opt_len << 3; > if (opt_len < l || l == 0) > return NULL; > +
Re: [RFC 04/12] ndisc: get rid off dev parameter in ndisc_opt_addr_space
Alexander Aring wrote: > This patch removes the net_device parameter from ndisc_opt_addr_space > function. This can be useful for calling such functionality which > doesn't depends on dev parameter. For current existing functionality > which depends on dev parameter, we introduce ndisc_dev_opt_addr_space to have > an easy replacement for the ndisc_opt_addr_space function. > > Cc: David S. Miller > Cc: Alexey Kuznetsov > Cc: James Morris > Cc: Hideaki YOSHIFUJI > Cc: Patrick McHardy > Signed-off-by: Alexander Aring > --- > include/net/ndisc.h | 13 + > net/ipv6/ndisc.c| 12 ++-- > 2 files changed, 15 insertions(+), 10 deletions(-) > > diff --git a/include/net/ndisc.h b/include/net/ndisc.h > index 2d8edaa..dbc8d01 100644 > --- a/include/net/ndisc.h > +++ b/include/net/ndisc.h > @@ -127,10 +127,15 @@ static inline int ndisc_addr_option_pad(unsigned short > type) > } > } > > -static inline int ndisc_opt_addr_space(struct net_device *dev) > +static inline int ndisc_opt_addr_space(unsigned char addr_len, int pad) > { > - return NDISC_OPT_SPACE(dev->addr_len + > -ndisc_addr_option_pad(dev->type)); > + return NDISC_OPT_SPACE(addr_len + pad); > +} > + > +static inline int ndisc_dev_opt_addr_space(const struct net_device *dev) > +{ > + return ndisc_opt_addr_space(dev->addr_len, > + ndisc_addr_option_pad(dev->type)); > } > I prefer not to change existing functions such as ndisc_opt_addr_space(), and name new function __ndisc_opt_addr_space() etc. Plus, my original thought (when I implement these functions) was to have per-net_device ndisc_opt_addr_spece(), ndisc_opt_adr_data() etc. What do you think of that? > static inline u8 *ndisc_opt_addr_data(struct nd_opt_hdr *p, > @@ -139,7 +144,7 @@ static inline u8 *ndisc_opt_addr_data(struct nd_opt_hdr > *p, > u8 *lladdr = (u8 *)(p + 1); > int lladdrlen = p->nd_opt_len << 3; > int prepad = ndisc_addr_option_pad(dev->type); > - if (lladdrlen != ndisc_opt_addr_space(dev)) > + if (lladdrlen != ndisc_opt_addr_space(dev->addr_len, prepad)) > return NULL; > return lladdr + prepad; > } > diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c > index c245895..2241f06 100644 > --- a/net/ipv6/ndisc.c > +++ b/net/ipv6/ndisc.c > @@ -152,9 +152,9 @@ EXPORT_SYMBOL_GPL(nd_tbl); > > static void ndisc_fill_addr_option(struct sk_buff *skb, int type, void *data) > { > - int pad = ndisc_addr_option_pad(skb->dev->type); > + int pad = ndisc_addr_option_pad(skb->dev->type); > int data_len = skb->dev->addr_len; > - int space = ndisc_opt_addr_space(skb->dev); > + int space = ndisc_opt_addr_space(data_len, pad); > u8 *opt = skb_put(skb, space); > > opt[0] = type; > @@ -509,7 +509,7 @@ void ndisc_send_na(struct net_device *dev, const struct > in6_addr *daddr, > if (!dev->addr_len) > inc_opt = 0; > if (inc_opt) > - optlen += ndisc_opt_addr_space(dev); > + optlen += ndisc_dev_opt_addr_space(dev); > > skb = ndisc_alloc_skb(dev, sizeof(*msg) + optlen); > if (!skb) > @@ -574,7 +574,7 @@ void ndisc_send_ns(struct net_device *dev, const struct > in6_addr *solicit, > if (ipv6_addr_any(saddr)) > inc_opt = false; > if (inc_opt) > - optlen += ndisc_opt_addr_space(dev); > + optlen += ndisc_dev_opt_addr_space(dev); > > skb = ndisc_alloc_skb(dev, sizeof(*msg) + optlen); > if (!skb) > @@ -626,7 +626,7 @@ void ndisc_send_rs(struct net_device *dev, const struct > in6_addr *saddr, > } > #endif > if (send_sllao) > - optlen += ndisc_opt_addr_space(dev); > + optlen += ndisc_dev_opt_addr_space(dev); > > skb = ndisc_alloc_skb(dev, sizeof(*msg) + optlen); > if (!skb) > @@ -1563,7 +1563,7 @@ void ndisc_send_redirect(struct sk_buff *skb, const > struct in6_addr *target) > memcpy(ha_buf, neigh->ha, dev->addr_len); > read_unlock_bh(&neigh->lock); > ha = ha_buf; > - optlen += ndisc_opt_addr_space(dev); > + optlen += ndisc_dev_opt_addr_space(dev); > } else > read_unlock_bh(&neigh->lock); > > -- Hideaki Yoshifuji Technical Division, MIRACLE LINUX CORPORATION
Re: [RFC 00/12] 6lowpan: introduce 6lowpan-nd
Hi. Alexander Aring wrote: > Hi, > > this patch series introduces the ndisc ops callback structure to add different > handling for IPv6 neighbour discovery cache functionality. It implements at > first > the two following use-cases: > > - 6CO handling as userspace option (For all 6LoWPAN layers, BTLE/802.15.4) > [0] > - short address handling for 802.15.4 6LoWPAN only [1] > > Since my last patch series, I completely changed the whole ndisc_ops callback > structure to not replace the whole ndisc functionality at recv/send level of > NS/NA/RS/RA which I send in my previous patch-series "6lowpan: introduce basic > 6lowpan-nd". I changed it now to add different handling in a very low-level > way > of ndisc functionality. Thank you for your work! It looks much better now. Some comments will follow. -- Hideaki Yoshifuji Technical Division, MIRACLE LINUX CORPORATION
[PATCH net v2 1/3] Documentation: networking: dsa: Remove poll_link description
This function has been removed in 4baee937b8d5 ("net: dsa: remove DSA link polling") in favor of using the PHYLIB polling mechanism. Reviewed-by: Vivien Didelot Signed-off-by: Florian Fainelli --- Documentation/networking/dsa/dsa.txt | 5 - 1 file changed, 5 deletions(-) diff --git a/Documentation/networking/dsa/dsa.txt b/Documentation/networking/dsa/dsa.txt index 631b0f7ae16f..8303eb8ced79 100644 --- a/Documentation/networking/dsa/dsa.txt +++ b/Documentation/networking/dsa/dsa.txt @@ -416,11 +416,6 @@ PHY devices and link management to the switch port MDIO registers. If unavailable return a negative error code. -- poll_link: Function invoked by DSA to query the link state of the switch - builtin Ethernet PHYs, per port. This function is responsible for calling - netif_carrier_{on,off} when appropriate, and can be used to poll all ports in a - single call. Executes from workqueue context. - - adjust_link: Function invoked by the PHY library when a slave network device is attached to a PHY device. This function is responsible for appropriately configuring the switch port link parameters: speed, duplex, pause based on -- 2.7.4
[PATCH net v2 2/3] Documentation: networking: dsa: Remove priv_size description
We no longer have a priv_size structure member since 5feebd0a8a79 ("net: dsa: Remove allocation of driver private memory") Reviewed-by: Vivien Didelot Signed-off-by: Florian Fainelli --- Documentation/networking/dsa/dsa.txt | 2 -- 1 file changed, 2 deletions(-) diff --git a/Documentation/networking/dsa/dsa.txt b/Documentation/networking/dsa/dsa.txt index 8303eb8ced79..411b57fd73aa 100644 --- a/Documentation/networking/dsa/dsa.txt +++ b/Documentation/networking/dsa/dsa.txt @@ -369,8 +369,6 @@ does not allocate any driver private context space. Switch configuration -- priv_size: additional size needed by the switch driver for its private context - - tag_protocol: this is to indicate what kind of tagging protocol is supported, should be a valid value from the dsa_tag_protocol enum -- 2.7.4
[PATCH net v2 3/3] Documentation: networking: dsa: Describe port_vlan_filtering
Described what the port_vlan_filtering function is supposed to accomplish. Fixes: fb2dabad69f0 ("net: dsa: support VLAN filtering switchdev attr") Signed-off-by: Florian Fainelli --- Documentation/networking/dsa/dsa.txt | 10 ++ 1 file changed, 10 insertions(+) diff --git a/Documentation/networking/dsa/dsa.txt b/Documentation/networking/dsa/dsa.txt index 411b57fd73aa..9d05ed7f7da5 100644 --- a/Documentation/networking/dsa/dsa.txt +++ b/Documentation/networking/dsa/dsa.txt @@ -535,6 +535,16 @@ Bridge layer Bridge VLAN filtering - +- port_vlan_filtering: bridge layer function invoked when the bridge gets + configured for turning on or off VLAN filtering. If nothing specific needs to + be done at the hardware level, this callback does not need to be implemented. + When VLAN filtering is turned on, the hardware must be programmed with + rejecting 802.1Q frames which have VLAN IDs outside of the programmed allowed + VLAN ID map/rules. If there is no PVID programmed into the switch port, + untagged frames must be rejected as well. When turned off the switch must + accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames are + allowed. + - port_vlan_prepare: bridge layer function invoked when the bridge prepares the configuration of a VLAN on the given port. If the operation is not supported by the hardware, this function should return -EOPNOTSUPP to inform the bridge -- 2.7.4
[PATCH net v2 0/3] Documentation: dsa: misc fixes
Hi David, Here are some miscelaneous documentation fixes for DSA, I targeted "net" because these are not functional code changes, but still documentation fixes per-se. Changes in v2: - reword what the port_vlan_filtering is about based on feedback from Vivien and Ido Florian Fainelli (3): Documentation: networking: dsa: Remove poll_link description Documentation: networking: dsa: Remove priv_size description Documentation: networking: dsa: Describe port_vlan_filtering Documentation/networking/dsa/dsa.txt | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) -- 2.7.4
Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer
Francois, Shuyu, this is the patch with the discussed changes. Shuyu it would be great if you could test this one. If it passes and there are no further objections I will resend it as a regular patch (including commit message, etc.) to the mailing list. diff --git a/drivers/net/ethernet/arc/emac_main.c b/drivers/net/ethernet/arc/emac_main.c index a3a9392..ec656b3 100644 --- a/drivers/net/ethernet/arc/emac_main.c +++ b/drivers/net/ethernet/arc/emac_main.c @@ -153,18 +153,29 @@ static void arc_emac_tx_clean(struct net_device *ndev) { struct arc_emac_priv *priv = netdev_priv(ndev); struct net_device_stats *stats = &ndev->stats; + unsigned int curr = priv->txbd_curr; unsigned int i; + /* Make sure buffers and txbd_curr are consistent */ + smp_rmb(); + for (i = 0; i < TX_BD_NUM; i++) { unsigned int *txbd_dirty = &priv->txbd_dirty; struct arc_emac_bd *txbd = &priv->txbd[*txbd_dirty]; struct buffer_state *tx_buff = &priv->tx_buff[*txbd_dirty]; - struct sk_buff *skb = tx_buff->skb; - unsigned int info = le32_to_cpu(txbd->info); + unsigned int info; + struct sk_buff *skb; - if ((info & FOR_EMAC) || !txbd->data || !skb) + if (*txbd_dirty == curr) break; + info = le32_to_cpu(txbd->info); + + if (info & FOR_EMAC) + break; + + skb = tx_buff->skb; + if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) { stats->tx_errors++; stats->tx_dropped++; @@ -195,8 +206,8 @@ static void arc_emac_tx_clean(struct net_device *ndev) *txbd_dirty = (*txbd_dirty + 1) % TX_BD_NUM; } - /* Ensure that txbd_dirty is visible to tx() before checking -* for queue stopped. + /* Ensure that txbd_dirty is visible to tx() and we see the most recent +* value for txbd_curr. */ smp_mb(); @@ -680,27 +691,24 @@ static int arc_emac_tx(struct sk_buff *skb, struct net_device *ndev) dma_unmap_len_set(&priv->tx_buff[*txbd_curr], len, len); priv->txbd[*txbd_curr].data = cpu_to_le32(addr); - - /* Make sure pointer to data buffer is set */ - wmb(); + priv->tx_buff[*txbd_curr].skb = skb; skb_tx_timestamp(skb); *info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len); - /* Make sure info word is set */ + /* 1. Make sure that with respect to tx_clean everything is set up +* properly before we advance txbd_curr. +* 2. Make sure writes to DMA descriptors are completed before we inform +* the hardware. +*/ wmb(); - priv->tx_buff[*txbd_curr].skb = skb; - /* Increment index to point to the next BD */ *txbd_curr = (*txbd_curr + 1) % TX_BD_NUM; - /* Ensure that tx_clean() sees the new txbd_curr before -* checking the queue status. This prevents an unneeded wake -* of the queue in tx_clean(). -*/ - smp_mb(); + /* Ensure tx_clean() sees the updated value of txbd_curr */ + smp_wmb(); if (!arc_emac_tx_avail(priv)) { netif_stop_queue(ndev); -- 1.9.1
Re: [ethtool 0/3][pull request] Intel Wired LAN Driver Updates 2016-05-03
On Wed, 2016-05-04 at 09:44 -0700, Jeff Kirsher wrote: > This series contains updates to ixgbe in ethtool. > > Preethi adds missing device IDs and mac_type definitions, also updated > the display registers for x550, x550em_x/a. Cleaned up the format string > storage by taking advantage of "for" loops. > > The following are changes since commit > deb1c6613ec14fd828d321e38c7bea45fe559bd5: > Release version 4.5. > and are available in the git repository at: > git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/ethtool master > > Preethi Banala (3): > ethtool/ixgbe: Add device ID and mac_type definitions > ethtool/ixgbe: Correct offsets and support x550, x550em_x, x550em_a > ethtool/ixgbe: Reduce format string storage > > ixgbe.c | 173 +++--- > -- > 1 file changed, 95 insertions(+), 78 deletions(-) > Is Ben still maintaining ethtool? I ask because I have this series which I sent out earlier this month, with no word and I know there is at least one other ethtool patch series that has had no response or committal from who ever is maintaining ethtool. I know we discussed last netconf that we should look at possibly a new tool to address the shortcomings of ethtool, but I was not aware we had abandoned maintaining the current ethtool already before any replacement tool has been developed. signature.asc Description: This is a digitally signed message part
Re: [PATCH net v2] fou: avoid using sk_user_data before it is initialised
From: Cong Wang Date: Mon, 23 May 2016 14:05:17 -0700 > On Mon, May 23, 2016 at 1:59 PM, David Miller wrote: >> From: Simon Horman >> Date: Fri, 20 May 2016 14:57:17 +0900 >> >>> During initialisation sk->sk_user_data should not be used before >>> it is initialised. >>> >>> Found by bisection after noticing the following: >> ... >>> Fixes: d92283e338f6 ("fou: change to use UDP socket GRO") >>> Signed-off-by: Simon Horman >>> --- >>> v2 >>> * Updated implementation to simply access fou->protocol directly >>> as suggested by Tom Herbert and Cong Want >> >> I think this was resolved in another way meanwhile. If a fix is still >> needed, please respin. > > This fix is only needed for -stable to backport, because Tom's > fix is within a series of patches, which makes it hard to backport. Ok, queued up for -stable.
Re: [net PATCH v2 1/1] net sched actions: policer missing timestamp processing
From: Jamal Hadi Salim Date: Mon, 23 May 2016 21:07:20 -0400 > From: Jamal Hadi Salim > > Policer was not dumping or updating timestamps > > Signed-off-by: Jamal Hadi Salim Applied.
Re: [RFC PATCH 1/7] fou: Get net from sock_net if dev_net unavailable
On Tue, May 24, 2016 at 3:01 PM, David Miller wrote: > From: Tom Herbert > Date: Mon, 23 May 2016 15:48:20 -0700 > >> diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c >> index 5f9207c..96260c6 100644 >> --- a/net/ipv4/fou.c >> +++ b/net/ipv4/fou.c >> @@ -807,13 +807,20 @@ int __fou_build_header(struct sk_buff *skb, struct >> ip_tunnel_encap *e, >> u8 *protocol, __be16 *sport, int type) >> { >> int err; >> + struct net *net; >> > > Please order local variables from longest to shortest line. > >> err = iptunnel_handle_offloads(skb, type); >> if (err) >> return err; >> >> - *sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev), >> - skb, 0, 0, false); >> + if (skb->dev) >> + net = dev_net(skb->dev); >> + else if (skb->sk) >> + net = sock_net(skb->sk); > > This is getting rediculous. Why not just put the net namespace pointer into > the tunnel encap object? Thanks, in this case it will be easier to just pass net in as an argument. Tom
[PATCH v2] Make builds default to quiet mode
Similar to the Linux kernel and perf add infrastructure to reduce the amount of output tossed to a user during a build. Full build output can be obtained with 'make V=1' Builds go from: make[1]: Leaving directory `/home/dsa/iproute2.git/lib' make[1]: Entering directory `/home/dsa/iproute2.git/ip' gcc -Wall -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wold-style-definition -Wformat=2 -O2 -I../include -DRESOLVE_HOSTNAMES -DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE-c -o ip.o ip.c gcc -Wall -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wold-style-definition -Wformat=2 -O2 -I../include -DRESOLVE_HOSTNAMES -DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE-c -o ipaddress.o ipaddress.c to: ... AR libutil.a ip CC ip.o CC ipaddress.o ... Signed-off-by: David Ahern --- v2 - changed 'echo "\n$$i"' to 'echo; echo $$i' per Andreas' comment Makefile | 6 +- bridge/Makefile | 1 + configure| 32 devlink/Makefile | 1 + genl/Makefile| 1 + ip/Makefile | 2 ++ lib/Makefile | 4 ++-- misc/Makefile| 12 +++- tc/Makefile | 15 --- tipc/Makefile| 1 + 10 files changed, 60 insertions(+), 15 deletions(-) diff --git a/Makefile b/Makefile index eb571a5accf8..15c81ecfdca3 100644 --- a/Makefile +++ b/Makefile @@ -1,3 +1,7 @@ +ifndef VERBOSE +MAKEFLAGS += --no-print-directory +endif + PREFIX?=/usr LIBDIR?=$(PREFIX)/lib SBINDIR?=/sbin @@ -50,7 +54,7 @@ LDLIBS += $(LIBNETLINK) all: Config @set -e; \ for i in $(SUBDIRS); \ - do $(MAKE) $(MFLAGS) -C $$i; done + do echo; echo $$i; $(MAKE) $(MFLAGS) -C $$i; done Config: sh configure $(KERNEL_INCLUDE) diff --git a/bridge/Makefile b/bridge/Makefile index 98007530240a..7203f70bc510 100644 --- a/bridge/Makefile +++ b/bridge/Makefile @@ -9,6 +9,7 @@ endif all: bridge bridge: $(BROBJ) $(LIBNETLINK) + $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@ install: all install -m 0755 bridge $(DESTDIR)$(SBINDIR) diff --git a/configure b/configure index d2540b0d3219..60eb6b51a571 100755 --- a/configure +++ b/configure @@ -317,7 +317,35 @@ EOF rm -f $TMPDIR/dbtest.c $TMPDIR/dbtest } +quiet_config() +{ + cat> Config + check_toolchain echo "TC schedulers" @@ -357,3 +385,7 @@ echo echo -n "docs:" check_docs echo + +echo >> Config +echo "%.o: %.c" >> Config +echo ' $(QUIET_CC)$(CC) $(CFLAGS) $(EXTRA_CFLAGS) -c -o $@ $<' >> Config diff --git a/devlink/Makefile b/devlink/Makefile index 3fdaa6904ee1..7256c28708eb 100644 --- a/devlink/Makefile +++ b/devlink/Makefile @@ -12,6 +12,7 @@ endif all: $(TARGETS) $(LIBS) devlink: $(DEVLINKOBJ) + $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@ install: all install -m 0755 $(TARGETS) $(DESTDIR)$(SBINDIR) diff --git a/genl/Makefile b/genl/Makefile index 03d1f26a1cee..f5a0bfe42aff 100644 --- a/genl/Makefile +++ b/genl/Makefile @@ -20,6 +20,7 @@ endif all: genl genl: $(GENLOBJ) $(LIBNETLINK) $(LIBUTIL) $(GENLLIB) + $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@ install: all install -m 0755 genl $(DESTDIR)$(SBINDIR) diff --git a/ip/Makefile b/ip/Makefile index f3d298739cac..a7f9c1101798 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -24,8 +24,10 @@ TARGETS=ip rtmon all: $(TARGETS) $(SCRIPTS) ip: $(IPOBJ) $(LIBNETLINK) + $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@ rtmon: $(RTMONOBJ) + $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@ install: all install -m 0755 $(TARGETS) $(DESTDIR)$(SBINDIR) diff --git a/lib/Makefile b/lib/Makefile index 9d1307dd4df2..52e016db50b6 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -15,10 +15,10 @@ NLOBJ=libgenl.o ll_map.o libnetlink.o all: libnetlink.a libutil.a libnetlink.a: $(NLOBJ) - $(AR) rcs $@ $(NLOBJ) + $(QUIET_AR)$(AR) rcs $@ $^ libutil.a: $(UTILOBJ) $(ADDLIB) - $(AR) rcs $@ $(UTILOBJ) $(ADDLIB) + $(QUIET_AR)$(AR) rcs $@ $^ install: diff --git a/misc/Makefile b/misc/Makefile index f50e7403a33b..72807678054b 100644 --- a/misc/Makefile +++ b/misc/Makefile @@ -21,23 +21,25 @@ endif all: $(TARGETS) ss: $(SSOBJ) + $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@ nstat: nstat.c - $(CC) $(CFLAGS) $(LDFLAGS) -o nstat nstat.c $(LIBNETLINK) -lm + $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o nstat nstat.c $(LIBNETLINK) -lm ifstat: ifstat.c - $(CC) $(CFLAGS) $(LDFLAGS) -o ifstat ifstat.c $(LIBNETLINK) -lm + $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o ifstat ifstat.c $(LIBNETLINK) -lm rtacct: rtacct.c - $(CC) $(CFLAGS) $(LDFLAGS) -o rtacct rtacct.c $(LIBNETLINK) -lm + $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o rtacct rtac
Re: [RFC PATCH 2/7] tou: Base infrastructure for Transport over UDP
From: Tom Herbert Date: Mon, 23 May 2016 15:48:21 -0700 > +int tou_encap_setsockopt(struct sock *sk, char __user *optval, int optlen, > + bool is_ipv6) > +{ > + struct tou_encap te; > + struct ip_tunnel_encap encap; > + struct inet_sock *inet = inet_sk(sk); > + struct ip_tunnel_encap *e = inet->tou_encap; This doesn't compile, because you don't add the tou_encap member to inet_sock until patch #3.
Re: [RFC PATCH 1/7] fou: Get net from sock_net if dev_net unavailable
From: Tom Herbert Date: Mon, 23 May 2016 15:48:20 -0700 > diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c > index 5f9207c..96260c6 100644 > --- a/net/ipv4/fou.c > +++ b/net/ipv4/fou.c > @@ -807,13 +807,20 @@ int __fou_build_header(struct sk_buff *skb, struct > ip_tunnel_encap *e, > u8 *protocol, __be16 *sport, int type) > { > int err; > + struct net *net; > Please order local variables from longest to shortest line. > err = iptunnel_handle_offloads(skb, type); > if (err) > return err; > > - *sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev), > - skb, 0, 0, false); > + if (skb->dev) > + net = dev_net(skb->dev); > + else if (skb->sk) > + net = sock_net(skb->sk); This is getting rediculous. Why not just put the net namespace pointer into the tunnel encap object?
Re: rtk8168 driver help needed
On 05/19/2016 08:31 PM, Francois Romieu wrote: > Murali Karicheri : > [...] >> Do you what could be wrong with rtk8168? > > Hardly. > > What do the device registers (ethtool -d) and device stats (ethtool -S) > look like ? Will dump in the next opportunity and respond > > (please trim useless material from previous mail) > Ok. Will keep in mind next time. -- Murali Karicheri Linux Kernel, Keystone
Re: [PATCH net] net_sched: avoid too many hrtimer_start() calls
From: Eric Dumazet Date: Mon, 23 May 2016 14:24:56 -0700 > From: Eric Dumazet > > I found a serious performance bug in packet schedulers using hrtimers. > > sch_htb and sch_fq are definitely impacted by this problem. > > We constantly rearm high resolution timers if some packets are throttled > in one (or more) class, and other packets are flying through qdisc on > another (non throttled) class. > > hrtimer_start() does not have the mod_timer() trick of doing nothing if > expires value does not change : > > if (timer_pending(timer) && > timer->expires == expires) > return 1; > > This issue is particularly visible when multiple cpus can queue/dequeue > packets on the same qdisc, as hrtimer code has to lock a remote base. > > I used following fix : > > 1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding > it. > > 2) Cache watchdog prior expiration. hrtimer might provide this, but I > prefer to not rely on some hrtimer internal. > > Signed-off-by: Eric Dumazet This looks fine, applied, thanks Eric.
Re: [PATCH] net/qlge: Avoids recursive EEH error
From: Gavin Shan Date: Mon, 23 May 2016 11:58:28 +1000 > One timer, whose handler keeps reading on MMIO register for EEH > core to detect error in time, is started when the PCI device driver > is loaded. MMIO register can't be accessed during PE reset in EEH > recovery. Otherwise, the unexpected recursive error is triggered. > The timer isn't closed that time if the interface isn't brought > up. So the unexpected recursive error is seen during EEH recovery > when the interface is down. > > This avoids the unexpected recursive EEH error by closing the timer > in qlge_io_error_detected() before EEH PE reset unconditionally. The > timer is started unconditionally after EEH PE reset in qlge_io_resume(). > Also, the timer should be closed unconditionally when the device is > removed from the system permanently in qlge_io_error_detected(). > > Reported-by: Shriya R. Kulkarni > Signed-off-by: Gavin Shan Applied, thanks.
Re: [PATCH v2 2/2] ip6_gre: Set flowi6_proto as IPPROTO_GRE in xmit path.
From: Haishuang Yan Date: Sat, 21 May 2016 18:17:35 +0800 > In gre6 xmit path, we are sending a GRE packet, so set fl6 proto > to IPPROTO_GRE properly. > > Signed-off-by: Haishuang Yan Applied.
Re: [PATCH v2 1/2] ip6_gre: Fix MTU setting for ip6gretap
From: Haishuang Yan Date: Sat, 21 May 2016 18:17:34 +0800 > When creat an ip6gretap interface with an unreachable route, > the MTU is about 14 bytes larger than what was needed. ... > Because rt is not NULL here, so dev->mtu will subtract the ethernet > header length later. But when rt is NULL, it just simply return, so > dev->mtu doesn't update correctly in this situation. > > This patch first verify the dev->type is ARPHRD_ETHER for ip6gretap > interface, and then decrease the mtu as early as possible. > > Signed-off-by: Haishuang Yan Applied.
[PATCH V2] brcmfmac: print error if p2p_ifadd firmware command fails
This is helpful for debugging, without this all I was getting from "iw" command on device with BCM43602 was: > command failed: Too many open files in system (-23) Signed-off-by: Rafał Miłecki --- V2: s/in/if/ in commit message --- drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c index 1652a48..f7b7e29 100644 --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c @@ -2031,7 +2031,7 @@ static int brcmf_p2p_request_p2p_if(struct brcmf_p2p_info *p2p, err = brcmf_fil_iovar_data_set(ifp, "p2p_ifadd", &if_request, sizeof(if_request)); if (err) - return err; + brcmf_err("p2p_ifadd failed %d\n", err); return err; } -- 1.8.4.5
[PATCH] brcmfmac: print error in p2p_ifadd firmware command fails
This is helpful for debugging, without this all I was getting from "iw" command on device with BCM43602 was: > command failed: Too many open files in system (-23) Signed-off-by: Rafał Miłecki --- drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c index 1652a48..f7b7e29 100644 --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c @@ -2031,7 +2031,7 @@ static int brcmf_p2p_request_p2p_if(struct brcmf_p2p_info *p2p, err = brcmf_fil_iovar_data_set(ifp, "p2p_ifadd", &if_request, sizeof(if_request)); if (err) - return err; + brcmf_err("p2p_ifadd failed %d\n", err); return err; } -- 1.8.4.5
Re: bpf: use-after-free in array_map_alloc
On Tue, May 24, 2016 at 12:04 PM, Tejun Heo wrote: > Hello, > > Alexei, can you please verify this patch? Map extension got rolled > into balance work so that there's no sync issues between the two async > operations. tests look good. No uaf and basic bpf tests exercise per-cpu map are fine. > > Thanks. > > Index: work/mm/percpu.c > === > --- work.orig/mm/percpu.c > +++ work/mm/percpu.c > @@ -112,7 +112,7 @@ struct pcpu_chunk { > int map_used; /* # of map entries used > before the sentry */ > int map_alloc; /* # of map entries allocated > */ > int *map; /* allocation map */ > - struct work_struct map_extend_work;/* async ->map[] extension */ > + struct list_headmap_extend_list;/* on pcpu_map_extend_chunks > */ > > void*data; /* chunk data */ > int first_free; /* no free below this */ > @@ -162,10 +162,13 @@ static struct pcpu_chunk *pcpu_reserved_ > static int pcpu_reserved_chunk_limit; > > static DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */ > -static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop */ > +static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map > ext */ > > static struct list_head *pcpu_slot __read_mostly; /* chunk list slots */ > > +/* chunks which need their map areas extended, protected by pcpu_lock */ > +static LIST_HEAD(pcpu_map_extend_chunks); > + > /* > * The number of empty populated pages, protected by pcpu_lock. The > * reserved chunk doesn't contribute to the count. > @@ -395,13 +398,19 @@ static int pcpu_need_to_extend(struct pc > { > int margin, new_alloc; > > + lockdep_assert_held(&pcpu_lock); > + > if (is_atomic) { > margin = 3; > > if (chunk->map_alloc < > - chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW && > - pcpu_async_enabled) > - schedule_work(&chunk->map_extend_work); > + chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW) { > + if (list_empty(&chunk->map_extend_list)) { > + list_add_tail(&chunk->map_extend_list, > + &pcpu_map_extend_chunks); > + pcpu_schedule_balance_work(); > + } > + } > } else { > margin = PCPU_ATOMIC_MAP_MARGIN_HIGH; > } > @@ -435,6 +444,8 @@ static int pcpu_extend_area_map(struct p > size_t old_size = 0, new_size = new_alloc * sizeof(new[0]); > unsigned long flags; > > + lockdep_assert_held(&pcpu_alloc_mutex); > + > new = pcpu_mem_zalloc(new_size); > if (!new) > return -ENOMEM; > @@ -467,20 +478,6 @@ out_unlock: > return 0; > } > > -static void pcpu_map_extend_workfn(struct work_struct *work) > -{ > - struct pcpu_chunk *chunk = container_of(work, struct pcpu_chunk, > - map_extend_work); > - int new_alloc; > - > - spin_lock_irq(&pcpu_lock); > - new_alloc = pcpu_need_to_extend(chunk, false); > - spin_unlock_irq(&pcpu_lock); > - > - if (new_alloc) > - pcpu_extend_area_map(chunk, new_alloc); > -} > - > /** > * pcpu_fit_in_area - try to fit the requested allocation in a candidate area > * @chunk: chunk the candidate area belongs to > @@ -740,7 +737,7 @@ static struct pcpu_chunk *pcpu_alloc_chu > chunk->map_used = 1; > > INIT_LIST_HEAD(&chunk->list); > - INIT_WORK(&chunk->map_extend_work, pcpu_map_extend_workfn); > + INIT_LIST_HEAD(&chunk->map_extend_list); > chunk->free_size = pcpu_unit_size; > chunk->contig_hint = pcpu_unit_size; > > @@ -895,6 +892,9 @@ static void __percpu *pcpu_alloc(size_t > return NULL; > } > > + if (!is_atomic) > + mutex_lock(&pcpu_alloc_mutex); > + > spin_lock_irqsave(&pcpu_lock, flags); > > /* serve reserved allocations from the reserved chunk if available */ > @@ -967,12 +967,9 @@ restart: > if (is_atomic) > goto fail; > > - mutex_lock(&pcpu_alloc_mutex); > - > if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) { > chunk = pcpu_create_chunk(); > if (!chunk) { > - mutex_unlock(&pcpu_alloc_mutex); > err = "failed to allocate new chunk"; > goto fail; > } > @@ -983,7 +980,6 @@ restart: > spin_lock_irqsave(&pcpu_lock, flags); > } > > - mutex_unlock(&pcpu_alloc_mutex); > goto restart; > > area_found: > @@ -993,8 +989,6 @@ area_found: >
Re: ixgbe: ksoftirqd consumes 100% CPU w/ ~50 TCP conns
On Tue, 24 May 2016 12:46:56 -0700 Alexander Duyck wrote: > I'm guessing the issue is lock contention on the IOMMU resource table. > I resolved most of that for the Rx side back when we implemented the > Rx page reuse but the Tx side still has to perform a DMA mapping for > each individual buffer. Depending on the needs of the user if they > still need the IOMMU enabled for use with something like KVM one thing > they may try doing is use the kernel parameter "iommu=pt" to allow > host devices to access memory without the penalty for having to > allocate/free resources and still provide guests with IOMMU isolation. Listen to Alex, he knows what his is talking about. My longer term plan for getting rid of the dma_map/unmap overhead is to _keep_ the pages DMA mapped and recycle them back via page-pool. Details in my slides, see slide 5: http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf Alex'es RX recycle trick for the Intel drivers are described on slide14. It seems like, in your use-case the pages might be held "too" long for the RX recycling trick to work. If you want to understand the IOMMU problem in details, I recommend to read the article "True IOMMU Protection from DMA Attacks" http://www.cs.technion.ac.il/~mad/publications/asplos2016-iommu.pdf (My solution is different, but they desc the problem very well) --Jesper > On Tue, May 24, 2016 at 9:40 AM, Brandon Philips wrote: > > Hello Everyone- > > > > So we tracked it down to IOMMU causing CPU affinity getting broken[1]. > > Can we provide any further details or is this a known issue? > > > > Thank You, > > > > Brandon > > > > [1] https://github.com/coreos/bugs/issues/1275#issuecomment-219866601 > > > > On Tue, May 17, 2016 at 12:44 PM, Brandon Philips wrote: > >> Hello ixgbe team- > >> > >> With Linux v4.6 and the ixgbe driver (details below) a user is reporting > >> ksoftirqd consuming 100% of the CPU on all cores after a moderate ~20-50 > >> number of TCP connections. They are unable to reproduce this issue with > >> Cisco hardware. > >> > >> With Kernel v3.19 they cannot reproduce[1] the issue. Disabling IOMMU > >> (intel_iommu=off) does "fix" the issue[2]. > >> > >> Thank You, > >> > >> Brandon > >> > >> [1] https://github.com/coreos/bugs/issues/1275#issuecomment-219157803 > >> [2] https://github.com/coreos/bugs/issues/1275#issuecomment-219819986 > >> > >> Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) > >> ethtool -i eno1 > >> driver: ixgbe > >> version: 4.0.1-k > >> firmware-version: 0x84e0 > >> bus-info: :06:00.0 > >> supports-statistics: yes > >> supports-test: yes > >> supports-eeprom-access: yes > >> supports-register-dump: yes > >> supports-priv-flags: no > >> > >> CPU > >> Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH v5 2/2] skb_array: ring test
On Tue, May 24, 2016 at 07:03:20PM +0200, Jesper Dangaard Brouer wrote: > > On Tue, 24 May 2016 12:28:09 +0200 > Jesper Dangaard Brouer wrote: > > > I do like perf, but it does not answer my questions about the > > performance of this queue. I will code something up in my own > > framework[2] to answer my own performance questions. > > > > Like what is be minimum overhead (in cycles) achievable with this type > > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot) > > for fastpath usage. > > Coded it up here: > https://github.com/netoptimizer/prototype-kernel/commit/b16a3332184 > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c > > This is a really fake benchmark, but it sort of shows the minimum > overhead achievable with this type of queue, where it is the same > CPU enqueuing and dequeuing, and cache is guaranteed to be hot. > > Measured on a i7-4790K CPU @ 4.00GHz, the average cost of > enqueue+dequeue of a single object is around 102 cycles(tsc). > > To compare this with below, where enq and deq is measured separately: > 102 / 2 = 51 cycles > > > > Then I also want to know how this performs when two CPUs are involved. > > As this is also a primary use-case, for you when sending packets into a > > guest. > > Coded it up here: > https://github.com/netoptimizer/prototype-kernel/commit/75fe31ef62e > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c > > This parallel benchmark try to keep two (or more) CPUs busy enqueuing or > dequeuing on the same skb_array queue. It prefills the queue, > and stops the test as soon as queue is empty or full, or > completes a number of "loops"/cycles. > > For two CPUs the results are really good: > enqueue: 54 cycles(tsc) > dequeue: 53 cycles(tsc) > > Going to 4 CPUs, things break down (but it was not primary use-case?): > CPU(0) 927 cycles(tsc) enqueue > CPU(1) 921 cycles(tsc) dequeue > CPU(2) 927 cycles(tsc) enqueue > CPU(3) 898 cycles(tsc) dequeue It's mostly the spinlock contention I guess. Maybe we don't need fair spinlocks in this case. Try replacing spinlocks with simple cmpxchg and see what happens?
Re: [PATCH RESEND 7/8] pipe: account to kmemcg
On Tue, 2016-05-24 at 19:13 +0300, Vladimir Davydov wrote: > On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote: > ... > > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe, > > > +struct pipe_buffer *buf) > > > +{ > > > + struct page *page = buf->page; > > > + > > > + if (page_count(page) == 1) { > > > > This looks racy : some cpu could have temporarily elevated page count. > > All pipe operations (pipe_buf_operations->get, ->release, ->steal) are > supposed to be called under pipe_lock. So, if we see a pipe_buffer->page > with refcount of 1 in ->steal, that means that we are the only its user > and it can't be spliced to another pipe. > > In fact, I just copied the code from generic_pipe_buf_steal, adding > kmemcg related checks along the way, so it should be fine. So you guarantee that no other cpu might have done get_page_unless_zero() right before this test ?
Re: ixgbe: ksoftirqd consumes 100% CPU w/ ~50 TCP conns
I'm guessing the issue is lock contention on the IOMMU resource table. I resolved most of that for the Rx side back when we implemented the Rx page reuse but the Tx side still has to perform a DMA mapping for each individual buffer. Depending on the needs of the user if they still need the IOMMU enabled for use with something like KVM one thing they may try doing is use the kernel parameter "iommu=pt" to allow host devices to access memory without the penalty for having to allocate/free resources and still provide guests with IOMMU isolation. - Alex On Tue, May 24, 2016 at 9:40 AM, Brandon Philips wrote: > Hello Everyone- > > So we tracked it down to IOMMU causing CPU affinity getting broken[1]. > Can we provide any further details or is this a known issue? > > Thank You, > > Brandon > > [1] https://github.com/coreos/bugs/issues/1275#issuecomment-219866601 > > On Tue, May 17, 2016 at 12:44 PM, Brandon Philips wrote: >> Hello ixgbe team- >> >> With Linux v4.6 and the ixgbe driver (details below) a user is reporting >> ksoftirqd consuming 100% of the CPU on all cores after a moderate ~20-50 >> number of TCP connections. They are unable to reproduce this issue with >> Cisco hardware. >> >> With Kernel v3.19 they cannot reproduce[1] the issue. Disabling IOMMU >> (intel_iommu=off) does "fix" the issue[2]. >> >> Thank You, >> >> Brandon >> >> [1] https://github.com/coreos/bugs/issues/1275#issuecomment-219157803 >> [2] https://github.com/coreos/bugs/issues/1275#issuecomment-219819986 >> >> Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) >> ethtool -i eno1 >> driver: ixgbe >> version: 4.0.1-k >> firmware-version: 0x84e0 >> bus-info: :06:00.0 >> supports-statistics: yes >> supports-test: yes >> supports-eeprom-access: yes >> supports-register-dump: yes >> supports-priv-flags: no >> >> CPU >> Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer
Lino Sanfilippo : [...] > I dont agree here. A dma_wmb would have an effect to "data" and "info", yes, > but it would have absolutely no effect to skb_tx_timestamp(), since there > is no dma access involved here. In fact skb_tx_timestamp() could probably > be even reordered to appear after the dma_wmb. > > Anyway, there is the wmb() directly after the assignment to "info". _This_ > barrier should ensure that skb_tx_timestamp() (along with a flush of data > and info to DMA) is executed before "txbd_curr" is advanced. > This means that the corresponding skb cant be freed prematurely by tx_clean(). The concern here is about sending adequate PTP payload on the network. skb_tx_timestamp() is used for network clock synchronization. Some extra information must be transmitted. Be it through direct payload change or through indirect control, it _is_ related to dma. Several (most ?) skb_tx_timestamp() misuses blur the picture: CPU vs device sync is of course way below the radar when the driver crashes because of plain use-after-free skb_tx_timestamp() :o/ -- Ueimor
Re: bpf: use-after-free in array_map_alloc
Hello, Alexei, can you please verify this patch? Map extension got rolled into balance work so that there's no sync issues between the two async operations. Thanks. Index: work/mm/percpu.c === --- work.orig/mm/percpu.c +++ work/mm/percpu.c @@ -112,7 +112,7 @@ struct pcpu_chunk { int map_used; /* # of map entries used before the sentry */ int map_alloc; /* # of map entries allocated */ int *map; /* allocation map */ - struct work_struct map_extend_work;/* async ->map[] extension */ + struct list_headmap_extend_list;/* on pcpu_map_extend_chunks */ void*data; /* chunk data */ int first_free; /* no free below this */ @@ -162,10 +162,13 @@ static struct pcpu_chunk *pcpu_reserved_ static int pcpu_reserved_chunk_limit; static DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */ -static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop */ +static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map ext */ static struct list_head *pcpu_slot __read_mostly; /* chunk list slots */ +/* chunks which need their map areas extended, protected by pcpu_lock */ +static LIST_HEAD(pcpu_map_extend_chunks); + /* * The number of empty populated pages, protected by pcpu_lock. The * reserved chunk doesn't contribute to the count. @@ -395,13 +398,19 @@ static int pcpu_need_to_extend(struct pc { int margin, new_alloc; + lockdep_assert_held(&pcpu_lock); + if (is_atomic) { margin = 3; if (chunk->map_alloc < - chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW && - pcpu_async_enabled) - schedule_work(&chunk->map_extend_work); + chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW) { + if (list_empty(&chunk->map_extend_list)) { + list_add_tail(&chunk->map_extend_list, + &pcpu_map_extend_chunks); + pcpu_schedule_balance_work(); + } + } } else { margin = PCPU_ATOMIC_MAP_MARGIN_HIGH; } @@ -435,6 +444,8 @@ static int pcpu_extend_area_map(struct p size_t old_size = 0, new_size = new_alloc * sizeof(new[0]); unsigned long flags; + lockdep_assert_held(&pcpu_alloc_mutex); + new = pcpu_mem_zalloc(new_size); if (!new) return -ENOMEM; @@ -467,20 +478,6 @@ out_unlock: return 0; } -static void pcpu_map_extend_workfn(struct work_struct *work) -{ - struct pcpu_chunk *chunk = container_of(work, struct pcpu_chunk, - map_extend_work); - int new_alloc; - - spin_lock_irq(&pcpu_lock); - new_alloc = pcpu_need_to_extend(chunk, false); - spin_unlock_irq(&pcpu_lock); - - if (new_alloc) - pcpu_extend_area_map(chunk, new_alloc); -} - /** * pcpu_fit_in_area - try to fit the requested allocation in a candidate area * @chunk: chunk the candidate area belongs to @@ -740,7 +737,7 @@ static struct pcpu_chunk *pcpu_alloc_chu chunk->map_used = 1; INIT_LIST_HEAD(&chunk->list); - INIT_WORK(&chunk->map_extend_work, pcpu_map_extend_workfn); + INIT_LIST_HEAD(&chunk->map_extend_list); chunk->free_size = pcpu_unit_size; chunk->contig_hint = pcpu_unit_size; @@ -895,6 +892,9 @@ static void __percpu *pcpu_alloc(size_t return NULL; } + if (!is_atomic) + mutex_lock(&pcpu_alloc_mutex); + spin_lock_irqsave(&pcpu_lock, flags); /* serve reserved allocations from the reserved chunk if available */ @@ -967,12 +967,9 @@ restart: if (is_atomic) goto fail; - mutex_lock(&pcpu_alloc_mutex); - if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) { chunk = pcpu_create_chunk(); if (!chunk) { - mutex_unlock(&pcpu_alloc_mutex); err = "failed to allocate new chunk"; goto fail; } @@ -983,7 +980,6 @@ restart: spin_lock_irqsave(&pcpu_lock, flags); } - mutex_unlock(&pcpu_alloc_mutex); goto restart; area_found: @@ -993,8 +989,6 @@ area_found: if (!is_atomic) { int page_start, page_end, rs, re; - mutex_lock(&pcpu_alloc_mutex); - page_start = PFN_DOWN(off); page_end = PFN_UP(off + size); @@ -1005,7 +999,6 @@ area_found: spin_lock_irqsave(&pcpu_lock, flags); if (ret) { -
[PATCH 4.2.y-ckt 48/53] VSOCK: do not disconnect socket when peer has shutdown SEND only
4.2.8-ckt11 -stable review patch. If anyone has any objections, please let me know. ---8< From: Ian Campbell [ Upstream commit dedc58e067d8c379a15a8a183c5db318201295bb ] The peer may be expecting a reply having sent a request and then done a shutdown(SHUT_WR), so tearing down the whole socket at this point seems wrong and breaks for me with a client which does a SHUT_WR. Looking at other socket family's stream_recvmsg callbacks doing a shutdown here does not seem to be the norm and removing it does not seem to have had any adverse effects that I can see. I'm using Stefan's RFC virtio transport patches, I'm unsure of the impact on the vmci transport. Signed-off-by: Ian Campbell Cc: "David S. Miller" Cc: Stefan Hajnoczi Cc: Claudio Imbrenda Cc: Andy King Cc: Dmitry Torokhov Cc: Jorgen Hansen Cc: Adit Ranadive Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller Signed-off-by: Kamal Mostafa --- net/vmw_vsock/af_vsock.c | 21 + 1 file changed, 1 insertion(+), 20 deletions(-) diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index df5fc6b..d68716c 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -1795,27 +1795,8 @@ vsock_stream_recvmsg(struct socket *sock, struct msghdr *msg, size_t len, else if (sk->sk_shutdown & RCV_SHUTDOWN) err = 0; - if (copied > 0) { - /* We only do these additional bookkeeping/notification steps -* if we actually copied something out of the queue pair -* instead of just peeking ahead. -*/ - - if (!(flags & MSG_PEEK)) { - /* If the other side has shutdown for sending and there -* is nothing more to read, then modify the socket -* state. -*/ - if (vsk->peer_shutdown & SEND_SHUTDOWN) { - if (vsock_stream_has_data(vsk) <= 0) { - sk->sk_state = SS_UNCONNECTED; - sock_set_flag(sk, SOCK_DONE); - sk->sk_state_change(sk); - } - } - } + if (copied > 0) err = copied; - } out_wait: finish_wait(sk_sleep(sk), &wait); -- 2.7.4
[PATCH net] sfc: on MC reset, clear PIO buffer linkage in TXQs
Otherwise, if we fail to allocate new PIO buffers, our TXQs will try to use the old ones, which aren't there any more. Fixes: 183233bec810 "sfc: Allocate and link PIO buffers; map them with write-combining" Signed-off-by: Edward Cree --- drivers/net/ethernet/sfc/ef10.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c index 1681084..1f30912 100644 --- a/drivers/net/ethernet/sfc/ef10.c +++ b/drivers/net/ethernet/sfc/ef10.c @@ -619,6 +619,17 @@ fail: return rc; } +static void efx_ef10_forget_old_piobufs(struct efx_nic *efx) +{ + struct efx_channel *channel; + struct efx_tx_queue *tx_queue; + + /* All our existing PIO buffers went away */ + efx_for_each_channel(channel, efx) + efx_for_each_channel_tx_queue(tx_queue, channel) + tx_queue->piobuf = NULL; +} + #else /* !EFX_USE_PIO */ static int efx_ef10_alloc_piobufs(struct efx_nic *efx, unsigned int n) @@ -635,6 +646,10 @@ static void efx_ef10_free_piobufs(struct efx_nic *efx) { } +static void efx_ef10_forget_old_piobufs(struct efx_nic *efx) +{ +} + #endif /* EFX_USE_PIO */ static void efx_ef10_remove(struct efx_nic *efx) @@ -1018,6 +1033,7 @@ static void efx_ef10_reset_mc_allocations(struct efx_nic *efx) nic_data->must_realloc_vis = true; nic_data->must_restore_filters = true; nic_data->must_restore_piobufs = true; + efx_ef10_forget_old_piobufs(efx); nic_data->rx_rss_context = EFX_EF10_RSS_CONTEXT_INVALID; /* Driver-created vswitches and vports must be re-created */
[4.2.y-ckt stable] Patch "VSOCK: do not disconnect socket when peer has shutdown SEND only" has been added to the 4.2.y-ckt tree
This is a note to let you know that I have just added a patch titled VSOCK: do not disconnect socket when peer has shutdown SEND only to the linux-4.2.y-queue branch of the 4.2.y-ckt extended stable tree which can be found at: https://git.launchpad.net/~canonical-kernel/linux/+git/linux-stable-ckt/log/?h=linux-4.2.y-queue This patch is scheduled to be released in version 4.2.8-ckt11. If you, or anyone else, feels it should not be added to this tree, please reply to this email. For more information about the 4.2.y-ckt tree, see https://wiki.ubuntu.com/Kernel/Dev/ExtendedStable Thanks. -Kamal ---8< >From 89cf2ec58bee58f349e0cf4c1b77dc53490e6949 Mon Sep 17 00:00:00 2001 From: Ian Campbell Date: Wed, 4 May 2016 14:21:53 +0100 Subject: VSOCK: do not disconnect socket when peer has shutdown SEND only [ Upstream commit dedc58e067d8c379a15a8a183c5db318201295bb ] The peer may be expecting a reply having sent a request and then done a shutdown(SHUT_WR), so tearing down the whole socket at this point seems wrong and breaks for me with a client which does a SHUT_WR. Looking at other socket family's stream_recvmsg callbacks doing a shutdown here does not seem to be the norm and removing it does not seem to have had any adverse effects that I can see. I'm using Stefan's RFC virtio transport patches, I'm unsure of the impact on the vmci transport. Signed-off-by: Ian Campbell Cc: "David S. Miller" Cc: Stefan Hajnoczi Cc: Claudio Imbrenda Cc: Andy King Cc: Dmitry Torokhov Cc: Jorgen Hansen Cc: Adit Ranadive Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller Signed-off-by: Kamal Mostafa --- net/vmw_vsock/af_vsock.c | 21 + 1 file changed, 1 insertion(+), 20 deletions(-) diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index df5fc6b..d68716c 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -1795,27 +1795,8 @@ vsock_stream_recvmsg(struct socket *sock, struct msghdr *msg, size_t len, else if (sk->sk_shutdown & RCV_SHUTDOWN) err = 0; - if (copied > 0) { - /* We only do these additional bookkeeping/notification steps -* if we actually copied something out of the queue pair -* instead of just peeking ahead. -*/ - - if (!(flags & MSG_PEEK)) { - /* If the other side has shutdown for sending and there -* is nothing more to read, then modify the socket -* state. -*/ - if (vsk->peer_shutdown & SEND_SHUTDOWN) { - if (vsock_stream_has_data(vsk) <= 0) { - sk->sk_state = SS_UNCONNECTED; - sock_set_flag(sk, SOCK_DONE); - sk->sk_state_change(sk); - } - } - } + if (copied > 0) err = copied; - } out_wait: finish_wait(sk_sleep(sk), &wait); -- 2.7.4
Re: [net-next PATCH 0/2] Follow-ups for GUEoIPv6 patches
On Tue, May 24, 2016 at 10:14 AM, Jeff Kirsher wrote: > On Wed, 2016-05-18 at 21:10 -0700, David Miller wrote: >> From: Jeff Kirsher >> Date: Wed, 18 May 2016 14:27:58 -0700 >> >> > On Wed, 2016-05-18 at 10:44 -0700, Alexander Duyck wrote: >> >> This patch series is meant to be applied after: >> >> [PATCH v7 net-next 00/16] ipv6: Enable GUEoIPv6 and more fixes for v6 >> >> tunneling >> >> >> >> The first patch addresses an issue we already resolved in the GREv4 >> and >> >> is >> >> now present in GREv6 with the introduction of FOU/GUE for IPv6 based >> GRE >> >> tunnels. >> >> >> >> The second patch goes through and enables IPv6 tunnel offloads for the >> >> Intel >> >> NICs that already support the IPv4 based IP-in-IP tunnel offloads. I >> >> have >> >> only done a bit of touch testing but have seen ~20 Gb/s over an i40e >> >> interface using a v4-in-v6 tunnel, and I have verified IPv6 GRE is >> still >> >> passing traffic at around the same rate. I plan to do further testing >> >> but >> >> with these patches present it should enable a wider audience to be >> able >> >> to >> >> test the new features introduced in Tom's patchset with hardware >> >> offloads. >> >> >> >> --- >> >> >> >> Alexander Duyck (2): >> >> ip6_gre: Do not allow segmentation offloads GRE_CSUM is enabled >> >> with FOU/GUE >> >> intel: Add support for IPv6 IP-in-IP offload >> > >> > Dave, I have this series added to my queue. >> >> Why would you if it depends upon Tom's series, as mentioned above, which >> isn't even in my tree yet? > > I was not able to apply his patches to my dev-queue due to conflicts and > from discussion with Alex, I made the mistake in assuming the issues I saw > was because it did not have Tom's driver changes. I thought Dave had already applied the patches to his tree? If so there shouldn't be a need for you to pull them in and apply them. That might also be the source of the conflicts if they are already in the tree. - Alex
[PATCH iproute2] Make builds default to quiet mode
Similar to the Linux kernel and perf add infrastructure to reduce the amount of output tossed to a user during a build. Full build output can be obtained with 'make V=1' Builds go from: make[1]: Leaving directory `/home/dsa/iproute2.git/lib' make[1]: Entering directory `/home/dsa/iproute2.git/ip' gcc -Wall -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wold-style-definition -Wformat=2 -O2 -I../include -DRESOLVE_HOSTNAMES -DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE-c -o ip.o ip.c gcc -Wall -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wold-style-definition -Wformat=2 -O2 -I../include -DRESOLVE_HOSTNAMES -DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE-c -o ipaddress.o ipaddress.c to: ... AR libutil.a ip CC ip.o CC ipaddress.o ... Signed-off-by: David Ahern --- Makefile | 6 +- bridge/Makefile | 1 + configure| 32 devlink/Makefile | 1 + genl/Makefile| 1 + ip/Makefile | 2 ++ lib/Makefile | 4 ++-- misc/Makefile| 12 +++- tc/Makefile | 15 --- tipc/Makefile| 1 + 10 files changed, 60 insertions(+), 15 deletions(-) diff --git a/Makefile b/Makefile index 7f79cdd67f2d..99a1a6f31c0d 100644 --- a/Makefile +++ b/Makefile @@ -1,3 +1,7 @@ +ifndef VERBOSE +MAKEFLAGS += --no-print-directory +endif + PREFIX?=/usr LIBDIR?=$(PREFIX)/lib SBINDIR?=/sbin @@ -50,7 +54,7 @@ LDLIBS += $(LIBNETLINK) all: Config @set -e; \ for i in $(SUBDIRS); \ - do $(MAKE) $(MFLAGS) -C $$i; done + do echo "\n$$i"; $(MAKE) $(MFLAGS) -C $$i; done Config: sh configure $(KERNEL_INCLUDE) diff --git a/bridge/Makefile b/bridge/Makefile index 98007530240a..7203f70bc510 100644 --- a/bridge/Makefile +++ b/bridge/Makefile @@ -9,6 +9,7 @@ endif all: bridge bridge: $(BROBJ) $(LIBNETLINK) + $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@ install: all install -m 0755 bridge $(DESTDIR)$(SBINDIR) diff --git a/configure b/configure index d2540b0d3219..60eb6b51a571 100755 --- a/configure +++ b/configure @@ -317,7 +317,35 @@ EOF rm -f $TMPDIR/dbtest.c $TMPDIR/dbtest } +quiet_config() +{ + cat> Config + check_toolchain echo "TC schedulers" @@ -357,3 +385,7 @@ echo echo -n "docs:" check_docs echo + +echo >> Config +echo "%.o: %.c" >> Config +echo ' $(QUIET_CC)$(CC) $(CFLAGS) $(EXTRA_CFLAGS) -c -o $@ $<' >> Config diff --git a/devlink/Makefile b/devlink/Makefile index 3fdaa6904ee1..7256c28708eb 100644 --- a/devlink/Makefile +++ b/devlink/Makefile @@ -12,6 +12,7 @@ endif all: $(TARGETS) $(LIBS) devlink: $(DEVLINKOBJ) + $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@ install: all install -m 0755 $(TARGETS) $(DESTDIR)$(SBINDIR) diff --git a/genl/Makefile b/genl/Makefile index 03d1f26a1cee..f5a0bfe42aff 100644 --- a/genl/Makefile +++ b/genl/Makefile @@ -20,6 +20,7 @@ endif all: genl genl: $(GENLOBJ) $(LIBNETLINK) $(LIBUTIL) $(GENLLIB) + $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@ install: all install -m 0755 genl $(DESTDIR)$(SBINDIR) diff --git a/ip/Makefile b/ip/Makefile index 2194ed9642eb..bb85fe3e6610 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -24,8 +24,10 @@ TARGETS=ip rtmon all: $(TARGETS) $(SCRIPTS) ip: $(IPOBJ) $(LIBNETLINK) + $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@ rtmon: $(RTMONOBJ) + $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@ install: all install -m 0755 $(TARGETS) $(DESTDIR)$(SBINDIR) diff --git a/lib/Makefile b/lib/Makefile index 9d1307dd4df2..52e016db50b6 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -15,10 +15,10 @@ NLOBJ=libgenl.o ll_map.o libnetlink.o all: libnetlink.a libutil.a libnetlink.a: $(NLOBJ) - $(AR) rcs $@ $(NLOBJ) + $(QUIET_AR)$(AR) rcs $@ $^ libutil.a: $(UTILOBJ) $(ADDLIB) - $(AR) rcs $@ $(UTILOBJ) $(ADDLIB) + $(QUIET_AR)$(AR) rcs $@ $^ install: diff --git a/misc/Makefile b/misc/Makefile index f50e7403a33b..72807678054b 100644 --- a/misc/Makefile +++ b/misc/Makefile @@ -21,23 +21,25 @@ endif all: $(TARGETS) ss: $(SSOBJ) + $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@ nstat: nstat.c - $(CC) $(CFLAGS) $(LDFLAGS) -o nstat nstat.c $(LIBNETLINK) -lm + $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o nstat nstat.c $(LIBNETLINK) -lm ifstat: ifstat.c - $(CC) $(CFLAGS) $(LDFLAGS) -o ifstat ifstat.c $(LIBNETLINK) -lm + $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o ifstat ifstat.c $(LIBNETLINK) -lm rtacct: rtacct.c - $(CC) $(CFLAGS) $(LDFLAGS) -o rtacct rtacct.c $(LIBNETLINK) -lm + $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o rtacct rtacct.c $(LIBNETLINK) -lm arpd: arpd.c - $(CC) $(CFLAGS) -I$(DBM_INC
Re: [PATCH iproute2] Make builds default to quiet mode
David Ahern writes: > @@ -50,7 +54,7 @@ LDLIBS += $(LIBNETLINK) > all: Config > @set -e; \ > for i in $(SUBDIRS); \ > - do $(MAKE) $(MFLAGS) -C $$i; done > + do echo "\n$$i"; $(MAKE) $(MFLAGS) -C $$i; done echo "\n" is not portable. Andreas. -- Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different."
Re: [net-next PATCH 0/2] Follow-ups for GUEoIPv6 patches
On Wed, 2016-05-18 at 21:10 -0700, David Miller wrote: > From: Jeff Kirsher > Date: Wed, 18 May 2016 14:27:58 -0700 > > > On Wed, 2016-05-18 at 10:44 -0700, Alexander Duyck wrote: > >> This patch series is meant to be applied after: > >> [PATCH v7 net-next 00/16] ipv6: Enable GUEoIPv6 and more fixes for v6 > >> tunneling > >> > >> The first patch addresses an issue we already resolved in the GREv4 > and > >> is > >> now present in GREv6 with the introduction of FOU/GUE for IPv6 based > GRE > >> tunnels. > >> > >> The second patch goes through and enables IPv6 tunnel offloads for the > >> Intel > >> NICs that already support the IPv4 based IP-in-IP tunnel offloads. I > >> have > >> only done a bit of touch testing but have seen ~20 Gb/s over an i40e > >> interface using a v4-in-v6 tunnel, and I have verified IPv6 GRE is > still > >> passing traffic at around the same rate. I plan to do further testing > >> but > >> with these patches present it should enable a wider audience to be > able > >> to > >> test the new features introduced in Tom's patchset with hardware > >> offloads. > >> > >> --- > >> > >> Alexander Duyck (2): > >> ip6_gre: Do not allow segmentation offloads GRE_CSUM is enabled > >> with FOU/GUE > >> intel: Add support for IPv6 IP-in-IP offload > > > > Dave, I have this series added to my queue. > > Why would you if it depends upon Tom's series, as mentioned above, which > isn't even in my tree yet? I was not able to apply his patches to my dev-queue due to conflicts and from discussion with Alex, I made the mistake in assuming the issues I saw was because it did not have Tom's driver changes. signature.asc Description: This is a digitally signed message part
Re: [RFC PATCH 7/7] tou: Support for GSO
On Tue, May 24, 2016 at 7:59 AM, Alexander Duyck wrote: > On Mon, May 23, 2016 at 3:48 PM, Tom Herbert wrote: >> Add SKB_GSO_TOU. In udp[64]_ufo_fragment check for SKB_GSO_TOU. If this >> is set call skb_udp_tou_segment. skb_udp_tou_segment is very similar >> to skb_udp_tunnel_segment except that we only need to deal with the >> L4 headers. >> >> Signed-off-by: Tom Herbert >> --- >> include/linux/skbuff.h | 2 + >> include/net/udp.h| 2 + >> net/ipv4/fou.c | 2 + >> net/ipv4/ip_output.c | 2 + >> net/ipv4/udp_offload.c | 164 >> +-- >> net/ipv6/inet6_connection_sock.c | 3 + >> net/ipv6/udp_offload.c | 128 +++--- >> 7 files changed, 236 insertions(+), 67 deletions(-) >> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h >> index 65968a9..b57e484 100644 >> --- a/include/linux/skbuff.h >> +++ b/include/linux/skbuff.h >> @@ -482,6 +482,8 @@ enum { >> SKB_GSO_PARTIAL = 1 << 13, >> >> SKB_GSO_TUNNEL_REMCSUM = 1 << 14, >> + >> + SKB_GSO_TOU = 1 << 15, >> }; >> > > So where do you add the netdev feature bit? From what I can tell that > was overlooked and as a result devices that support FCoE CRC will end > up corrupting TOU frames because netif_gso_ok currently ands the two > together. > An obvious omission, thanks for pointing it out. > Also I am pretty sure we can offload this on the Intel NICs using the > GSO partial approach as we can just stuff the UDP header into the > space that we would use for IPv4 options or IPv6 extension headers and > it shouldn't complain. > That would be cool! > - Alex
Re: ixgbe: ksoftirqd consumes 100% CPU w/ ~50 TCP conns
Hello Everyone- So we tracked it down to IOMMU causing CPU affinity getting broken[1]. Can we provide any further details or is this a known issue? Thank You, Brandon [1] https://github.com/coreos/bugs/issues/1275#issuecomment-219866601 On Tue, May 17, 2016 at 12:44 PM, Brandon Philips wrote: > Hello ixgbe team- > > With Linux v4.6 and the ixgbe driver (details below) a user is reporting > ksoftirqd consuming 100% of the CPU on all cores after a moderate ~20-50 > number of TCP connections. They are unable to reproduce this issue with > Cisco hardware. > > With Kernel v3.19 they cannot reproduce[1] the issue. Disabling IOMMU > (intel_iommu=off) does "fix" the issue[2]. > > Thank You, > > Brandon > > [1] https://github.com/coreos/bugs/issues/1275#issuecomment-219157803 > [2] https://github.com/coreos/bugs/issues/1275#issuecomment-219819986 > > Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) > ethtool -i eno1 > driver: ixgbe > version: 4.0.1-k > firmware-version: 0x84e0 > bus-info: :06:00.0 > supports-statistics: yes > supports-test: yes > supports-eeprom-access: yes > supports-register-dump: yes > supports-priv-flags: no > > CPU > Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
Re: [PATCH v5 2/2] skb_array: ring test
On Tue, 24 May 2016 12:28:09 +0200 Jesper Dangaard Brouer wrote: > I do like perf, but it does not answer my questions about the > performance of this queue. I will code something up in my own > framework[2] to answer my own performance questions. > > Like what is be minimum overhead (in cycles) achievable with this type > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot) > for fastpath usage. Coded it up here: https://github.com/netoptimizer/prototype-kernel/commit/b16a3332184 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c This is a really fake benchmark, but it sort of shows the minimum overhead achievable with this type of queue, where it is the same CPU enqueuing and dequeuing, and cache is guaranteed to be hot. Measured on a i7-4790K CPU @ 4.00GHz, the average cost of enqueue+dequeue of a single object is around 102 cycles(tsc). To compare this with below, where enq and deq is measured separately: 102 / 2 = 51 cycles > Then I also want to know how this performs when two CPUs are involved. > As this is also a primary use-case, for you when sending packets into a > guest. Coded it up here: https://github.com/netoptimizer/prototype-kernel/commit/75fe31ef62e https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c This parallel benchmark try to keep two (or more) CPUs busy enqueuing or dequeuing on the same skb_array queue. It prefills the queue, and stops the test as soon as queue is empty or full, or completes a number of "loops"/cycles. For two CPUs the results are really good: enqueue: 54 cycles(tsc) dequeue: 53 cycles(tsc) Going to 4 CPUs, things break down (but it was not primary use-case?): CPU(0) 927 cycles(tsc) enqueue CPU(1) 921 cycles(tsc) dequeue CPU(2) 927 cycles(tsc) enqueue CPU(3) 898 cycles(tsc) dequeue Next on my todo-list is to implement same tests for e.g. alf_queue, so we can compare the concurrency part (which is the important part). But FYI I'll be busy the next days at conf http://fosd2016.itu.dk/ -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer [1] https://git.kernel.org/cgit/linux/kernel/git/mst/vhost.git/log/?h=vhost [2] https://github.com/netoptimizer/prototype-kernel
Re: [PATCH RESEND 7/8] pipe: account to kmemcg
On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote: ... > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe, > > + struct pipe_buffer *buf) > > +{ > > + struct page *page = buf->page; > > + > > + if (page_count(page) == 1) { > > This looks racy : some cpu could have temporarily elevated page count. All pipe operations (pipe_buf_operations->get, ->release, ->steal) are supposed to be called under pipe_lock. So, if we see a pipe_buffer->page with refcount of 1 in ->steal, that means that we are the only its user and it can't be spliced to another pipe. In fact, I just copied the code from generic_pipe_buf_steal, adding kmemcg related checks along the way, so it should be fine. Thanks, Vladimir > > > + if (memcg_kmem_enabled()) { > > + memcg_kmem_uncharge(page, 0); > > + __ClearPageKmemcg(page); > > + } > > + __SetPageLocked(page); > > + return 0; > > + } > > + return 1; > > +}
Re: [PATCH RESEND 8/8] af_unix: charge buffers to kmemcg
On Tue, May 24, 2016 at 06:02:06AM -0700, Eric Dumazet wrote: > On Tue, 2016-05-24 at 11:49 +0300, Vladimir Davydov wrote: > > Unix sockets can consume a significant amount of system memory, hence > > they should be accounted to kmemcg. > > > > Since unix socket buffers are always allocated from process context, > > all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in > > sock->sk_allocation mask. > > I have two questions : > > 1) What happens when a buffer, allocated from socket lands in a > different socket , maybe owned by another user/process. > > Who owns it now, in term of kmemcg accounting ? We never move memcg charges. E.g. if two processes from different cgroups are sharing a memory region, each page will be charged to the process which touched it first. Or if two processes are working with the same directory tree, inodes and dentries will be charged to the first user. The same is fair for unix socket buffers - they will be charged to the sender. > > 2) Has performance impact been evaluated ? I ran netperf STREAM_STREAM with default options in a kmemcg on a 4 core x 2 HT box. The results are below: # clientsbandwidth (10^6bits/sec) base patched 1 67643 +- 725 64874 +- 353- 4.0 % 4 193585 +- 2516 186715 +- 1460- 3.5 % 8 194820 +- 377 187443 +- 1229- 3.7 % So the accounting doesn't come for free - it takes ~4% of performance. I believe we could optimize it by using per cpu batching not only on charge, but also on uncharge in memcg core, but that's beyond the scope of this patch set - I'll take a look at this later. Anyway, if performance impact is found to be unacceptable, it is always possible to disable kmem accounting at boot time (cgroup.memory=nokmem) or not use memory cgroups at runtime at all (thanks to jump labels there'll be no overhead even if they are compiled in). Thanks, Vladimir
Re: Davicom DM9162 PHY supported in the kernel?
On Tue, May 24, 2016 at 05:09:35PM +0100, Amr Bekhit wrote: > Hi Andrew, > > > How about adding a printk() in genphy_read_status(). > > Would you be able to point me to some more information about these > status bits, please? You can get the datasheet from here: http://www.davicom.com.tw/userfile/24247/DM9162_DM9162I-12-MCO-DS-F01_08062014.pdf I would start by looking at the BMCR register. The genphy_read_status() function should get called about once per second, so that is an O.K. place to add debug prints, and there is example code for reading MII_BMCR. Andrew
Re: Davicom DM9162 PHY supported in the kernel?
Hi Andrew, > How about adding a printk() in genphy_read_status(). Would you be able to point me to some more information about these status bits, please? I'm struggling to find information about what they are, where I read them from, etc. I'm happy to write some debug printk's in the code, I just don't know where to get information about these status bits.
[PATCH net 0/2] Fix spinlock usage in HWBM
Hi, these two patches fix spinlock related issues introduced in v4.6. They have been reported by Russell King and Jean-Jacques Hiblot. Thanks to them, Gregory Gregory CLEMENT (2): net: mvneta: Fix lacking spinlock initialization net: hwbm: Fix unbalanced spinlock in error case drivers/net/ethernet/marvell/mvneta_bm.c | 1 + net/core/hwbm.c | 3 +++ 2 files changed, 4 insertions(+) -- 2.5.0
[PATCH net 1/2] net: mvneta: Fix lacking spinlock initialization
The spinlock used by the hwbm functions must be initialized by the network driver. This commit fixes this lack and the following erros when lockdep is enabled: INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. [] (unwind_backtrace) from [] (show_stack+0x10/0x14) [] (show_stack) from [] (dump_stack+0xb4/0xe0) [] (dump_stack) from [] (__lock_acquire+0x1f58/0x2060) [] (__lock_acquire) from [] (lock_acquire+0xa4/0xd0) [] (lock_acquire) from [] (_raw_spin_lock_irqsave+0x54/0x68) [] (_raw_spin_lock_irqsave) from [] (hwbm_pool_add+0x1c/0xdc) [] (hwbm_pool_add) from [] (mvneta_bm_pool_use+0x338/0x490) [] (mvneta_bm_pool_use) from [] (mvneta_probe+0x654/0x1284) [] (mvneta_probe) from [] (platform_drv_probe+0x4c/0xb0) [] (platform_drv_probe) from [] (driver_probe_device+0x214/0x2c0) [] (driver_probe_device) from [] (__driver_attach+0xc0/0xc4) [] (__driver_attach) from [] (bus_for_each_dev+0x68/0x9c) [] (bus_for_each_dev) from [] (bus_add_driver+0x1a0/0x218) [] (bus_add_driver) from [] (driver_register+0x78/0xf8) [] (driver_register) from [] (do_one_initcall+0x90/0x1dc) [] (do_one_initcall) from [] (kernel_init_freeable+0x15c/0x1fc) [] (kernel_init_freeable) from [] (kernel_init+0x8/0x114) [] (kernel_init) from [] (ret_from_fork+0x14/0x24) Fixes: baa11ebc0c76 ("net: mvneta: Use the new hwbm framework") Reported-by: Russell King Cc: Signed-off-by: Gregory CLEMENT --- drivers/net/ethernet/marvell/mvneta_bm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/marvell/mvneta_bm.c b/drivers/net/ethernet/marvell/mvneta_bm.c index 01fccec632ec..466939f8f0cf 100644 --- a/drivers/net/ethernet/marvell/mvneta_bm.c +++ b/drivers/net/ethernet/marvell/mvneta_bm.c @@ -189,6 +189,7 @@ struct mvneta_bm_pool *mvneta_bm_pool_use(struct mvneta_bm *priv, u8 pool_id, SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); hwbm_pool->construct = mvneta_bm_construct; hwbm_pool->priv = new_pool; + spin_lock_init(&hwbm_pool->lock); /* Create new pool */ err = mvneta_bm_pool_create(priv, new_pool); -- 2.5.0
[PATCH net 2/2] net: hwbm: Fix unbalanced spinlock in error case
When hwbm_pool_add exited in error the spinlock was not released. This patch fixes this issue. Fixes: 8cb2d8bf57e6 ("net: add a hardware buffer management helper API") Reported-by: Jean-Jacques Hiblot Cc: Signed-off-by: Gregory CLEMENT --- net/core/hwbm.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/core/hwbm.c b/net/core/hwbm.c index 941c28486896..2cab489ae62e 100644 --- a/net/core/hwbm.c +++ b/net/core/hwbm.c @@ -55,18 +55,21 @@ int hwbm_pool_add(struct hwbm_pool *bm_pool, unsigned int buf_num, gfp_t gfp) spin_lock_irqsave(&bm_pool->lock, flags); if (bm_pool->buf_num == bm_pool->size) { pr_warn("pool already filled\n"); + spin_unlock_irqrestore(&bm_pool->lock, flags); return bm_pool->buf_num; } if (buf_num + bm_pool->buf_num > bm_pool->size) { pr_warn("cannot allocate %d buffers for pool\n", buf_num); + spin_unlock_irqrestore(&bm_pool->lock, flags); return 0; } if ((buf_num + bm_pool->buf_num) < bm_pool->buf_num) { pr_warn("Adding %d buffers to the %d current buffers will overflow\n", buf_num, bm_pool->buf_num); + spin_unlock_irqrestore(&bm_pool->lock, flags); return 0; } -- 2.5.0
kernel panic with kernel 3.14.70, LVS on keepalived restart
Dear LVS users / netdev readers, today I've got a pretty peculiar problem. I've been running 3.14.48 (and some earlier 3.14 kernels) for a long time now in an LVS / keepalived driven loadbalancing cluster. See below for more detail on the setup. Today I started to upgrade to the current 3.14.70 kernel. At first glance everything seems fine, I can failover _to_ the box with the new kernel, and traffic is flowing fine. However, when I then switch BACK to a different box, the 3.14.70 kernel crashes. I've got an incomplete console dump (IPMI avi capture, single frame...) you can see here: https://plus.google.com/u/0/photos/photo/114613285248943487324/6288270822391242162 I usually manually failover by (having some automation) fiddle with keepalived.conf VRRP priority settings, then restart the daemon. The issue / reboot only manifests when I RESTART the ACTIVE keepalived. It always happens, then, with 3.14.70. That never happened before. On the other hand just reloading keepalived, with the prio-modified config, works fine! As I don't remember why I restarted instead of reloading, I can for now change my automation easily - but the issue is weird anyway. More info on the setup: 1) kernel is vanilla 3.14.70 (and was vanilla 3.14.48 without the issue), with a single (self written) patch to bonding applied (see http://permalink.gmane.org/gmane.linux.network/316758). Unfortunately I cannot live without that patch, i.e. can't try to reproduce with a pure vanilla vanilla kernel. 2) keepalived is 1.2.13 3) config uses "use_vmac" / "vmac_xmit_base", on multiple interfaces, i.e. MACVLAN interfaces on top of: 4) "normal" interfaces are both bridge-over-VLAN-over-LACP-bond-over-eth and ARP-bond-over-VLAN-over-eth 5) there is active conntracking including conntrackd (but excluding LVS state), LVS loadbalancing of some 15k pps, LVS sync, and heavy iptables use including ipset matching, going on. Just for completeness. Anybody got any idea what the root cause might be? best regards Patrick
Re: bpf: use-after-free in array_map_alloc
Hello, On Tue, May 24, 2016 at 10:40:54AM +0200, Vlastimil Babka wrote: > [+CC Marco who reported the CVE, forgot that earlier] > > On 05/23/2016 11:35 PM, Tejun Heo wrote: > > Hello, > > > > Can you please test whether this patch resolves the issue? While > > adding support for atomic allocations, I reduced alloc_mutex covered > > region too much. > > > > Thanks. > > Ugh, this makes the code even more head-spinning than it was. Locking-wise, it isn't complicated. It used to be a single mutex protecting everything. Atomic alloc support required putting core allocation parts under spinlock. It is messy because the two paths are mixed in the same function. If we break out the core part to a separate function and let the sleepable path call into that, it should look okay, but that's for another patch. Also, I think protecting chunk's lifetime w/ alloc_mutex is making it a bit nasty. Maybe we should do per-chunk "extending" completion and let pcpu_alloc_mutex just protect populating chunks. > > @@ -435,6 +435,8 @@ static int pcpu_extend_area_map(struct pcpu_chunk > > *chunk, int new_alloc) > > size_t old_size = 0, new_size = new_alloc * sizeof(new[0]); > > unsigned long flags; > > > > + lockdep_assert_held(&pcpu_alloc_mutex); > > I don't see where the mutex gets locked when called via > pcpu_map_extend_workfn? (except via the new cancel_work_sync() call below?) Ah, right. > Also what protects chunks with scheduled work items from being removed? cancel_work_sync(), which now obviously should be called outside alloc_mutex. > > @@ -895,6 +897,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t > > align, bool reserved, > > return NULL; > > } > > > > + if (!is_atomic) > > + mutex_lock(&pcpu_alloc_mutex); > > BTW I noticed that > bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL; > > this is too pessimistic IMHO. Reclaim is possible even without __GFP_FS and > __GFP_IO. Could you just use gfpflags_allow_blocking(gfp) here? vmalloc hardcodes GFP_KERNEL, so getting more relaxed doesn't buy us much. Thanks. -- tejun
Re: [RFC PATCH 7/7] tou: Support for GSO
On Mon, May 23, 2016 at 3:48 PM, Tom Herbert wrote: > Add SKB_GSO_TOU. In udp[64]_ufo_fragment check for SKB_GSO_TOU. If this > is set call skb_udp_tou_segment. skb_udp_tou_segment is very similar > to skb_udp_tunnel_segment except that we only need to deal with the > L4 headers. > > Signed-off-by: Tom Herbert > --- > include/linux/skbuff.h | 2 + > include/net/udp.h| 2 + > net/ipv4/fou.c | 2 + > net/ipv4/ip_output.c | 2 + > net/ipv4/udp_offload.c | 164 > +-- > net/ipv6/inet6_connection_sock.c | 3 + > net/ipv6/udp_offload.c | 128 +++--- > 7 files changed, 236 insertions(+), 67 deletions(-) > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index 65968a9..b57e484 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -482,6 +482,8 @@ enum { > SKB_GSO_PARTIAL = 1 << 13, > > SKB_GSO_TUNNEL_REMCSUM = 1 << 14, > + > + SKB_GSO_TOU = 1 << 15, > }; > So where do you add the netdev feature bit? From what I can tell that was overlooked and as a result devices that support FCoE CRC will end up corrupting TOU frames because netif_gso_ok currently ands the two together. Also I am pretty sure we can offload this on the Intel NICs using the GSO partial approach as we can just stuff the UDP header into the space that we would use for IPv4 options or IPv6 extension headers and it shouldn't complain. - Alex
Re: [PATCH net 3/3] Documentation: networking: dsa: Describe port_vlan_filtering
Hi Vivien, Florian, Tue, May 24, 2016 at 05:32:33PM IDT, vivien.dide...@savoirfairelinux.com wrote: >Hi Florian, > >Florian Fainelli writes: > >> Described what the port_vlan_filtering function is supposed to >> accomplish. >> >> Fixes: fb2dabad69f0 ("net: dsa: support VLAN filtering switchdev attr") >> Signed-off-by: Florian Fainelli >> --- >> Documentation/networking/dsa/dsa.txt | 6 ++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/Documentation/networking/dsa/dsa.txt >> b/Documentation/networking/dsa/dsa.txt >> index 411b57fd73aa..a42fd2ec32a5 100644 >> --- a/Documentation/networking/dsa/dsa.txt >> +++ b/Documentation/networking/dsa/dsa.txt >> @@ -535,6 +535,12 @@ Bridge layer >> Bridge VLAN filtering >> - >> >> +- port_vlan_filtering: bridge layer function invoked when the bridge gets >> + configured for turning on or off VLAN filtering. If nothing specific >> needs to >> + be done at the hardware level, 0 must be returned. When VLAN filtering is >> + turned on, the hardware must be programmed with rejecting non-802.1Q >> frames, >> + when turned off the switch must accept any 802.1Q frames. > >Note that port_vlan_filtering is optional so a driver don't need to >implement it if nothing specific needs to be done at the hardware level. > >Also I'd think that with VLAN filtering on, the hardware must not reject >untagged frames, only 802.1Q frames which don't respect the programmed >VLAN rules. With VLAN filtering on I believe you only need to reject untagged frames when there's no PVID on the port. See __allowed_ingress() in net/bridge/br_vlan.c
Re: [RFC] net: remove busylock
On Tue, 2016-05-24 at 13:50 +, David Laight wrote: > From: Jesper Dangaard Brouer > > Sent: 20 May 2016 18:50 > ... > > If would be cool if you could run a test with removed busylock and > > allow HTB to bulk dequeue. > > (Without having looked ) > Could you have two queues and separate queue and dequeue locks. > > The enqueue code would acquire the enqueue lock and add the packet > to the first queue. > > The dequeue code would acquire the dequeue lock and try to remove > a packet from the 2nd queue, if nothing present it would acquire > the enqueue lock and move the entire 1st queue to the 2nd queue. > > The obvious downside is two lock/unlocks for single packet dequeue. > If you can guarantee a single dequeue thread the 2nd lock isn't needed. > Not with HTB or any 'complex' qdisc hierarchy, where we have many 'queues' and strict limits (packets per qdisc)
Re: [PATCH net 3/3] Documentation: networking: dsa: Describe port_vlan_filtering
Hi Florian, Florian Fainelli writes: > Described what the port_vlan_filtering function is supposed to > accomplish. > > Fixes: fb2dabad69f0 ("net: dsa: support VLAN filtering switchdev attr") > Signed-off-by: Florian Fainelli > --- > Documentation/networking/dsa/dsa.txt | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/Documentation/networking/dsa/dsa.txt > b/Documentation/networking/dsa/dsa.txt > index 411b57fd73aa..a42fd2ec32a5 100644 > --- a/Documentation/networking/dsa/dsa.txt > +++ b/Documentation/networking/dsa/dsa.txt > @@ -535,6 +535,12 @@ Bridge layer > Bridge VLAN filtering > - > > +- port_vlan_filtering: bridge layer function invoked when the bridge gets > + configured for turning on or off VLAN filtering. If nothing specific needs > to > + be done at the hardware level, 0 must be returned. When VLAN filtering is > + turned on, the hardware must be programmed with rejecting non-802.1Q > frames, > + when turned off the switch must accept any 802.1Q frames. Note that port_vlan_filtering is optional so a driver don't need to implement it if nothing specific needs to be done at the hardware level. Also I'd think that with VLAN filtering on, the hardware must not reject untagged frames, only 802.1Q frames which don't respect the programmed VLAN rules. Thanks, Vivien
[PATCH v2] tipc: fix potential null pointer dereferences in some compat functions
Before calling the nla_parse_nested function, make sure the pointer to the attribute is not null. This patch fixes several potential null pointer dereference vulnerabilities in the tipc netlink functions. Signed-off-by: Baozeng Ding --- v2: declare local variable as reverse christmas tree format and make the commit log fit in 80 columns --- net/tipc/netlink_compat.c | 111 ++ 1 file changed, 93 insertions(+), 18 deletions(-) diff --git a/net/tipc/netlink_compat.c b/net/tipc/netlink_compat.c index d7d050f..72a1c8f 100644 --- a/net/tipc/netlink_compat.c +++ b/net/tipc/netlink_compat.c @@ -346,9 +346,15 @@ static int tipc_nl_compat_bearer_dump(struct tipc_nl_compat_msg *msg, struct nlattr **attrs) { struct nlattr *bearer[TIPC_NLA_BEARER_MAX + 1]; + int err; + + if (!attrs[TIPC_NLA_BEARER]) + return -EINVAL; - nla_parse_nested(bearer, TIPC_NLA_BEARER_MAX, attrs[TIPC_NLA_BEARER], -NULL); + err = nla_parse_nested(bearer, TIPC_NLA_BEARER_MAX, + attrs[TIPC_NLA_BEARER], NULL); + if (err) + return err; return tipc_add_tlv(msg->rep, TIPC_TLV_BEARER_NAME, nla_data(bearer[TIPC_NLA_BEARER_NAME]), @@ -460,14 +466,31 @@ static int tipc_nl_compat_link_stat_dump(struct tipc_nl_compat_msg *msg, struct nlattr *link[TIPC_NLA_LINK_MAX + 1]; struct nlattr *prop[TIPC_NLA_PROP_MAX + 1]; struct nlattr *stats[TIPC_NLA_STATS_MAX + 1]; + int err; - nla_parse_nested(link, TIPC_NLA_LINK_MAX, attrs[TIPC_NLA_LINK], NULL); + if (!attrs[TIPC_NLA_LINK]) + return -EINVAL; - nla_parse_nested(prop, TIPC_NLA_PROP_MAX, link[TIPC_NLA_LINK_PROP], -NULL); + err = nla_parse_nested(link, TIPC_NLA_LINK_MAX, attrs[TIPC_NLA_LINK], + NULL); + if (err) + return err; + + if (!link[TIPC_NLA_LINK_PROP]) + return -EINVAL; - nla_parse_nested(stats, TIPC_NLA_STATS_MAX, link[TIPC_NLA_LINK_STATS], -NULL); + err = nla_parse_nested(prop, TIPC_NLA_PROP_MAX, + link[TIPC_NLA_LINK_PROP], NULL); + if (err) + return err; + + if (!link[TIPC_NLA_LINK_STATS]) + return -EINVAL; + + err = nla_parse_nested(stats, TIPC_NLA_STATS_MAX, + link[TIPC_NLA_LINK_STATS], NULL); + if (err) + return err; name = (char *)TLV_DATA(msg->req); if (strcmp(name, nla_data(link[TIPC_NLA_LINK_NAME])) != 0) @@ -569,8 +592,15 @@ static int tipc_nl_compat_link_dump(struct tipc_nl_compat_msg *msg, { struct nlattr *link[TIPC_NLA_LINK_MAX + 1]; struct tipc_link_info link_info; + int err; - nla_parse_nested(link, TIPC_NLA_LINK_MAX, attrs[TIPC_NLA_LINK], NULL); + if (!attrs[TIPC_NLA_LINK]) + return -EINVAL; + + err = nla_parse_nested(link, TIPC_NLA_LINK_MAX, attrs[TIPC_NLA_LINK], + NULL); + if (err) + return err; link_info.dest = nla_get_flag(link[TIPC_NLA_LINK_DEST]); link_info.up = htonl(nla_get_flag(link[TIPC_NLA_LINK_UP])); @@ -758,12 +788,23 @@ static int tipc_nl_compat_name_table_dump(struct tipc_nl_compat_msg *msg, u32 node, depth, type, lowbound, upbound; static const char * const scope_str[] = {"", " zone", " cluster", " node"}; + int err; - nla_parse_nested(nt, TIPC_NLA_NAME_TABLE_MAX, -attrs[TIPC_NLA_NAME_TABLE], NULL); + if (!attrs[TIPC_NLA_NAME_TABLE]) + return -EINVAL; - nla_parse_nested(publ, TIPC_NLA_PUBL_MAX, nt[TIPC_NLA_NAME_TABLE_PUBL], -NULL); + err = nla_parse_nested(nt, TIPC_NLA_NAME_TABLE_MAX, + attrs[TIPC_NLA_NAME_TABLE], NULL); + if (err) + return err; + + if (!nt[TIPC_NLA_NAME_TABLE_PUBL]) + return -EINVAL; + + err = nla_parse_nested(publ, TIPC_NLA_PUBL_MAX, + nt[TIPC_NLA_NAME_TABLE_PUBL], NULL); + if (err) + return err; ntq = (struct tipc_name_table_query *)TLV_DATA(msg->req); @@ -815,8 +856,15 @@ static int __tipc_nl_compat_publ_dump(struct tipc_nl_compat_msg *msg, { u32 type, lower, upper; struct nlattr *publ[TIPC_NLA_PUBL_MAX + 1]; + int err; - nla_parse_nested(publ, TIPC_NLA_PUBL_MAX, attrs[TIPC_NLA_PUBL], NULL); + if (!attrs[TIPC_NLA_PUBL]) + return -EINVAL; + + err = nla_parse_nested(publ, TIPC_NLA_PUBL_MAX, attrs[TIPC_NLA_PUBL], + NULL); + if (err) + return err;
Re: [PATCH net 1/3] Documentation: networking: dsa: Remove poll_link description
Florian Fainelli writes: > This function has been removed in 4baee937b8d5 ("net: dsa: remove DSA > link polling") in favor of using the PHYLIB polling mechanism. > > Signed-off-by: Florian Fainelli Reviewed-by: Vivien Didelot
Re: [PATCH net 2/3] Documentation: networking: dsa: Remove priv_size description
Florian Fainelli writes: > We no longer have a priv_size structure member since 5feebd0a8a79 ("net: > dsa: Remove allocation of driver private memory") > > Signed-off-by: Florian Fainelli Reviewed-by: Vivien Didelot
RE: [RFC] net: remove busylock
From: Jesper Dangaard Brouer > Sent: 20 May 2016 18:50 ... > If would be cool if you could run a test with removed busylock and > allow HTB to bulk dequeue. (Without having looked ) Could you have two queues and separate queue and dequeue locks. The enqueue code would acquire the enqueue lock and add the packet to the first queue. The dequeue code would acquire the dequeue lock and try to remove a packet from the 2nd queue, if nothing present it would acquire the enqueue lock and move the entire 1st queue to the 2nd queue. The obvious downside is two lock/unlocks for single packet dequeue. If you can guarantee a single dequeue thread the 2nd lock isn't needed. David
Re: [PATCH iproute2 v3 5/5] ip: add MACsec support
Hello Stephen, 2016-05-23, 16:21:42 -0700, Stephen Hemminger wrote: > On Wed, 18 May 2016 17:35:13 +0200 > Sabrina Dubroca wrote: > > > + > > +static void print_rx_sc(const char *prefix, __u64 sci, __u8 active, struct > > rtattr *rxsc_stats, struct rtattr *sa) > > +{ > > Overall, this looks fine, but could you break some of the really long lines. > In general, I like iproute2 to follow kernel style, and in general stick to > the 80 column recommendation where it makes sense. Spliting strings or > stuff in loops may not make sense. There are several places in this code > longer than 100 chars. Ugh, yeah, I forgot to clean this up, sorry. I'll do that and resend. Thanks, -- Sabrina
Re: [PATCH iproute2 v3 1/5] add missing if_macsec.h header from kernel
2016-05-23, 16:07:20 -0700, Stephen Hemminger wrote: > On Wed, 18 May 2016 17:35:09 +0200 > Sabrina Dubroca wrote: > > > Signed-off-by: Sabrina Dubroca > > Acked-by: Phil Sutter > > This header already exists in master branch of iproute2, with current version > from upstream kernel (with PAD values). It wasn't when I posted this ;) Excellent, I'll drop this patch from v4. Thanks, -- Sabrina
Re: [PATCH RESEND 8/8] af_unix: charge buffers to kmemcg
On Tue, 2016-05-24 at 11:49 +0300, Vladimir Davydov wrote: > Unix sockets can consume a significant amount of system memory, hence > they should be accounted to kmemcg. > > Since unix socket buffers are always allocated from process context, > all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in > sock->sk_allocation mask. I have two questions : 1) What happens when a buffer, allocated from socket lands in a different socket , maybe owned by another user/process. Who owns it now, in term of kmemcg accounting ? 2) Has performance impact been evaluated ? Thanks.
Re: [PATCH RESEND 7/8] pipe: account to kmemcg
On Tue, 2016-05-24 at 11:49 +0300, Vladimir Davydov wrote: > Pipes can consume a significant amount of system memory, hence they > should be accounted to kmemcg. > > This patch marks pipe_inode_info and anonymous pipe buffer page > allocations as __GFP_ACCOUNT so that they would be charged to kmemcg. > Note, since a pipe buffer page can be "stolen" and get reused for other > purposes, including mapping to userspace, we clear PageKmemcg thus > resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which > is introduced by this patch. > > Signed-off-by: Vladimir Davydov > Cc: Alexander Viro > --- > fs/pipe.c | 32 ++-- > 1 file changed, 26 insertions(+), 6 deletions(-) > > diff --git a/fs/pipe.c b/fs/pipe.c > index 0d3f5165cb0b..4b32928f5426 100644 > --- a/fs/pipe.c > +++ b/fs/pipe.c > @@ -21,6 +21,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -137,6 +138,22 @@ static void anon_pipe_buf_release(struct pipe_inode_info > *pipe, > put_page(page); > } > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe, > +struct pipe_buffer *buf) > +{ > + struct page *page = buf->page; > + > + if (page_count(page) == 1) { This looks racy : some cpu could have temporarily elevated page count. > + if (memcg_kmem_enabled()) { > + memcg_kmem_uncharge(page, 0); > + __ClearPageKmemcg(page); > + } > + __SetPageLocked(page); > + return 0; > + } > + return 1; > +} > +
Re: [iproute2 PATCH v2 1/1] vv4.6.0
More git foo. I have no idea how this went out ;-> Please ignore this but not the other. cheers, jamal On 16-05-24 07:52 AM, Jamal Hadi Salim wrote: From: Stephen Hemminger --- include/SNAPSHOT.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/SNAPSHOT.h b/include/SNAPSHOT.h index 9220f77..cc67058 100644 --- a/include/SNAPSHOT.h +++ b/include/SNAPSHOT.h @@ -1 +1 @@ -static const char SNAPSHOT[] = "160314"; +static const char SNAPSHOT[] = "160518";
Re: [PATCH v5 2/2] skb_array: ring test
On Tue, May 24, 2016 at 12:28:09PM +0200, Jesper Dangaard Brouer wrote: > On Mon, 23 May 2016 23:52:47 +0300 > "Michael S. Tsirkin" wrote: > > > On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote: > > > On Mon, 23 May 2016 13:43:46 +0300 > > > "Michael S. Tsirkin" wrote: > > > > > > > Add ringtest based unit test for skb array. > > > > > > > > Signed-off-by: Michael S. Tsirkin > > > > --- > > > > tools/virtio/ringtest/skb_array.c | 167 > > > > ++ > > > > tools/virtio/ringtest/Makefile| 4 +- > > > > > > Patch didn't apply cleanly to Makefile, as you also seems to have > > > "virtio_ring_inorder", I manually applied it. > > > > > > I chdir to tools/virtio/ringtest/ and I could compile "skb_array", > > > BUT how do I use it??? (the README is not helpful) > > > > > > What is the "output", are there any performance measurement results? > > > > First, if it completes successfully this means it completed > > a ton of cycles without errors. It caches any missing barriers > > which aren't nops on your system. > > I applied these patches on net-next (at commit 07b75260e) and the > skb_array test program never terminates. Strangely if I use your git > tree[1] (on branch vhost) the program does terminate... I didn't spot > the difference. > > > Second - use perf. > > I do like perf, but it does not answer my questions about the > performance of this queue. I will code something up in my own > framework[2] to answer my own performance questions. > > Like what is be minimum overhead (in cycles) achievable with this type > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot) > for fastpath usage. Actually there is, kind of, a way to find out with my tool if you have an HT CPU. When you do run-on-all.sh it will pin consumer to the last CPU, then run producer on all of them. Look for the number for the HT pair - this shares cache between producer and consumer. This is not the same as doing produce + consume on the same CPU but it's close enough I think. To measure overhead I guess I should build a NOP tool that does not actually produce or consume anything. Will do. > Then I also want to know how this performs when two CPUs are involved. > As this is also a primary use-case, for you when sending packets into a > guest. > > > > E.g. simple perf stat will measure how long does it take to execute. > > there's a script that runs it on different CPUs, > > so I normally do: > > > > sh run-on-all.sh perf stat -r 5 ./skb_array > > I recommend documenting this in the README file in the same dir ;-) > > [1] https://git.kernel.org/cgit/linux/kernel/git/mst/vhost.git/log/?h=vhost > [2] https://github.com/netoptimizer/prototype-kernel > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer
[PULL] vhost: cleanups and fixes
The following changes since commit 2dcd0af568b0cf583645c8a317dd12e344b1c72a: Linux 4.6 (2016-05-15 15:43:13 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus for you to fetch changes up to bb991288728e6a47a6f0fac6a4e9dfaeecc27956: ringtest: pass buf != NULL (2016-05-22 19:44:14 +0300) virtio: patches for 4.7 Looks like a quiet cycle for virtio. There's a new inorder option for the ringtest tool, and a bugfix for balloon for ppc platforms when using virtio 1 mode. Signed-off-by: Michael S. Tsirkin Michael S. Tsirkin (3): virtio: add inorder option virtio_balloon: fix PFN format for virtio-1 ringtest: pass buf != NULL drivers/virtio/virtio_balloon.c | 20 +++- tools/virtio/ringtest/main.c| 2 +- tools/virtio/ringtest/virtio_ring_0_9.c | 49 - tools/virtio/ringtest/virtio_ring_inorder.c | 2 ++ tools/virtio/ringtest/Makefile | 5 ++- 5 files changed, 67 insertions(+), 11 deletions(-) create mode 100644 tools/virtio/ringtest/virtio_ring_inorder.c
Re: [PATCH v5 2/2] skb_array: ring test
On Tue, May 24, 2016 at 12:28:09PM +0200, Jesper Dangaard Brouer wrote: > On Mon, 23 May 2016 23:52:47 +0300 > "Michael S. Tsirkin" wrote: > > > On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote: > > > On Mon, 23 May 2016 13:43:46 +0300 > > > "Michael S. Tsirkin" wrote: > > > > > > > Add ringtest based unit test for skb array. > > > > > > > > Signed-off-by: Michael S. Tsirkin > > > > --- > > > > tools/virtio/ringtest/skb_array.c | 167 > > > > ++ > > > > tools/virtio/ringtest/Makefile| 4 +- > > > > > > Patch didn't apply cleanly to Makefile, as you also seems to have > > > "virtio_ring_inorder", I manually applied it. > > > > > > I chdir to tools/virtio/ringtest/ and I could compile "skb_array", > > > BUT how do I use it??? (the README is not helpful) > > > > > > What is the "output", are there any performance measurement results? > > > > First, if it completes successfully this means it completed > > a ton of cycles without errors. It caches any missing barriers > > which aren't nops on your system. > > I applied these patches on net-next (at commit 07b75260e) and the > skb_array test program never terminates. Strangely if I use your git > tree[1] (on branch vhost) the program does terminate... I didn't spot > the difference. Oh, that's my bad. You need ringtest: pass buf != NULL just a stub pointer for now. Signed-off-by: Michael S. Tsirkin from my tree. -- MST
[iproute2 PATCH v2 1/1] tc simple action: bug fix
From: Jamal Hadi Salim Failed compile m_simple.c: In function ‘parse_simple’: m_simple.c:154:6: warning: too many arguments for format [-Wformat-extra-args] *argv); ^ m_simple.c:103:14: warning: unused variable ‘maybe_bind’ [-Wunused-variable] Reported-by: Daniel Borkmann Signed-off-by: Jamal Hadi Salim --- tc/m_simple.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/tc/m_simple.c b/tc/m_simple.c index feba61b..27b3e5e 100644 --- a/tc/m_simple.c +++ b/tc/m_simple.c @@ -100,7 +100,7 @@ parse_simple(struct action_util *a, int *argc_p, char ***argv_p, int tca_id, struct tc_defact sel = {}; int argc = *argc_p; char **argv = *argv_p; - int ok = 0, maybe_bind = 0; + int ok = 0; struct rtattr *tail; char *simpdata = NULL; @@ -150,7 +150,7 @@ parse_simple(struct action_util *a, int *argc_p, char ***argv_p, int tca_id, if (matches(*argv, "index") == 0) { NEXT_ARG(); if (get_u32(&sel.index, *argv, 10)) { - fprintf(stderr, "simple: Illegal \"index\"\n", + fprintf(stderr, "simple: Illegal \"index\" (%s)\n", *argv); return -1; } @@ -171,7 +171,6 @@ parse_simple(struct action_util *a, int *argc_p, char ***argv_p, int tca_id, return -1; } - sel.action = TC_ACT_PIPE; tail = NLMSG_TAIL(n); -- 1.9.1
[iproute2 PATCH v2 1/1] vv4.6.0
From: Stephen Hemminger --- include/SNAPSHOT.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/SNAPSHOT.h b/include/SNAPSHOT.h index 9220f77..cc67058 100644 --- a/include/SNAPSHOT.h +++ b/include/SNAPSHOT.h @@ -1 +1 @@ -static const char SNAPSHOT[] = "160314"; +static const char SNAPSHOT[] = "160518"; -- 1.9.1
Re: [iproute2 1/1] tc simple action: bug fix
On 16-05-24 06:30 AM, Daniel Borkmann wrote: On 05/24/2016 03:04 AM, Jamal Hadi Salim wrote: From: Jamal Hadi Salim Failed compile m_simple.c: In function ‘parse_simple’: m_simple.c:154:6: warning: too many arguments for format [-Wformat-extra-args] *argv); ^ m_simple.c:103:14: warning: unused variable ‘maybe_bind’ [-Wunused-variable] Btw, that last one is still not addressed. Sigh. It is fixed in my tree. I have a few of patches that I was planning to send later and some are touching intersecting files and i am trying to be clever with git. Any tips? ;-> Will send v2. cheers, jamal
Re: [PATCH v5 2/2] skb_array: ring test
On Tue, May 24, 2016 at 12:28:09PM +0200, Jesper Dangaard Brouer wrote: > On Mon, 23 May 2016 23:52:47 +0300 > "Michael S. Tsirkin" wrote: > > > On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote: > > > On Mon, 23 May 2016 13:43:46 +0300 > > > "Michael S. Tsirkin" wrote: > > > > > > > Add ringtest based unit test for skb array. > > > > > > > > Signed-off-by: Michael S. Tsirkin > > > > --- > > > > tools/virtio/ringtest/skb_array.c | 167 > > > > ++ > > > > tools/virtio/ringtest/Makefile| 4 +- > > > > > > Patch didn't apply cleanly to Makefile, as you also seems to have > > > "virtio_ring_inorder", I manually applied it. > > > > > > I chdir to tools/virtio/ringtest/ and I could compile "skb_array", > > > BUT how do I use it??? (the README is not helpful) > > > > > > What is the "output", are there any performance measurement results? > > > > First, if it completes successfully this means it completed > > a ton of cycles without errors. It caches any missing barriers > > which aren't nops on your system. > > I applied these patches on net-next (at commit 07b75260e) and the > skb_array test program never terminates. Strangely if I use your git > tree[1] (on branch vhost) the program does terminate... I didn't spot > the difference. Disassemble the binaries and compare? Should be identical. Or attach gdb and look at array.producer and array.consumer. > > Second - use perf. > > I do like perf, but it does not answer my questions about the > performance of this queue. I will code something up in my own > framework[2] to answer my own performance questions. Sounds good. > Like what is be minimum overhead (in cycles) achievable with this type > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot) > for fastpath usage. Interesting. > Then I also want to know how this performs when two CPUs are involved. This has flags to pin threads to different CPUs. > As this is also a primary use-case, for you when sending packets into a > guest. > That's absolutely the primary usecase. Was designed with this in mind. > > > E.g. simple perf stat will measure how long does it take to execute. > > there's a script that runs it on different CPUs, > > so I normally do: > > > > sh run-on-all.sh perf stat -r 5 ./skb_array > > I recommend documenting this in the README file in the same dir ;-) Good idea. Will do. > [1] https://git.kernel.org/cgit/linux/kernel/git/mst/vhost.git/log/?h=vhost > [2] https://github.com/netoptimizer/prototype-kernel > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer
Re: [iproute2 1/1] tc simple action: bug fix
On 05/24/2016 03:04 AM, Jamal Hadi Salim wrote: From: Jamal Hadi Salim Failed compile m_simple.c: In function ‘parse_simple’: m_simple.c:154:6: warning: too many arguments for format [-Wformat-extra-args] *argv); ^ m_simple.c:103:14: warning: unused variable ‘maybe_bind’ [-Wunused-variable] Btw, that last one is still not addressed.
Re: [PATCH v5 2/2] skb_array: ring test
On Mon, 23 May 2016 23:52:47 +0300 "Michael S. Tsirkin" wrote: > On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote: > > On Mon, 23 May 2016 13:43:46 +0300 > > "Michael S. Tsirkin" wrote: > > > > > Add ringtest based unit test for skb array. > > > > > > Signed-off-by: Michael S. Tsirkin > > > --- > > > tools/virtio/ringtest/skb_array.c | 167 > > > ++ > > > tools/virtio/ringtest/Makefile| 4 +- > > > > Patch didn't apply cleanly to Makefile, as you also seems to have > > "virtio_ring_inorder", I manually applied it. > > > > I chdir to tools/virtio/ringtest/ and I could compile "skb_array", > > BUT how do I use it??? (the README is not helpful) > > > > What is the "output", are there any performance measurement results? > > First, if it completes successfully this means it completed > a ton of cycles without errors. It caches any missing barriers > which aren't nops on your system. I applied these patches on net-next (at commit 07b75260e) and the skb_array test program never terminates. Strangely if I use your git tree[1] (on branch vhost) the program does terminate... I didn't spot the difference. > Second - use perf. I do like perf, but it does not answer my questions about the performance of this queue. I will code something up in my own framework[2] to answer my own performance questions. Like what is be minimum overhead (in cycles) achievable with this type of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot) for fastpath usage. Then I also want to know how this performs when two CPUs are involved. As this is also a primary use-case, for you when sending packets into a guest. > E.g. simple perf stat will measure how long does it take to execute. > there's a script that runs it on different CPUs, > so I normally do: > > sh run-on-all.sh perf stat -r 5 ./skb_array I recommend documenting this in the README file in the same dir ;-) [1] https://git.kernel.org/cgit/linux/kernel/git/mst/vhost.git/log/?h=vhost [2] https://github.com/netoptimizer/prototype-kernel -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
[PATCH RESEND 8/8] af_unix: charge buffers to kmemcg
Unix sockets can consume a significant amount of system memory, hence they should be accounted to kmemcg. Since unix socket buffers are always allocated from process context, all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in sock->sk_allocation mask. Signed-off-by: Vladimir Davydov Cc: "David S. Miller" --- net/unix/af_unix.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 80aa6a3e6817..022bdd3ab7d9 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -769,6 +769,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) lockdep_set_class(&sk->sk_receive_queue.lock, &af_unix_sk_receive_queue_lock_key); + sk->sk_allocation = GFP_KERNEL_ACCOUNT; sk->sk_write_space = unix_write_space; sk->sk_max_ack_backlog = net->unx.sysctl_max_dgram_qlen; sk->sk_destruct = unix_sock_destructor; -- 2.1.4
[PATCH net 1/1] qed: Reset the enable flag for eth protocol.
This patch fixes the coding error in determining the enable flag for the application/protocol. The enable flag should be set for all protocols but the eth. Signed-off-by: Sudarsana Reddy Kalluru Signed-off-by: Yuval Mintz --- drivers/net/ethernet/qlogic/qed/qed_dcbx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c index a06d19a..21ec1c2 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c +++ b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c @@ -222,7 +222,7 @@ qed_dcbx_process_tlv(struct qed_hwfn *p_hwfn, * indication, but we only got here if there was an * app tlv for the protocol, so dcbx must be enabled. */ - enable = !!(type == DCBX_PROTOCOL_ETH); + enable = !(type == DCBX_PROTOCOL_ETH); qed_dcbx_update_app_info(p_data, p_hwfn, enable, true, priority, tc, type); -- 1.8.3.1
[RFC PATCH V3 0/3] basic device IOTLB support
This patch tries to implement an device IOTLB for vhost. This could be used with for co-operation with userspace IOMMU implementation (qemu) for a secure DMA environment (DMAR) in guest. The idea is simple. When vhost meets an IOTLB miss, it will request the assistance of userspace to do the translation, this is done through: - when there's a IOTLB miss, it will notify userspace through vhost_net fd and then userspace read the fault address, size and access from vhost fd. - userspace write the translation result back to vhost fd, vhost can then update its IOTLB. The codes were optimized for fixed mapping users e.g dpdk in guest. It will be slow if dynamic mappings were used in guest. We could do optimizations on top. The codes were designed to be architecture independent. It should be easily ported to any architecture. Stress tested with l2fwd/vfio in guest with 4K/2M/1G page size. On 1G hugepage case, 100% TLB hit rate were noticed. Changes from V2: - introduce memory accessors for vhost - switch from ioctls to oridinary file read/write for iotlb miss and updating - do not assume virtqueue were virtually mapped contiguously, all virtqueue access were done throug IOTLB - verify memory access during IOTLB update and fail early - introduce a module parameter for the size of IOTLB Changes from V1: - support any size/range of updating and invalidation through introducing the interval tree. - convert from per device iotlb request to per virtqueue iotlb request, this solves the possible deadlock in V1. - read/write permission check support. Please review. Jason Wang (3): vhost: introduce vhost memory accessors vhost: convert pre sorted vhost memory array to interval tree vhost: device IOTLB API drivers/vhost/net.c| 63 +++- drivers/vhost/vhost.c | 760 ++--- drivers/vhost/vhost.h | 60 +++- include/uapi/linux/vhost.h | 28 ++ 4 files changed, 790 insertions(+), 121 deletions(-) -- 2.7.4
[RFC PATCH V3 2/3] vhost: convert pre sorted vhost memory array to interval tree
Current pre-sorted memory region array has some limitations for future device IOTLB conversion: 1) need extra work for adding and removing a single region, and it's expected to be slow because of sorting or memory re-allocation. 2) need extra work of removing a large range which may intersect several regions with different size. 3) need trick for a replacement policy like LRU To overcome the above shortcomings, this patch convert it to interval tree which can easily address the above issue with almost no extra work. The patch could be used for: - Extend the current API and only let the userspace to send diffs of memory table. - Simplify Device IOTLB implementation. Signed-off-by: Jason Wang --- drivers/vhost/net.c | 8 +-- drivers/vhost/vhost.c | 182 -- drivers/vhost/vhost.h | 27 ++-- 3 files changed, 128 insertions(+), 89 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index beaeb17..a584239 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -1037,20 +1037,20 @@ static long vhost_net_reset_owner(struct vhost_net *n) struct socket *tx_sock = NULL; struct socket *rx_sock = NULL; long err; - struct vhost_memory *memory; + struct vhost_umem *umem; mutex_lock(&n->dev.mutex); err = vhost_dev_check_owner(&n->dev); if (err) goto done; - memory = vhost_dev_reset_owner_prepare(); - if (!memory) { + umem = vhost_dev_reset_owner_prepare(); + if (!umem) { err = -ENOMEM; goto done; } vhost_net_stop(n, &tx_sock, &rx_sock); vhost_net_flush(n); - vhost_dev_reset_owner(&n->dev, memory); + vhost_dev_reset_owner(&n->dev, umem); vhost_net_vq_reset(n); done: mutex_unlock(&n->dev.mutex); diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 9f2a63a..166e779 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "vhost.h" @@ -42,6 +43,10 @@ enum { #define vhost_used_event(vq) ((__virtio16 __user *)&vq->avail->ring[vq->num]) #define vhost_avail_event(vq) ((__virtio16 __user *)&vq->used->ring[vq->num]) +INTERVAL_TREE_DEFINE(struct vhost_umem_node, +rb, __u64, __subtree_last, +START, LAST, , vhost_umem_interval_tree); + #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY static void vhost_disable_cross_endian(struct vhost_virtqueue *vq) { @@ -300,10 +305,10 @@ static void vhost_vq_reset(struct vhost_dev *dev, vq->call_ctx = NULL; vq->call = NULL; vq->log_ctx = NULL; - vq->memory = NULL; vhost_reset_is_le(vq); vhost_disable_cross_endian(vq); vq->busyloop_timeout = 0; + vq->umem = NULL; } static int vhost_worker(void *data) @@ -407,7 +412,7 @@ void vhost_dev_init(struct vhost_dev *dev, mutex_init(&dev->mutex); dev->log_ctx = NULL; dev->log_file = NULL; - dev->memory = NULL; + dev->umem = NULL; dev->mm = NULL; spin_lock_init(&dev->work_lock); INIT_LIST_HEAD(&dev->work_list); @@ -512,27 +517,36 @@ err_mm: } EXPORT_SYMBOL_GPL(vhost_dev_set_owner); -struct vhost_memory *vhost_dev_reset_owner_prepare(void) +static void *vhost_kvzalloc(unsigned long size) { - return kmalloc(offsetof(struct vhost_memory, regions), GFP_KERNEL); + void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); + + if (!n) + n = vzalloc(size); + return n; +} + +struct vhost_umem *vhost_dev_reset_owner_prepare(void) +{ + return vhost_kvzalloc(sizeof(struct vhost_umem)); } EXPORT_SYMBOL_GPL(vhost_dev_reset_owner_prepare); /* Caller should have device mutex */ -void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_memory *memory) +void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_umem *umem) { int i; vhost_dev_cleanup(dev, true); /* Restore memory to default empty mapping. */ - memory->nregions = 0; - dev->memory = memory; + INIT_LIST_HEAD(&umem->umem_list); + dev->umem = umem; /* We don't need VQ locks below since vhost_dev_cleanup makes sure * VQs aren't running. */ for (i = 0; i < dev->nvqs; ++i) - dev->vqs[i]->memory = memory; + dev->vqs[i]->umem = umem; } EXPORT_SYMBOL_GPL(vhost_dev_reset_owner); @@ -549,6 +563,21 @@ void vhost_dev_stop(struct vhost_dev *dev) } EXPORT_SYMBOL_GPL(vhost_dev_stop); +static void vhost_umem_clean(struct vhost_umem *umem) +{ + struct vhost_umem_node *node, *tmp; + + if (!umem) + return; + + list_for_each_entry_safe(node, tmp, &umem->umem_list, link) { + vhost_umem_interval_tree_remove(node, &umem->umem_tree); + list_del(&node->link); +
[RFC PATCH V3 1/3] vhost: introduce vhost memory accessors
This patch introduces vhost memory accessors which were just wrappers for userspace address access helpers. This is a requirement for vhost device iotlb implementation which will add iotlb translations in those accessors. Signed-off-by: Jason Wang --- drivers/vhost/net.c | 1 + drivers/vhost/vhost.c | 50 +++--- 2 files changed, 36 insertions(+), 15 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index f744eeb..beaeb17 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -985,6 +985,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd) vhost_net_disable_vq(n, vq); vq->private_data = sock; + /* FIXME: iotlb prefetch here? */ r = vhost_vq_init_access(vq); if (r) goto err_used; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 669fef1..9f2a63a 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -651,6 +651,22 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem, return 1; } +#define vhost_put_user(vq, x, ptr) __put_user(x, ptr) + +static int vhost_copy_to_user(struct vhost_virtqueue *vq, void *to, + const void *from, unsigned size) +{ + return __copy_to_user(to, from, size); +} + +#define vhost_get_user(vq, x, ptr) __get_user(x, ptr) + +static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to, + void *from, unsigned size) +{ + return __copy_from_user(to, from, size); +} + static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num, struct vring_desc __user *desc, struct vring_avail __user *avail, @@ -1156,7 +1172,8 @@ EXPORT_SYMBOL_GPL(vhost_log_write); static int vhost_update_used_flags(struct vhost_virtqueue *vq) { void __user *used; - if (__put_user(cpu_to_vhost16(vq, vq->used_flags), &vq->used->flags) < 0) + if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags), + &vq->used->flags) < 0) return -EFAULT; if (unlikely(vq->log_used)) { /* Make sure the flag is seen before log. */ @@ -1174,7 +1191,8 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq) static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 avail_event) { - if (__put_user(cpu_to_vhost16(vq, vq->avail_idx), vhost_avail_event(vq))) + if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx), + vhost_avail_event(vq))) return -EFAULT; if (unlikely(vq->log_used)) { void __user *used; @@ -1212,7 +1230,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq) r = -EFAULT; goto err; } - r = __get_user(last_used_idx, &vq->used->idx); + r = vhost_get_user(vq, last_used_idx, &vq->used->idx); if (r) goto err; vq->last_used_idx = vhost16_to_cpu(vq, last_used_idx); @@ -1392,7 +1410,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, /* Check it isn't doing very strange things with descriptor numbers. */ last_avail_idx = vq->last_avail_idx; - if (unlikely(__get_user(avail_idx, &vq->avail->idx))) { + if (unlikely(vhost_get_user(vq, avail_idx, &vq->avail->idx))) { vq_err(vq, "Failed to access avail idx at %p\n", &vq->avail->idx); return -EFAULT; @@ -1414,8 +1432,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, /* Grab the next descriptor number they're advertising, and increment * the index we've seen. */ - if (unlikely(__get_user(ring_head, - &vq->avail->ring[last_avail_idx & (vq->num - 1)]))) { + if (unlikely(vhost_get_user(vq, ring_head, +&vq->avail->ring[last_avail_idx & (vq->num - 1)]))) { vq_err(vq, "Failed to read head: idx %d address %p\n", last_avail_idx, &vq->avail->ring[last_avail_idx % vq->num]); @@ -1450,7 +1468,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, i, vq->num, head); return -EINVAL; } - ret = __copy_from_user(&desc, vq->desc + i, sizeof desc); + ret = vhost_copy_from_user(vq, &desc, vq->desc + i, + sizeof desc); if (unlikely(ret)) { vq_err(vq, "Failed to get descriptor: idx %d addr %p\n", i, vq->desc + i); @@ -1538,15 +1557,15 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq, start = vq->last_used_idx & (vq->num - 1); used = vq->used->ring + start; if (count == 1)
[RFC PATCH V3 3/3] vhost: device IOTLB API
This patch tries to implement an device IOTLB for vhost. This could be used with for co-operation with userspace(qemu) implementation of DMA remapping. The idea is simple, cache the translation in a software device IOTLB (interval tree) implementation in vhost and make vhost_net file descriptor could be read or wrote by a process. When vhost meets an IOTLB miss, the fault address, size and access could be read from the file. After userspace finishes the translation, it write the translated address to the vhost_net file to update the device IOTLB. When device IOTLB (VHOST_F_DEVICE_IOTLB) is enabled all vq address set by ioctl were treated as iova instead of virtual address and the accessing could only be done through IOTLB instead of direct userspace memory access. Before each rounds or vq processing, all vq metadata were prefetched in device IOTLB to make sure no translation fault happens during vq processing. In most cases, virtqueue were mapped contiguous even in virtual address. So the IOTLB translation for virtqueue itself maybe a little bit slower. We can add fast path on top of this patch. Signed-off-by: Jason Wang --- drivers/vhost/net.c| 54 - drivers/vhost/vhost.c | 586 + drivers/vhost/vhost.h | 35 ++- include/uapi/linux/vhost.h | 28 +++ 4 files changed, 656 insertions(+), 47 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index a584239..0f45e3d 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -61,7 +61,8 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;" enum { VHOST_NET_FEATURES = VHOST_FEATURES | (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) | -(1ULL << VIRTIO_NET_F_MRG_RXBUF) +(1ULL << VIRTIO_NET_F_MRG_RXBUF) | +(1ULL << VHOST_F_DEVICE_IOTLB) }; enum { @@ -308,7 +309,8 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net, { unsigned long uninitialized_var(endtime); int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov), - out_num, in_num, NULL, NULL); + out_num, in_num, NULL, NULL, + VHOST_ACCESS_RO); if (r == vq->num && vq->busyloop_timeout) { preempt_disable(); @@ -318,7 +320,8 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net, cpu_relax_lowlatency(); preempt_enable(); r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov), - out_num, in_num, NULL, NULL); + out_num, in_num, NULL, NULL, + VHOST_ACCESS_RO); } return r; @@ -351,6 +354,9 @@ static void handle_tx(struct vhost_net *net) if (!sock) goto out; + if (!vq_iotlb_prefetch(vq)) + goto out; + vhost_disable_notify(&net->dev, vq); hdr_size = nvq->vhost_hlen; @@ -538,7 +544,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq, } r = vhost_get_vq_desc(vq, vq->iov + seg, ARRAY_SIZE(vq->iov) - seg, &out, - &in, log, log_num); + &in, log, log_num, VHOST_ACCESS_WO); if (unlikely(r < 0)) goto err; @@ -612,6 +618,10 @@ static void handle_rx(struct vhost_net *net) sock = vq->private_data; if (!sock) goto out; + + if (!vq_iotlb_prefetch(vq)) + goto out; + vhost_disable_notify(&net->dev, vq); vhost_hlen = nvq->vhost_hlen; @@ -1085,6 +1095,11 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features) mutex_unlock(&n->dev.mutex); return -EFAULT; } + if ((features & (1ULL << VHOST_F_DEVICE_IOTLB))) { + if (vhost_init_device_iotlb(&n->dev, true)) + return -EFAULT; + } + for (i = 0; i < VHOST_NET_VQ_MAX; ++i) { mutex_lock(&n->vqs[i].vq.mutex); n->vqs[i].vq.acked_features = features; @@ -1167,9 +1182,40 @@ static long vhost_net_compat_ioctl(struct file *f, unsigned int ioctl, } #endif +static ssize_t vhost_net_chr_read_iter(struct kiocb *iocb, struct iov_iter *to) +{ + struct file *file = iocb->ki_filp; + struct vhost_net *n = file->private_data; + struct vhost_dev *dev = &n->dev; + int noblock = file->f_flags & O_NONBLOCK; + + return vhost_chr_read_iter(dev, to, noblock); +} + +static ssize_t vhost_net_chr_write_iter(struct kiocb *iocb, + struct iov_iter *from) +{ + struct file *file = iocb->ki_filp; + struct vhost_net *n = file->private_data; + struc
[PATCH RESEND 6/8] arch: x86: charge page tables to kmemcg
Page tables can bite a relatively big chunk off system memory and their allocations are easy to trigger from userspace, so they should be accounted to kmemcg. This patch marks page table allocations as __GFP_ACCOUNT for x86. Note we must not charge allocations of kernel page tables, because they can be shared among processes from different cgroups so accounting them to a particular one can pin other cgroups for indefinitely long. So we clear __GFP_ACCOUNT flag if a page table is allocated for the kernel. Signed-off-by: Vladimir Davydov Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" --- arch/x86/include/asm/pgalloc.h | 12 ++-- arch/x86/mm/pgtable.c | 11 --- 2 files changed, 18 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h index bf7f8b55b0f9..2f531633cb16 100644 --- a/arch/x86/include/asm/pgalloc.h +++ b/arch/x86/include/asm/pgalloc.h @@ -81,7 +81,11 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) { struct page *page; - page = alloc_pages(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO, 0); + gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_REPEAT | __GFP_ZERO; + + if (mm == &init_mm) + gfp &= ~__GFP_ACCOUNT; + page = alloc_pages(gfp, 0); if (!page) return NULL; if (!pgtable_pmd_page_ctor(page)) { @@ -125,7 +129,11 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud) static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr) { - return (pud_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT); + gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_REPEAT; + + if (mm == &init_mm) + gfp &= ~__GFP_ACCOUNT; + return (pud_t *)get_zeroed_page(gfp); } static inline void pud_free(struct mm_struct *mm, pud_t *pud) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 4eb287e25043..421ac6b74d11 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -6,7 +6,8 @@ #include #include -#define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO +#define PGALLOC_GFP (GFP_KERNEL_ACCOUNT | __GFP_NOTRACK | __GFP_REPEAT | \ +__GFP_ZERO) #ifdef CONFIG_HIGHPTE #define PGALLOC_USER_GFP __GFP_HIGHMEM @@ -18,7 +19,7 @@ gfp_t __userpte_alloc_gfp = PGALLOC_GFP | PGALLOC_USER_GFP; pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address) { - return (pte_t *)__get_free_page(PGALLOC_GFP); + return (pte_t *)__get_free_page(PGALLOC_GFP & ~__GFP_ACCOUNT); } pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address) @@ -207,9 +208,13 @@ static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[]) { int i; bool failed = false; + gfp_t gfp = PGALLOC_GFP; + + if (mm == &init_mm) + gfp &= ~__GFP_ACCOUNT; for(i = 0; i < PREALLOCATED_PMDS; i++) { - pmd_t *pmd = (pmd_t *)__get_free_page(PGALLOC_GFP); + pmd_t *pmd = (pmd_t *)__get_free_page(gfp); if (!pmd) failed = true; if (pmd && !pgtable_pmd_page_ctor(virt_to_page(pmd))) { -- 2.1.4
[PATCH RESEND 1/8] mm: remove pointless struct in struct page definition
... to reduce indentation level thus leaving more space for comments. Signed-off-by: Vladimir Davydov --- include/linux/mm_types.h | 68 +++- 1 file changed, 32 insertions(+), 36 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index d553855503e6..3cc5977a9cab 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -60,51 +60,47 @@ struct page { }; /* Second double word */ - struct { - union { - pgoff_t index; /* Our offset within mapping. */ - void *freelist; /* sl[aou]b first free object */ - /* page_deferred_list().prev-- second tail page */ - }; + union { + pgoff_t index; /* Our offset within mapping. */ + void *freelist; /* sl[aou]b first free object */ + /* page_deferred_list().prev-- second tail page */ + }; - union { + union { #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \ defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE) - /* Used for cmpxchg_double in slub */ - unsigned long counters; + /* Used for cmpxchg_double in slub */ + unsigned long counters; #else - /* -* Keep _refcount separate from slub cmpxchg_double -* data. As the rest of the double word is protected by -* slab_lock but _refcount is not. -*/ - unsigned counters; + /* +* Keep _refcount separate from slub cmpxchg_double data. +* As the rest of the double word is protected by slab_lock +* but _refcount is not. +*/ + unsigned counters; #endif + struct { - struct { - - union { - /* -* Count of ptes mapped in mms, to show -* when page is mapped & limit reverse -* map searches. -*/ - atomic_t _mapcount; - - struct { /* SLUB */ - unsigned inuse:16; - unsigned objects:15; - unsigned frozen:1; - }; - int units; /* SLOB */ - }; + union { /* -* Usage count, *USE WRAPPER FUNCTION* -* when manual accounting. See page_ref.h +* Count of ptes mapped in mms, to show when +* page is mapped & limit reverse map searches. */ - atomic_t _refcount; + atomic_t _mapcount; + + unsigned int active;/* SLAB */ + struct {/* SLUB */ + unsigned inuse:16; + unsigned objects:15; + unsigned frozen:1; + }; + int units; /* SLOB */ }; - unsigned int active;/* SLAB */ + /* +* Usage count, *USE WRAPPER FUNCTION* when manual +* accounting. See page_ref.h +*/ + atomic_t _refcount; }; }; -- 2.1.4
Re: [PATCH 8/8] af_unix: charge buffers to kmemcg
[adding netdev to Cc] On Mon, May 23, 2016 at 01:20:29PM +0300, Vladimir Davydov wrote: > Unix sockets can consume a significant amount of system memory, hence > they should be accounted to kmemcg. > > Since unix socket buffers are always allocated from process context, > all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in > sock->sk_allocation mask. > > Signed-off-by: Vladimir Davydov > Cc: "David S. Miller" > --- > net/unix/af_unix.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > index 80aa6a3e6817..022bdd3ab7d9 100644 > --- a/net/unix/af_unix.c > +++ b/net/unix/af_unix.c > @@ -769,6 +769,7 @@ static struct sock *unix_create1(struct net *net, struct > socket *sock, int kern) > lockdep_set_class(&sk->sk_receive_queue.lock, > &af_unix_sk_receive_queue_lock_key); > > + sk->sk_allocation = GFP_KERNEL_ACCOUNT; > sk->sk_write_space = unix_write_space; > sk->sk_max_ack_backlog = net->unx.sysctl_max_dgram_qlen; > sk->sk_destruct = unix_sock_destructor;
[PATCH] brcmfmac: fix setting AP channel with new firmwares
Firmware for new chipsets is based on a new major version of code internally maintained at Broadcom. E.g. brcmfmac4366b-pcie.bin (used for BCM4366B1) is based on 10.10.69.3309 while brcmfmac43602-pcie.ap.bin was based on 7.35.177.56. Currently setting AP 5 GHz channel doesn't work reliably with BCM4366B1. When setting e.g. 36 control channel with VHT80 (center channel 42) firmware may randomly pick one of: 1) 52 control channel with 58 as center one 2) 100 control channel with 106 as center one 3) 116 control channel with 122 as center one 4) 149 control channel with 155 as center one It seems new firmwares require setting AP mode (BRCMF_C_SET_AP) before specifying a channel. Changing an order of firmware calls fixes the problem. This fix was verified with BCM4366B1 and tested for regressions on BCM43602. It's unclear if it's needed (or correct at all) for P2P interfaces so it leaves this code unaffected. Signed-off-by: Rafał Miłecki --- .../net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c index 299a404..3d09d23 100644 --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c @@ -4423,7 +4423,7 @@ brcmf_cfg80211_start_ap(struct wiphy *wiphy, struct net_device *ndev, struct brcmf_join_params join_params; enum nl80211_iftype dev_role; struct brcmf_fil_bss_enable_le bss_enable; - u16 chanspec; + u16 chanspec = chandef_to_chanspec(&cfg->d11inf, &settings->chandef); bool mbss; int is_11d; @@ -4499,16 +4499,16 @@ brcmf_cfg80211_start_ap(struct wiphy *wiphy, struct net_device *ndev, brcmf_config_ap_mgmt_ie(ifp->vif, &settings->beacon); - if (!mbss) { - chanspec = chandef_to_chanspec(&cfg->d11inf, - &settings->chandef); + if (dev_role == NL80211_IFTYPE_P2P_GO) { err = brcmf_fil_iovar_int_set(ifp, "chanspec", chanspec); if (err < 0) { brcmf_err("Set Channel failed: chspec=%d, %d\n", chanspec, err); goto exit; } + } + if (!mbss) { if (is_11d != ifp->vif->is_11d) { err = brcmf_fil_cmd_int_set(ifp, BRCMF_C_SET_REGULATORY, is_11d); @@ -4565,6 +4565,14 @@ brcmf_cfg80211_start_ap(struct wiphy *wiphy, struct net_device *ndev, brcmf_err("setting AP mode failed %d\n", err); goto exit; } + if (!mbss) { + err = brcmf_fil_iovar_int_set(ifp, "chanspec", chanspec); + if (err < 0) { + brcmf_err("Set Channel failed: chspec=%d, %d\n", + chanspec, err); + goto exit; + } + } err = brcmf_fil_cmd_int_set(ifp, BRCMF_C_UP, 1); if (err < 0) { brcmf_err("BRCMF_C_UP error (%d)\n", err); -- 1.8.4.5
[PATCH RESEND 2/8] mm: clean up non-standard page->_mapcount users
- Add a proper comment to page->_mapcount. - Introduce a macro for generating helper functions. - Place all special page->_mapcount values next to each other so that readers can see all possible values and so we don't get duplicates. Signed-off-by: Vladimir Davydov --- include/linux/mm_types.h | 5 include/linux/page-flags.h | 73 -- scripts/tags.sh| 3 ++ 3 files changed, 40 insertions(+), 41 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 3cc5977a9cab..16bdef7943e3 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -85,6 +85,11 @@ struct page { /* * Count of ptes mapped in mms, to show when * page is mapped & limit reverse map searches. +* +* Extra information about page type may be +* stored here for pages that are never mapped, +* in which case the value MUST BE <= -2. +* See page-flags.h for more details. */ atomic_t _mapcount; diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index e5a32445f930..9940ade6a25e 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -593,54 +593,45 @@ TESTPAGEFLAG_FALSE(DoubleMap) #endif /* - * PageBuddy() indicate that the page is free and in the buddy system - * (see mm/page_alloc.c). - * - * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to - * -2 so that an underflow of the page_mapcount() won't be mistaken - * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE. -128 can be created very - * efficiently by most CPU architectures. + * For pages that are never mapped to userspace, page->mapcount may be + * used for storing extra information about page type. Any value used + * for this purpose must be <= -2, but it's better start not too close + * to -2 so that an underflow of the page_mapcount() won't be mistaken + * for a special page. */ -#define PAGE_BUDDY_MAPCOUNT_VALUE (-128) - -static inline int PageBuddy(struct page *page) -{ - return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE; +#define PAGE_MAPCOUNT_OPS(uname, lname) \ +static __always_inline int Page##uname(struct page *page) \ +{ \ + return atomic_read(&page->_mapcount) == \ + PAGE_##lname##_MAPCOUNT_VALUE; \ +} \ +static __always_inline void __SetPage##uname(struct page *page) \ +{ \ + VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page); \ + atomic_set(&page->_mapcount, PAGE_##lname##_MAPCOUNT_VALUE);\ +} \ +static __always_inline void __ClearPage##uname(struct page *page) \ +{ \ + VM_BUG_ON_PAGE(!Page##uname(page), page); \ + atomic_set(&page->_mapcount, -1); \ } -static inline void __SetPageBuddy(struct page *page) -{ - VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page); - atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE); -} +/* + * PageBuddy() indicate that the page is free and in the buddy system + * (see mm/page_alloc.c). + */ +#define PAGE_BUDDY_MAPCOUNT_VALUE (-128) +PAGE_MAPCOUNT_OPS(Buddy, BUDDY) -static inline void __ClearPageBuddy(struct page *page) -{ - VM_BUG_ON_PAGE(!PageBuddy(page), page); - atomic_set(&page->_mapcount, -1); -} +/* + * PageBalloon() is set on pages that are on the balloon page list + * (see mm/balloon_compaction.c). + */ +#define PAGE_BALLOON_MAPCOUNT_VALUE(-256) +PAGE_MAPCOUNT_OPS(Balloon, BALLOON) extern bool is_free_buddy_page(struct page *page); -#define PAGE_BALLOON_MAPCOUNT_VALUE (-256) - -static inline int PageBalloon(struct page *page) -{ - return atomic_read(&page->_mapcount) == PAGE_BALLOON_MAPCOUNT_VALUE; -} - -static inline void __SetPageBalloon(struct page *page) -{ - VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page); - atomic_set(&page->_mapcount, PAGE_BALLOON_MAPCOUNT_VALUE); -} - -static inline void __ClearPageBalloon(struct page *page) -{ - VM_BUG_ON_PAGE(!PageBalloon(page), page); - atomic_set(&page->_mapcount, -1); -} - /* * If network-based swap is enabled, sl*b must keep track of whether pages * were allocated from pfmemalloc reserves. diff --git a/scrip
[PATCH RESEND 0/8] More stuff to charge to kmemcg
[resending with all relevant lists in Cc] Hi, This patch implements per kmemcg accounting of page tables (x86-only), pipe buffers, and unix socket buffers. Basically, this is v2 of my earlier attempt [1], addressing comments by Andrew, namely: lack of comments to non-standard _mapcount usage, extra overhead even when kmemcg is unused, wrong handling of stolen pipe buffer pages. Patches 1-3 are just cleanups that are not supposed to introduce any functional changes. Patches 4 and 5 move charge/uncharge to generic page allocator paths for the sake of accounting pipe and unix socket buffers. Patches 5-7 make x86 page tables, pipe buffers, and unix socket buffers accountable. [1] http://lkml.kernel.org/r/%3ccover.1443262808.git.vdavy...@parallels.com%3E Thanks, Vladimir Davydov (8): mm: remove pointless struct in struct page definition mm: clean up non-standard page->_mapcount users mm: memcontrol: cleanup kmem charge functions mm: charge/uncharge kmemcg from generic page allocator paths mm: memcontrol: teach uncharge_list to deal with kmem pages arch: x86: charge page tables to kmemcg pipe: account to kmemcg af_unix: charge buffers to kmemcg arch/x86/include/asm/pgalloc.h | 12 - arch/x86/mm/pgtable.c | 11 ++-- fs/pipe.c | 32 --- include/linux/gfp.h| 10 +--- include/linux/memcontrol.h | 103 +++- include/linux/mm_types.h | 73 - include/linux/page-flags.h | 78 +-- kernel/fork.c | 6 +-- mm/memcontrol.c| 117 - mm/page_alloc.c| 63 +- mm/slab.h | 16 -- mm/slab_common.c | 2 +- mm/slub.c | 6 +-- mm/vmalloc.c | 6 +-- net/unix/af_unix.c | 1 + scripts/tags.sh| 3 ++ 16 files changed, 245 insertions(+), 294 deletions(-) -- 2.1.4
[PATCH RESEND 4/8] mm: charge/uncharge kmemcg from generic page allocator paths
Currently, to charge a non-slab allocation to kmemcg one has to use alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with this helper should finally be freed using free_kmem_pages, otherwise it won't be uncharged. This API suits its current users fine, but it turns out to be impossible to use along with page reference counting, i.e. when an allocation is supposed to be freed with put_page, as it is the case with pipe or unix socket buffers. To overcome this limitation, this patch moves charging/uncharging to generic page allocator paths, i.e. to __alloc_pages_nodemask and free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way, one can use any of the available page allocation functions to get the allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT, just like in case of kmalloc and friends. A charged page will be automatically uncharged on free. To make it possible, we need to mark pages charged to kmemcg somehow. To avoid introducing a new page flag, we make use of page->_mapcount for marking such pages. Since pages charged to kmemcg are not supposed to be mapped to userspace, it should work just fine. There are other (ab)users of page->_mapcount - buddy and balloon pages - but we don't conflict with them. In case kmemcg is compiled out or not used at runtime, this patch introduces no overhead to generic page allocator paths. If kmemcg is used, it will be plus one gfp flags check on alloc and plus one page->_mapcount check on free, which shouldn't hurt performance, because the data accessed are hot. Signed-off-by: Vladimir Davydov --- include/linux/gfp.h| 10 +-- include/linux/page-flags.h | 7 + kernel/fork.c | 6 ++--- mm/page_alloc.c| 66 +- mm/slab_common.c | 2 +- mm/slub.c | 6 ++--- mm/vmalloc.c | 6 ++--- 7 files changed, 31 insertions(+), 72 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 570383a41853..c29e9d347bc6 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -78,8 +78,7 @@ struct vm_area_struct; * __GFP_THISNODE forces the allocation to be satisified from the requested * node with no fallbacks or placement policy enforcements. * - * __GFP_ACCOUNT causes the allocation to be accounted to kmemcg (only relevant - * to kmem allocations). + * __GFP_ACCOUNT causes the allocation to be accounted to kmemcg. */ #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) #define __GFP_WRITE((__force gfp_t)___GFP_WRITE) @@ -486,10 +485,6 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order, #define alloc_page_vma_node(gfp_mask, vma, addr, node) \ alloc_pages_vma(gfp_mask, 0, vma, addr, node, false) -extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order); -extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, - unsigned int order); - extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); extern unsigned long get_zeroed_page(gfp_t gfp_mask); @@ -513,9 +508,6 @@ extern void *__alloc_page_frag(struct page_frag_cache *nc, unsigned int fragsz, gfp_t gfp_mask); extern void __free_page_frag(void *addr); -extern void __free_kmem_pages(struct page *page, unsigned int order); -extern void free_kmem_pages(unsigned long addr, unsigned int order); - #define __free_page(page) __free_pages((page), 0) #define free_page(addr) free_pages((addr), 0) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 9940ade6a25e..b51e75a47e82 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -630,6 +630,13 @@ PAGE_MAPCOUNT_OPS(Buddy, BUDDY) #define PAGE_BALLOON_MAPCOUNT_VALUE(-256) PAGE_MAPCOUNT_OPS(Balloon, BALLOON) +/* + * If kmemcg is enabled, the buddy allocator will set PageKmemcg() on + * pages allocated with __GFP_ACCOUNT. It gets cleared on page free. + */ +#define PAGE_KMEMCG_MAPCOUNT_VALUE (-512) +PAGE_MAPCOUNT_OPS(Kmemcg, KMEMCG) + extern bool is_free_buddy_page(struct page *page); /* diff --git a/kernel/fork.c b/kernel/fork.c index 66cc2e0e137e..3f3c30f80786 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -162,8 +162,8 @@ void __weak arch_release_thread_info(struct thread_info *ti) static struct thread_info *alloc_thread_info_node(struct task_struct *tsk, int node) { - struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP, - THREAD_SIZE_ORDER); + struct page *page = alloc_pages_node(node, THREADINFO_GFP, +THREAD_SIZE_ORDER); if (page) memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK, @@ -178,7 +178,7 @@ static inline void free_thread_info(struct thread
[PATCH RESEND 5/8] mm: memcontrol: teach uncharge_list to deal with kmem pages
Page table pages are batched-freed in release_pages on most architectures. If we want to charge them to kmemcg (this is what is done later in this series), we need to teach mem_cgroup_uncharge_list to handle kmem pages. Signed-off-by: Vladimir Davydov --- mm/memcontrol.c | 42 -- 1 file changed, 24 insertions(+), 18 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 482b4a0c97e4..89a421ee4713 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5432,15 +5432,18 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg, static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout, unsigned long nr_anon, unsigned long nr_file, - unsigned long nr_huge, struct page *dummy_page) + unsigned long nr_huge, unsigned long nr_kmem, + struct page *dummy_page) { - unsigned long nr_pages = nr_anon + nr_file; + unsigned long nr_pages = nr_anon + nr_file + nr_kmem; unsigned long flags; if (!mem_cgroup_is_root(memcg)) { page_counter_uncharge(&memcg->memory, nr_pages); if (do_memsw_account()) page_counter_uncharge(&memcg->memsw, nr_pages); + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && nr_kmem) + page_counter_uncharge(&memcg->kmem, nr_kmem); memcg_oom_recover(memcg); } @@ -5463,6 +5466,7 @@ static void uncharge_list(struct list_head *page_list) unsigned long nr_anon = 0; unsigned long nr_file = 0; unsigned long nr_huge = 0; + unsigned long nr_kmem = 0; unsigned long pgpgout = 0; struct list_head *next; struct page *page; @@ -5473,8 +5477,6 @@ static void uncharge_list(struct list_head *page_list) */ next = page_list->next; do { - unsigned int nr_pages = 1; - page = list_entry(next, struct page, lru); next = page->lru.next; @@ -5493,31 +5495,35 @@ static void uncharge_list(struct list_head *page_list) if (memcg != page->mem_cgroup) { if (memcg) { uncharge_batch(memcg, pgpgout, nr_anon, nr_file, - nr_huge, page); - pgpgout = nr_anon = nr_file = nr_huge = 0; + nr_huge, nr_kmem, page); + pgpgout = nr_anon = nr_file = + nr_huge = nr_kmem = 0; } memcg = page->mem_cgroup; } - if (PageTransHuge(page)) { - nr_pages <<= compound_order(page); - VM_BUG_ON_PAGE(!PageTransHuge(page), page); - nr_huge += nr_pages; - } + if (!PageKmemcg(page)) { + unsigned int nr_pages = 1; - if (PageAnon(page)) - nr_anon += nr_pages; - else - nr_file += nr_pages; + if (PageTransHuge(page)) { + nr_pages <<= compound_order(page); + VM_BUG_ON_PAGE(!PageTransHuge(page), page); + nr_huge += nr_pages; + } + if (PageAnon(page)) + nr_anon += nr_pages; + else + nr_file += nr_pages; + pgpgout++; + } else + nr_kmem += 1 << compound_order(page); page->mem_cgroup = NULL; - - pgpgout++; } while (next != page_list); if (memcg) uncharge_batch(memcg, pgpgout, nr_anon, nr_file, - nr_huge, page); + nr_huge, nr_kmem, page); } /** -- 2.1.4
[PATCH RESEND 3/8] mm: memcontrol: cleanup kmem charge functions
- Handle memcg_kmem_enabled check out to the caller. This reduces the number of function definitions making the code easier to follow. At the same time it doesn't result in code bloat, because all of these functions are used only in one or two places. - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't have to dive deep into memcg implementation to see which allocations are charged and which are not. - Refresh comments. Signed-off-by: Vladimir Davydov --- include/linux/memcontrol.h | 103 +++-- mm/memcontrol.c| 75 - mm/page_alloc.c| 9 ++-- mm/slab.h | 16 +-- 4 files changed, 80 insertions(+), 123 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index a805474df4ab..2d03975c7dc0 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -754,6 +754,13 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg) } #endif +struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep); +void memcg_kmem_put_cache(struct kmem_cache *cachep); +int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order, + struct mem_cgroup *memcg); +int memcg_kmem_charge(struct page *page, gfp_t gfp, int order); +void memcg_kmem_uncharge(struct page *page, int order); + #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB) extern struct static_key_false memcg_kmem_enabled_key; @@ -775,22 +782,6 @@ static inline bool memcg_kmem_enabled(void) } /* - * In general, we'll do everything in our power to not incur in any overhead - * for non-memcg users for the kmem functions. Not even a function call, if we - * can avoid it. - * - * Therefore, we'll inline all those functions so that in the best case, we'll - * see that kmemcg is off for everybody and proceed quickly. If it is on, - * we'll still do most of the flag checking inline. We check a lot of - * conditions, but because they are pretty simple, they are expected to be - * fast. - */ -int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order, - struct mem_cgroup *memcg); -int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order); -void __memcg_kmem_uncharge(struct page *page, int order); - -/* * helper for accessing a memcg's index. It will be used as an index in the * child cache array in kmem_cache, and also to derive its name. This function * will return -1 when this is not a kmem-limited memcg. @@ -800,67 +791,6 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg) return memcg ? memcg->kmemcg_id : -1; } -struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp); -void __memcg_kmem_put_cache(struct kmem_cache *cachep); - -static inline bool __memcg_kmem_bypass(void) -{ - if (!memcg_kmem_enabled()) - return true; - if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD)) - return true; - return false; -} - -/** - * memcg_kmem_charge: charge a kmem page - * @page: page to charge - * @gfp: reclaim mode - * @order: allocation order - * - * Returns 0 on success, an error code on failure. - */ -static __always_inline int memcg_kmem_charge(struct page *page, -gfp_t gfp, int order) -{ - if (__memcg_kmem_bypass()) - return 0; - if (!(gfp & __GFP_ACCOUNT)) - return 0; - return __memcg_kmem_charge(page, gfp, order); -} - -/** - * memcg_kmem_uncharge: uncharge a kmem page - * @page: page to uncharge - * @order: allocation order - */ -static __always_inline void memcg_kmem_uncharge(struct page *page, int order) -{ - if (memcg_kmem_enabled()) - __memcg_kmem_uncharge(page, order); -} - -/** - * memcg_kmem_get_cache: selects the correct per-memcg cache for allocation - * @cachep: the original global kmem cache - * - * All memory allocated from a per-memcg cache is charged to the owner memcg. - */ -static __always_inline struct kmem_cache * -memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp) -{ - if (__memcg_kmem_bypass()) - return cachep; - return __memcg_kmem_get_cache(cachep, gfp); -} - -static __always_inline void memcg_kmem_put_cache(struct kmem_cache *cachep) -{ - if (memcg_kmem_enabled()) - __memcg_kmem_put_cache(cachep); -} - /** * memcg_kmem_update_page_stat - update kmem page state statistics * @page: the page @@ -883,15 +813,6 @@ static inline bool memcg_kmem_enabled(void) return false; } -static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order) -{ - return 0; -} - -static inline void memcg_kmem_uncharge(struct page *page, int order) -{ -} - static inline int memcg_cache_id(struct mem_cgroup *memcg) { return -1; @@ -905,16 +826,6 @@ static i
[PATCH RESEND 7/8] pipe: account to kmemcg
Pipes can consume a significant amount of system memory, hence they should be accounted to kmemcg. This patch marks pipe_inode_info and anonymous pipe buffer page allocations as __GFP_ACCOUNT so that they would be charged to kmemcg. Note, since a pipe buffer page can be "stolen" and get reused for other purposes, including mapping to userspace, we clear PageKmemcg thus resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which is introduced by this patch. Signed-off-by: Vladimir Davydov Cc: Alexander Viro --- fs/pipe.c | 32 ++-- 1 file changed, 26 insertions(+), 6 deletions(-) diff --git a/fs/pipe.c b/fs/pipe.c index 0d3f5165cb0b..4b32928f5426 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include @@ -137,6 +138,22 @@ static void anon_pipe_buf_release(struct pipe_inode_info *pipe, put_page(page); } +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe, + struct pipe_buffer *buf) +{ + struct page *page = buf->page; + + if (page_count(page) == 1) { + if (memcg_kmem_enabled()) { + memcg_kmem_uncharge(page, 0); + __ClearPageKmemcg(page); + } + __SetPageLocked(page); + return 0; + } + return 1; +} + /** * generic_pipe_buf_steal - attempt to take ownership of a &pipe_buffer * @pipe: the pipe that the buffer belongs to @@ -219,7 +236,7 @@ static const struct pipe_buf_operations anon_pipe_buf_ops = { .can_merge = 1, .confirm = generic_pipe_buf_confirm, .release = anon_pipe_buf_release, - .steal = generic_pipe_buf_steal, + .steal = anon_pipe_buf_steal, .get = generic_pipe_buf_get, }; @@ -227,7 +244,7 @@ static const struct pipe_buf_operations packet_pipe_buf_ops = { .can_merge = 0, .confirm = generic_pipe_buf_confirm, .release = anon_pipe_buf_release, - .steal = generic_pipe_buf_steal, + .steal = anon_pipe_buf_steal, .get = generic_pipe_buf_get, }; @@ -405,7 +422,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) int copied; if (!page) { - page = alloc_page(GFP_HIGHUSER); + page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT); if (unlikely(!page)) { ret = ret ? : -ENOMEM; break; @@ -611,7 +628,7 @@ struct pipe_inode_info *alloc_pipe_info(void) { struct pipe_inode_info *pipe; - pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL); + pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL_ACCOUNT); if (pipe) { unsigned long pipe_bufs = PIPE_DEF_BUFFERS; struct user_struct *user = get_current_user(); @@ -619,7 +636,9 @@ struct pipe_inode_info *alloc_pipe_info(void) if (!too_many_pipe_buffers_hard(user)) { if (too_many_pipe_buffers_soft(user)) pipe_bufs = 1; - pipe->bufs = kzalloc(sizeof(struct pipe_buffer) * pipe_bufs, GFP_KERNEL); + pipe->bufs = kcalloc(pipe_bufs, +sizeof(struct pipe_buffer), +GFP_KERNEL_ACCOUNT); } if (pipe->bufs) { @@ -1010,7 +1029,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long nr_pages) if (nr_pages < pipe->nrbufs) return -EBUSY; - bufs = kcalloc(nr_pages, sizeof(*bufs), GFP_KERNEL | __GFP_NOWARN); + bufs = kcalloc(nr_pages, sizeof(*bufs), + GFP_KERNEL_ACCOUNT | __GFP_NOWARN); if (unlikely(!bufs)) return -ENOMEM; -- 2.1.4
Re: bpf: use-after-free in array_map_alloc
[+CC Marco who reported the CVE, forgot that earlier] On 05/23/2016 11:35 PM, Tejun Heo wrote: Hello, Can you please test whether this patch resolves the issue? While adding support for atomic allocations, I reduced alloc_mutex covered region too much. Thanks. Ugh, this makes the code even more head-spinning than it was. diff --git a/mm/percpu.c b/mm/percpu.c index 0c59684..bd2df70 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -162,7 +162,7 @@ static struct pcpu_chunk *pcpu_reserved_chunk; static int pcpu_reserved_chunk_limit; static DEFINE_SPINLOCK(pcpu_lock);/* all internal data structures */ -static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop */ +static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map extension */ static struct list_head *pcpu_slot __read_mostly; /* chunk list slots */ @@ -435,6 +435,8 @@ static int pcpu_extend_area_map(struct pcpu_chunk *chunk, int new_alloc) size_t old_size = 0, new_size = new_alloc * sizeof(new[0]); unsigned long flags; + lockdep_assert_held(&pcpu_alloc_mutex); I don't see where the mutex gets locked when called via pcpu_map_extend_workfn? (except via the new cancel_work_sync() call below?) Also what protects chunks with scheduled work items from being removed? + new = pcpu_mem_zalloc(new_size); if (!new) return -ENOMEM; @@ -895,6 +897,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, return NULL; } + if (!is_atomic) + mutex_lock(&pcpu_alloc_mutex); BTW I noticed that bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL; this is too pessimistic IMHO. Reclaim is possible even without __GFP_FS and __GFP_IO. Could you just use gfpflags_allow_blocking(gfp) here? + spin_lock_irqsave(&pcpu_lock, flags); /* serve reserved allocations from the reserved chunk if available */ @@ -967,12 +972,11 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, if (is_atomic) goto fail; - mutex_lock(&pcpu_alloc_mutex); + lockdep_assert_held(&pcpu_alloc_mutex); if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) { chunk = pcpu_create_chunk(); if (!chunk) { - mutex_unlock(&pcpu_alloc_mutex); err = "failed to allocate new chunk"; goto fail; } @@ -983,7 +987,6 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, spin_lock_irqsave(&pcpu_lock, flags); } - mutex_unlock(&pcpu_alloc_mutex); goto restart; area_found: @@ -993,8 +996,6 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, if (!is_atomic) { int page_start, page_end, rs, re; - mutex_lock(&pcpu_alloc_mutex); - page_start = PFN_DOWN(off); page_end = PFN_UP(off + size); @@ -1005,7 +1006,6 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, spin_lock_irqsave(&pcpu_lock, flags); if (ret) { - mutex_unlock(&pcpu_alloc_mutex); pcpu_free_area(chunk, off, &occ_pages); err = "failed to populate"; goto fail_unlock; @@ -1045,6 +1045,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, /* see the flag handling in pcpu_blance_workfn() */ pcpu_atomic_alloc_failed = true; pcpu_schedule_balance_work(); + } else { + mutex_unlock(&pcpu_alloc_mutex); } return NULL; } @@ -1137,6 +1139,8 @@ static void pcpu_balance_workfn(struct work_struct *work) list_for_each_entry_safe(chunk, next, &to_free, list) { int rs, re; + cancel_work_sync(&chunk->map_extend_work); This deserves some comment? + pcpu_for_each_pop_region(chunk, rs, re, 0, pcpu_unit_pages) { pcpu_depopulate_chunk(chunk, rs, re); spin_lock_irq(&pcpu_lock); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org
Re: [PATCH 0/2 (linux-stable-4.5.y)] net: stmmac: MDIO fixes
On 5/23/2016 5:36 PM, Greg KH wrote: On Mon, May 23, 2016 at 03:17:41PM +0200, Giuseppe Cavallaro wrote: These patches port some recent fixes for linux-4.5.y stable branch. They fix the MDIO settings trying to cover the all possible scenario when the stmmac is connected to a real transceiver or a switch. These patches are needed for Kernel 4.5 where Lime2, BananaPi and Cubieboard2 show problems when use the stmmac: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=823493 What are the git commit id for these patches in Linus's tree? Commits are: stmmac: fix MDIO settings a7657f128c279ae5796ab2ca7d04a7819f4259f0 and Revert "stmmac: Fix 'eth0: No PHY found' regression" d7e944c8ddc0983640a9a32868fb217485d12ca2 Regards Peppe thanks, greg k-h