date:20160524

[PATCH v2] net: alx: use custom skb allocator

2016-05-24 Thread Feng Tang

This patch follows Eric Dumazet's commit 7b70176421 for Atheros
atl1c driver to fix one exactly same bug in alx driver, that the
network link will be lost in 1-5 minutes after the device is up.

My laptop Lenovo Y580 with Atheros AR8161 ethernet device hit the
same problem with kernel 4.4, and it will be cured by Jarod Wilson's
commit c406700c for alx driver which get merged in 4.5. But there
are still some alx devices can't function well even with Jarod's
patch, while this patch could make them work fine. More details on
https://bugzilla.kernel.org/show_bug.cgi?id=70761

The debug shows the issue is very likely to be related with the RX
DMA address, specifically 0x...f80, if RX buffer get 0x...f80 several
times, their will be RX overflow error and device will stop working.

For kernel 4.5.0 with Jarod's patch which works fine with my
AR8161/Lennov Y580, if I made some change to the
__netdev_alloc_skb
--> __alloc_page_frag()
to make the allocated buffer can get an address with 0x...f80,
then the same error happens. If I make it to 0x...f40 or 0xfc0,
everything will be still fine. So I tend to believe that the
0x..f80 address cause the silicon to behave abnormally.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=70761
Cc: Eric Dumazet 
Cc: Johannes Berg 
Cc: Jarod Wilson 
Signed-off-by: Feng Tang 
Tested-by: Ole Lukoie 
---
 drivers/net/ethernet/atheros/alx/alx.h  |  4 +++
 drivers/net/ethernet/atheros/alx/main.c | 48 -
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/atheros/alx/alx.h 
b/drivers/net/ethernet/atheros/alx/alx.h
index 8fc93c5..d02c424 100644
--- a/drivers/net/ethernet/atheros/alx/alx.h
+++ b/drivers/net/ethernet/atheros/alx/alx.h
@@ -96,6 +96,10 @@ struct alx_priv {
unsigned int rx_ringsz;
unsigned int rxbuf_size;
 
+   struct page  *rx_page;
+   unsigned int rx_page_offset;
+   unsigned int rx_frag_size;
+
struct napi_struct napi;
struct alx_tx_queue txq;
struct alx_rx_queue rxq;
diff --git a/drivers/net/ethernet/atheros/alx/main.c 
b/drivers/net/ethernet/atheros/alx/main.c
index 9fe8b5e..c98acdc 100644
--- a/drivers/net/ethernet/atheros/alx/main.c
+++ b/drivers/net/ethernet/atheros/alx/main.c
@@ -70,6 +70,35 @@ static void alx_free_txbuf(struct alx_priv *alx, int entry)
}
 }
 
+static struct sk_buff *alx_alloc_skb(struct alx_priv *alx, gfp_t gfp)
+{
+   struct sk_buff *skb;
+   struct page *page;
+
+   if (alx->rx_frag_size > PAGE_SIZE)
+   return __netdev_alloc_skb(alx->dev, alx->rxbuf_size, gfp);
+
+   page = alx->rx_page;
+   if (!page) {
+   alx->rx_page = page = alloc_page(gfp);
+   if (unlikely(!page))
+   return NULL;
+   alx->rx_page_offset = 0;
+   }
+
+   skb = build_skb(page_address(page) + alx->rx_page_offset,
+   alx->rx_frag_size);
+   if (likely(skb)) {
+   alx->rx_page_offset += alx->rx_frag_size;
+   if (alx->rx_page_offset >= PAGE_SIZE)
+   alx->rx_page = NULL;
+   else
+   get_page(page);
+   }
+   return skb;
+}
+
+
 static int alx_refill_rx_ring(struct alx_priv *alx, gfp_t gfp)
 {
struct alx_rx_queue *rxq = &alx->rxq;
@@ -86,7 +115,7 @@ static int alx_refill_rx_ring(struct alx_priv *alx, gfp_t 
gfp)
while (!cur_buf->skb && next != rxq->read_idx) {
struct alx_rfd *rfd = &rxq->rfd[cur];
 
-   skb = __netdev_alloc_skb(alx->dev, alx->rxbuf_size, gfp);
+   skb = alx_alloc_skb(alx, gfp);
if (!skb)
break;
dma = dma_map_single(&alx->hw.pdev->dev,
@@ -124,6 +153,7 @@ static int alx_refill_rx_ring(struct alx_priv *alx, gfp_t 
gfp)
alx_write_mem16(&alx->hw, ALX_RFD_PIDX, cur);
}
 
+
return count;
 }
 
@@ -592,6 +622,11 @@ static void alx_free_rings(struct alx_priv *alx)
kfree(alx->txq.bufs);
kfree(alx->rxq.bufs);
 
+   if (alx->rx_page) {
+   put_page(alx->rx_page);
+   alx->rx_page = NULL;
+   }
+
dma_free_coherent(&alx->hw.pdev->dev,
  alx->descmem.size,
  alx->descmem.virt,
@@ -646,6 +681,7 @@ static int alx_request_irq(struct alx_priv *alx)
  alx->dev->name, alx);
if (!err)
goto out;
+
/* fall back to legacy interrupt */
pci_disable_msi(alx->hw.pdev);
}
@@ -689,6 +725,7 @@ static int alx_init_sw(struct alx_priv *alx)
struct pci_dev *pdev = alx->hw.pdev;
struct alx_hw *hw = &alx->hw;
int err;
+   unsigned int head_size;
 
err = alx_identify_hw(alx);
if (err) {
@@ -704,7 +741,12 @@ static int alx_init_sw(struct alx

Re: [PATCH] net: alx: use custom skb allocator

2016-05-24 Thread Feng Tang

Hi Jarod,

On Fri, May 20, 2016 at 02:26:57PM -0400, Jarod Wilson wrote:
> On Fri, May 20, 2016 at 03:56:23PM +0800, Feng Tang wrote:
> > Hi,
> > 
> > Please ignore this patch. 
> > 
> > I found the problem and made the patch with kernel 4.4 with Ubuntu 12.04
> > on Lenovo Y580. 
> > 
> > After submitting the patch, I tried to google the datasheet for
> > Atheros 8161 dev, and found there was already a kernel bugzilla 
> > https://bugzilla.kernel.org/show_bug.cgi?id=70761
> > that had a fix from Jarod Wilson which get merged in v4.5 (commit 
> > c406700cd) 
> > 
> > Per my test, the new 4.6.0 kernel works fine with Jarod's patch. 
> 
> Might not hurt to get this some testing by people for whom my patch didn't
> fix things. It seems to help for a fair number of people, but not
> everyone, so perhaps this would be of use as well.

Thanks for the suggestion, and you are right. Some users whose AR8161 can't be
cured by your path report the device is working with my patch, though one 
reported the TX is slower while RX is not (which can't be reproduced on
my Lenovo Y580).
https://bugzilla.kernel.org/show_bug.cgi?id=70761

So I think this patch is better to be merged, and I will send out a v2

I did some more debug, it looks very likely to be related with the RX
DMA address, specifically 0x...f80, if RX buffer get 0x...f80 several
times, their will be RX overflow error and card will stop working.

For kernel 4.5.0 which merged Jarod's patch and works fine with my
AR8161/Lennov Y580, if I made some change to the
__netdev_alloc_skb
--> __alloc_page_frag()
to make the allocated buffer can get an address with 0x...f80, then the
same error happens. If I make it to 0x...f40 or 0xfc0, every thing
is still fine. So I tend to believe that the 0x..f80 address cause the
silicon to behave abnormally.

Thanks,
Feng

> 
> Personally, my own alx-driven hardware didn't have a problem to begin
> with, so I'm not able to do more than make sure things don't regress
> w/already functioning hardware.
> 
> 
> > On Fri, May 20, 2016 at 01:41:03PM +0800, Feng Tang wrote:
> > > This patch follows Eric Dumazet's commit 7b70176421 to fix one
> > > exactly same bug in alx driver, that the network link will
> > > be lost in 1-5 minutes after the device is up.
> > > 
> > > Following is a git log from Eric's 7b70176421:
> > > 
> > > "We had reports ( https://bugzilla.kernel.org/show_bug.cgi?id=54021 )
> > > that using high order pages for skb allocations is problematic for atl1c
> > > 
> > > We do not know exactly what the problem is, but we suspect that crossing
> > > 4K pages is not well supported by this hardware.
> > > 
> > > Use a custom allocator, using page allocator and 2K fragments for
> > > optimal stack behavior. We might make this allocator generic
> > >  in future kernels."
> > > 
> > > And my debug shows the same suspect, most of the errors happen
> > > when there is a RX buffer address with 0x..f80, hopefully
> > > this will get noticed and fixed from silicon side.
> > > 
> > > My victim is a Lenovo Y580 Laptop with Atheros ALX AR8161 etherenet
> > > device(PCI ID 1969:1091), with this patch the ethernet dev
> > > works just fine
> > > 
> > > Signed-off-by: Feng Tang

Re: [RFC 08/12] ipv6: introduce neighbour discovery ops

2016-05-24 Thread YOSHIFUJI Hideaki

Hi,

Alexander Aring wrote:
> This patch introduces neighbour discovery ops callback structure. The
> idea is to separate the handling for 6LoWPAN into the 6lowpan module.
> 
> These callback offers 6lowpan different handling, such as 802.15.4 short
> address handling or RFC6775 (Neighbor Discovery Optimization for IPv6
> over 6LoWPANs).
> 
> Cc: David S. Miller 
> Cc: Alexey Kuznetsov 
> Cc: James Morris 
> Cc: Hideaki YOSHIFUJI 
> Cc: Patrick McHardy 
> Signed-off-by: Alexander Aring 
> ---
>  include/linux/netdevice.h |   5 ++
>  include/net/ndisc.h   | 176 
> +-
>  net/ipv6/addrconf.c   |   9 ++-
>  net/ipv6/ndisc.c  | 119 +--
>  net/ipv6/route.c  |  14 ++--
>  5 files changed, 275 insertions(+), 48 deletions(-)
> 

> @@ -205,6 +376,9 @@ void ndisc_send_redirect(struct sk_buff *skb, const 
> struct in6_addr *target);
>  int ndisc_mc_map(const struct in6_addr *addr, char *buf, struct net_device 
> *dev,
>int dir);
>  
> +void ndisc_neigh_update(const struct net_device *dev, struct neighbour 
> *neigh,
> + const u8 *lladdr, u8 new, u32 flags, u8 icmp6_type,
> + struct ndisc_options *ndopts);
>  

I prefer ndisc_update().


>  /*
>   *   IGMP
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 393cdbf..4506cac 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -2531,7 +2531,7 @@ void addrconf_prefix_rcv(struct net_device *dev, u8 
> *opt, int len, bool sllao)
>  
>   if (pinfo->autoconf && in6_dev->cnf.autoconf) {
>   struct in6_addr addr;
> - bool tokenized = false;
> + bool tokenized = false, dev_addr_generated = false;
>  
>   if (pinfo->prefix_len == 64) {
>   memcpy(&addr, &pinfo->prefix, 8);
> @@ -2551,6 +2551,8 @@ void addrconf_prefix_rcv(struct net_device *dev, u8 
> *opt, int len, bool sllao)
>  ipv6_inherit_eui64(addr.s6_addr + 8, 
> in6_dev)) {
>   in6_dev_put(in6_dev);
>   return;
> + } else {
> + dev_addr_generated = true;
>   }
>   goto ok;
>   }
> @@ -2564,6 +2566,11 @@ ok:
>addr_type, addr_flags, sllao,
>tokenized, valid_lft,
>prefered_lft);
> + ndisc_ops_prefix_rcv_add_addr(net, dev, pinfo, in6_dev, &addr,
> +   addr_type, addr_flags, sllao,
> +   tokenized, valid_lft,
> +   prefered_lft,
> +   dev_addr_generated);
>   }
>   inet6_prefix_notify(RTM_NEWPREFIX, in6_dev, pinfo);
>   in6_dev_put(in6_dev);
> diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
> index d794d64..99fd53c 100644
> --- a/net/ipv6/ndisc.c
> +++ b/net/ipv6/ndisc.c
> @@ -191,24 +191,28 @@ static struct nd_opt_hdr *ndisc_next_option(struct 
> nd_opt_hdr *cur,
>   return cur <= end && cur->nd_opt_type == type ? cur : NULL;
>  }
>  
> -static inline int ndisc_is_useropt(struct nd_opt_hdr *opt)
> +static inline int ndisc_is_useropt(const struct net_device *dev,
> +struct nd_opt_hdr *opt)
>  {
>   return opt->nd_opt_type == ND_OPT_RDNSS ||
> - opt->nd_opt_type == ND_OPT_DNSSL;
> + opt->nd_opt_type == ND_OPT_DNSSL ||
> + ndisc_ops_is_useropt(dev, opt->nd_opt_type);
>  }
>  
> -static struct nd_opt_hdr *ndisc_next_useropt(struct nd_opt_hdr *cur,
> +static struct nd_opt_hdr *ndisc_next_useropt(const struct net_device *dev,
> +  struct nd_opt_hdr *cur,
>struct nd_opt_hdr *end)
>  {
>   if (!cur || !end || cur >= end)
>   return NULL;
>   do {
>   cur = ((void *)cur) + (cur->nd_opt_len << 3);
> - } while (cur < end && !ndisc_is_useropt(cur));
> - return cur <= end && ndisc_is_useropt(cur) ? cur : NULL;
> + } while (cur < end && !ndisc_is_useropt(dev, cur));
> + return cur <= end && ndisc_is_useropt(dev, cur) ? cur : NULL;
>  }
>  
> -struct ndisc_options *ndisc_parse_options(u8 *opt, int opt_len,
> +struct ndisc_options *ndisc_parse_options(const struct net_device *dev,
> +   u8 *opt, int opt_len,
> struct ndisc_options *ndopts)
>  {
>   struct nd_opt_hdr *nd_opt = (struct nd_opt_hdr *)opt;
> @@ -223,6 +227,8 @@ struct ndisc_options *ndisc_parse_options(u8 *opt, int 
> opt_len,
>   l = nd_opt->nd_opt_len << 3;
>   if (opt_len < l || l == 0)
>   return NULL;
> +

Re: [RFC 04/12] ndisc: get rid off dev parameter in ndisc_opt_addr_space

2016-05-24 Thread YOSHIFUJI Hideaki



Alexander Aring wrote:
> This patch removes the net_device parameter from ndisc_opt_addr_space
> function. This can be useful for calling such functionality which
> doesn't depends on dev parameter. For current existing functionality
> which depends on dev parameter, we introduce ndisc_dev_opt_addr_space to have
> an easy replacement for the ndisc_opt_addr_space function.
> 
> Cc: David S. Miller 
> Cc: Alexey Kuznetsov 
> Cc: James Morris 
> Cc: Hideaki YOSHIFUJI 
> Cc: Patrick McHardy 
> Signed-off-by: Alexander Aring 
> ---
>  include/net/ndisc.h | 13 +
>  net/ipv6/ndisc.c| 12 ++--
>  2 files changed, 15 insertions(+), 10 deletions(-)
> 
> diff --git a/include/net/ndisc.h b/include/net/ndisc.h
> index 2d8edaa..dbc8d01 100644
> --- a/include/net/ndisc.h
> +++ b/include/net/ndisc.h
> @@ -127,10 +127,15 @@ static inline int ndisc_addr_option_pad(unsigned short 
> type)
>   }
>  }
>  
> -static inline int ndisc_opt_addr_space(struct net_device *dev)
> +static inline int ndisc_opt_addr_space(unsigned char addr_len, int pad)
>  {
> - return NDISC_OPT_SPACE(dev->addr_len +
> -ndisc_addr_option_pad(dev->type));
> + return NDISC_OPT_SPACE(addr_len + pad);
> +}
> +
> +static inline int ndisc_dev_opt_addr_space(const struct net_device *dev)
> +{
> + return ndisc_opt_addr_space(dev->addr_len,
> + ndisc_addr_option_pad(dev->type));
>  }
>  

I prefer not to change existing functions such as ndisc_opt_addr_space(),
and name new function __ndisc_opt_addr_space() etc.

Plus, my original thought (when I implement these functions) was to
have per-net_device ndisc_opt_addr_spece(), ndisc_opt_adr_data() etc.

What do you think of that?

>  static inline u8 *ndisc_opt_addr_data(struct nd_opt_hdr *p,
> @@ -139,7 +144,7 @@ static inline u8 *ndisc_opt_addr_data(struct nd_opt_hdr 
> *p,
>   u8 *lladdr = (u8 *)(p + 1);
>   int lladdrlen = p->nd_opt_len << 3;
>   int prepad = ndisc_addr_option_pad(dev->type);
> - if (lladdrlen != ndisc_opt_addr_space(dev))
> + if (lladdrlen != ndisc_opt_addr_space(dev->addr_len, prepad))
>   return NULL;
>   return lladdr + prepad;
>  }
> diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
> index c245895..2241f06 100644
> --- a/net/ipv6/ndisc.c
> +++ b/net/ipv6/ndisc.c
> @@ -152,9 +152,9 @@ EXPORT_SYMBOL_GPL(nd_tbl);
>  
>  static void ndisc_fill_addr_option(struct sk_buff *skb, int type, void *data)
>  {
> - int pad   = ndisc_addr_option_pad(skb->dev->type);
> + int pad = ndisc_addr_option_pad(skb->dev->type);
>   int data_len = skb->dev->addr_len;
> - int space = ndisc_opt_addr_space(skb->dev);
> + int space = ndisc_opt_addr_space(data_len, pad);
>   u8 *opt = skb_put(skb, space);
>  
>   opt[0] = type;
> @@ -509,7 +509,7 @@ void ndisc_send_na(struct net_device *dev, const struct 
> in6_addr *daddr,
>   if (!dev->addr_len)
>   inc_opt = 0;
>   if (inc_opt)
> - optlen += ndisc_opt_addr_space(dev);
> + optlen += ndisc_dev_opt_addr_space(dev);
>  
>   skb = ndisc_alloc_skb(dev, sizeof(*msg) + optlen);
>   if (!skb)
> @@ -574,7 +574,7 @@ void ndisc_send_ns(struct net_device *dev, const struct 
> in6_addr *solicit,
>   if (ipv6_addr_any(saddr))
>   inc_opt = false;
>   if (inc_opt)
> - optlen += ndisc_opt_addr_space(dev);
> + optlen += ndisc_dev_opt_addr_space(dev);
>  
>   skb = ndisc_alloc_skb(dev, sizeof(*msg) + optlen);
>   if (!skb)
> @@ -626,7 +626,7 @@ void ndisc_send_rs(struct net_device *dev, const struct 
> in6_addr *saddr,
>   }
>  #endif
>   if (send_sllao)
> - optlen += ndisc_opt_addr_space(dev);
> + optlen += ndisc_dev_opt_addr_space(dev);
>  
>   skb = ndisc_alloc_skb(dev, sizeof(*msg) + optlen);
>   if (!skb)
> @@ -1563,7 +1563,7 @@ void ndisc_send_redirect(struct sk_buff *skb, const 
> struct in6_addr *target)
>   memcpy(ha_buf, neigh->ha, dev->addr_len);
>   read_unlock_bh(&neigh->lock);
>   ha = ha_buf;
> - optlen += ndisc_opt_addr_space(dev);
> + optlen += ndisc_dev_opt_addr_space(dev);
>   } else
>   read_unlock_bh(&neigh->lock);
>  
> 

-- 
Hideaki Yoshifuji 
Technical Division, MIRACLE LINUX CORPORATION

Re: [RFC 00/12] 6lowpan: introduce 6lowpan-nd

2016-05-24 Thread YOSHIFUJI Hideaki

Hi.

Alexander Aring wrote:
> Hi,
> 
> this patch series introduces the ndisc ops callback structure to add different
> handling for IPv6 neighbour discovery cache functionality. It implements at 
> first
> the two following use-cases:
> 
>  - 6CO handling as userspace option (For all 6LoWPAN layers, BTLE/802.15.4) 
> [0]
>  - short address handling for 802.15.4 6LoWPAN only [1]
> 
> Since my last patch series, I completely changed the whole ndisc_ops callback
> structure to not replace the whole ndisc functionality at recv/send level of
> NS/NA/RS/RA which I send in my previous patch-series "6lowpan: introduce basic
> 6lowpan-nd". I changed it now to add different handling in a very low-level 
> way
> of ndisc functionality.

Thank you for your work!  It looks much better now.
Some comments will follow.

-- 
Hideaki Yoshifuji 
Technical Division, MIRACLE LINUX CORPORATION

[PATCH net v2 1/3] Documentation: networking: dsa: Remove poll_link description

2016-05-24 Thread Florian Fainelli

This function has been removed in 4baee937b8d5 ("net: dsa: remove DSA
link polling") in favor of using the PHYLIB polling mechanism.

Reviewed-by: Vivien Didelot 
Signed-off-by: Florian Fainelli 
---
 Documentation/networking/dsa/dsa.txt | 5 -
 1 file changed, 5 deletions(-)

diff --git a/Documentation/networking/dsa/dsa.txt 
b/Documentation/networking/dsa/dsa.txt
index 631b0f7ae16f..8303eb8ced79 100644
--- a/Documentation/networking/dsa/dsa.txt
+++ b/Documentation/networking/dsa/dsa.txt
@@ -416,11 +416,6 @@ PHY devices and link management
   to the switch port MDIO registers. If unavailable return a negative error
   code.
 
-- poll_link: Function invoked by DSA to query the link state of the switch
-  builtin Ethernet PHYs, per port. This function is responsible for calling
-  netif_carrier_{on,off} when appropriate, and can be used to poll all ports 
in a
-  single call. Executes from workqueue context.
-
 - adjust_link: Function invoked by the PHY library when a slave network device
   is attached to a PHY device. This function is responsible for appropriately
   configuring the switch port link parameters: speed, duplex, pause based on
-- 
2.7.4

[PATCH net v2 2/3] Documentation: networking: dsa: Remove priv_size description

2016-05-24 Thread Florian Fainelli

We no longer have a priv_size structure member since 5feebd0a8a79 ("net:
dsa: Remove allocation of driver private memory")

Reviewed-by: Vivien Didelot 
Signed-off-by: Florian Fainelli 
---
 Documentation/networking/dsa/dsa.txt | 2 --
 1 file changed, 2 deletions(-)

diff --git a/Documentation/networking/dsa/dsa.txt 
b/Documentation/networking/dsa/dsa.txt
index 8303eb8ced79..411b57fd73aa 100644
--- a/Documentation/networking/dsa/dsa.txt
+++ b/Documentation/networking/dsa/dsa.txt
@@ -369,8 +369,6 @@ does not allocate any driver private context space.
 Switch configuration
 
 
-- priv_size: additional size needed by the switch driver for its private 
context
-
 - tag_protocol: this is to indicate what kind of tagging protocol is supported,
   should be a valid value from the dsa_tag_protocol enum
 
-- 
2.7.4

[PATCH net v2 3/3] Documentation: networking: dsa: Describe port_vlan_filtering

2016-05-24 Thread Florian Fainelli

Described what the port_vlan_filtering function is supposed to
accomplish.

Fixes: fb2dabad69f0 ("net: dsa: support VLAN filtering switchdev attr")
Signed-off-by: Florian Fainelli 
---
 Documentation/networking/dsa/dsa.txt | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/Documentation/networking/dsa/dsa.txt 
b/Documentation/networking/dsa/dsa.txt
index 411b57fd73aa..9d05ed7f7da5 100644
--- a/Documentation/networking/dsa/dsa.txt
+++ b/Documentation/networking/dsa/dsa.txt
@@ -535,6 +535,16 @@ Bridge layer
 Bridge VLAN filtering
 -
 
+- port_vlan_filtering: bridge layer function invoked when the bridge gets
+  configured for turning on or off VLAN filtering. If nothing specific needs to
+  be done at the hardware level, this callback does not need to be implemented.
+  When VLAN filtering is turned on, the hardware must be programmed with
+  rejecting 802.1Q frames which have VLAN IDs outside of the programmed allowed
+  VLAN ID map/rules.  If there is no PVID programmed into the switch port,
+  untagged frames must be rejected as well. When turned off the switch must
+  accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames 
are
+  allowed.
+
 - port_vlan_prepare: bridge layer function invoked when the bridge prepares the
   configuration of a VLAN on the given port. If the operation is not supported
   by the hardware, this function should return -EOPNOTSUPP to inform the bridge
-- 
2.7.4

[PATCH net v2 0/3] Documentation: dsa: misc fixes

2016-05-24 Thread Florian Fainelli

Hi David,

Here are some miscelaneous documentation fixes for DSA, I targeted "net"
because these are not functional code changes, but still documentation fixes
per-se.

Changes in v2:

- reword what the port_vlan_filtering is about based on feedback from Vivien 
and Ido

Florian Fainelli (3):
  Documentation: networking: dsa: Remove poll_link description
  Documentation: networking: dsa: Remove priv_size description
  Documentation: networking: dsa: Describe port_vlan_filtering

 Documentation/networking/dsa/dsa.txt | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

-- 
2.7.4

Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-24 Thread Lino Sanfilippo

Francois, Shuyu,

this is the patch with the discussed changes.

Shuyu it would be great if you could test this one. If it passes
and there are no further objections I will resend it as a regular patch
(including commit message, etc.) to the mailing list.


diff --git a/drivers/net/ethernet/arc/emac_main.c 
b/drivers/net/ethernet/arc/emac_main.c
index a3a9392..ec656b3 100644
--- a/drivers/net/ethernet/arc/emac_main.c
+++ b/drivers/net/ethernet/arc/emac_main.c
@@ -153,18 +153,29 @@ static void arc_emac_tx_clean(struct net_device *ndev)
 {
struct arc_emac_priv *priv = netdev_priv(ndev);
struct net_device_stats *stats = &ndev->stats;
+   unsigned int curr = priv->txbd_curr;
unsigned int i;
 
+   /* Make sure buffers and txbd_curr are consistent */
+   smp_rmb();
+
for (i = 0; i < TX_BD_NUM; i++) {
unsigned int *txbd_dirty = &priv->txbd_dirty;
struct arc_emac_bd *txbd = &priv->txbd[*txbd_dirty];
struct buffer_state *tx_buff = &priv->tx_buff[*txbd_dirty];
-   struct sk_buff *skb = tx_buff->skb;
-   unsigned int info = le32_to_cpu(txbd->info);
+   unsigned int info;
+   struct sk_buff *skb;
 
-   if ((info & FOR_EMAC) || !txbd->data || !skb)
+   if (*txbd_dirty == curr)
break;
 
+   info = le32_to_cpu(txbd->info);
+
+   if (info & FOR_EMAC)
+   break;
+
+   skb = tx_buff->skb;
+
if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) {
stats->tx_errors++;
stats->tx_dropped++;
@@ -195,8 +206,8 @@ static void arc_emac_tx_clean(struct net_device *ndev)
*txbd_dirty = (*txbd_dirty + 1) % TX_BD_NUM;
}
 
-   /* Ensure that txbd_dirty is visible to tx() before checking
-* for queue stopped.
+   /* Ensure that txbd_dirty is visible to tx() and we see the most recent
+* value for txbd_curr.
 */
smp_mb();
 
@@ -680,27 +691,24 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
net_device *ndev)
dma_unmap_len_set(&priv->tx_buff[*txbd_curr], len, len);
 
priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
-
-   /* Make sure pointer to data buffer is set */
-   wmb();
+   priv->tx_buff[*txbd_curr].skb = skb;
 
skb_tx_timestamp(skb);
 
*info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
 
-   /* Make sure info word is set */
+   /* 1. Make sure that with respect to tx_clean everything is set up
+* properly before we advance txbd_curr.
+* 2. Make sure writes to DMA descriptors are completed before we inform
+* the hardware.
+*/
wmb();
 
-   priv->tx_buff[*txbd_curr].skb = skb;
-
/* Increment index to point to the next BD */
*txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
 
-   /* Ensure that tx_clean() sees the new txbd_curr before
-* checking the queue status. This prevents an unneeded wake
-* of the queue in tx_clean().
-*/
-   smp_mb();
+   /* Ensure tx_clean() sees the updated value of txbd_curr */
+   smp_wmb();
 
if (!arc_emac_tx_avail(priv)) {
netif_stop_queue(ndev);
-- 
1.9.1

Re: [ethtool 0/3][pull request] Intel Wired LAN Driver Updates 2016-05-03

2016-05-24 Thread Jeff Kirsher

On Wed, 2016-05-04 at 09:44 -0700, Jeff Kirsher wrote:
> This series contains updates to ixgbe in ethtool.
> 
> Preethi adds missing device IDs and mac_type definitions, also updated
> the display registers for x550, x550em_x/a.  Cleaned up the format string
> storage by taking advantage of "for" loops.
> 
> The following are changes since commit
> deb1c6613ec14fd828d321e38c7bea45fe559bd5:
>   Release version 4.5.
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/ethtool master
> 
> Preethi Banala (3):
>   ethtool/ixgbe: Add device ID and mac_type definitions
>   ethtool/ixgbe: Correct offsets and support x550, x550em_x, x550em_a
>   ethtool/ixgbe: Reduce format string storage
> 
>  ixgbe.c | 173 +++---
> --
>  1 file changed, 95 insertions(+), 78 deletions(-)
> 

Is Ben still maintaining ethtool?  I ask because I have this series which I
sent out earlier this month, with no word and I know there is at least one
other ethtool patch series that has had no response or committal from who
ever is maintaining ethtool.

I know we discussed last netconf that we should look at possibly a new tool
to address the shortcomings of ethtool, but I was not aware we had
abandoned maintaining the current ethtool already before any replacement
tool has been developed.

signature.asc
Description: This is a digitally signed message part

Re: [PATCH net v2] fou: avoid using sk_user_data before it is initialised

2016-05-24 Thread David Miller

From: Cong Wang 
Date: Mon, 23 May 2016 14:05:17 -0700

> On Mon, May 23, 2016 at 1:59 PM, David Miller  wrote:
>> From: Simon Horman 
>> Date: Fri, 20 May 2016 14:57:17 +0900
>>
>>> During initialisation sk->sk_user_data should not be used before
>>> it is initialised.
>>>
>>> Found by bisection after noticing the following:
>>  ...
>>> Fixes: d92283e338f6 ("fou: change to use UDP socket GRO")
>>> Signed-off-by: Simon Horman 
>>> ---
>>> v2
>>> * Updated implementation to simply access fou->protocol directly
>>>   as suggested by Tom Herbert and Cong Want
>>
>> I think this was resolved in another way meanwhile.  If a fix is still
>> needed, please respin.
> 
> This fix is only needed for -stable to backport, because Tom's
> fix is within a series of patches, which makes it hard to backport.

Ok, queued up for -stable.

Re: [net PATCH v2 1/1] net sched actions: policer missing timestamp processing

2016-05-24 Thread David Miller

From: Jamal Hadi Salim 
Date: Mon, 23 May 2016 21:07:20 -0400

> From: Jamal Hadi Salim 
> 
> Policer was not dumping or updating timestamps
> 
> Signed-off-by: Jamal Hadi Salim 

Applied.

Re: [RFC PATCH 1/7] fou: Get net from sock_net if dev_net unavailable

2016-05-24 Thread Tom Herbert

On Tue, May 24, 2016 at 3:01 PM, David Miller  wrote:
> From: Tom Herbert 
> Date: Mon, 23 May 2016 15:48:20 -0700
>
>> diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
>> index 5f9207c..96260c6 100644
>> --- a/net/ipv4/fou.c
>> +++ b/net/ipv4/fou.c
>> @@ -807,13 +807,20 @@ int __fou_build_header(struct sk_buff *skb, struct 
>> ip_tunnel_encap *e,
>>  u8 *protocol, __be16 *sport, int type)
>>  {
>>   int err;
>> + struct net *net;
>>
>
> Please order local variables from longest to shortest line.
>
>>   err = iptunnel_handle_offloads(skb, type);
>>   if (err)
>>   return err;
>>
>> - *sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev),
>> - skb, 0, 0, false);
>> + if (skb->dev)
>> + net = dev_net(skb->dev);
>> + else if (skb->sk)
>> + net = sock_net(skb->sk);
>
> This is getting rediculous.  Why not just put the net namespace pointer into
> the tunnel encap object?

Thanks, in this case it will be easier to just pass net in as an argument.

Tom

[PATCH v2] Make builds default to quiet mode

2016-05-24 Thread David Ahern

Similar to the Linux kernel and perf add infrastructure to reduce the
amount of output tossed to a user during a build. Full build output
can be obtained with 'make V=1'

Builds go from:

make[1]: Leaving directory `/home/dsa/iproute2.git/lib'
make[1]: Entering directory `/home/dsa/iproute2.git/ip'
gcc -Wall -Wstrict-prototypes  -Wmissing-prototypes -Wmissing-declarations 
-Wold-style-definition -Wformat=2 -O2 -I../include -DRESOLVE_HOSTNAMES 
-DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" -D_GNU_SOURCE 
-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE-c -o ip.o 
ip.c
gcc -Wall -Wstrict-prototypes  -Wmissing-prototypes -Wmissing-declarations 
-Wold-style-definition -Wformat=2 -O2 -I../include -DRESOLVE_HOSTNAMES 
-DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" -D_GNU_SOURCE 
-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE-c -o 
ipaddress.o ipaddress.c

to:

...
AR   libutil.a

ip
CC   ip.o
CC   ipaddress.o
...

Signed-off-by: David Ahern 
---
v2
- changed 'echo  "\n$$i"' to 'echo; echo $$i' per Andreas'
  comment

 Makefile |  6 +-
 bridge/Makefile  |  1 +
 configure| 32 
 devlink/Makefile |  1 +
 genl/Makefile|  1 +
 ip/Makefile  |  2 ++
 lib/Makefile |  4 ++--
 misc/Makefile| 12 +++-
 tc/Makefile  | 15 ---
 tipc/Makefile|  1 +
 10 files changed, 60 insertions(+), 15 deletions(-)

diff --git a/Makefile b/Makefile
index eb571a5accf8..15c81ecfdca3 100644
--- a/Makefile
+++ b/Makefile
@@ -1,3 +1,7 @@
+ifndef VERBOSE
+MAKEFLAGS += --no-print-directory
+endif
+
 PREFIX?=/usr
 LIBDIR?=$(PREFIX)/lib
 SBINDIR?=/sbin
@@ -50,7 +54,7 @@ LDLIBS += $(LIBNETLINK)
 all: Config
@set -e; \
for i in $(SUBDIRS); \
-   do $(MAKE) $(MFLAGS) -C $$i; done
+   do echo; echo $$i; $(MAKE) $(MFLAGS) -C $$i; done
 
 Config:
sh configure $(KERNEL_INCLUDE)
diff --git a/bridge/Makefile b/bridge/Makefile
index 98007530240a..7203f70bc510 100644
--- a/bridge/Makefile
+++ b/bridge/Makefile
@@ -9,6 +9,7 @@ endif
 all: bridge
 
 bridge: $(BROBJ) $(LIBNETLINK) 
+   $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
 
 install: all
install -m 0755 bridge $(DESTDIR)$(SBINDIR)
diff --git a/configure b/configure
index d2540b0d3219..60eb6b51a571 100755
--- a/configure
+++ b/configure
@@ -317,7 +317,35 @@ EOF
 rm -f $TMPDIR/dbtest.c $TMPDIR/dbtest
 }
 
+quiet_config()
+{
+   cat > Config
+
 check_toolchain
 
 echo "TC schedulers"
@@ -357,3 +385,7 @@ echo
 echo -n "docs:"
 check_docs
 echo
+
+echo >> Config
+echo "%.o: %.c" >> Config
+echo ' $(QUIET_CC)$(CC) $(CFLAGS) $(EXTRA_CFLAGS) -c -o $@ $<' >> Config
diff --git a/devlink/Makefile b/devlink/Makefile
index 3fdaa6904ee1..7256c28708eb 100644
--- a/devlink/Makefile
+++ b/devlink/Makefile
@@ -12,6 +12,7 @@ endif
 all: $(TARGETS) $(LIBS)
 
 devlink: $(DEVLINKOBJ)
+   $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
 
 install: all
install -m 0755 $(TARGETS) $(DESTDIR)$(SBINDIR)
diff --git a/genl/Makefile b/genl/Makefile
index 03d1f26a1cee..f5a0bfe42aff 100644
--- a/genl/Makefile
+++ b/genl/Makefile
@@ -20,6 +20,7 @@ endif
 all: genl
 
 genl: $(GENLOBJ) $(LIBNETLINK) $(LIBUTIL) $(GENLLIB)
+   $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
 
 install: all
install -m 0755 genl $(DESTDIR)$(SBINDIR)
diff --git a/ip/Makefile b/ip/Makefile
index f3d298739cac..a7f9c1101798 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -24,8 +24,10 @@ TARGETS=ip rtmon
 all: $(TARGETS) $(SCRIPTS)
 
 ip: $(IPOBJ) $(LIBNETLINK)
+   $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
 
 rtmon: $(RTMONOBJ)
+   $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
 
 install: all
install -m 0755 $(TARGETS) $(DESTDIR)$(SBINDIR)
diff --git a/lib/Makefile b/lib/Makefile
index 9d1307dd4df2..52e016db50b6 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -15,10 +15,10 @@ NLOBJ=libgenl.o ll_map.o libnetlink.o
 all: libnetlink.a libutil.a
 
 libnetlink.a: $(NLOBJ)
-   $(AR) rcs $@ $(NLOBJ)
+   $(QUIET_AR)$(AR) rcs $@ $^
 
 libutil.a: $(UTILOBJ) $(ADDLIB)
-   $(AR) rcs $@ $(UTILOBJ) $(ADDLIB)
+   $(QUIET_AR)$(AR) rcs $@ $^
 
 install:
 
diff --git a/misc/Makefile b/misc/Makefile
index f50e7403a33b..72807678054b 100644
--- a/misc/Makefile
+++ b/misc/Makefile
@@ -21,23 +21,25 @@ endif
 all: $(TARGETS)
 
 ss: $(SSOBJ)
+   $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
 
 nstat: nstat.c
-   $(CC) $(CFLAGS) $(LDFLAGS) -o nstat nstat.c $(LIBNETLINK) -lm
+   $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o nstat nstat.c $(LIBNETLINK) -lm
 
 ifstat: ifstat.c
-   $(CC) $(CFLAGS) $(LDFLAGS) -o ifstat ifstat.c $(LIBNETLINK) -lm
+   $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o ifstat ifstat.c $(LIBNETLINK) 
-lm
 
 rtacct: rtacct.c
-   $(CC) $(CFLAGS) $(LDFLAGS) -o rtacct rtacct.c $(LIBNETLINK) -lm
+   $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o rtacct rtac

Re: [RFC PATCH 2/7] tou: Base infrastructure for Transport over UDP

2016-05-24 Thread David Miller

From: Tom Herbert 
Date: Mon, 23 May 2016 15:48:21 -0700

> +int tou_encap_setsockopt(struct sock *sk, char __user *optval, int optlen,
> +  bool is_ipv6)
> +{
> + struct tou_encap te;
> + struct ip_tunnel_encap encap;
> + struct inet_sock *inet = inet_sk(sk);
> + struct ip_tunnel_encap *e = inet->tou_encap;

This doesn't compile, because you don't add the tou_encap member to inet_sock
until patch #3.

Re: [RFC PATCH 1/7] fou: Get net from sock_net if dev_net unavailable

2016-05-24 Thread David Miller

From: Tom Herbert 
Date: Mon, 23 May 2016 15:48:20 -0700

> diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
> index 5f9207c..96260c6 100644
> --- a/net/ipv4/fou.c
> +++ b/net/ipv4/fou.c
> @@ -807,13 +807,20 @@ int __fou_build_header(struct sk_buff *skb, struct 
> ip_tunnel_encap *e,
>  u8 *protocol, __be16 *sport, int type)
>  {
>   int err;
> + struct net *net;
>  

Please order local variables from longest to shortest line.

>   err = iptunnel_handle_offloads(skb, type);
>   if (err)
>   return err;
>  
> - *sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev),
> - skb, 0, 0, false);
> + if (skb->dev)
> + net = dev_net(skb->dev);
> + else if (skb->sk)
> + net = sock_net(skb->sk);

This is getting rediculous.  Why not just put the net namespace pointer into
the tunnel encap object?

Re: rtk8168 driver help needed

2016-05-24 Thread Murali Karicheri

On 05/19/2016 08:31 PM, Francois Romieu wrote:
> Murali Karicheri  :
> [...]
>> Do you what could be wrong with rtk8168?
> 
> Hardly.
> 
> What do the device registers (ethtool -d) and device stats (ethtool -S)
> look like ?
Will dump in the next opportunity and respond
> 
> (please trim useless material from previous mail)
> 
Ok. Will keep in mind next time.

-- 
Murali Karicheri
Linux Kernel, Keystone

Re: [PATCH net] net_sched: avoid too many hrtimer_start() calls

2016-05-24 Thread David Miller

From: Eric Dumazet 
Date: Mon, 23 May 2016 14:24:56 -0700

> From: Eric Dumazet 
> 
> I found a serious performance bug in packet schedulers using hrtimers.
> 
> sch_htb and sch_fq are definitely impacted by this problem.
> 
> We constantly rearm high resolution timers if some packets are throttled
> in one (or more) class, and other packets are flying through qdisc on
> another (non throttled) class.
> 
> hrtimer_start() does not have the mod_timer() trick of doing nothing if
> expires value does not change :
> 
>   if (timer_pending(timer) &&
> timer->expires == expires)
> return 1;
> 
> This issue is particularly visible when multiple cpus can queue/dequeue
> packets on the same qdisc, as hrtimer code has to lock a remote base.
> 
> I used following fix :
> 
> 1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding
> it.
> 
> 2) Cache watchdog prior expiration. hrtimer might provide this, but I
> prefer to not rely on some hrtimer internal.
> 
> Signed-off-by: Eric Dumazet 

This looks fine, applied, thanks Eric.

Re: [PATCH] net/qlge: Avoids recursive EEH error

2016-05-24 Thread David Miller

From: Gavin Shan 
Date: Mon, 23 May 2016 11:58:28 +1000

> One timer, whose handler keeps reading on MMIO register for EEH
> core to detect error in time, is started when the PCI device driver
> is loaded. MMIO register can't be accessed during PE reset in EEH
> recovery. Otherwise, the unexpected recursive error is triggered.
> The timer isn't closed that time if the interface isn't brought
> up. So the unexpected recursive error is seen during EEH recovery
> when the interface is down.
> 
> This avoids the unexpected recursive EEH error by closing the timer
> in qlge_io_error_detected() before EEH PE reset unconditionally. The
> timer is started unconditionally after EEH PE reset in qlge_io_resume().
> Also, the timer should be closed unconditionally when the device is
> removed from the system permanently in qlge_io_error_detected().
> 
> Reported-by: Shriya R. Kulkarni 
> Signed-off-by: Gavin Shan 

Applied, thanks.

Re: [PATCH v2 2/2] ip6_gre: Set flowi6_proto as IPPROTO_GRE in xmit path.

2016-05-24 Thread David Miller

From: Haishuang Yan 
Date: Sat, 21 May 2016 18:17:35 +0800

> In gre6 xmit path, we are sending a GRE packet, so set fl6 proto
> to IPPROTO_GRE properly.
> 
> Signed-off-by: Haishuang Yan 

Applied.

Re: [PATCH v2 1/2] ip6_gre: Fix MTU setting for ip6gretap

2016-05-24 Thread David Miller

From: Haishuang Yan 
Date: Sat, 21 May 2016 18:17:34 +0800

> When creat an ip6gretap interface with an unreachable route,
> the MTU is about 14 bytes larger than what was needed.
 ...
> Because rt is not NULL here, so dev->mtu will subtract the ethernet
> header length later. But when rt is NULL, it just simply return, so
> dev->mtu doesn't update correctly in this situation.
> 
> This patch first verify the dev->type is ARPHRD_ETHER for ip6gretap
> interface, and then decrease the mtu as early as possible.
> 
> Signed-off-by: Haishuang Yan 

Applied.

[PATCH V2] brcmfmac: print error if p2p_ifadd firmware command fails

2016-05-24 Thread Rafał Miłecki

This is helpful for debugging, without this all I was getting from "iw"
command on device with BCM43602 was:
> command failed: Too many open files in system (-23)

Signed-off-by: Rafał Miłecki 
---
V2: s/in/if/ in commit message
---
 drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c
index 1652a48..f7b7e29 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c
@@ -2031,7 +2031,7 @@ static int brcmf_p2p_request_p2p_if(struct brcmf_p2p_info 
*p2p,
err = brcmf_fil_iovar_data_set(ifp, "p2p_ifadd", &if_request,
   sizeof(if_request));
if (err)
-   return err;
+   brcmf_err("p2p_ifadd failed %d\n", err);
 
return err;
 }
-- 
1.8.4.5

[PATCH] brcmfmac: print error in p2p_ifadd firmware command fails

2016-05-24 Thread Rafał Miłecki

This is helpful for debugging, without this all I was getting from "iw"
command on device with BCM43602 was:
> command failed: Too many open files in system (-23)

Signed-off-by: Rafał Miłecki 
---
 drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c
index 1652a48..f7b7e29 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c
@@ -2031,7 +2031,7 @@ static int brcmf_p2p_request_p2p_if(struct brcmf_p2p_info 
*p2p,
err = brcmf_fil_iovar_data_set(ifp, "p2p_ifadd", &if_request,
   sizeof(if_request));
if (err)
-   return err;
+   brcmf_err("p2p_ifadd failed %d\n", err);
 
return err;
 }
-- 
1.8.4.5

Re: bpf: use-after-free in array_map_alloc

2016-05-24 Thread Alexei Starovoitov

On Tue, May 24, 2016 at 12:04 PM, Tejun Heo  wrote:
> Hello,
>
> Alexei, can you please verify this patch?  Map extension got rolled
> into balance work so that there's no sync issues between the two async
> operations.

tests look good. No uaf and basic bpf tests exercise per-cpu map are fine.

>
> Thanks.
>
> Index: work/mm/percpu.c
> ===
> --- work.orig/mm/percpu.c
> +++ work/mm/percpu.c
> @@ -112,7 +112,7 @@ struct pcpu_chunk {
> int map_used;   /* # of map entries used 
> before the sentry */
> int map_alloc;  /* # of map entries allocated 
> */
> int *map;   /* allocation map */
> -   struct work_struct  map_extend_work;/* async ->map[] extension */
> +   struct list_headmap_extend_list;/* on pcpu_map_extend_chunks 
> */
>
> void*data;  /* chunk data */
> int first_free; /* no free below this */
> @@ -162,10 +162,13 @@ static struct pcpu_chunk *pcpu_reserved_
>  static int pcpu_reserved_chunk_limit;
>
>  static DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */
> -static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop */
> +static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map 
> ext */
>
>  static struct list_head *pcpu_slot __read_mostly; /* chunk list slots */
>
> +/* chunks which need their map areas extended, protected by pcpu_lock */
> +static LIST_HEAD(pcpu_map_extend_chunks);
> +
>  /*
>   * The number of empty populated pages, protected by pcpu_lock.  The
>   * reserved chunk doesn't contribute to the count.
> @@ -395,13 +398,19 @@ static int pcpu_need_to_extend(struct pc
>  {
> int margin, new_alloc;
>
> +   lockdep_assert_held(&pcpu_lock);
> +
> if (is_atomic) {
> margin = 3;
>
> if (chunk->map_alloc <
> -   chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW &&
> -   pcpu_async_enabled)
> -   schedule_work(&chunk->map_extend_work);
> +   chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW) {
> +   if (list_empty(&chunk->map_extend_list)) {
> +   list_add_tail(&chunk->map_extend_list,
> + &pcpu_map_extend_chunks);
> +   pcpu_schedule_balance_work();
> +   }
> +   }
> } else {
> margin = PCPU_ATOMIC_MAP_MARGIN_HIGH;
> }
> @@ -435,6 +444,8 @@ static int pcpu_extend_area_map(struct p
> size_t old_size = 0, new_size = new_alloc * sizeof(new[0]);
> unsigned long flags;
>
> +   lockdep_assert_held(&pcpu_alloc_mutex);
> +
> new = pcpu_mem_zalloc(new_size);
> if (!new)
> return -ENOMEM;
> @@ -467,20 +478,6 @@ out_unlock:
> return 0;
>  }
>
> -static void pcpu_map_extend_workfn(struct work_struct *work)
> -{
> -   struct pcpu_chunk *chunk = container_of(work, struct pcpu_chunk,
> -   map_extend_work);
> -   int new_alloc;
> -
> -   spin_lock_irq(&pcpu_lock);
> -   new_alloc = pcpu_need_to_extend(chunk, false);
> -   spin_unlock_irq(&pcpu_lock);
> -
> -   if (new_alloc)
> -   pcpu_extend_area_map(chunk, new_alloc);
> -}
> -
>  /**
>   * pcpu_fit_in_area - try to fit the requested allocation in a candidate area
>   * @chunk: chunk the candidate area belongs to
> @@ -740,7 +737,7 @@ static struct pcpu_chunk *pcpu_alloc_chu
> chunk->map_used = 1;
>
> INIT_LIST_HEAD(&chunk->list);
> -   INIT_WORK(&chunk->map_extend_work, pcpu_map_extend_workfn);
> +   INIT_LIST_HEAD(&chunk->map_extend_list);
> chunk->free_size = pcpu_unit_size;
> chunk->contig_hint = pcpu_unit_size;
>
> @@ -895,6 +892,9 @@ static void __percpu *pcpu_alloc(size_t
> return NULL;
> }
>
> +   if (!is_atomic)
> +   mutex_lock(&pcpu_alloc_mutex);
> +
> spin_lock_irqsave(&pcpu_lock, flags);
>
> /* serve reserved allocations from the reserved chunk if available */
> @@ -967,12 +967,9 @@ restart:
> if (is_atomic)
> goto fail;
>
> -   mutex_lock(&pcpu_alloc_mutex);
> -
> if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) {
> chunk = pcpu_create_chunk();
> if (!chunk) {
> -   mutex_unlock(&pcpu_alloc_mutex);
> err = "failed to allocate new chunk";
> goto fail;
> }
> @@ -983,7 +980,6 @@ restart:
> spin_lock_irqsave(&pcpu_lock, flags);
> }
>
> -   mutex_unlock(&pcpu_alloc_mutex);
> goto restart;
>
>  area_found:
> @@ -993,8 +989,6 @@ area_found:
>

Re: ixgbe: ksoftirqd consumes 100% CPU w/ ~50 TCP conns

2016-05-24 Thread Jesper Dangaard Brouer

On Tue, 24 May 2016 12:46:56 -0700
Alexander Duyck  wrote:

> I'm guessing the issue is lock contention on the IOMMU resource table.
> I resolved most of that for the Rx side back when we implemented the
> Rx page reuse but the Tx side still has to perform a DMA mapping for
> each individual buffer.  Depending on the needs of the user if they
> still need the IOMMU enabled for use with something like KVM one thing
> they may try doing is use the kernel parameter "iommu=pt" to allow
> host devices to access memory without the penalty for having to
> allocate/free resources and still provide guests with IOMMU isolation.

Listen to Alex, he knows what his is talking about.

My longer term plan for getting rid of the dma_map/unmap overhead is to
_keep_ the pages DMA mapped and recycle them back via page-pool.
Details in my slides, see slide 5:
 
http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

Alex'es RX recycle trick for the Intel drivers are described on
slide14.  It seems like, in your use-case the pages might be held "too"
long for the RX recycling trick to work.

If you want to understand the IOMMU problem in details, I recommend to
read the article "True IOMMU Protection from DMA Attacks"
 http://www.cs.technion.ac.il/~mad/publications/asplos2016-iommu.pdf
(My solution is different, but they desc the problem very well)

--Jesper

 
> On Tue, May 24, 2016 at 9:40 AM, Brandon Philips  wrote:
> > Hello Everyone-
> >
> > So we tracked it down to IOMMU causing CPU affinity getting broken[1].
> > Can we provide any further details or is this a known issue?
> >
> > Thank You,
> >
> > Brandon
> >
> > [1] https://github.com/coreos/bugs/issues/1275#issuecomment-219866601
> >
> > On Tue, May 17, 2016 at 12:44 PM, Brandon Philips  wrote:  
> >> Hello ixgbe team-
> >>
> >> With Linux v4.6 and the ixgbe driver (details below) a user is reporting
> >> ksoftirqd consuming 100% of the CPU on all cores after a moderate  ~20-50
> >> number of TCP connections. They are unable to reproduce this issue with
> >> Cisco hardware.
> >>
> >> With Kernel v3.19 they cannot reproduce[1] the issue. Disabling IOMMU
> >> (intel_iommu=off) does "fix" the issue[2].
> >>
> >> Thank You,
> >>
> >> Brandon
> >>
> >> [1] https://github.com/coreos/bugs/issues/1275#issuecomment-219157803
> >> [2] https://github.com/coreos/bugs/issues/1275#issuecomment-219819986
> >>
> >> Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
> >> ethtool -i eno1
> >> driver: ixgbe
> >> version: 4.0.1-k
> >> firmware-version: 0x84e0
> >> bus-info: :06:00.0
> >> supports-statistics: yes
> >> supports-test: yes
> >> supports-eeprom-access: yes
> >> supports-register-dump: yes
> >> supports-priv-flags: no
> >>
> >> CPU
> >> Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz  



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH v5 2/2] skb_array: ring test

2016-05-24 Thread Michael S. Tsirkin

On Tue, May 24, 2016 at 07:03:20PM +0200, Jesper Dangaard Brouer wrote:
> 
> On Tue, 24 May 2016 12:28:09 +0200
> Jesper Dangaard Brouer  wrote:
> 
> > I do like perf, but it does not answer my questions about the
> > performance of this queue. I will code something up in my own
> > framework[2] to answer my own performance questions.
> > 
> > Like what is be minimum overhead (in cycles) achievable with this type
> > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot)
> > for fastpath usage.
> 
> Coded it up here:
>  https://github.com/netoptimizer/prototype-kernel/commit/b16a3332184
>  
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c
> 
> This is a really fake benchmark, but it sort of shows the minimum
> overhead achievable with this type of queue, where it is the same
> CPU enqueuing and dequeuing, and cache is guaranteed to be hot.
> 
> Measured on a i7-4790K CPU @ 4.00GHz, the average cost of
> enqueue+dequeue of a single object is around 102 cycles(tsc).
> 
> To compare this with below, where enq and deq is measured separately:
>  102 / 2 = 51 cycles
> 
>  
> > Then I also want to know how this performs when two CPUs are involved.
> > As this is also a primary use-case, for you when sending packets into a
> > guest.
> 
> Coded it up here:
>  https://github.com/netoptimizer/prototype-kernel/commit/75fe31ef62e
>  
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c
>  
> This parallel benchmark try to keep two (or more) CPUs busy enqueuing or
> dequeuing on the same skb_array queue.  It prefills the queue,
> and stops the test as soon as queue is empty or full, or
> completes a number of "loops"/cycles.
> 
> For two CPUs the results are really good:
>  enqueue: 54 cycles(tsc)
>  dequeue: 53 cycles(tsc)
> 
> Going to 4 CPUs, things break down (but it was not primary use-case?):
>  CPU(0) 927 cycles(tsc) enqueue
>  CPU(1) 921 cycles(tsc) dequeue
>  CPU(2) 927 cycles(tsc) enqueue
>  CPU(3) 898 cycles(tsc) dequeue

It's mostly the spinlock contention I guess.
Maybe we don't need fair spinlocks in this case.
Try replacing spinlocks with simple cmpxchg
and see what happens?

Re: [PATCH RESEND 7/8] pipe: account to kmemcg

2016-05-24 Thread Eric Dumazet

On Tue, 2016-05-24 at 19:13 +0300, Vladimir Davydov wrote:
> On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
> ...
> > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > > +struct pipe_buffer *buf)
> > > +{
> > > + struct page *page = buf->page;
> > > +
> > > + if (page_count(page) == 1) {
> > 
> > This looks racy : some cpu could have temporarily elevated page count.
> 
> All pipe operations (pipe_buf_operations->get, ->release, ->steal) are
> supposed to be called under pipe_lock. So, if we see a pipe_buffer->page
> with refcount of 1 in ->steal, that means that we are the only its user
> and it can't be spliced to another pipe.
> 
> In fact, I just copied the code from generic_pipe_buf_steal, adding
> kmemcg related checks along the way, so it should be fine.

So you guarantee that no other cpu might have done
get_page_unless_zero() right before this test ?

Re: ixgbe: ksoftirqd consumes 100% CPU w/ ~50 TCP conns

2016-05-24 Thread Alexander Duyck

I'm guessing the issue is lock contention on the IOMMU resource table.
I resolved most of that for the Rx side back when we implemented the
Rx page reuse but the Tx side still has to perform a DMA mapping for
each individual buffer.  Depending on the needs of the user if they
still need the IOMMU enabled for use with something like KVM one thing
they may try doing is use the kernel parameter "iommu=pt" to allow
host devices to access memory without the penalty for having to
allocate/free resources and still provide guests with IOMMU isolation.

- Alex

On Tue, May 24, 2016 at 9:40 AM, Brandon Philips  wrote:
> Hello Everyone-
>
> So we tracked it down to IOMMU causing CPU affinity getting broken[1].
> Can we provide any further details or is this a known issue?
>
> Thank You,
>
> Brandon
>
> [1] https://github.com/coreos/bugs/issues/1275#issuecomment-219866601
>
> On Tue, May 17, 2016 at 12:44 PM, Brandon Philips  wrote:
>> Hello ixgbe team-
>>
>> With Linux v4.6 and the ixgbe driver (details below) a user is reporting
>> ksoftirqd consuming 100% of the CPU on all cores after a moderate  ~20-50
>> number of TCP connections. They are unable to reproduce this issue with
>> Cisco hardware.
>>
>> With Kernel v3.19 they cannot reproduce[1] the issue. Disabling IOMMU
>> (intel_iommu=off) does "fix" the issue[2].
>>
>> Thank You,
>>
>> Brandon
>>
>> [1] https://github.com/coreos/bugs/issues/1275#issuecomment-219157803
>> [2] https://github.com/coreos/bugs/issues/1275#issuecomment-219819986
>>
>> Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
>> ethtool -i eno1
>> driver: ixgbe
>> version: 4.0.1-k
>> firmware-version: 0x84e0
>> bus-info: :06:00.0
>> supports-statistics: yes
>> supports-test: yes
>> supports-eeprom-access: yes
>> supports-register-dump: yes
>> supports-priv-flags: no
>>
>> CPU
>> Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz

Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-24 Thread Francois Romieu

Lino Sanfilippo  :
[...]
> I dont agree here. A dma_wmb would have an effect to "data" and "info", yes,
> but it would have absolutely no effect to skb_tx_timestamp(), since there
> is no dma access involved here. In fact skb_tx_timestamp() could probably
> be even reordered to appear after the dma_wmb.
>
> Anyway, there is the wmb() directly after the assignment to "info". _This_
> barrier  should ensure that skb_tx_timestamp() (along with a flush of data
> and info to DMA) is executed before "txbd_curr" is advanced.
> This means that the corresponding skb cant be freed prematurely by tx_clean().

The concern here is about sending adequate PTP payload on the network.

skb_tx_timestamp() is used for network clock synchronization. Some extra
information must be transmitted. Be it through direct payload change
or through indirect control, it _is_ related to dma.

Several (most ?) skb_tx_timestamp() misuses blur the picture: CPU vs device
sync is of course way below the radar when the driver crashes because of
plain use-after-free skb_tx_timestamp() :o/

-- 
Ueimor

Re: bpf: use-after-free in array_map_alloc

2016-05-24 Thread Tejun Heo

Hello,

Alexei, can you please verify this patch?  Map extension got rolled
into balance work so that there's no sync issues between the two async
operations.

Thanks.

Index: work/mm/percpu.c
===
--- work.orig/mm/percpu.c
+++ work/mm/percpu.c
@@ -112,7 +112,7 @@ struct pcpu_chunk {
int map_used;   /* # of map entries used before 
the sentry */
int map_alloc;  /* # of map entries allocated */
int *map;   /* allocation map */
-   struct work_struct  map_extend_work;/* async ->map[] extension */
+   struct list_headmap_extend_list;/* on pcpu_map_extend_chunks */
 
void*data;  /* chunk data */
int first_free; /* no free below this */
@@ -162,10 +162,13 @@ static struct pcpu_chunk *pcpu_reserved_
 static int pcpu_reserved_chunk_limit;
 
 static DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */
-static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop */
+static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map 
ext */
 
 static struct list_head *pcpu_slot __read_mostly; /* chunk list slots */
 
+/* chunks which need their map areas extended, protected by pcpu_lock */
+static LIST_HEAD(pcpu_map_extend_chunks);
+
 /*
  * The number of empty populated pages, protected by pcpu_lock.  The
  * reserved chunk doesn't contribute to the count.
@@ -395,13 +398,19 @@ static int pcpu_need_to_extend(struct pc
 {
int margin, new_alloc;
 
+   lockdep_assert_held(&pcpu_lock);
+
if (is_atomic) {
margin = 3;
 
if (chunk->map_alloc <
-   chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW &&
-   pcpu_async_enabled)
-   schedule_work(&chunk->map_extend_work);
+   chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW) {
+   if (list_empty(&chunk->map_extend_list)) {
+   list_add_tail(&chunk->map_extend_list,
+ &pcpu_map_extend_chunks);
+   pcpu_schedule_balance_work();
+   }
+   }
} else {
margin = PCPU_ATOMIC_MAP_MARGIN_HIGH;
}
@@ -435,6 +444,8 @@ static int pcpu_extend_area_map(struct p
size_t old_size = 0, new_size = new_alloc * sizeof(new[0]);
unsigned long flags;
 
+   lockdep_assert_held(&pcpu_alloc_mutex);
+
new = pcpu_mem_zalloc(new_size);
if (!new)
return -ENOMEM;
@@ -467,20 +478,6 @@ out_unlock:
return 0;
 }
 
-static void pcpu_map_extend_workfn(struct work_struct *work)
-{
-   struct pcpu_chunk *chunk = container_of(work, struct pcpu_chunk,
-   map_extend_work);
-   int new_alloc;
-
-   spin_lock_irq(&pcpu_lock);
-   new_alloc = pcpu_need_to_extend(chunk, false);
-   spin_unlock_irq(&pcpu_lock);
-
-   if (new_alloc)
-   pcpu_extend_area_map(chunk, new_alloc);
-}
-
 /**
  * pcpu_fit_in_area - try to fit the requested allocation in a candidate area
  * @chunk: chunk the candidate area belongs to
@@ -740,7 +737,7 @@ static struct pcpu_chunk *pcpu_alloc_chu
chunk->map_used = 1;
 
INIT_LIST_HEAD(&chunk->list);
-   INIT_WORK(&chunk->map_extend_work, pcpu_map_extend_workfn);
+   INIT_LIST_HEAD(&chunk->map_extend_list);
chunk->free_size = pcpu_unit_size;
chunk->contig_hint = pcpu_unit_size;
 
@@ -895,6 +892,9 @@ static void __percpu *pcpu_alloc(size_t
return NULL;
}
 
+   if (!is_atomic)
+   mutex_lock(&pcpu_alloc_mutex);
+
spin_lock_irqsave(&pcpu_lock, flags);
 
/* serve reserved allocations from the reserved chunk if available */
@@ -967,12 +967,9 @@ restart:
if (is_atomic)
goto fail;
 
-   mutex_lock(&pcpu_alloc_mutex);
-
if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) {
chunk = pcpu_create_chunk();
if (!chunk) {
-   mutex_unlock(&pcpu_alloc_mutex);
err = "failed to allocate new chunk";
goto fail;
}
@@ -983,7 +980,6 @@ restart:
spin_lock_irqsave(&pcpu_lock, flags);
}
 
-   mutex_unlock(&pcpu_alloc_mutex);
goto restart;
 
 area_found:
@@ -993,8 +989,6 @@ area_found:
if (!is_atomic) {
int page_start, page_end, rs, re;
 
-   mutex_lock(&pcpu_alloc_mutex);
-
page_start = PFN_DOWN(off);
page_end = PFN_UP(off + size);
 
@@ -1005,7 +999,6 @@ area_found:
 
spin_lock_irqsave(&pcpu_lock, flags);
if (ret) {
-

[PATCH 4.2.y-ckt 48/53] VSOCK: do not disconnect socket when peer has shutdown SEND only

2016-05-24 Thread Kamal Mostafa

4.2.8-ckt11 -stable review patch.  If anyone has any objections, please let me 
know.

---8<

From: Ian Campbell 

[ Upstream commit dedc58e067d8c379a15a8a183c5db318201295bb ]

The peer may be expecting a reply having sent a request and then done a
shutdown(SHUT_WR), so tearing down the whole socket at this point seems
wrong and breaks for me with a client which does a SHUT_WR.

Looking at other socket family's stream_recvmsg callbacks doing a shutdown
here does not seem to be the norm and removing it does not seem to have
had any adverse effects that I can see.

I'm using Stefan's RFC virtio transport patches, I'm unsure of the impact
on the vmci transport.

Signed-off-by: Ian Campbell 
Cc: "David S. Miller" 
Cc: Stefan Hajnoczi 
Cc: Claudio Imbrenda 
Cc: Andy King 
Cc: Dmitry Torokhov 
Cc: Jorgen Hansen 
Cc: Adit Ranadive 
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller 
Signed-off-by: Kamal Mostafa 
---
 net/vmw_vsock/af_vsock.c | 21 +
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index df5fc6b..d68716c 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1795,27 +1795,8 @@ vsock_stream_recvmsg(struct socket *sock, struct msghdr 
*msg, size_t len,
else if (sk->sk_shutdown & RCV_SHUTDOWN)
err = 0;
 
-   if (copied > 0) {
-   /* We only do these additional bookkeeping/notification steps
-* if we actually copied something out of the queue pair
-* instead of just peeking ahead.
-*/
-
-   if (!(flags & MSG_PEEK)) {
-   /* If the other side has shutdown for sending and there
-* is nothing more to read, then modify the socket
-* state.
-*/
-   if (vsk->peer_shutdown & SEND_SHUTDOWN) {
-   if (vsock_stream_has_data(vsk) <= 0) {
-   sk->sk_state = SS_UNCONNECTED;
-   sock_set_flag(sk, SOCK_DONE);
-   sk->sk_state_change(sk);
-   }
-   }
-   }
+   if (copied > 0)
err = copied;
-   }
 
 out_wait:
finish_wait(sk_sleep(sk), &wait);
-- 
2.7.4

[PATCH net] sfc: on MC reset, clear PIO buffer linkage in TXQs

2016-05-24 Thread Edward Cree

Otherwise, if we fail to allocate new PIO buffers, our TXQs will try to
use the old ones, which aren't there any more.

Fixes: 183233bec810 "sfc: Allocate and link PIO buffers; map them with 
write-combining"
Signed-off-by: Edward Cree 
---
 drivers/net/ethernet/sfc/ef10.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 1681084..1f30912 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -619,6 +619,17 @@ fail:
return rc;
 }
 
+static void efx_ef10_forget_old_piobufs(struct efx_nic *efx)
+{
+   struct efx_channel *channel;
+   struct efx_tx_queue *tx_queue;
+
+   /* All our existing PIO buffers went away */
+   efx_for_each_channel(channel, efx)
+   efx_for_each_channel_tx_queue(tx_queue, channel)
+   tx_queue->piobuf = NULL;
+}
+
 #else /* !EFX_USE_PIO */
 
 static int efx_ef10_alloc_piobufs(struct efx_nic *efx, unsigned int n)
@@ -635,6 +646,10 @@ static void efx_ef10_free_piobufs(struct efx_nic *efx)
 {
 }
 
+static void efx_ef10_forget_old_piobufs(struct efx_nic *efx)
+{
+}
+
 #endif /* EFX_USE_PIO */
 
 static void efx_ef10_remove(struct efx_nic *efx)
@@ -1018,6 +1033,7 @@ static void efx_ef10_reset_mc_allocations(struct efx_nic 
*efx)
nic_data->must_realloc_vis = true;
nic_data->must_restore_filters = true;
nic_data->must_restore_piobufs = true;
+   efx_ef10_forget_old_piobufs(efx);
nic_data->rx_rss_context = EFX_EF10_RSS_CONTEXT_INVALID;
 
/* Driver-created vswitches and vports must be re-created */

[4.2.y-ckt stable] Patch "VSOCK: do not disconnect socket when peer has shutdown SEND only" has been added to the 4.2.y-ckt tree

2016-05-24 Thread Kamal Mostafa

This is a note to let you know that I have just added a patch titled

VSOCK: do not disconnect socket when peer has shutdown SEND only

to the linux-4.2.y-queue branch of the 4.2.y-ckt extended stable tree 
which can be found at:


https://git.launchpad.net/~canonical-kernel/linux/+git/linux-stable-ckt/log/?h=linux-4.2.y-queue

This patch is scheduled to be released in version 4.2.8-ckt11.

If you, or anyone else, feels it should not be added to this tree, please 
reply to this email.

For more information about the 4.2.y-ckt tree, see
https://wiki.ubuntu.com/Kernel/Dev/ExtendedStable

Thanks.
-Kamal

---8<

>From 89cf2ec58bee58f349e0cf4c1b77dc53490e6949 Mon Sep 17 00:00:00 2001
From: Ian Campbell 
Date: Wed, 4 May 2016 14:21:53 +0100
Subject: VSOCK: do not disconnect socket when peer has shutdown SEND only

[ Upstream commit dedc58e067d8c379a15a8a183c5db318201295bb ]

The peer may be expecting a reply having sent a request and then done a
shutdown(SHUT_WR), so tearing down the whole socket at this point seems
wrong and breaks for me with a client which does a SHUT_WR.

Looking at other socket family's stream_recvmsg callbacks doing a shutdown
here does not seem to be the norm and removing it does not seem to have
had any adverse effects that I can see.

I'm using Stefan's RFC virtio transport patches, I'm unsure of the impact
on the vmci transport.

Signed-off-by: Ian Campbell 
Cc: "David S. Miller" 
Cc: Stefan Hajnoczi 
Cc: Claudio Imbrenda 
Cc: Andy King 
Cc: Dmitry Torokhov 
Cc: Jorgen Hansen 
Cc: Adit Ranadive 
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller 
Signed-off-by: Kamal Mostafa 
---
 net/vmw_vsock/af_vsock.c | 21 +
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index df5fc6b..d68716c 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1795,27 +1795,8 @@ vsock_stream_recvmsg(struct socket *sock, struct msghdr 
*msg, size_t len,
else if (sk->sk_shutdown & RCV_SHUTDOWN)
err = 0;

-   if (copied > 0) {
-   /* We only do these additional bookkeeping/notification steps
-* if we actually copied something out of the queue pair
-* instead of just peeking ahead.
-*/
-
-   if (!(flags & MSG_PEEK)) {
-   /* If the other side has shutdown for sending and there
-* is nothing more to read, then modify the socket
-* state.
-*/
-   if (vsk->peer_shutdown & SEND_SHUTDOWN) {
-   if (vsock_stream_has_data(vsk) <= 0) {
-   sk->sk_state = SS_UNCONNECTED;
-   sock_set_flag(sk, SOCK_DONE);
-   sk->sk_state_change(sk);
-   }
-   }
-   }
+   if (copied > 0)
err = copied;
-   }

 out_wait:
finish_wait(sk_sleep(sk), &wait);
--
2.7.4

Re: [net-next PATCH 0/2] Follow-ups for GUEoIPv6 patches

2016-05-24 Thread Alex Duyck

On Tue, May 24, 2016 at 10:14 AM, Jeff Kirsher
 wrote:
> On Wed, 2016-05-18 at 21:10 -0700, David Miller wrote:
>> From: Jeff Kirsher 
>> Date: Wed, 18 May 2016 14:27:58 -0700
>>
>> > On Wed, 2016-05-18 at 10:44 -0700, Alexander Duyck wrote:
>> >> This patch series is meant to be applied after:
>> >> [PATCH v7 net-next 00/16] ipv6: Enable GUEoIPv6 and more fixes for v6
>> >> tunneling
>> >>
>> >> The first patch addresses an issue we already resolved in the GREv4
>> and
>> >> is
>> >> now present in GREv6 with the introduction of FOU/GUE for IPv6 based
>> GRE
>> >> tunnels.
>> >>
>> >> The second patch goes through and enables IPv6 tunnel offloads for the
>> >> Intel
>> >> NICs that already support the IPv4 based IP-in-IP tunnel offloads.  I
>> >> have
>> >> only done a bit of touch testing but have seen ~20 Gb/s over an i40e
>> >> interface using a v4-in-v6 tunnel, and I have verified IPv6 GRE is
>> still
>> >> passing traffic at around the same rate.  I plan to do further testing
>> >> but
>> >> with these patches present it should enable a wider audience to be
>> able
>> >> to
>> >> test the new features introduced in Tom's patchset with hardware
>> >> offloads.
>> >>
>> >> ---
>> >>
>> >> Alexander Duyck (2):
>> >>   ip6_gre: Do not allow segmentation offloads GRE_CSUM is enabled
>> >> with FOU/GUE
>> >>   intel: Add support for IPv6 IP-in-IP offload
>> >
>> > Dave, I have this series added to my queue.
>>
>> Why would you if it depends upon Tom's series, as mentioned above, which
>> isn't even in my tree yet?
>
> I was not able to apply his patches to my dev-queue due to conflicts and
> from discussion with Alex, I made the mistake in assuming the issues I saw
> was because it did not have Tom's driver changes.

I thought Dave had already applied the patches to his tree?  If so
there shouldn't be a need for you to pull them in and apply them.
That might also be the source of the conflicts if they are already in
the tree.

- Alex

[PATCH iproute2] Make builds default to quiet mode

2016-05-24 Thread David Ahern

Similar to the Linux kernel and perf add infrastructure to reduce the
amount of output tossed to a user during a build. Full build output
can be obtained with 'make V=1'

Builds go from:

make[1]: Leaving directory `/home/dsa/iproute2.git/lib'
make[1]: Entering directory `/home/dsa/iproute2.git/ip'
gcc -Wall -Wstrict-prototypes  -Wmissing-prototypes -Wmissing-declarations 
-Wold-style-definition -Wformat=2 -O2 -I../include -DRESOLVE_HOSTNAMES 
-DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" -D_GNU_SOURCE 
-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE-c -o ip.o 
ip.c
gcc -Wall -Wstrict-prototypes  -Wmissing-prototypes -Wmissing-declarations 
-Wold-style-definition -Wformat=2 -O2 -I../include -DRESOLVE_HOSTNAMES 
-DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" -D_GNU_SOURCE 
-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE-c -o 
ipaddress.o ipaddress.c

to:

...
AR   libutil.a

ip
CC   ip.o
CC   ipaddress.o
...

Signed-off-by: David Ahern 
---
 Makefile |  6 +-
 bridge/Makefile  |  1 +
 configure| 32 
 devlink/Makefile |  1 +
 genl/Makefile|  1 +
 ip/Makefile  |  2 ++
 lib/Makefile |  4 ++--
 misc/Makefile| 12 +++-
 tc/Makefile  | 15 ---
 tipc/Makefile|  1 +
 10 files changed, 60 insertions(+), 15 deletions(-)

diff --git a/Makefile b/Makefile
index 7f79cdd67f2d..99a1a6f31c0d 100644
--- a/Makefile
+++ b/Makefile
@@ -1,3 +1,7 @@
+ifndef VERBOSE
+MAKEFLAGS += --no-print-directory
+endif
+
 PREFIX?=/usr
 LIBDIR?=$(PREFIX)/lib
 SBINDIR?=/sbin
@@ -50,7 +54,7 @@ LDLIBS += $(LIBNETLINK)
 all: Config
@set -e; \
for i in $(SUBDIRS); \
-   do $(MAKE) $(MFLAGS) -C $$i; done
+   do echo "\n$$i"; $(MAKE) $(MFLAGS) -C $$i; done
 
 Config:
sh configure $(KERNEL_INCLUDE)
diff --git a/bridge/Makefile b/bridge/Makefile
index 98007530240a..7203f70bc510 100644
--- a/bridge/Makefile
+++ b/bridge/Makefile
@@ -9,6 +9,7 @@ endif
 all: bridge
 
 bridge: $(BROBJ) $(LIBNETLINK) 
+   $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
 
 install: all
install -m 0755 bridge $(DESTDIR)$(SBINDIR)
diff --git a/configure b/configure
index d2540b0d3219..60eb6b51a571 100755
--- a/configure
+++ b/configure
@@ -317,7 +317,35 @@ EOF
 rm -f $TMPDIR/dbtest.c $TMPDIR/dbtest
 }
 
+quiet_config()
+{
+   cat > Config
+
 check_toolchain
 
 echo "TC schedulers"
@@ -357,3 +385,7 @@ echo
 echo -n "docs:"
 check_docs
 echo
+
+echo >> Config
+echo "%.o: %.c" >> Config
+echo ' $(QUIET_CC)$(CC) $(CFLAGS) $(EXTRA_CFLAGS) -c -o $@ $<' >> Config
diff --git a/devlink/Makefile b/devlink/Makefile
index 3fdaa6904ee1..7256c28708eb 100644
--- a/devlink/Makefile
+++ b/devlink/Makefile
@@ -12,6 +12,7 @@ endif
 all: $(TARGETS) $(LIBS)
 
 devlink: $(DEVLINKOBJ)
+   $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
 
 install: all
install -m 0755 $(TARGETS) $(DESTDIR)$(SBINDIR)
diff --git a/genl/Makefile b/genl/Makefile
index 03d1f26a1cee..f5a0bfe42aff 100644
--- a/genl/Makefile
+++ b/genl/Makefile
@@ -20,6 +20,7 @@ endif
 all: genl
 
 genl: $(GENLOBJ) $(LIBNETLINK) $(LIBUTIL) $(GENLLIB)
+   $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
 
 install: all
install -m 0755 genl $(DESTDIR)$(SBINDIR)
diff --git a/ip/Makefile b/ip/Makefile
index 2194ed9642eb..bb85fe3e6610 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -24,8 +24,10 @@ TARGETS=ip rtmon
 all: $(TARGETS) $(SCRIPTS)
 
 ip: $(IPOBJ) $(LIBNETLINK)
+   $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
 
 rtmon: $(RTMONOBJ)
+   $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
 
 install: all
install -m 0755 $(TARGETS) $(DESTDIR)$(SBINDIR)
diff --git a/lib/Makefile b/lib/Makefile
index 9d1307dd4df2..52e016db50b6 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -15,10 +15,10 @@ NLOBJ=libgenl.o ll_map.o libnetlink.o
 all: libnetlink.a libutil.a
 
 libnetlink.a: $(NLOBJ)
-   $(AR) rcs $@ $(NLOBJ)
+   $(QUIET_AR)$(AR) rcs $@ $^
 
 libutil.a: $(UTILOBJ) $(ADDLIB)
-   $(AR) rcs $@ $(UTILOBJ) $(ADDLIB)
+   $(QUIET_AR)$(AR) rcs $@ $^
 
 install:
 
diff --git a/misc/Makefile b/misc/Makefile
index f50e7403a33b..72807678054b 100644
--- a/misc/Makefile
+++ b/misc/Makefile
@@ -21,23 +21,25 @@ endif
 all: $(TARGETS)
 
 ss: $(SSOBJ)
+   $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
 
 nstat: nstat.c
-   $(CC) $(CFLAGS) $(LDFLAGS) -o nstat nstat.c $(LIBNETLINK) -lm
+   $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o nstat nstat.c $(LIBNETLINK) -lm
 
 ifstat: ifstat.c
-   $(CC) $(CFLAGS) $(LDFLAGS) -o ifstat ifstat.c $(LIBNETLINK) -lm
+   $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o ifstat ifstat.c $(LIBNETLINK) 
-lm
 
 rtacct: rtacct.c
-   $(CC) $(CFLAGS) $(LDFLAGS) -o rtacct rtacct.c $(LIBNETLINK) -lm
+   $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o rtacct rtacct.c $(LIBNETLINK) 
-lm
 
 arpd: arpd.c
-   $(CC) $(CFLAGS) -I$(DBM_INC

Re: [PATCH iproute2] Make builds default to quiet mode

2016-05-24 Thread Andreas Schwab

David Ahern  writes:

> @@ -50,7 +54,7 @@ LDLIBS += $(LIBNETLINK)
>  all: Config
>   @set -e; \
>   for i in $(SUBDIRS); \
> - do $(MAKE) $(MFLAGS) -C $$i; done
> + do echo "\n$$i"; $(MAKE) $(MFLAGS) -C $$i; done

echo "\n" is not portable.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

Re: [net-next PATCH 0/2] Follow-ups for GUEoIPv6 patches

2016-05-24 Thread Jeff Kirsher

On Wed, 2016-05-18 at 21:10 -0700, David Miller wrote:
> From: Jeff Kirsher 
> Date: Wed, 18 May 2016 14:27:58 -0700
> 
> > On Wed, 2016-05-18 at 10:44 -0700, Alexander Duyck wrote:
> >> This patch series is meant to be applied after:
> >> [PATCH v7 net-next 00/16] ipv6: Enable GUEoIPv6 and more fixes for v6
> >> tunneling
> >> 
> >> The first patch addresses an issue we already resolved in the GREv4
> and
> >> is
> >> now present in GREv6 with the introduction of FOU/GUE for IPv6 based
> GRE
> >> tunnels.
> >> 
> >> The second patch goes through and enables IPv6 tunnel offloads for the
> >> Intel
> >> NICs that already support the IPv4 based IP-in-IP tunnel offloads.  I
> >> have
> >> only done a bit of touch testing but have seen ~20 Gb/s over an i40e
> >> interface using a v4-in-v6 tunnel, and I have verified IPv6 GRE is
> still
> >> passing traffic at around the same rate.  I plan to do further testing
> >> but
> >> with these patches present it should enable a wider audience to be
> able
> >> to
> >> test the new features introduced in Tom's patchset with hardware
> >> offloads.
> >> 
> >> ---
> >> 
> >> Alexander Duyck (2):
> >>   ip6_gre: Do not allow segmentation offloads GRE_CSUM is enabled
> >> with FOU/GUE
> >>   intel: Add support for IPv6 IP-in-IP offload
> > 
> > Dave, I have this series added to my queue.
> 
> Why would you if it depends upon Tom's series, as mentioned above, which
> isn't even in my tree yet?

I was not able to apply his patches to my dev-queue due to conflicts and
from discussion with Alex, I made the mistake in assuming the issues I saw
was because it did not have Tom's driver changes.

signature.asc
Description: This is a digitally signed message part

Re: [RFC PATCH 7/7] tou: Support for GSO

2016-05-24 Thread Tom Herbert

On Tue, May 24, 2016 at 7:59 AM, Alexander Duyck
 wrote:
> On Mon, May 23, 2016 at 3:48 PM, Tom Herbert  wrote:
>> Add SKB_GSO_TOU. In udp[64]_ufo_fragment check for SKB_GSO_TOU. If this
>> is set call skb_udp_tou_segment. skb_udp_tou_segment is very similar
>> to skb_udp_tunnel_segment except that we only need to deal with the
>> L4 headers.
>>
>> Signed-off-by: Tom Herbert 
>> ---
>>  include/linux/skbuff.h   |   2 +
>>  include/net/udp.h|   2 +
>>  net/ipv4/fou.c   |   2 +
>>  net/ipv4/ip_output.c |   2 +
>>  net/ipv4/udp_offload.c   | 164 
>> +--
>>  net/ipv6/inet6_connection_sock.c |   3 +
>>  net/ipv6/udp_offload.c   | 128 +++---
>>  7 files changed, 236 insertions(+), 67 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 65968a9..b57e484 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -482,6 +482,8 @@ enum {
>> SKB_GSO_PARTIAL = 1 << 13,
>>
>> SKB_GSO_TUNNEL_REMCSUM = 1 << 14,
>> +
>> +   SKB_GSO_TOU = 1 << 15,
>>  };
>>
>
> So where do you add the netdev feature bit?  From what I can tell that
> was overlooked and as a result devices that support FCoE CRC will end
> up corrupting TOU frames because netif_gso_ok currently ands the two
> together.
>
An obvious omission, thanks for pointing it out.

> Also I am pretty sure we can offload this on the Intel NICs using the
> GSO partial approach as we can just stuff the UDP header into the
> space that we would use for IPv4 options or IPv6 extension headers and
> it shouldn't complain.
>
That would be cool!

> - Alex

Re: ixgbe: ksoftirqd consumes 100% CPU w/ ~50 TCP conns

2016-05-24 Thread Brandon Philips

Hello Everyone-

So we tracked it down to IOMMU causing CPU affinity getting broken[1].
Can we provide any further details or is this a known issue?

Thank You,

Brandon

[1] https://github.com/coreos/bugs/issues/1275#issuecomment-219866601

On Tue, May 17, 2016 at 12:44 PM, Brandon Philips  wrote:
> Hello ixgbe team-
>
> With Linux v4.6 and the ixgbe driver (details below) a user is reporting
> ksoftirqd consuming 100% of the CPU on all cores after a moderate  ~20-50
> number of TCP connections. They are unable to reproduce this issue with
> Cisco hardware.
>
> With Kernel v3.19 they cannot reproduce[1] the issue. Disabling IOMMU
> (intel_iommu=off) does "fix" the issue[2].
>
> Thank You,
>
> Brandon
>
> [1] https://github.com/coreos/bugs/issues/1275#issuecomment-219157803
> [2] https://github.com/coreos/bugs/issues/1275#issuecomment-219819986
>
> Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
> ethtool -i eno1
> driver: ixgbe
> version: 4.0.1-k
> firmware-version: 0x84e0
> bus-info: :06:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: no
>
> CPU
> Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz

Re: [PATCH v5 2/2] skb_array: ring test

2016-05-24 Thread Jesper Dangaard Brouer

On Tue, 24 May 2016 12:28:09 +0200
Jesper Dangaard Brouer  wrote:

> I do like perf, but it does not answer my questions about the
> performance of this queue. I will code something up in my own
> framework[2] to answer my own performance questions.
> 
> Like what is be minimum overhead (in cycles) achievable with this type
> of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot)
> for fastpath usage.

Coded it up here:
 https://github.com/netoptimizer/prototype-kernel/commit/b16a3332184

https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c

This is a really fake benchmark, but it sort of shows the minimum
overhead achievable with this type of queue, where it is the same
CPU enqueuing and dequeuing, and cache is guaranteed to be hot.

Measured on a i7-4790K CPU @ 4.00GHz, the average cost of
enqueue+dequeue of a single object is around 102 cycles(tsc).

To compare this with below, where enq and deq is measured separately:
 102 / 2 = 51 cycles

> Then I also want to know how this performs when two CPUs are involved.
> As this is also a primary use-case, for you when sending packets into a
> guest.

Coded it up here:
 https://github.com/netoptimizer/prototype-kernel/commit/75fe31ef62e

https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c

This parallel benchmark try to keep two (or more) CPUs busy enqueuing or
dequeuing on the same skb_array queue.  It prefills the queue,
and stops the test as soon as queue is empty or full, or
completes a number of "loops"/cycles.

For two CPUs the results are really good:
 enqueue: 54 cycles(tsc)
 dequeue: 53 cycles(tsc)

Going to 4 CPUs, things break down (but it was not primary use-case?):
 CPU(0) 927 cycles(tsc) enqueue
 CPU(1) 921 cycles(tsc) dequeue
 CPU(2) 927 cycles(tsc) enqueue
 CPU(3) 898 cycles(tsc) dequeue

Next on my todo-list is to implement same tests for e.g. alf_queue, so
we can compare the concurrency part (which is the important part).
But FYI I'll be busy the next days at conf http://fosd2016.itu.dk/

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://git.kernel.org/cgit/linux/kernel/git/mst/vhost.git/log/?h=vhost
[2] https://github.com/netoptimizer/prototype-kernel

Re: [PATCH RESEND 7/8] pipe: account to kmemcg

2016-05-24 Thread Vladimir Davydov

On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
...
> > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > +  struct pipe_buffer *buf)
> > +{
> > +   struct page *page = buf->page;
> > +
> > +   if (page_count(page) == 1) {
> 
> This looks racy : some cpu could have temporarily elevated page count.

All pipe operations (pipe_buf_operations->get, ->release, ->steal) are
supposed to be called under pipe_lock. So, if we see a pipe_buffer->page
with refcount of 1 in ->steal, that means that we are the only its user
and it can't be spliced to another pipe.

In fact, I just copied the code from generic_pipe_buf_steal, adding
kmemcg related checks along the way, so it should be fine.

Thanks,
Vladimir

> 
> > +   if (memcg_kmem_enabled()) {
> > +   memcg_kmem_uncharge(page, 0);
> > +   __ClearPageKmemcg(page);
> > +   }
> > +   __SetPageLocked(page);
> > +   return 0;
> > +   }
> > +   return 1;
> > +}

Re: [PATCH RESEND 8/8] af_unix: charge buffers to kmemcg

2016-05-24 Thread Vladimir Davydov

On Tue, May 24, 2016 at 06:02:06AM -0700, Eric Dumazet wrote:
> On Tue, 2016-05-24 at 11:49 +0300, Vladimir Davydov wrote:
> > Unix sockets can consume a significant amount of system memory, hence
> > they should be accounted to kmemcg.
> > 
> > Since unix socket buffers are always allocated from process context,
> > all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
> > sock->sk_allocation mask.
> 
> I have two questions : 
> 
> 1) What happens when a buffer, allocated from socket  lands in a
> different socket , maybe owned by another user/process.
> 
> Who owns it now, in term of kmemcg accounting ?

We never move memcg charges. E.g. if two processes from different
cgroups are sharing a memory region, each page will be charged to the
process which touched it first. Or if two processes are working with the
same directory tree, inodes and dentries will be charged to the first
user. The same is fair for unix socket buffers - they will be charged to
the sender.

> 
> 2) Has performance impact been evaluated ?

I ran netperf STREAM_STREAM with default options in a kmemcg on
a 4 core x 2 HT box. The results are below:

 # clientsbandwidth (10^6bits/sec)
base  patched
 1  67643 +-  725  64874 +-  353- 4.0 %
 4 193585 +- 2516 186715 +- 1460- 3.5 %
 8 194820 +-  377 187443 +- 1229- 3.7 %

So the accounting doesn't come for free - it takes ~4% of performance.
I believe we could optimize it by using per cpu batching not only on
charge, but also on uncharge in memcg core, but that's beyond the scope
of this patch set - I'll take a look at this later.

Anyway, if performance impact is found to be unacceptable, it is always
possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
or not use memory cgroups at runtime at all (thanks to jump labels
there'll be no overhead even if they are compiled in).

Thanks,
Vladimir

Re: Davicom DM9162 PHY supported in the kernel?

2016-05-24 Thread Andrew Lunn

On Tue, May 24, 2016 at 05:09:35PM +0100, Amr Bekhit wrote:
> Hi Andrew,
> 
> > How about adding a printk() in genphy_read_status().
> 
> Would you be able to point me to some more information about these
> status bits, please?

You can get the datasheet from here:

http://www.davicom.com.tw/userfile/24247/DM9162_DM9162I-12-MCO-DS-F01_08062014.pdf

I would start by looking at the BMCR register. The
genphy_read_status() function should get called about once per second,
so that is an O.K. place to add debug prints, and there is example
code for reading MII_BMCR.

 Andrew

Re: Davicom DM9162 PHY supported in the kernel?

2016-05-24 Thread Amr Bekhit

Hi Andrew,

> How about adding a printk() in genphy_read_status().

Would you be able to point me to some more information about these
status bits, please? I'm struggling to find information about what
they are, where I read them from, etc. I'm happy to write some debug
printk's in the code, I just don't know where to get information about
these status bits.

[PATCH net 0/2] Fix spinlock usage in HWBM

2016-05-24 Thread Gregory CLEMENT

Hi,

these two patches fix spinlock related issues introduced in v4.6. They
have been reported by Russell King and Jean-Jacques Hiblot.

Thanks to them,

Gregory

Gregory CLEMENT (2):
  net: mvneta: Fix lacking spinlock initialization
  net: hwbm: Fix unbalanced spinlock in error case

 drivers/net/ethernet/marvell/mvneta_bm.c | 1 +
 net/core/hwbm.c  | 3 +++
 2 files changed, 4 insertions(+)

-- 
2.5.0

[PATCH net 1/2] net: mvneta: Fix lacking spinlock initialization

2016-05-24 Thread Gregory CLEMENT

The spinlock used by the hwbm functions must be initialized by the
network driver. This commit fixes this lack and the following erros when
lockdep is enabled:

INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
[] (unwind_backtrace) from [] (show_stack+0x10/0x14)
[] (show_stack) from [] (dump_stack+0xb4/0xe0)
[] (dump_stack) from [] (__lock_acquire+0x1f58/0x2060)
[] (__lock_acquire) from [] (lock_acquire+0xa4/0xd0)
[] (lock_acquire) from [] (_raw_spin_lock_irqsave+0x54/0x68)
[] (_raw_spin_lock_irqsave) from [] 
(hwbm_pool_add+0x1c/0xdc)
[] (hwbm_pool_add) from [] (mvneta_bm_pool_use+0x338/0x490)
[] (mvneta_bm_pool_use) from [] (mvneta_probe+0x654/0x1284)
[] (mvneta_probe) from [] (platform_drv_probe+0x4c/0xb0)
[] (platform_drv_probe) from [] 
(driver_probe_device+0x214/0x2c0)
[] (driver_probe_device) from [] (__driver_attach+0xc0/0xc4)
[] (__driver_attach) from [] (bus_for_each_dev+0x68/0x9c)
[] (bus_for_each_dev) from [] (bus_add_driver+0x1a0/0x218)
[] (bus_add_driver) from [] (driver_register+0x78/0xf8)
[] (driver_register) from [] (do_one_initcall+0x90/0x1dc)
[] (do_one_initcall) from [] 
(kernel_init_freeable+0x15c/0x1fc)
[] (kernel_init_freeable) from [] (kernel_init+0x8/0x114)
[] (kernel_init) from [] (ret_from_fork+0x14/0x24)

Fixes: baa11ebc0c76 ("net: mvneta: Use the new hwbm framework")
Reported-by: Russell King 
Cc: 
Signed-off-by: Gregory CLEMENT 
---
 drivers/net/ethernet/marvell/mvneta_bm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/marvell/mvneta_bm.c 
b/drivers/net/ethernet/marvell/mvneta_bm.c
index 01fccec632ec..466939f8f0cf 100644
--- a/drivers/net/ethernet/marvell/mvneta_bm.c
+++ b/drivers/net/ethernet/marvell/mvneta_bm.c
@@ -189,6 +189,7 @@ struct mvneta_bm_pool *mvneta_bm_pool_use(struct mvneta_bm 
*priv, u8 pool_id,
SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
hwbm_pool->construct = mvneta_bm_construct;
hwbm_pool->priv = new_pool;
+   spin_lock_init(&hwbm_pool->lock);
 
/* Create new pool */
err = mvneta_bm_pool_create(priv, new_pool);
-- 
2.5.0

[PATCH net 2/2] net: hwbm: Fix unbalanced spinlock in error case

2016-05-24 Thread Gregory CLEMENT

When hwbm_pool_add exited in error the spinlock was not released. This
patch fixes this issue.

Fixes: 8cb2d8bf57e6 ("net: add a hardware buffer management helper API")
Reported-by: Jean-Jacques Hiblot 
Cc: 
Signed-off-by: Gregory CLEMENT 
---
 net/core/hwbm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/hwbm.c b/net/core/hwbm.c
index 941c28486896..2cab489ae62e 100644
--- a/net/core/hwbm.c
+++ b/net/core/hwbm.c
@@ -55,18 +55,21 @@ int hwbm_pool_add(struct hwbm_pool *bm_pool, unsigned int 
buf_num, gfp_t gfp)
spin_lock_irqsave(&bm_pool->lock, flags);
if (bm_pool->buf_num == bm_pool->size) {
pr_warn("pool already filled\n");
+   spin_unlock_irqrestore(&bm_pool->lock, flags);
return bm_pool->buf_num;
}
 
if (buf_num + bm_pool->buf_num > bm_pool->size) {
pr_warn("cannot allocate %d buffers for pool\n",
buf_num);
+   spin_unlock_irqrestore(&bm_pool->lock, flags);
return 0;
}
 
if ((buf_num + bm_pool->buf_num) < bm_pool->buf_num) {
pr_warn("Adding %d buffers to the %d current buffers will 
overflow\n",
buf_num,  bm_pool->buf_num);
+   spin_unlock_irqrestore(&bm_pool->lock, flags);
return 0;
}
 
-- 
2.5.0

kernel panic with kernel 3.14.70, LVS on keepalived restart

2016-05-24 Thread Patrick Schaaf

Dear LVS users / netdev readers,

today I've got a pretty peculiar problem.

I've been running 3.14.48 (and some earlier 3.14 kernels) for a long time now 
in an LVS / keepalived driven loadbalancing cluster. See below for more detail 
on the setup.

Today I started to upgrade to the current 3.14.70 kernel. At first glance 
everything seems fine, I can failover _to_ the box with the new kernel, and 
traffic is flowing fine.

However, when I then switch BACK to a different box, the 3.14.70 kernel 
crashes. I've got an incomplete console dump (IPMI avi capture, single 
frame...) you can see here:

 
https://plus.google.com/u/0/photos/photo/114613285248943487324/6288270822391242162

I usually manually failover by (having some automation) fiddle with 
keepalived.conf VRRP priority settings, then restart the daemon.

The issue / reboot only manifests when I RESTART the ACTIVE keepalived. It 
always happens, then, with 3.14.70. That never happened before.

On the other hand just reloading keepalived, with the prio-modified config, 
works fine!

As I don't remember why I restarted instead of reloading, I can for now change 
my automation easily - but the issue is weird anyway.

More info on the setup:

1) kernel is vanilla 3.14.70 (and was vanilla 3.14.48 without the issue), with 
a single (self written) patch to bonding applied (see 
http://permalink.gmane.org/gmane.linux.network/316758). Unfortunately I cannot 
live without that patch, i.e. can't try to reproduce with a pure vanilla 
vanilla kernel.

2) keepalived is 1.2.13

3) config uses "use_vmac" / "vmac_xmit_base", on multiple interfaces, i.e. 
MACVLAN interfaces on top of:

4) "normal" interfaces are both bridge-over-VLAN-over-LACP-bond-over-eth and 
ARP-bond-over-VLAN-over-eth

5) there is active conntracking including conntrackd (but excluding LVS 
state), LVS loadbalancing of some 15k pps, LVS sync, and heavy iptables use 
including ipset matching, going on. Just for completeness.

Anybody got any idea what the root cause might be?

best regards
  Patrick

Re: bpf: use-after-free in array_map_alloc

2016-05-24 Thread Tejun Heo

Hello,

On Tue, May 24, 2016 at 10:40:54AM +0200, Vlastimil Babka wrote:
> [+CC Marco who reported the CVE, forgot that earlier]
> 
> On 05/23/2016 11:35 PM, Tejun Heo wrote:
> > Hello,
> > 
> > Can you please test whether this patch resolves the issue?  While
> > adding support for atomic allocations, I reduced alloc_mutex covered
> > region too much.
> > 
> > Thanks.
> 
> Ugh, this makes the code even more head-spinning than it was.

Locking-wise, it isn't complicated.  It used to be a single mutex
protecting everything.  Atomic alloc support required putting core
allocation parts under spinlock.  It is messy because the two paths
are mixed in the same function.  If we break out the core part to a
separate function and let the sleepable path call into that, it should
look okay, but that's for another patch.

Also, I think protecting chunk's lifetime w/ alloc_mutex is making it
a bit nasty.  Maybe we should do per-chunk "extending" completion and
let pcpu_alloc_mutex just protect populating chunks.

> > @@ -435,6 +435,8 @@ static int pcpu_extend_area_map(struct pcpu_chunk 
> > *chunk, int new_alloc)
> > size_t old_size = 0, new_size = new_alloc * sizeof(new[0]);
> > unsigned long flags;
> > 
> > +   lockdep_assert_held(&pcpu_alloc_mutex);
> 
> I don't see where the mutex gets locked when called via
> pcpu_map_extend_workfn? (except via the new cancel_work_sync() call below?)

Ah, right.

> Also what protects chunks with scheduled work items from being removed?

cancel_work_sync(), which now obviously should be called outside
alloc_mutex.

> > @@ -895,6 +897,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
> > align, bool reserved,
> > return NULL;
> > }
> > 
> > +   if (!is_atomic)
> > +   mutex_lock(&pcpu_alloc_mutex);
> 
> BTW I noticed that
>   bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;
> 
> this is too pessimistic IMHO. Reclaim is possible even without __GFP_FS and
> __GFP_IO. Could you just use gfpflags_allow_blocking(gfp) here?

vmalloc hardcodes GFP_KERNEL, so getting more relaxed doesn't buy us
much.

Thanks.

-- 
tejun

Re: [RFC PATCH 7/7] tou: Support for GSO

2016-05-24 Thread Alexander Duyck

On Mon, May 23, 2016 at 3:48 PM, Tom Herbert  wrote:
> Add SKB_GSO_TOU. In udp[64]_ufo_fragment check for SKB_GSO_TOU. If this
> is set call skb_udp_tou_segment. skb_udp_tou_segment is very similar
> to skb_udp_tunnel_segment except that we only need to deal with the
> L4 headers.
>
> Signed-off-by: Tom Herbert 
> ---
>  include/linux/skbuff.h   |   2 +
>  include/net/udp.h|   2 +
>  net/ipv4/fou.c   |   2 +
>  net/ipv4/ip_output.c |   2 +
>  net/ipv4/udp_offload.c   | 164 
> +--
>  net/ipv6/inet6_connection_sock.c |   3 +
>  net/ipv6/udp_offload.c   | 128 +++---
>  7 files changed, 236 insertions(+), 67 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 65968a9..b57e484 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -482,6 +482,8 @@ enum {
> SKB_GSO_PARTIAL = 1 << 13,
>
> SKB_GSO_TUNNEL_REMCSUM = 1 << 14,
> +
> +   SKB_GSO_TOU = 1 << 15,
>  };
>

So where do you add the netdev feature bit?  From what I can tell that
was overlooked and as a result devices that support FCoE CRC will end
up corrupting TOU frames because netif_gso_ok currently ands the two
together.

Also I am pretty sure we can offload this on the Intel NICs using the
GSO partial approach as we can just stuff the UDP header into the
space that we would use for IPv4 options or IPv6 extension headers and
it shouldn't complain.

- Alex

Re: [PATCH net 3/3] Documentation: networking: dsa: Describe port_vlan_filtering

2016-05-24 Thread Ido Schimmel

Hi Vivien, Florian,

Tue, May 24, 2016 at 05:32:33PM IDT, vivien.dide...@savoirfairelinux.com wrote:
>Hi Florian,
>
>Florian Fainelli  writes:
>
>> Described what the port_vlan_filtering function is supposed to
>> accomplish.
>>
>> Fixes: fb2dabad69f0 ("net: dsa: support VLAN filtering switchdev attr")
>> Signed-off-by: Florian Fainelli 
>> ---
>>  Documentation/networking/dsa/dsa.txt | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/Documentation/networking/dsa/dsa.txt 
>> b/Documentation/networking/dsa/dsa.txt
>> index 411b57fd73aa..a42fd2ec32a5 100644
>> --- a/Documentation/networking/dsa/dsa.txt
>> +++ b/Documentation/networking/dsa/dsa.txt
>> @@ -535,6 +535,12 @@ Bridge layer
>>  Bridge VLAN filtering
>>  -
>>  
>> +- port_vlan_filtering: bridge layer function invoked when the bridge gets
>> +  configured for turning on or off VLAN filtering. If nothing specific 
>> needs to
>> +  be done at the hardware level, 0 must be returned. When VLAN filtering is
>> +  turned on, the hardware must be programmed with rejecting non-802.1Q 
>> frames,
>> +  when turned off the switch must accept any 802.1Q frames.
>
>Note that port_vlan_filtering is optional so a driver don't need to
>implement it if nothing specific needs to be done at the hardware level.
>
>Also I'd think that with VLAN filtering on, the hardware must not reject
>untagged frames, only 802.1Q frames which don't respect the programmed
>VLAN rules.

With VLAN filtering on I believe you only need to reject untagged frames
when there's no PVID on the port. See __allowed_ingress() in
net/bridge/br_vlan.c

Re: [RFC] net: remove busylock

2016-05-24 Thread Eric Dumazet

On Tue, 2016-05-24 at 13:50 +, David Laight wrote:
> From: Jesper Dangaard Brouer
> > Sent: 20 May 2016 18:50
> ...
> > If would be cool if you could run a test with removed busylock and
> > allow HTB to bulk dequeue.
> 
> (Without having looked )
> Could you have two queues and separate queue and dequeue locks.
> 
> The enqueue code would acquire the enqueue lock and add the packet
> to the first queue.
> 
> The dequeue code would acquire the dequeue lock and try to remove
> a packet from the 2nd queue, if nothing present it would acquire
> the enqueue lock and move the entire 1st queue to the 2nd queue.
> 
> The obvious downside is two lock/unlocks for single packet dequeue.
> If you can guarantee a single dequeue thread the 2nd lock isn't needed.
> 

Not with HTB or any 'complex' qdisc hierarchy, where we have many
'queues' and strict limits (packets per qdisc)

Re: [PATCH net 3/3] Documentation: networking: dsa: Describe port_vlan_filtering

2016-05-24 Thread Vivien Didelot

Hi Florian,

Florian Fainelli  writes:

> Described what the port_vlan_filtering function is supposed to
> accomplish.
>
> Fixes: fb2dabad69f0 ("net: dsa: support VLAN filtering switchdev attr")
> Signed-off-by: Florian Fainelli 
> ---
>  Documentation/networking/dsa/dsa.txt | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/Documentation/networking/dsa/dsa.txt 
> b/Documentation/networking/dsa/dsa.txt
> index 411b57fd73aa..a42fd2ec32a5 100644
> --- a/Documentation/networking/dsa/dsa.txt
> +++ b/Documentation/networking/dsa/dsa.txt
> @@ -535,6 +535,12 @@ Bridge layer
>  Bridge VLAN filtering
>  -
>  
> +- port_vlan_filtering: bridge layer function invoked when the bridge gets
> +  configured for turning on or off VLAN filtering. If nothing specific needs 
> to
> +  be done at the hardware level, 0 must be returned. When VLAN filtering is
> +  turned on, the hardware must be programmed with rejecting non-802.1Q 
> frames,
> +  when turned off the switch must accept any 802.1Q frames.

Note that port_vlan_filtering is optional so a driver don't need to
implement it if nothing specific needs to be done at the hardware level.

Also I'd think that with VLAN filtering on, the hardware must not reject
untagged frames, only 802.1Q frames which don't respect the programmed
VLAN rules.

Thanks,

Vivien

[PATCH v2] tipc: fix potential null pointer dereferences in some compat functions

2016-05-24 Thread Baozeng Ding

Before calling the nla_parse_nested function, make sure the pointer to the
attribute is not null. This patch fixes several potential null pointer
dereference vulnerabilities in the tipc netlink functions.

Signed-off-by: Baozeng Ding 
---
v2: declare local variable as reverse christmas tree format and make the commit
log fit in 80 columns
---
 net/tipc/netlink_compat.c | 111 ++
 1 file changed, 93 insertions(+), 18 deletions(-)

diff --git a/net/tipc/netlink_compat.c b/net/tipc/netlink_compat.c
index d7d050f..72a1c8f 100644
--- a/net/tipc/netlink_compat.c
+++ b/net/tipc/netlink_compat.c
@@ -346,9 +346,15 @@ static int tipc_nl_compat_bearer_dump(struct 
tipc_nl_compat_msg *msg,
  struct nlattr **attrs)
 {
struct nlattr *bearer[TIPC_NLA_BEARER_MAX + 1];
+   int err;
+
+   if (!attrs[TIPC_NLA_BEARER])
+   return -EINVAL;
 
-   nla_parse_nested(bearer, TIPC_NLA_BEARER_MAX, attrs[TIPC_NLA_BEARER],
-NULL);
+   err = nla_parse_nested(bearer, TIPC_NLA_BEARER_MAX,
+  attrs[TIPC_NLA_BEARER], NULL);
+   if (err)
+   return err;
 
return tipc_add_tlv(msg->rep, TIPC_TLV_BEARER_NAME,
nla_data(bearer[TIPC_NLA_BEARER_NAME]),
@@ -460,14 +466,31 @@ static int tipc_nl_compat_link_stat_dump(struct 
tipc_nl_compat_msg *msg,
struct nlattr *link[TIPC_NLA_LINK_MAX + 1];
struct nlattr *prop[TIPC_NLA_PROP_MAX + 1];
struct nlattr *stats[TIPC_NLA_STATS_MAX + 1];
+   int err;
 
-   nla_parse_nested(link, TIPC_NLA_LINK_MAX, attrs[TIPC_NLA_LINK], NULL);
+   if (!attrs[TIPC_NLA_LINK])
+   return -EINVAL;
 
-   nla_parse_nested(prop, TIPC_NLA_PROP_MAX, link[TIPC_NLA_LINK_PROP],
-NULL);
+   err = nla_parse_nested(link, TIPC_NLA_LINK_MAX, attrs[TIPC_NLA_LINK],
+  NULL);
+   if (err)
+   return err;
+
+   if (!link[TIPC_NLA_LINK_PROP])
+   return -EINVAL;
 
-   nla_parse_nested(stats, TIPC_NLA_STATS_MAX, link[TIPC_NLA_LINK_STATS],
-NULL);
+   err = nla_parse_nested(prop, TIPC_NLA_PROP_MAX,
+  link[TIPC_NLA_LINK_PROP], NULL);
+   if (err)
+   return err;
+
+   if (!link[TIPC_NLA_LINK_STATS])
+   return -EINVAL;
+
+   err = nla_parse_nested(stats, TIPC_NLA_STATS_MAX,
+  link[TIPC_NLA_LINK_STATS], NULL);
+   if (err)
+   return err;
 
name = (char *)TLV_DATA(msg->req);
if (strcmp(name, nla_data(link[TIPC_NLA_LINK_NAME])) != 0)
@@ -569,8 +592,15 @@ static int tipc_nl_compat_link_dump(struct 
tipc_nl_compat_msg *msg,
 {
struct nlattr *link[TIPC_NLA_LINK_MAX + 1];
struct tipc_link_info link_info;
+   int err;
 
-   nla_parse_nested(link, TIPC_NLA_LINK_MAX, attrs[TIPC_NLA_LINK], NULL);
+   if (!attrs[TIPC_NLA_LINK])
+   return -EINVAL;
+
+   err = nla_parse_nested(link, TIPC_NLA_LINK_MAX, attrs[TIPC_NLA_LINK],
+  NULL);
+   if (err)
+   return err;
 
link_info.dest = nla_get_flag(link[TIPC_NLA_LINK_DEST]);
link_info.up = htonl(nla_get_flag(link[TIPC_NLA_LINK_UP]));
@@ -758,12 +788,23 @@ static int tipc_nl_compat_name_table_dump(struct 
tipc_nl_compat_msg *msg,
u32 node, depth, type, lowbound, upbound;
static const char * const scope_str[] = {"", " zone", " cluster",
 " node"};
+   int err;
 
-   nla_parse_nested(nt, TIPC_NLA_NAME_TABLE_MAX,
-attrs[TIPC_NLA_NAME_TABLE], NULL);
+   if (!attrs[TIPC_NLA_NAME_TABLE])
+   return -EINVAL;
 
-   nla_parse_nested(publ, TIPC_NLA_PUBL_MAX, nt[TIPC_NLA_NAME_TABLE_PUBL],
-NULL);
+   err = nla_parse_nested(nt, TIPC_NLA_NAME_TABLE_MAX,
+  attrs[TIPC_NLA_NAME_TABLE], NULL);
+   if (err)
+   return err;
+
+   if (!nt[TIPC_NLA_NAME_TABLE_PUBL])
+   return -EINVAL;
+
+   err = nla_parse_nested(publ, TIPC_NLA_PUBL_MAX,
+  nt[TIPC_NLA_NAME_TABLE_PUBL], NULL);
+   if (err)
+   return err;
 
ntq = (struct tipc_name_table_query *)TLV_DATA(msg->req);
 
@@ -815,8 +856,15 @@ static int __tipc_nl_compat_publ_dump(struct 
tipc_nl_compat_msg *msg,
 {
u32 type, lower, upper;
struct nlattr *publ[TIPC_NLA_PUBL_MAX + 1];
+   int err;
 
-   nla_parse_nested(publ, TIPC_NLA_PUBL_MAX, attrs[TIPC_NLA_PUBL], NULL);
+   if (!attrs[TIPC_NLA_PUBL])
+   return -EINVAL;
+
+   err = nla_parse_nested(publ, TIPC_NLA_PUBL_MAX, attrs[TIPC_NLA_PUBL],
+  NULL);
+   if (err)
+   return err;

Re: [PATCH net 1/3] Documentation: networking: dsa: Remove poll_link description

2016-05-24 Thread Vivien Didelot

Florian Fainelli  writes:

> This function has been removed in 4baee937b8d5 ("net: dsa: remove DSA
> link polling") in favor of using the PHYLIB polling mechanism.
>
> Signed-off-by: Florian Fainelli 

Reviewed-by: Vivien Didelot

Re: [PATCH net 2/3] Documentation: networking: dsa: Remove priv_size description

2016-05-24 Thread Vivien Didelot

Florian Fainelli  writes:

> We no longer have a priv_size structure member since 5feebd0a8a79 ("net:
> dsa: Remove allocation of driver private memory")
>
> Signed-off-by: Florian Fainelli 

Reviewed-by: Vivien Didelot

RE: [RFC] net: remove busylock

2016-05-24 Thread David Laight

From: Jesper Dangaard Brouer
> Sent: 20 May 2016 18:50
...
> If would be cool if you could run a test with removed busylock and
> allow HTB to bulk dequeue.

(Without having looked )
Could you have two queues and separate queue and dequeue locks.

The enqueue code would acquire the enqueue lock and add the packet
to the first queue.

The dequeue code would acquire the dequeue lock and try to remove
a packet from the 2nd queue, if nothing present it would acquire
the enqueue lock and move the entire 1st queue to the 2nd queue.

The obvious downside is two lock/unlocks for single packet dequeue.
If you can guarantee a single dequeue thread the 2nd lock isn't needed.

David

Re: [PATCH iproute2 v3 5/5] ip: add MACsec support

2016-05-24 Thread Sabrina Dubroca

Hello Stephen,

2016-05-23, 16:21:42 -0700, Stephen Hemminger wrote:
> On Wed, 18 May 2016 17:35:13 +0200
> Sabrina Dubroca  wrote:
> 
> > +
> > +static void print_rx_sc(const char *prefix, __u64 sci, __u8 active, struct 
> > rtattr *rxsc_stats, struct rtattr *sa)
> > +{
> 
> Overall, this looks fine, but could you break some of the really long lines.
> In general, I like iproute2 to follow kernel style, and in general stick to
> the 80 column recommendation where it makes sense.  Spliting strings or 
> stuff in loops may not make sense. There are several places in this code
> longer than 100 chars.

Ugh, yeah, I forgot to clean this up, sorry.
I'll do that and resend.


Thanks,

-- 
Sabrina

Re: [PATCH iproute2 v3 1/5] add missing if_macsec.h header from kernel

2016-05-24 Thread Sabrina Dubroca

2016-05-23, 16:07:20 -0700, Stephen Hemminger wrote:
> On Wed, 18 May 2016 17:35:09 +0200
> Sabrina Dubroca  wrote:
> 
> > Signed-off-by: Sabrina Dubroca 
> > Acked-by: Phil Sutter 
> 
> This header already exists in master branch of iproute2, with current version
> from upstream kernel (with PAD values).

It wasn't when I posted this ;)
Excellent, I'll drop this patch from v4.


Thanks,

-- 
Sabrina

Re: [PATCH RESEND 8/8] af_unix: charge buffers to kmemcg

2016-05-24 Thread Eric Dumazet

On Tue, 2016-05-24 at 11:49 +0300, Vladimir Davydov wrote:
> Unix sockets can consume a significant amount of system memory, hence
> they should be accounted to kmemcg.
> 
> Since unix socket buffers are always allocated from process context,
> all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
> sock->sk_allocation mask.

I have two questions : 

1) What happens when a buffer, allocated from socket  lands in a
different socket , maybe owned by another user/process.

Who owns it now, in term of kmemcg accounting ?

2) Has performance impact been evaluated ?

Thanks.

Re: [PATCH RESEND 7/8] pipe: account to kmemcg

2016-05-24 Thread Eric Dumazet

On Tue, 2016-05-24 at 11:49 +0300, Vladimir Davydov wrote:
> Pipes can consume a significant amount of system memory, hence they
> should be accounted to kmemcg.
> 
> This patch marks pipe_inode_info and anonymous pipe buffer page
> allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
> Note, since a pipe buffer page can be "stolen" and get reused for other
> purposes, including mapping to userspace, we clear PageKmemcg thus
> resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
> is introduced by this patch.
> 
> Signed-off-by: Vladimir Davydov 
> Cc: Alexander Viro 
> ---
>  fs/pipe.c | 32 ++--
>  1 file changed, 26 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 0d3f5165cb0b..4b32928f5426 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -137,6 +138,22 @@ static void anon_pipe_buf_release(struct pipe_inode_info 
> *pipe,
>   put_page(page);
>  }
>  
> +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> +struct pipe_buffer *buf)
> +{
> + struct page *page = buf->page;
> +
> + if (page_count(page) == 1) {

This looks racy : some cpu could have temporarily elevated page count.

> + if (memcg_kmem_enabled()) {
> + memcg_kmem_uncharge(page, 0);
> + __ClearPageKmemcg(page);
> + }
> + __SetPageLocked(page);
> + return 0;
> + }
> + return 1;
> +}
> +

Re: [iproute2 PATCH v2 1/1] vv4.6.0

2016-05-24 Thread Jamal Hadi Salim


More git foo. I have no idea how this went out ;->
Please ignore this but not the other.

cheers,
jamal

On 16-05-24 07:52 AM, Jamal Hadi Salim wrote:

From: Stephen Hemminger 

---
  include/SNAPSHOT.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/SNAPSHOT.h b/include/SNAPSHOT.h
index 9220f77..cc67058 100644
--- a/include/SNAPSHOT.h
+++ b/include/SNAPSHOT.h
@@ -1 +1 @@
-static const char SNAPSHOT[] = "160314";
+static const char SNAPSHOT[] = "160518";

Re: [PATCH v5 2/2] skb_array: ring test

2016-05-24 Thread Michael S. Tsirkin

On Tue, May 24, 2016 at 12:28:09PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 23 May 2016 23:52:47 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote:
> > > On Mon, 23 May 2016 13:43:46 +0300
> > > "Michael S. Tsirkin"  wrote:
> > >   
> > > > Add ringtest based unit test for skb array.
> > > >
> > > > Signed-off-by: Michael S. Tsirkin 
> > > > ---
> > > >  tools/virtio/ringtest/skb_array.c | 167 
> > > > ++
> > > >  tools/virtio/ringtest/Makefile|   4 +-  
> > > 
> > > Patch didn't apply cleanly to Makefile, as you also seems to have
> > > "virtio_ring_inorder", I manually applied it.
> > > 
> > > I chdir to tools/virtio/ringtest/ and I could compile "skb_array",
> > > BUT how do I use it??? (the README is not helpful)
> > > 
> > > What is the "output", are there any performance measurement results?  
> > 
> > First, if it completes successfully this means it completed
> > a ton of cycles without errors. It caches any missing barriers
> > which aren't nops on your system.
> 
> I applied these patches on net-next (at commit 07b75260e) and the
> skb_array test program never terminates.   Strangely if I use your git
> tree[1] (on branch vhost) the program does terminate... I didn't spot
> the difference.
>  
> > Second - use perf.
> 
> I do like perf, but it does not answer my questions about the
> performance of this queue. I will code something up in my own
> framework[2] to answer my own performance questions.
> 
> Like what is be minimum overhead (in cycles) achievable with this type
> of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot)
> for fastpath usage.

Actually there is, kind of, a way to find out with my tool
if you have an HT CPU.  When you do run-on-all.sh
it will pin consumer to the last CPU, then run producer
on all of them. Look for the number for the HT pair -
this shares cache between producer and consumer.

This is not the same as doing produce + consume on
the same CPU but it's close enough I think.

To measure overhead I guess I should build a NOP tool
that does not actually produce or consume anything.
Will do.

> Then I also want to know how this performs when two CPUs are involved.
> As this is also a primary use-case, for you when sending packets into a
> guest.
> 
> 
> > E.g. simple perf stat will measure how long does it take to execute.
> > there's a script that runs it on different CPUs,
> > so I normally do:
> >
> > sh run-on-all.sh perf stat -r 5 ./skb_array
> 
> I recommend documenting this in the README file in the same dir ;-)
> 
> [1] https://git.kernel.org/cgit/linux/kernel/git/mst/vhost.git/log/?h=vhost
> [2] https://github.com/netoptimizer/prototype-kernel
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

[PULL] vhost: cleanups and fixes

2016-05-24 Thread Michael S. Tsirkin

The following changes since commit 2dcd0af568b0cf583645c8a317dd12e344b1c72a:

  Linux 4.6 (2016-05-15 15:43:13 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to bb991288728e6a47a6f0fac6a4e9dfaeecc27956:

  ringtest: pass buf != NULL (2016-05-22 19:44:14 +0300)


virtio: patches for 4.7

Looks like a quiet cycle for virtio.  There's a new inorder option for the
ringtest tool, and a bugfix for balloon for ppc platforms when using virtio 1
mode.

Signed-off-by: Michael S. Tsirkin 


Michael S. Tsirkin (3):
  virtio: add inorder option
  virtio_balloon: fix PFN format for virtio-1
  ringtest: pass buf != NULL

 drivers/virtio/virtio_balloon.c | 20 +++-
 tools/virtio/ringtest/main.c|  2 +-
 tools/virtio/ringtest/virtio_ring_0_9.c | 49 -
 tools/virtio/ringtest/virtio_ring_inorder.c |  2 ++
 tools/virtio/ringtest/Makefile  |  5 ++-
 5 files changed, 67 insertions(+), 11 deletions(-)
 create mode 100644 tools/virtio/ringtest/virtio_ring_inorder.c

Re: [PATCH v5 2/2] skb_array: ring test

2016-05-24 Thread Michael S. Tsirkin

On Tue, May 24, 2016 at 12:28:09PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 23 May 2016 23:52:47 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote:
> > > On Mon, 23 May 2016 13:43:46 +0300
> > > "Michael S. Tsirkin"  wrote:
> > >   
> > > > Add ringtest based unit test for skb array.
> > > >
> > > > Signed-off-by: Michael S. Tsirkin 
> > > > ---
> > > >  tools/virtio/ringtest/skb_array.c | 167 
> > > > ++
> > > >  tools/virtio/ringtest/Makefile|   4 +-  
> > > 
> > > Patch didn't apply cleanly to Makefile, as you also seems to have
> > > "virtio_ring_inorder", I manually applied it.
> > > 
> > > I chdir to tools/virtio/ringtest/ and I could compile "skb_array",
> > > BUT how do I use it??? (the README is not helpful)
> > > 
> > > What is the "output", are there any performance measurement results?  
> > 
> > First, if it completes successfully this means it completed
> > a ton of cycles without errors. It caches any missing barriers
> > which aren't nops on your system.
> 
> I applied these patches on net-next (at commit 07b75260e) and the
> skb_array test program never terminates.   Strangely if I use your git
> tree[1] (on branch vhost) the program does terminate... I didn't spot
> the difference.

Oh, that's my bad. You need

ringtest: pass buf != NULL

just a stub pointer for now.

Signed-off-by: Michael S. Tsirkin 

from my tree.

-- 
MST

[iproute2 PATCH v2 1/1] tc simple action: bug fix

2016-05-24 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

Failed compile
m_simple.c: In function ‘parse_simple’:
m_simple.c:154:6: warning: too many arguments for format [-Wformat-extra-args]
  *argv);
  ^
m_simple.c:103:14: warning: unused variable ‘maybe_bind’ [-Wunused-variable]

Reported-by: Daniel Borkmann 
Signed-off-by: Jamal Hadi Salim 
---
 tc/m_simple.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/tc/m_simple.c b/tc/m_simple.c
index feba61b..27b3e5e 100644
--- a/tc/m_simple.c
+++ b/tc/m_simple.c
@@ -100,7 +100,7 @@ parse_simple(struct action_util *a, int *argc_p, char 
***argv_p, int tca_id,
struct tc_defact sel = {};
int argc = *argc_p;
char **argv = *argv_p;
-   int ok = 0, maybe_bind = 0;
+   int ok = 0;
struct rtattr *tail;
char *simpdata = NULL;
 
@@ -150,7 +150,7 @@ parse_simple(struct action_util *a, int *argc_p, char 
***argv_p, int tca_id,
if (matches(*argv, "index") == 0) {
NEXT_ARG();
if (get_u32(&sel.index, *argv, 10)) {
-   fprintf(stderr, "simple: Illegal \"index\"\n",
+   fprintf(stderr, "simple: Illegal \"index\" 
(%s)\n",
*argv);
return -1;
}
@@ -171,7 +171,6 @@ parse_simple(struct action_util *a, int *argc_p, char 
***argv_p, int tca_id,
return -1;
}
 
-
sel.action = TC_ACT_PIPE;
 
tail = NLMSG_TAIL(n);
-- 
1.9.1

[iproute2 PATCH v2 1/1] vv4.6.0

2016-05-24 Thread Jamal Hadi Salim

From: Stephen Hemminger 

---
 include/SNAPSHOT.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/SNAPSHOT.h b/include/SNAPSHOT.h
index 9220f77..cc67058 100644
--- a/include/SNAPSHOT.h
+++ b/include/SNAPSHOT.h
@@ -1 +1 @@
-static const char SNAPSHOT[] = "160314";
+static const char SNAPSHOT[] = "160518";
-- 
1.9.1

Re: [iproute2 1/1] tc simple action: bug fix

2016-05-24 Thread Jamal Hadi Salim


On 16-05-24 06:30 AM, Daniel Borkmann wrote:

On 05/24/2016 03:04 AM, Jamal Hadi Salim wrote:

From: Jamal Hadi Salim 

Failed compile
m_simple.c: In function ‘parse_simple’:
m_simple.c:154:6: warning: too many arguments for format
[-Wformat-extra-args]
   *argv);
   ^
m_simple.c:103:14: warning: unused variable ‘maybe_bind’
[-Wunused-variable]


Btw, that last one is still not addressed.


Sigh. It is fixed in my tree. I have a few of patches that
I was planning to send later and some are touching intersecting files
and i am trying to be clever with git. Any tips? ;->
Will send v2.

cheers,
jamal

Re: [PATCH v5 2/2] skb_array: ring test

2016-05-24 Thread Michael S. Tsirkin

On Tue, May 24, 2016 at 12:28:09PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 23 May 2016 23:52:47 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote:
> > > On Mon, 23 May 2016 13:43:46 +0300
> > > "Michael S. Tsirkin"  wrote:
> > >   
> > > > Add ringtest based unit test for skb array.
> > > >
> > > > Signed-off-by: Michael S. Tsirkin 
> > > > ---
> > > >  tools/virtio/ringtest/skb_array.c | 167 
> > > > ++
> > > >  tools/virtio/ringtest/Makefile|   4 +-  
> > > 
> > > Patch didn't apply cleanly to Makefile, as you also seems to have
> > > "virtio_ring_inorder", I manually applied it.
> > > 
> > > I chdir to tools/virtio/ringtest/ and I could compile "skb_array",
> > > BUT how do I use it??? (the README is not helpful)
> > > 
> > > What is the "output", are there any performance measurement results?  
> > 
> > First, if it completes successfully this means it completed
> > a ton of cycles without errors. It caches any missing barriers
> > which aren't nops on your system.
> 
> I applied these patches on net-next (at commit 07b75260e) and the
> skb_array test program never terminates.   Strangely if I use your git
> tree[1] (on branch vhost) the program does terminate... I didn't spot
> the difference.

Disassemble the binaries and compare? Should be identical.
Or attach gdb and look at array.producer and array.consumer.

> > Second - use perf.
> 
> I do like perf, but it does not answer my questions about the
> performance of this queue. I will code something up in my own
> framework[2] to answer my own performance questions.

Sounds good.

> Like what is be minimum overhead (in cycles) achievable with this type
> of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot)
> for fastpath usage.

Interesting.

> Then I also want to know how this performs when two CPUs are involved.

This has flags to pin threads to different CPUs.


> As this is also a primary use-case, for you when sending packets into a
> guest.
> 

That's absolutely the primary usecase.
Was designed with this in mind.

> 
> > E.g. simple perf stat will measure how long does it take to execute.
> > there's a script that runs it on different CPUs,
> > so I normally do:
> >
> > sh run-on-all.sh perf stat -r 5 ./skb_array
> 
> I recommend documenting this in the README file in the same dir ;-)

Good idea.  Will do.

> [1] https://git.kernel.org/cgit/linux/kernel/git/mst/vhost.git/log/?h=vhost
> [2] https://github.com/netoptimizer/prototype-kernel
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

Re: [iproute2 1/1] tc simple action: bug fix

2016-05-24 Thread Daniel Borkmann


On 05/24/2016 03:04 AM, Jamal Hadi Salim wrote:

From: Jamal Hadi Salim 

Failed compile
m_simple.c: In function ‘parse_simple’:
m_simple.c:154:6: warning: too many arguments for format [-Wformat-extra-args]
   *argv);
   ^
m_simple.c:103:14: warning: unused variable ‘maybe_bind’ [-Wunused-variable]


Btw, that last one is still not addressed.

Re: [PATCH v5 2/2] skb_array: ring test

2016-05-24 Thread Jesper Dangaard Brouer

On Mon, 23 May 2016 23:52:47 +0300
"Michael S. Tsirkin"  wrote:

> On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote:
> > On Mon, 23 May 2016 13:43:46 +0300
> > "Michael S. Tsirkin"  wrote:
> >   
> > > Add ringtest based unit test for skb array.
> > >
> > > Signed-off-by: Michael S. Tsirkin 
> > > ---
> > >  tools/virtio/ringtest/skb_array.c | 167 
> > > ++
> > >  tools/virtio/ringtest/Makefile|   4 +-  
> > 
> > Patch didn't apply cleanly to Makefile, as you also seems to have
> > "virtio_ring_inorder", I manually applied it.
> > 
> > I chdir to tools/virtio/ringtest/ and I could compile "skb_array",
> > BUT how do I use it??? (the README is not helpful)
> > 
> > What is the "output", are there any performance measurement results?  
> 
> First, if it completes successfully this means it completed
> a ton of cycles without errors. It caches any missing barriers
> which aren't nops on your system.

I applied these patches on net-next (at commit 07b75260e) and the
skb_array test program never terminates.   Strangely if I use your git
tree[1] (on branch vhost) the program does terminate... I didn't spot
the difference.

> Second - use perf.

I do like perf, but it does not answer my questions about the
performance of this queue. I will code something up in my own
framework[2] to answer my own performance questions.

Like what is be minimum overhead (in cycles) achievable with this type
of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot)
for fastpath usage.

Then I also want to know how this performs when two CPUs are involved.
As this is also a primary use-case, for you when sending packets into a
guest.

> E.g. simple perf stat will measure how long does it take to execute.
> there's a script that runs it on different CPUs,
> so I normally do:
>
> sh run-on-all.sh perf stat -r 5 ./skb_array

I recommend documenting this in the README file in the same dir ;-)

[1] https://git.kernel.org/cgit/linux/kernel/git/mst/vhost.git/log/?h=vhost
[2] https://github.com/netoptimizer/prototype-kernel
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH RESEND 8/8] af_unix: charge buffers to kmemcg

2016-05-24 Thread Vladimir Davydov

Unix sockets can consume a significant amount of system memory, hence
they should be accounted to kmemcg.

Since unix socket buffers are always allocated from process context,
all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
sock->sk_allocation mask.

Signed-off-by: Vladimir Davydov 
Cc: "David S. Miller" 
---
 net/unix/af_unix.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 80aa6a3e6817..022bdd3ab7d9 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -769,6 +769,7 @@ static struct sock *unix_create1(struct net *net, struct 
socket *sock, int kern)
lockdep_set_class(&sk->sk_receive_queue.lock,
&af_unix_sk_receive_queue_lock_key);
 
+   sk->sk_allocation   = GFP_KERNEL_ACCOUNT;
sk->sk_write_space  = unix_write_space;
sk->sk_max_ack_backlog  = net->unx.sysctl_max_dgram_qlen;
sk->sk_destruct = unix_sock_destructor;
-- 
2.1.4

[PATCH net 1/1] qed: Reset the enable flag for eth protocol.

2016-05-24 Thread Sudarsana Reddy Kalluru

This patch fixes the coding error in determining the enable flag for
the application/protocol. The enable flag should be set for all protocols
but the eth.

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_dcbx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c 
b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
index a06d19a..21ec1c2 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
@@ -222,7 +222,7 @@ qed_dcbx_process_tlv(struct qed_hwfn *p_hwfn,
 * indication, but we only got here if there was an
 * app tlv for the protocol, so dcbx must be enabled.
 */
-   enable = !!(type == DCBX_PROTOCOL_ETH);
+   enable = !(type == DCBX_PROTOCOL_ETH);
 
qed_dcbx_update_app_info(p_data, p_hwfn, enable, true,
 priority, tc, type);
-- 
1.8.3.1

[RFC PATCH V3 0/3] basic device IOTLB support

2016-05-24 Thread Jason Wang

This patch tries to implement an device IOTLB for vhost. This could be
used with for co-operation with userspace IOMMU implementation (qemu)
for a secure DMA environment (DMAR) in guest.

The idea is simple. When vhost meets an IOTLB miss, it will request
the assistance of userspace to do the translation, this is done
through:

- when there's a IOTLB miss, it will notify userspace through
  vhost_net fd and then userspace read the fault address, size and
  access from vhost fd.
- userspace write the translation result back to vhost fd, vhost can
  then update its IOTLB.

The codes were optimized for fixed mapping users e.g dpdk in guest. It
will be slow if dynamic mappings were used in guest. We could do
optimizations on top.

The codes were designed to be architecture independent. It should be
easily ported to any architecture.

Stress tested with l2fwd/vfio in guest with 4K/2M/1G page size. On 1G
hugepage case, 100% TLB hit rate were noticed.

Changes from V2:
- introduce memory accessors for vhost
- switch from ioctls to oridinary file read/write for iotlb miss and
  updating
- do not assume virtqueue were virtually mapped contiguously, all
  virtqueue access were done throug IOTLB
- verify memory access during IOTLB update and fail early
- introduce a module parameter for the size of IOTLB

Changes from V1:
- support any size/range of updating and invalidation through
  introducing the interval tree.
- convert from per device iotlb request to per virtqueue iotlb
  request, this solves the possible deadlock in V1.
- read/write permission check support.

Please review.

Jason Wang (3):
  vhost: introduce vhost memory accessors
  vhost: convert pre sorted vhost memory array to interval tree
  vhost: device IOTLB API

 drivers/vhost/net.c|  63 +++-
 drivers/vhost/vhost.c  | 760 ++---
 drivers/vhost/vhost.h  |  60 +++-
 include/uapi/linux/vhost.h |  28 ++
 4 files changed, 790 insertions(+), 121 deletions(-)

-- 
2.7.4

[RFC PATCH V3 2/3] vhost: convert pre sorted vhost memory array to interval tree

2016-05-24 Thread Jason Wang

Current pre-sorted memory region array has some limitations for future
device IOTLB conversion:

1) need extra work for adding and removing a single region, and it's
   expected to be slow because of sorting or memory re-allocation.
2) need extra work of removing a large range which may intersect
   several regions with different size.
3) need trick for a replacement policy like LRU

To overcome the above shortcomings, this patch convert it to interval
tree which can easily address the above issue with almost no extra
work.

The patch could be used for:

- Extend the current API and only let the userspace to send diffs of
  memory table.
- Simplify Device IOTLB implementation.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   |   8 +--
 drivers/vhost/vhost.c | 182 --
 drivers/vhost/vhost.h |  27 ++--
 3 files changed, 128 insertions(+), 89 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index beaeb17..a584239 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1037,20 +1037,20 @@ static long vhost_net_reset_owner(struct vhost_net *n)
struct socket *tx_sock = NULL;
struct socket *rx_sock = NULL;
long err;
-   struct vhost_memory *memory;
+   struct vhost_umem *umem;
 
mutex_lock(&n->dev.mutex);
err = vhost_dev_check_owner(&n->dev);
if (err)
goto done;
-   memory = vhost_dev_reset_owner_prepare();
-   if (!memory) {
+   umem = vhost_dev_reset_owner_prepare();
+   if (!umem) {
err = -ENOMEM;
goto done;
}
vhost_net_stop(n, &tx_sock, &rx_sock);
vhost_net_flush(n);
-   vhost_dev_reset_owner(&n->dev, memory);
+   vhost_dev_reset_owner(&n->dev, umem);
vhost_net_vq_reset(n);
 done:
mutex_unlock(&n->dev.mutex);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 9f2a63a..166e779 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "vhost.h"
 
@@ -42,6 +43,10 @@ enum {
 #define vhost_used_event(vq) ((__virtio16 __user *)&vq->avail->ring[vq->num])
 #define vhost_avail_event(vq) ((__virtio16 __user *)&vq->used->ring[vq->num])
 
+INTERVAL_TREE_DEFINE(struct vhost_umem_node,
+rb, __u64, __subtree_last,
+START, LAST, , vhost_umem_interval_tree);
+
 #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
 static void vhost_disable_cross_endian(struct vhost_virtqueue *vq)
 {
@@ -300,10 +305,10 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->call_ctx = NULL;
vq->call = NULL;
vq->log_ctx = NULL;
-   vq->memory = NULL;
vhost_reset_is_le(vq);
vhost_disable_cross_endian(vq);
vq->busyloop_timeout = 0;
+   vq->umem = NULL;
 }
 
 static int vhost_worker(void *data)
@@ -407,7 +412,7 @@ void vhost_dev_init(struct vhost_dev *dev,
mutex_init(&dev->mutex);
dev->log_ctx = NULL;
dev->log_file = NULL;
-   dev->memory = NULL;
+   dev->umem = NULL;
dev->mm = NULL;
spin_lock_init(&dev->work_lock);
INIT_LIST_HEAD(&dev->work_list);
@@ -512,27 +517,36 @@ err_mm:
 }
 EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
 
-struct vhost_memory *vhost_dev_reset_owner_prepare(void)
+static void *vhost_kvzalloc(unsigned long size)
 {
-   return kmalloc(offsetof(struct vhost_memory, regions), GFP_KERNEL);
+   void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+
+   if (!n)
+   n = vzalloc(size);
+   return n;
+}
+
+struct vhost_umem *vhost_dev_reset_owner_prepare(void)
+{
+   return vhost_kvzalloc(sizeof(struct vhost_umem));
 }
 EXPORT_SYMBOL_GPL(vhost_dev_reset_owner_prepare);
 
 /* Caller should have device mutex */
-void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_memory *memory)
+void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_umem *umem)
 {
int i;
 
vhost_dev_cleanup(dev, true);
 
/* Restore memory to default empty mapping. */
-   memory->nregions = 0;
-   dev->memory = memory;
+   INIT_LIST_HEAD(&umem->umem_list);
+   dev->umem = umem;
/* We don't need VQ locks below since vhost_dev_cleanup makes sure
 * VQs aren't running.
 */
for (i = 0; i < dev->nvqs; ++i)
-   dev->vqs[i]->memory = memory;
+   dev->vqs[i]->umem = umem;
 }
 EXPORT_SYMBOL_GPL(vhost_dev_reset_owner);
 
@@ -549,6 +563,21 @@ void vhost_dev_stop(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_stop);
 
+static void vhost_umem_clean(struct vhost_umem *umem)
+{
+   struct vhost_umem_node *node, *tmp;
+
+   if (!umem)
+   return;
+
+   list_for_each_entry_safe(node, tmp, &umem->umem_list, link) {
+   vhost_umem_interval_tree_remove(node, &umem->umem_tree);
+   list_del(&node->link);
+

[RFC PATCH V3 1/3] vhost: introduce vhost memory accessors

2016-05-24 Thread Jason Wang

This patch introduces vhost memory accessors which were just wrappers
for userspace address access helpers. This is a requirement for vhost
device iotlb implementation which will add iotlb translations in those
accessors.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   |  1 +
 drivers/vhost/vhost.c | 50 +++---
 2 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f744eeb..beaeb17 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -985,6 +985,7 @@ static long vhost_net_set_backend(struct vhost_net *n, 
unsigned index, int fd)
 
vhost_net_disable_vq(n, vq);
vq->private_data = sock;
+   /* FIXME: iotlb prefetch here? */
r = vhost_vq_init_access(vq);
if (r)
goto err_used;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 669fef1..9f2a63a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -651,6 +651,22 @@ static int memory_access_ok(struct vhost_dev *d, struct 
vhost_memory *mem,
return 1;
 }
 
+#define vhost_put_user(vq, x, ptr)  __put_user(x, ptr)
+
+static int vhost_copy_to_user(struct vhost_virtqueue *vq, void *to,
+ const void *from, unsigned size)
+{
+   return __copy_to_user(to, from, size);
+}
+
+#define vhost_get_user(vq, x, ptr) __get_user(x, ptr)
+
+static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
+   void *from, unsigned size)
+{
+   return __copy_from_user(to, from, size);
+}
+
 static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
struct vring_desc __user *desc,
struct vring_avail __user *avail,
@@ -1156,7 +1172,8 @@ EXPORT_SYMBOL_GPL(vhost_log_write);
 static int vhost_update_used_flags(struct vhost_virtqueue *vq)
 {
void __user *used;
-   if (__put_user(cpu_to_vhost16(vq, vq->used_flags), &vq->used->flags) < 
0)
+   if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
+  &vq->used->flags) < 0)
return -EFAULT;
if (unlikely(vq->log_used)) {
/* Make sure the flag is seen before log. */
@@ -1174,7 +1191,8 @@ static int vhost_update_used_flags(struct vhost_virtqueue 
*vq)
 
 static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 
avail_event)
 {
-   if (__put_user(cpu_to_vhost16(vq, vq->avail_idx), 
vhost_avail_event(vq)))
+   if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
+  vhost_avail_event(vq)))
return -EFAULT;
if (unlikely(vq->log_used)) {
void __user *used;
@@ -1212,7 +1230,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
r = -EFAULT;
goto err;
}
-   r = __get_user(last_used_idx, &vq->used->idx);
+   r = vhost_get_user(vq, last_used_idx, &vq->used->idx);
if (r)
goto err;
vq->last_used_idx = vhost16_to_cpu(vq, last_used_idx);
@@ -1392,7 +1410,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
/* Check it isn't doing very strange things with descriptor numbers. */
last_avail_idx = vq->last_avail_idx;
-   if (unlikely(__get_user(avail_idx, &vq->avail->idx))) {
+   if (unlikely(vhost_get_user(vq, avail_idx, &vq->avail->idx))) {
vq_err(vq, "Failed to access avail idx at %p\n",
   &vq->avail->idx);
return -EFAULT;
@@ -1414,8 +1432,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
/* Grab the next descriptor number they're advertising, and increment
 * the index we've seen. */
-   if (unlikely(__get_user(ring_head,
-   &vq->avail->ring[last_avail_idx & (vq->num - 
1)]))) {
+   if (unlikely(vhost_get_user(vq, ring_head,
+&vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
vq_err(vq, "Failed to read head: idx %d address %p\n",
   last_avail_idx,
   &vq->avail->ring[last_avail_idx % vq->num]);
@@ -1450,7 +1468,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
   i, vq->num, head);
return -EINVAL;
}
-   ret = __copy_from_user(&desc, vq->desc + i, sizeof desc);
+   ret = vhost_copy_from_user(vq, &desc, vq->desc + i,
+  sizeof desc);
if (unlikely(ret)) {
vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
   i, vq->desc + i);
@@ -1538,15 +1557,15 @@ static int __vhost_add_used_n(struct vhost_virtqueue 
*vq,
start = vq->last_used_idx & (vq->num - 1);
used = vq->used->ring + start;
if (count == 1)

[RFC PATCH V3 3/3] vhost: device IOTLB API

2016-05-24 Thread Jason Wang

This patch tries to implement an device IOTLB for vhost. This could be
used with for co-operation with userspace(qemu) implementation of DMA
remapping.

The idea is simple, cache the translation in a software device IOTLB
(interval tree) implementation in vhost and make vhost_net file
descriptor could be read or wrote by a process. When vhost meets an
IOTLB miss, the fault address, size and access could be read from the
file. After userspace finishes the translation, it write the
translated address to the vhost_net file to update the device IOTLB.

When device IOTLB (VHOST_F_DEVICE_IOTLB) is enabled all vq address
set by ioctl were treated as iova instead of virtual address and the
accessing could only be done through IOTLB instead of direct
userspace memory access. Before each rounds or vq processing, all vq
metadata were prefetched in device IOTLB to make sure no translation
fault happens during vq processing.

In most cases, virtqueue were mapped contiguous even in virtual
address. So the IOTLB translation for virtqueue itself maybe a little
bit slower. We can add fast path on top of this patch.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c|  54 -
 drivers/vhost/vhost.c  | 586 +
 drivers/vhost/vhost.h  |  35 ++-
 include/uapi/linux/vhost.h |  28 +++
 4 files changed, 656 insertions(+), 47 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index a584239..0f45e3d 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -61,7 +61,8 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
 enum {
VHOST_NET_FEATURES = VHOST_FEATURES |
 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
-(1ULL << VIRTIO_NET_F_MRG_RXBUF)
+(1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+(1ULL << VHOST_F_DEVICE_IOTLB)
 };
 
 enum {
@@ -308,7 +309,8 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 {
unsigned long uninitialized_var(endtime);
int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
-   out_num, in_num, NULL, NULL);
+ out_num, in_num, NULL, NULL,
+ VHOST_ACCESS_RO);
 
if (r == vq->num && vq->busyloop_timeout) {
preempt_disable();
@@ -318,7 +320,8 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
cpu_relax_lowlatency();
preempt_enable();
r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
-   out_num, in_num, NULL, NULL);
+ out_num, in_num, NULL, NULL,
+ VHOST_ACCESS_RO);
}
 
return r;
@@ -351,6 +354,9 @@ static void handle_tx(struct vhost_net *net)
if (!sock)
goto out;
 
+   if (!vq_iotlb_prefetch(vq))
+   goto out;
+
vhost_disable_notify(&net->dev, vq);
 
hdr_size = nvq->vhost_hlen;
@@ -538,7 +544,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
}
r = vhost_get_vq_desc(vq, vq->iov + seg,
  ARRAY_SIZE(vq->iov) - seg, &out,
- &in, log, log_num);
+ &in, log, log_num, VHOST_ACCESS_WO);
if (unlikely(r < 0))
goto err;
 
@@ -612,6 +618,10 @@ static void handle_rx(struct vhost_net *net)
sock = vq->private_data;
if (!sock)
goto out;
+
+   if (!vq_iotlb_prefetch(vq))
+   goto out;
+
vhost_disable_notify(&net->dev, vq);
 
vhost_hlen = nvq->vhost_hlen;
@@ -1085,6 +1095,11 @@ static int vhost_net_set_features(struct vhost_net *n, 
u64 features)
mutex_unlock(&n->dev.mutex);
return -EFAULT;
}
+   if ((features & (1ULL << VHOST_F_DEVICE_IOTLB))) {
+   if (vhost_init_device_iotlb(&n->dev, true))
+   return -EFAULT;
+   }
+
for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
mutex_lock(&n->vqs[i].vq.mutex);
n->vqs[i].vq.acked_features = features;
@@ -1167,9 +1182,40 @@ static long vhost_net_compat_ioctl(struct file *f, 
unsigned int ioctl,
 }
 #endif
 
+static ssize_t vhost_net_chr_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+   struct file *file = iocb->ki_filp;
+   struct vhost_net *n = file->private_data;
+   struct vhost_dev *dev = &n->dev;
+   int noblock = file->f_flags & O_NONBLOCK;
+
+   return vhost_chr_read_iter(dev, to, noblock);
+}
+
+static ssize_t vhost_net_chr_write_iter(struct kiocb *iocb,
+   struct iov_iter *from)
+{
+   struct file *file = iocb->ki_filp;
+   struct vhost_net *n = file->private_data;
+   struc

[PATCH RESEND 6/8] arch: x86: charge page tables to kmemcg

2016-05-24 Thread Vladimir Davydov

Page tables can bite a relatively big chunk off system memory and their
allocations are easy to trigger from userspace, so they should be
accounted to kmemcg.

This patch marks page table allocations as __GFP_ACCOUNT for x86. Note
we must not charge allocations of kernel page tables, because they can
be shared among processes from different cgroups so accounting them to a
particular one can pin other cgroups for indefinitely long. So we clear
__GFP_ACCOUNT flag if a page table is allocated for the kernel.

Signed-off-by: Vladimir Davydov 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
---
 arch/x86/include/asm/pgalloc.h | 12 ++--
 arch/x86/mm/pgtable.c  | 11 ---
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index bf7f8b55b0f9..2f531633cb16 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -81,7 +81,11 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t 
*pmd,
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
struct page *page;
-   page = alloc_pages(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO, 0);
+   gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_REPEAT | __GFP_ZERO;
+
+   if (mm == &init_mm)
+   gfp &= ~__GFP_ACCOUNT;
+   page = alloc_pages(gfp, 0);
if (!page)
return NULL;
if (!pgtable_pmd_page_ctor(page)) {
@@ -125,7 +129,11 @@ static inline void pgd_populate(struct mm_struct *mm, 
pgd_t *pgd, pud_t *pud)
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return (pud_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+   gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_REPEAT;
+
+   if (mm == &init_mm)
+   gfp &= ~__GFP_ACCOUNT;
+   return (pud_t *)get_zeroed_page(gfp);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 4eb287e25043..421ac6b74d11 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -6,7 +6,8 @@
 #include 
 #include 
 
-#define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO
+#define PGALLOC_GFP (GFP_KERNEL_ACCOUNT | __GFP_NOTRACK | __GFP_REPEAT | \
+__GFP_ZERO)
 
 #ifdef CONFIG_HIGHPTE
 #define PGALLOC_USER_GFP __GFP_HIGHMEM
@@ -18,7 +19,7 @@ gfp_t __userpte_alloc_gfp = PGALLOC_GFP | PGALLOC_USER_GFP;
 
 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-   return (pte_t *)__get_free_page(PGALLOC_GFP);
+   return (pte_t *)__get_free_page(PGALLOC_GFP & ~__GFP_ACCOUNT);
 }
 
 pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -207,9 +208,13 @@ static int preallocate_pmds(struct mm_struct *mm, pmd_t 
*pmds[])
 {
int i;
bool failed = false;
+   gfp_t gfp = PGALLOC_GFP;
+
+   if (mm == &init_mm)
+   gfp &= ~__GFP_ACCOUNT;
 
for(i = 0; i < PREALLOCATED_PMDS; i++) {
-   pmd_t *pmd = (pmd_t *)__get_free_page(PGALLOC_GFP);
+   pmd_t *pmd = (pmd_t *)__get_free_page(gfp);
if (!pmd)
failed = true;
if (pmd && !pgtable_pmd_page_ctor(virt_to_page(pmd))) {
-- 
2.1.4

[PATCH RESEND 1/8] mm: remove pointless struct in struct page definition

2016-05-24 Thread Vladimir Davydov

... to reduce indentation level thus leaving more space for comments.

Signed-off-by: Vladimir Davydov 
---
 include/linux/mm_types.h | 68 +++-
 1 file changed, 32 insertions(+), 36 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d553855503e6..3cc5977a9cab 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -60,51 +60,47 @@ struct page {
};
 
/* Second double word */
-   struct {
-   union {
-   pgoff_t index;  /* Our offset within mapping. */
-   void *freelist; /* sl[aou]b first free object */
-   /* page_deferred_list().prev-- second tail page */
-   };
+   union {
+   pgoff_t index;  /* Our offset within mapping. */
+   void *freelist; /* sl[aou]b first free object */
+   /* page_deferred_list().prev-- second tail page */
+   };
 
-   union {
+   union {
 #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
-   /* Used for cmpxchg_double in slub */
-   unsigned long counters;
+   /* Used for cmpxchg_double in slub */
+   unsigned long counters;
 #else
-   /*
-* Keep _refcount separate from slub cmpxchg_double
-* data.  As the rest of the double word is protected by
-* slab_lock but _refcount is not.
-*/
-   unsigned counters;
+   /*
+* Keep _refcount separate from slub cmpxchg_double data.
+* As the rest of the double word is protected by slab_lock
+* but _refcount is not.
+*/
+   unsigned counters;
 #endif
+   struct {
 
-   struct {
-
-   union {
-   /*
-* Count of ptes mapped in mms, to show
-* when page is mapped & limit reverse
-* map searches.
-*/
-   atomic_t _mapcount;
-
-   struct { /* SLUB */
-   unsigned inuse:16;
-   unsigned objects:15;
-   unsigned frozen:1;
-   };
-   int units;  /* SLOB */
-   };
+   union {
/*
-* Usage count, *USE WRAPPER FUNCTION*
-* when manual accounting. See page_ref.h
+* Count of ptes mapped in mms, to show when
+* page is mapped & limit reverse map searches.
 */
-   atomic_t _refcount;
+   atomic_t _mapcount;
+
+   unsigned int active;/* SLAB */
+   struct {/* SLUB */
+   unsigned inuse:16;
+   unsigned objects:15;
+   unsigned frozen:1;
+   };
+   int units;  /* SLOB */
};
-   unsigned int active;/* SLAB */
+   /*
+* Usage count, *USE WRAPPER FUNCTION* when manual
+* accounting. See page_ref.h
+*/
+   atomic_t _refcount;
};
};
 
-- 
2.1.4

Re: [PATCH 8/8] af_unix: charge buffers to kmemcg

2016-05-24 Thread Vladimir Davydov

[adding netdev to Cc]

On Mon, May 23, 2016 at 01:20:29PM +0300, Vladimir Davydov wrote:
> Unix sockets can consume a significant amount of system memory, hence
> they should be accounted to kmemcg.
> 
> Since unix socket buffers are always allocated from process context,
> all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
> sock->sk_allocation mask.
> 
> Signed-off-by: Vladimir Davydov 
> Cc: "David S. Miller" 
> ---
>  net/unix/af_unix.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 80aa6a3e6817..022bdd3ab7d9 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -769,6 +769,7 @@ static struct sock *unix_create1(struct net *net, struct 
> socket *sock, int kern)
>   lockdep_set_class(&sk->sk_receive_queue.lock,
>   &af_unix_sk_receive_queue_lock_key);
>  
> + sk->sk_allocation   = GFP_KERNEL_ACCOUNT;
>   sk->sk_write_space  = unix_write_space;
>   sk->sk_max_ack_backlog  = net->unx.sysctl_max_dgram_qlen;
>   sk->sk_destruct = unix_sock_destructor;

[PATCH] brcmfmac: fix setting AP channel with new firmwares

2016-05-24 Thread Rafał Miłecki

Firmware for new chipsets is based on a new major version of code
internally maintained at Broadcom. E.g. brcmfmac4366b-pcie.bin (used for
BCM4366B1) is based on 10.10.69.3309 while brcmfmac43602-pcie.ap.bin was
based on 7.35.177.56.

Currently setting AP 5 GHz channel doesn't work reliably with BCM4366B1.
When setting e.g. 36 control channel with VHT80 (center channel 42)
firmware may randomly pick one of:
1) 52 control channel with 58 as center one
2) 100 control channel with 106 as center one
3) 116 control channel with 122 as center one
4) 149 control channel with 155 as center one

It seems new firmwares require setting AP mode (BRCMF_C_SET_AP) before
specifying a channel. Changing an order of firmware calls fixes the
problem.

This fix was verified with BCM4366B1 and tested for regressions on
BCM43602. It's unclear if it's needed (or correct at all) for P2P
interfaces so it leaves this code unaffected.

Signed-off-by: Rafał Miłecki 
---
 .../net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c  | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
index 299a404..3d09d23 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
@@ -4423,7 +4423,7 @@ brcmf_cfg80211_start_ap(struct wiphy *wiphy, struct 
net_device *ndev,
struct brcmf_join_params join_params;
enum nl80211_iftype dev_role;
struct brcmf_fil_bss_enable_le bss_enable;
-   u16 chanspec;
+   u16 chanspec = chandef_to_chanspec(&cfg->d11inf, &settings->chandef);
bool mbss;
int is_11d;
 
@@ -4499,16 +4499,16 @@ brcmf_cfg80211_start_ap(struct wiphy *wiphy, struct 
net_device *ndev,
 
brcmf_config_ap_mgmt_ie(ifp->vif, &settings->beacon);
 
-   if (!mbss) {
-   chanspec = chandef_to_chanspec(&cfg->d11inf,
-  &settings->chandef);
+   if (dev_role == NL80211_IFTYPE_P2P_GO) {
err = brcmf_fil_iovar_int_set(ifp, "chanspec", chanspec);
if (err < 0) {
brcmf_err("Set Channel failed: chspec=%d, %d\n",
  chanspec, err);
goto exit;
}
+   }
 
+   if (!mbss) {
if (is_11d != ifp->vif->is_11d) {
err = brcmf_fil_cmd_int_set(ifp, BRCMF_C_SET_REGULATORY,
is_11d);
@@ -4565,6 +4565,14 @@ brcmf_cfg80211_start_ap(struct wiphy *wiphy, struct 
net_device *ndev,
brcmf_err("setting AP mode failed %d\n", err);
goto exit;
}
+   if (!mbss) {
+   err = brcmf_fil_iovar_int_set(ifp, "chanspec", 
chanspec);
+   if (err < 0) {
+   brcmf_err("Set Channel failed: chspec=%d, %d\n",
+ chanspec, err);
+   goto exit;
+   }
+   }
err = brcmf_fil_cmd_int_set(ifp, BRCMF_C_UP, 1);
if (err < 0) {
brcmf_err("BRCMF_C_UP error (%d)\n", err);
-- 
1.8.4.5

[PATCH RESEND 2/8] mm: clean up non-standard page->_mapcount users

2016-05-24 Thread Vladimir Davydov

 - Add a proper comment to page->_mapcount.
 - Introduce a macro for generating helper functions.
 - Place all special page->_mapcount values next to each other so that
   readers can see all possible values and so we don't get duplicates.

Signed-off-by: Vladimir Davydov 
---
 include/linux/mm_types.h   |  5 
 include/linux/page-flags.h | 73 --
 scripts/tags.sh|  3 ++
 3 files changed, 40 insertions(+), 41 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3cc5977a9cab..16bdef7943e3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -85,6 +85,11 @@ struct page {
/*
 * Count of ptes mapped in mms, to show when
 * page is mapped & limit reverse map searches.
+*
+* Extra information about page type may be
+* stored here for pages that are never mapped,
+* in which case the value MUST BE <= -2.
+* See page-flags.h for more details.
 */
atomic_t _mapcount;
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e5a32445f930..9940ade6a25e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -593,54 +593,45 @@ TESTPAGEFLAG_FALSE(DoubleMap)
 #endif
 
 /*
- * PageBuddy() indicate that the page is free and in the buddy system
- * (see mm/page_alloc.c).
- *
- * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to
- * -2 so that an underflow of the page_mapcount() won't be mistaken
- * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE. -128 can be created very
- * efficiently by most CPU architectures.
+ * For pages that are never mapped to userspace, page->mapcount may be
+ * used for storing extra information about page type. Any value used
+ * for this purpose must be <= -2, but it's better start not too close
+ * to -2 so that an underflow of the page_mapcount() won't be mistaken
+ * for a special page.
  */
-#define PAGE_BUDDY_MAPCOUNT_VALUE (-128)
-
-static inline int PageBuddy(struct page *page)
-{
-   return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE;
+#define PAGE_MAPCOUNT_OPS(uname, lname)
\
+static __always_inline int Page##uname(struct page *page)  \
+{  \
+   return atomic_read(&page->_mapcount) == \
+   PAGE_##lname##_MAPCOUNT_VALUE;  \
+}  \
+static __always_inline void __SetPage##uname(struct page *page)
\
+{  \
+   VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);  \
+   atomic_set(&page->_mapcount, PAGE_##lname##_MAPCOUNT_VALUE);\
+}  \
+static __always_inline void __ClearPage##uname(struct page *page)  \
+{  \
+   VM_BUG_ON_PAGE(!Page##uname(page), page);   \
+   atomic_set(&page->_mapcount, -1);   \
 }
 
-static inline void __SetPageBuddy(struct page *page)
-{
-   VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
-   atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE);
-}
+/*
+ * PageBuddy() indicate that the page is free and in the buddy system
+ * (see mm/page_alloc.c).
+ */
+#define PAGE_BUDDY_MAPCOUNT_VALUE  (-128)
+PAGE_MAPCOUNT_OPS(Buddy, BUDDY)
 
-static inline void __ClearPageBuddy(struct page *page)
-{
-   VM_BUG_ON_PAGE(!PageBuddy(page), page);
-   atomic_set(&page->_mapcount, -1);
-}
+/*
+ * PageBalloon() is set on pages that are on the balloon page list
+ * (see mm/balloon_compaction.c).
+ */
+#define PAGE_BALLOON_MAPCOUNT_VALUE(-256)
+PAGE_MAPCOUNT_OPS(Balloon, BALLOON)
 
 extern bool is_free_buddy_page(struct page *page);
 
-#define PAGE_BALLOON_MAPCOUNT_VALUE (-256)
-
-static inline int PageBalloon(struct page *page)
-{
-   return atomic_read(&page->_mapcount) == PAGE_BALLOON_MAPCOUNT_VALUE;
-}
-
-static inline void __SetPageBalloon(struct page *page)
-{
-   VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
-   atomic_set(&page->_mapcount, PAGE_BALLOON_MAPCOUNT_VALUE);
-}
-
-static inline void __ClearPageBalloon(struct page *page)
-{
-   VM_BUG_ON_PAGE(!PageBalloon(page), page);
-   atomic_set(&page->_mapcount, -1);
-}
-
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
diff --git a/scrip

[PATCH RESEND 0/8] More stuff to charge to kmemcg

2016-05-24 Thread Vladimir Davydov

[resending with all relevant lists in Cc]

Hi,

This patch implements per kmemcg accounting of page tables (x86-only),
pipe buffers, and unix socket buffers.

Basically, this is v2 of my earlier attempt [1], addressing comments by
Andrew, namely: lack of comments to non-standard _mapcount usage, extra
overhead even when kmemcg is unused, wrong handling of stolen pipe
buffer pages.

Patches 1-3 are just cleanups that are not supposed to introduce any
functional changes. Patches 4 and 5 move charge/uncharge to generic page
allocator paths for the sake of accounting pipe and unix socket buffers.
Patches 5-7 make x86 page tables, pipe buffers, and unix socket buffers
accountable.

[1] http://lkml.kernel.org/r/%3ccover.1443262808.git.vdavy...@parallels.com%3E

Thanks,

Vladimir Davydov (8):
  mm: remove pointless struct in struct page definition
  mm: clean up non-standard page->_mapcount users
  mm: memcontrol: cleanup kmem charge functions
  mm: charge/uncharge kmemcg from generic page allocator paths
  mm: memcontrol: teach uncharge_list to deal with kmem pages
  arch: x86: charge page tables to kmemcg
  pipe: account to kmemcg
  af_unix: charge buffers to kmemcg

 arch/x86/include/asm/pgalloc.h |  12 -
 arch/x86/mm/pgtable.c  |  11 ++--
 fs/pipe.c  |  32 ---
 include/linux/gfp.h|  10 +---
 include/linux/memcontrol.h | 103 +++-
 include/linux/mm_types.h   |  73 -
 include/linux/page-flags.h |  78 +--
 kernel/fork.c  |   6 +--
 mm/memcontrol.c| 117 -
 mm/page_alloc.c|  63 +-
 mm/slab.h  |  16 --
 mm/slab_common.c   |   2 +-
 mm/slub.c  |   6 +--
 mm/vmalloc.c   |   6 +--
 net/unix/af_unix.c |   1 +
 scripts/tags.sh|   3 ++
 16 files changed, 245 insertions(+), 294 deletions(-)

-- 
2.1.4

[PATCH RESEND 4/8] mm: charge/uncharge kmemcg from generic page allocator paths

2016-05-24 Thread Vladimir Davydov

Currently, to charge a non-slab allocation to kmemcg one has to use
alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
this helper should finally be freed using free_kmem_pages, otherwise it
won't be uncharged.

This API suits its current users fine, but it turns out to be impossible
to use along with page reference counting, i.e. when an allocation is
supposed to be freed with put_page, as it is the case with pipe or unix
socket buffers.

To overcome this limitation, this patch moves charging/uncharging to
generic page allocator paths, i.e. to __alloc_pages_nodemask and
free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
one can use any of the available page allocation functions to get the
allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
just like in case of kmalloc and friends. A charged page will be
automatically uncharged on free.

To make it possible, we need to mark pages charged to kmemcg somehow. To
avoid introducing a new page flag, we make use of page->_mapcount for
marking such pages. Since pages charged to kmemcg are not supposed to be
mapped to userspace, it should work just fine. There are other (ab)users
of page->_mapcount - buddy and balloon pages - but we don't conflict
with them.

In case kmemcg is compiled out or not used at runtime, this patch
introduces no overhead to generic page allocator paths. If kmemcg is
used, it will be plus one gfp flags check on alloc and plus one
page->_mapcount check on free, which shouldn't hurt performance, because
the data accessed are hot.

Signed-off-by: Vladimir Davydov 
---
 include/linux/gfp.h| 10 +--
 include/linux/page-flags.h |  7 +
 kernel/fork.c  |  6 ++---
 mm/page_alloc.c| 66 +-
 mm/slab_common.c   |  2 +-
 mm/slub.c  |  6 ++---
 mm/vmalloc.c   |  6 ++---
 7 files changed, 31 insertions(+), 72 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..c29e9d347bc6 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -78,8 +78,7 @@ struct vm_area_struct;
  * __GFP_THISNODE forces the allocation to be satisified from the requested
  *   node with no fallbacks or placement policy enforcements.
  *
- * __GFP_ACCOUNT causes the allocation to be accounted to kmemcg (only relevant
- *   to kmem allocations).
+ * __GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
  */
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
 #define __GFP_WRITE((__force gfp_t)___GFP_WRITE)
@@ -486,10 +485,6 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int 
order,
 #define alloc_page_vma_node(gfp_mask, vma, addr, node) \
alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
 
-extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order);
-extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask,
- unsigned int order);
-
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
 
@@ -513,9 +508,6 @@ extern void *__alloc_page_frag(struct page_frag_cache *nc,
   unsigned int fragsz, gfp_t gfp_mask);
 extern void __free_page_frag(void *addr);
 
-extern void __free_kmem_pages(struct page *page, unsigned int order);
-extern void free_kmem_pages(unsigned long addr, unsigned int order);
-
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9940ade6a25e..b51e75a47e82 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -630,6 +630,13 @@ PAGE_MAPCOUNT_OPS(Buddy, BUDDY)
 #define PAGE_BALLOON_MAPCOUNT_VALUE(-256)
 PAGE_MAPCOUNT_OPS(Balloon, BALLOON)
 
+/*
+ * If kmemcg is enabled, the buddy allocator will set PageKmemcg() on
+ * pages allocated with __GFP_ACCOUNT. It gets cleared on page free.
+ */
+#define PAGE_KMEMCG_MAPCOUNT_VALUE (-512)
+PAGE_MAPCOUNT_OPS(Kmemcg, KMEMCG)
+
 extern bool is_free_buddy_page(struct page *page);
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 66cc2e0e137e..3f3c30f80786 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -162,8 +162,8 @@ void __weak arch_release_thread_info(struct thread_info *ti)
 static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
  int node)
 {
-   struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
- THREAD_SIZE_ORDER);
+   struct page *page = alloc_pages_node(node, THREADINFO_GFP,
+THREAD_SIZE_ORDER);
 
if (page)
memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
@@ -178,7 +178,7 @@ static inline void free_thread_info(struct thread

[PATCH RESEND 5/8] mm: memcontrol: teach uncharge_list to deal with kmem pages

2016-05-24 Thread Vladimir Davydov

Page table pages are batched-freed in release_pages on most
architectures. If we want to charge them to kmemcg (this is what is done
later in this series), we need to teach mem_cgroup_uncharge_list to
handle kmem pages.

Signed-off-by: Vladimir Davydov 
---
 mm/memcontrol.c | 42 --
 1 file changed, 24 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 482b4a0c97e4..89a421ee4713 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5432,15 +5432,18 @@ void mem_cgroup_cancel_charge(struct page *page, struct 
mem_cgroup *memcg,
 
 static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
   unsigned long nr_anon, unsigned long nr_file,
-  unsigned long nr_huge, struct page *dummy_page)
+  unsigned long nr_huge, unsigned long nr_kmem,
+  struct page *dummy_page)
 {
-   unsigned long nr_pages = nr_anon + nr_file;
+   unsigned long nr_pages = nr_anon + nr_file + nr_kmem;
unsigned long flags;
 
if (!mem_cgroup_is_root(memcg)) {
page_counter_uncharge(&memcg->memory, nr_pages);
if (do_memsw_account())
page_counter_uncharge(&memcg->memsw, nr_pages);
+   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && nr_kmem)
+   page_counter_uncharge(&memcg->kmem, nr_kmem);
memcg_oom_recover(memcg);
}
 
@@ -5463,6 +5466,7 @@ static void uncharge_list(struct list_head *page_list)
unsigned long nr_anon = 0;
unsigned long nr_file = 0;
unsigned long nr_huge = 0;
+   unsigned long nr_kmem = 0;
unsigned long pgpgout = 0;
struct list_head *next;
struct page *page;
@@ -5473,8 +5477,6 @@ static void uncharge_list(struct list_head *page_list)
 */
next = page_list->next;
do {
-   unsigned int nr_pages = 1;
-
page = list_entry(next, struct page, lru);
next = page->lru.next;
 
@@ -5493,31 +5495,35 @@ static void uncharge_list(struct list_head *page_list)
if (memcg != page->mem_cgroup) {
if (memcg) {
uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-  nr_huge, page);
-   pgpgout = nr_anon = nr_file = nr_huge = 0;
+  nr_huge, nr_kmem, page);
+   pgpgout = nr_anon = nr_file =
+   nr_huge = nr_kmem = 0;
}
memcg = page->mem_cgroup;
}
 
-   if (PageTransHuge(page)) {
-   nr_pages <<= compound_order(page);
-   VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-   nr_huge += nr_pages;
-   }
+   if (!PageKmemcg(page)) {
+   unsigned int nr_pages = 1;
 
-   if (PageAnon(page))
-   nr_anon += nr_pages;
-   else
-   nr_file += nr_pages;
+   if (PageTransHuge(page)) {
+   nr_pages <<= compound_order(page);
+   VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+   nr_huge += nr_pages;
+   }
+   if (PageAnon(page))
+   nr_anon += nr_pages;
+   else
+   nr_file += nr_pages;
+   pgpgout++;
+   } else
+   nr_kmem += 1 << compound_order(page);
 
page->mem_cgroup = NULL;
-
-   pgpgout++;
} while (next != page_list);
 
if (memcg)
uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-  nr_huge, page);
+  nr_huge, nr_kmem, page);
 }
 
 /**
-- 
2.1.4

[PATCH RESEND 3/8] mm: memcontrol: cleanup kmem charge functions

2016-05-24 Thread Vladimir Davydov

 - Handle memcg_kmem_enabled check out to the caller. This reduces the
   number of function definitions making the code easier to follow. At
   the same time it doesn't result in code bloat, because all of these
   functions are used only in one or two places.
 - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
   have to dive deep into memcg implementation to see which allocations
   are charged and which are not.
 - Refresh comments.

Signed-off-by: Vladimir Davydov 
---
 include/linux/memcontrol.h | 103 +++--
 mm/memcontrol.c|  75 -
 mm/page_alloc.c|   9 ++--
 mm/slab.h  |  16 +--
 4 files changed, 80 insertions(+), 123 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a805474df4ab..2d03975c7dc0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -754,6 +754,13 @@ static inline bool mem_cgroup_under_socket_pressure(struct 
mem_cgroup *memcg)
 }
 #endif
 
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
+void memcg_kmem_put_cache(struct kmem_cache *cachep);
+int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
+   struct mem_cgroup *memcg);
+int memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
+void memcg_kmem_uncharge(struct page *page, int order);
+
 #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
 extern struct static_key_false memcg_kmem_enabled_key;
 
@@ -775,22 +782,6 @@ static inline bool memcg_kmem_enabled(void)
 }
 
 /*
- * In general, we'll do everything in our power to not incur in any overhead
- * for non-memcg users for the kmem functions. Not even a function call, if we
- * can avoid it.
- *
- * Therefore, we'll inline all those functions so that in the best case, we'll
- * see that kmemcg is off for everybody and proceed quickly.  If it is on,
- * we'll still do most of the flag checking inline. We check a lot of
- * conditions, but because they are pretty simple, they are expected to be
- * fast.
- */
-int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
- struct mem_cgroup *memcg);
-int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
-void __memcg_kmem_uncharge(struct page *page, int order);
-
-/*
  * helper for accessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
  * will return -1 when this is not a kmem-limited memcg.
@@ -800,67 +791,6 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
return memcg ? memcg->kmemcg_id : -1;
 }
 
-struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t 
gfp);
-void __memcg_kmem_put_cache(struct kmem_cache *cachep);
-
-static inline bool __memcg_kmem_bypass(void)
-{
-   if (!memcg_kmem_enabled())
-   return true;
-   if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
-   return true;
-   return false;
-}
-
-/**
- * memcg_kmem_charge: charge a kmem page
- * @page: page to charge
- * @gfp: reclaim mode
- * @order: allocation order
- *
- * Returns 0 on success, an error code on failure.
- */
-static __always_inline int memcg_kmem_charge(struct page *page,
-gfp_t gfp, int order)
-{
-   if (__memcg_kmem_bypass())
-   return 0;
-   if (!(gfp & __GFP_ACCOUNT))
-   return 0;
-   return __memcg_kmem_charge(page, gfp, order);
-}
-
-/**
- * memcg_kmem_uncharge: uncharge a kmem page
- * @page: page to uncharge
- * @order: allocation order
- */
-static __always_inline void memcg_kmem_uncharge(struct page *page, int order)
-{
-   if (memcg_kmem_enabled())
-   __memcg_kmem_uncharge(page, order);
-}
-
-/**
- * memcg_kmem_get_cache: selects the correct per-memcg cache for allocation
- * @cachep: the original global kmem cache
- *
- * All memory allocated from a per-memcg cache is charged to the owner memcg.
- */
-static __always_inline struct kmem_cache *
-memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
-{
-   if (__memcg_kmem_bypass())
-   return cachep;
-   return __memcg_kmem_get_cache(cachep, gfp);
-}
-
-static __always_inline void memcg_kmem_put_cache(struct kmem_cache *cachep)
-{
-   if (memcg_kmem_enabled())
-   __memcg_kmem_put_cache(cachep);
-}
-
 /**
  * memcg_kmem_update_page_stat - update kmem page state statistics
  * @page: the page
@@ -883,15 +813,6 @@ static inline bool memcg_kmem_enabled(void)
return false;
 }
 
-static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
-{
-   return 0;
-}
-
-static inline void memcg_kmem_uncharge(struct page *page, int order)
-{
-}
-
 static inline int memcg_cache_id(struct mem_cgroup *memcg)
 {
return -1;
@@ -905,16 +826,6 @@ static i

[PATCH RESEND 7/8] pipe: account to kmemcg

2016-05-24 Thread Vladimir Davydov

Pipes can consume a significant amount of system memory, hence they
should be accounted to kmemcg.

This patch marks pipe_inode_info and anonymous pipe buffer page
allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
Note, since a pipe buffer page can be "stolen" and get reused for other
purposes, including mapping to userspace, we clear PageKmemcg thus
resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
is introduced by this patch.

Signed-off-by: Vladimir Davydov 
Cc: Alexander Viro 
---
 fs/pipe.c | 32 ++--
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 0d3f5165cb0b..4b32928f5426 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -137,6 +138,22 @@ static void anon_pipe_buf_release(struct pipe_inode_info 
*pipe,
put_page(page);
 }
 
+static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
+  struct pipe_buffer *buf)
+{
+   struct page *page = buf->page;
+
+   if (page_count(page) == 1) {
+   if (memcg_kmem_enabled()) {
+   memcg_kmem_uncharge(page, 0);
+   __ClearPageKmemcg(page);
+   }
+   __SetPageLocked(page);
+   return 0;
+   }
+   return 1;
+}
+
 /**
  * generic_pipe_buf_steal - attempt to take ownership of a &pipe_buffer
  * @pipe:  the pipe that the buffer belongs to
@@ -219,7 +236,7 @@ static const struct pipe_buf_operations anon_pipe_buf_ops = 
{
.can_merge = 1,
.confirm = generic_pipe_buf_confirm,
.release = anon_pipe_buf_release,
-   .steal = generic_pipe_buf_steal,
+   .steal = anon_pipe_buf_steal,
.get = generic_pipe_buf_get,
 };
 
@@ -227,7 +244,7 @@ static const struct pipe_buf_operations packet_pipe_buf_ops 
= {
.can_merge = 0,
.confirm = generic_pipe_buf_confirm,
.release = anon_pipe_buf_release,
-   .steal = generic_pipe_buf_steal,
+   .steal = anon_pipe_buf_steal,
.get = generic_pipe_buf_get,
 };
 
@@ -405,7 +422,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
int copied;
 
if (!page) {
-   page = alloc_page(GFP_HIGHUSER);
+   page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
if (unlikely(!page)) {
ret = ret ? : -ENOMEM;
break;
@@ -611,7 +628,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
 {
struct pipe_inode_info *pipe;
 
-   pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL);
+   pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL_ACCOUNT);
if (pipe) {
unsigned long pipe_bufs = PIPE_DEF_BUFFERS;
struct user_struct *user = get_current_user();
@@ -619,7 +636,9 @@ struct pipe_inode_info *alloc_pipe_info(void)
if (!too_many_pipe_buffers_hard(user)) {
if (too_many_pipe_buffers_soft(user))
pipe_bufs = 1;
-   pipe->bufs = kzalloc(sizeof(struct pipe_buffer) * 
pipe_bufs, GFP_KERNEL);
+   pipe->bufs = kcalloc(pipe_bufs,
+sizeof(struct pipe_buffer),
+GFP_KERNEL_ACCOUNT);
}
 
if (pipe->bufs) {
@@ -1010,7 +1029,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, 
unsigned long nr_pages)
if (nr_pages < pipe->nrbufs)
return -EBUSY;
 
-   bufs = kcalloc(nr_pages, sizeof(*bufs), GFP_KERNEL | __GFP_NOWARN);
+   bufs = kcalloc(nr_pages, sizeof(*bufs),
+  GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
if (unlikely(!bufs))
return -ENOMEM;
 
-- 
2.1.4

Re: bpf: use-after-free in array_map_alloc

2016-05-24 Thread Vlastimil Babka


[+CC Marco who reported the CVE, forgot that earlier]

On 05/23/2016 11:35 PM, Tejun Heo wrote:

Hello,

Can you please test whether this patch resolves the issue?  While
adding support for atomic allocations, I reduced alloc_mutex covered
region too much.

Thanks.


Ugh, this makes the code even more head-spinning than it was.


diff --git a/mm/percpu.c b/mm/percpu.c
index 0c59684..bd2df70 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -162,7 +162,7 @@ static struct pcpu_chunk *pcpu_reserved_chunk;
  static int pcpu_reserved_chunk_limit;

  static DEFINE_SPINLOCK(pcpu_lock);/* all internal data structures */
-static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop */
+static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map 
extension */

  static struct list_head *pcpu_slot __read_mostly; /* chunk list slots */

@@ -435,6 +435,8 @@ static int pcpu_extend_area_map(struct pcpu_chunk *chunk, 
int new_alloc)
size_t old_size = 0, new_size = new_alloc * sizeof(new[0]);
unsigned long flags;

+   lockdep_assert_held(&pcpu_alloc_mutex);


I don't see where the mutex gets locked when called via 
pcpu_map_extend_workfn? (except via the new cancel_work_sync() call below?)


Also what protects chunks with scheduled work items from being removed?


+
new = pcpu_mem_zalloc(new_size);
if (!new)
return -ENOMEM;
@@ -895,6 +897,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, 
bool reserved,
return NULL;
}

+   if (!is_atomic)
+   mutex_lock(&pcpu_alloc_mutex);


BTW I noticed that
bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;

this is too pessimistic IMHO. Reclaim is possible even without __GFP_FS 
and __GFP_IO. Could you just use gfpflags_allow_blocking(gfp) here?



+
spin_lock_irqsave(&pcpu_lock, flags);

/* serve reserved allocations from the reserved chunk if available */
@@ -967,12 +972,11 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
align, bool reserved,
if (is_atomic)
goto fail;

-   mutex_lock(&pcpu_alloc_mutex);
+   lockdep_assert_held(&pcpu_alloc_mutex);

if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) {
chunk = pcpu_create_chunk();
if (!chunk) {
-   mutex_unlock(&pcpu_alloc_mutex);
err = "failed to allocate new chunk";
goto fail;
}
@@ -983,7 +987,6 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, 
bool reserved,
spin_lock_irqsave(&pcpu_lock, flags);
}

-   mutex_unlock(&pcpu_alloc_mutex);
goto restart;

  area_found:
@@ -993,8 +996,6 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, 
bool reserved,
if (!is_atomic) {
int page_start, page_end, rs, re;

-   mutex_lock(&pcpu_alloc_mutex);
-
page_start = PFN_DOWN(off);
page_end = PFN_UP(off + size);

@@ -1005,7 +1006,6 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
align, bool reserved,

spin_lock_irqsave(&pcpu_lock, flags);
if (ret) {
-   mutex_unlock(&pcpu_alloc_mutex);
pcpu_free_area(chunk, off, &occ_pages);
err = "failed to populate";
goto fail_unlock;
@@ -1045,6 +1045,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
align, bool reserved,
/* see the flag handling in pcpu_blance_workfn() */
pcpu_atomic_alloc_failed = true;
pcpu_schedule_balance_work();
+   } else {
+   mutex_unlock(&pcpu_alloc_mutex);
}
return NULL;
  }
@@ -1137,6 +1139,8 @@ static void pcpu_balance_workfn(struct work_struct *work)
list_for_each_entry_safe(chunk, next, &to_free, list) {
int rs, re;

+   cancel_work_sync(&chunk->map_extend_work);


This deserves some comment?


+
pcpu_for_each_pop_region(chunk, rs, re, 0, pcpu_unit_pages) {
pcpu_depopulate_chunk(chunk, rs, re);
spin_lock_irq(&pcpu_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org

Re: [PATCH 0/2 (linux-stable-4.5.y)] net: stmmac: MDIO fixes

2016-05-24 Thread Giuseppe CAVALLARO


On 5/23/2016 5:36 PM, Greg KH wrote:

On Mon, May 23, 2016 at 03:17:41PM +0200, Giuseppe Cavallaro wrote:

These patches port some recent fixes for linux-4.5.y stable branch.
They fix the MDIO settings trying to cover the all possible scenario
when the stmmac is connected to a real transceiver or a switch.

These patches are needed for Kernel 4.5 where Lime2, BananaPi and Cubieboard2
show problems when use the stmmac:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=823493


What are the git commit id for these patches in Linus's tree?


Commits are:

stmmac: fix MDIO settings

   a7657f128c279ae5796ab2ca7d04a7819f4259f0

and

Revert "stmmac: Fix 'eth0: No PHY found' regression"

   d7e944c8ddc0983640a9a32868fb217485d12ca2

Regards
Peppe



thanks,

greg k-h

90 matches

Mail list logo