Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-20 Thread Eric Dumazet
On Thu, 2013-06-20 at 13:14 +0800, Jason Wang wrote:

> A drawback of NETIF_F_LLTX is that we may contend on qdisc lock
> especially when we have a huge number of tx queues.

For your information loopback driver is LLTX, and there is no qdisc on
it, unless you specifically add one.

Once you add a qdisc on a device, you hit the typical contention on a
spinlock. But its hard to design a parallel qdisc. So far nobody did
that.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-20 Thread Eric Dumazet
On Thu, 2013-06-20 at 13:14 +0800, Jason Wang wrote:

 A drawback of NETIF_F_LLTX is that we may contend on qdisc lock
 especially when we have a huge number of tx queues.

For your information loopback driver is LLTX, and there is no qdisc on
it, unless you specifically add one.

Once you add a qdisc on a device, you hit the typical contention on a
spinlock. But its hard to design a parallel qdisc. So far nobody did
that.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Jason Wang
On 06/19/2013 05:56 PM, Eric Dumazet wrote:
> On Wed, 2013-06-19 at 12:11 +0300, Michael S. Tsirkin wrote:
>
>> Well KVM supports up to 160 VCPUs on x86.
>>
>> Creating a queue per CPU is very reasonable, and
>> assuming cache line size of 64 bytes, netdev_queue seems to be 320
>> bytes, that's 320*160 = 51200. So 12.5 pages, order-4 allocation.
>> I agree most people don't have such systems yet, but
>> they do exist.
> Even so, it will just work, like a fork() is likely to work, even if a
> process needs order-1 allocation for kernel stack.
>
> Some drivers still use order-10 allocations with kmalloc(), and nobody
> complained yet.
>
> We had complains with mlx4 driver lately only bcause kmalloc() now gives
> a warning if allocations above MAX_ORDER are attempted.
>
> Having a single pointer means that we can :
>
> - Attempts a regular kmalloc() call, it will work most of the time.
> - fallback to vmalloc() _if_ kmalloc() failed.
>
> Frankly, if you want one tx queue per cpu, I would rather use
> NETIF_F_LLTX, like some other virtual devices.

A drawback of NETIF_F_LLTX is that we may contend on qdisc lock
especially when we have a huge number of tx queues.
>
> This way, you can have real per cpu memory, with proper NUMA affinity.
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Michael S. Tsirkin
On Wed, Jun 19, 2013 at 08:58:31AM -0700, Eric Dumazet wrote:
> On Wed, 2013-06-19 at 18:40 +0300, Michael S. Tsirkin wrote:
> 
> > That's a good trick too - vmalloc memory is a bit slower
> > on x86 since it's not using a huge page, but that's only
> > when we have lots of CPUs/queues...
> > 
> > Short term - how about switching to vmalloc if > 32 queues?
> 
> You do not have to switch "if > 32 queues"
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index fa007db..473cc9e 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -130,6 +130,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "net-sysfs.h"
>  
> @@ -5253,6 +5254,14 @@ static void netdev_init_one_queue(struct net_device 
> *dev,
>  #endif
>  }
>  
> +static void netif_free_tx_queues(struct net_device *dev)
> +{
> + if (is_vmalloc_addr(dev->_tx))
> + vfree(dev->_tx);
> + else
> + kfree(dev->_tx);
> +}
> +
>  static int netif_alloc_netdev_queues(struct net_device *dev)
>  {
>   unsigned int count = dev->num_tx_queues;
> @@ -5261,9 +5270,11 @@ static int netif_alloc_netdev_queues(struct net_device 
> *dev)
>   BUG_ON(count < 1);
>  
>   tx = kcalloc(count, sizeof(struct netdev_queue), GFP_KERNEL);

As we have explicit code to recover, maybe set __GFP_NOWARN?
If we are out of memory, vzalloc will warn too, we don't
need two warnings.

> - if (!tx)
> - return -ENOMEM;
> -
> + if (!tx) {
> + tx = vzalloc(count * sizeof(struct netdev_queue));
> + if (!tx)
> + return -ENOMEM;
> + }
>   dev->_tx = tx;
>  
>   netdev_for_each_tx_queue(dev, netdev_init_one_queue, NULL);
> @@ -5811,7 +5822,7 @@ free_all:
>  
>  free_pcpu:
>   free_percpu(dev->pcpu_refcnt);
> - kfree(dev->_tx);
> + netif_free_tx_queues(dev);
>  #ifdef CONFIG_RPS
>   kfree(dev->_rx);
>  #endif
> @@ -5836,7 +5847,7 @@ void free_netdev(struct net_device *dev)
>  
>   release_net(dev_net(dev));
>  
> - kfree(dev->_tx);
> + netif_free_tx_queues(dev);
>  #ifdef CONFIG_RPS
>   kfree(dev->_rx);
>  #endif
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Eric Dumazet
On Wed, 2013-06-19 at 17:06 +0100, David Laight wrote:

> Given the number of places I've seen this code added, why
> not put it in a general helper?

Good luck with that. We had many attempts in the past.

> I also thought that malloc() with GFP_KERNEL would sleep.
> Under what conditions does it fail instead?

For example on 32 bit kernel, LOW memory exhaustion/fragmentation.

vmalloc() has an extra 128MB virtual space (on i386 at least with
standard 3G/1G split)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread David Laight
> +static void netif_free_tx_queues(struct net_device *dev)
> +{
> + if (is_vmalloc_addr(dev->_tx))
> + vfree(dev->_tx);
> + else
> + kfree(dev->_tx);
> +}
> +
>  static int netif_alloc_netdev_queues(struct net_device *dev)
>  {
>   unsigned int count = dev->num_tx_queues;
> @@ -5261,9 +5270,11 @@ static int netif_alloc_netdev_queues(struct net_device 
> *dev)
>   BUG_ON(count < 1);
> 
>   tx = kcalloc(count, sizeof(struct netdev_queue), GFP_KERNEL);
> - if (!tx)
> - return -ENOMEM;
> -
> + if (!tx) {
> + tx = vzalloc(count * sizeof(struct netdev_queue));
> + if (!tx)
> + return -ENOMEM;
> + }
>   dev->_tx = tx;

Given the number of places I've seen this code added, why
not put it in a general helper?

I also thought that malloc() with GFP_KERNEL would sleep.
Under what conditions does it fail instead?

David



Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Eric Dumazet
On Wed, 2013-06-19 at 18:40 +0300, Michael S. Tsirkin wrote:

> That's a good trick too - vmalloc memory is a bit slower
> on x86 since it's not using a huge page, but that's only
> when we have lots of CPUs/queues...
> 
> Short term - how about switching to vmalloc if > 32 queues?

You do not have to switch "if > 32 queues"

diff --git a/net/core/dev.c b/net/core/dev.c
index fa007db..473cc9e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -130,6 +130,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "net-sysfs.h"
 
@@ -5253,6 +5254,14 @@ static void netdev_init_one_queue(struct net_device *dev,
 #endif
 }
 
+static void netif_free_tx_queues(struct net_device *dev)
+{
+   if (is_vmalloc_addr(dev->_tx))
+   vfree(dev->_tx);
+   else
+   kfree(dev->_tx);
+}
+
 static int netif_alloc_netdev_queues(struct net_device *dev)
 {
unsigned int count = dev->num_tx_queues;
@@ -5261,9 +5270,11 @@ static int netif_alloc_netdev_queues(struct net_device 
*dev)
BUG_ON(count < 1);
 
tx = kcalloc(count, sizeof(struct netdev_queue), GFP_KERNEL);
-   if (!tx)
-   return -ENOMEM;
-
+   if (!tx) {
+   tx = vzalloc(count * sizeof(struct netdev_queue));
+   if (!tx)
+   return -ENOMEM;
+   }
dev->_tx = tx;
 
netdev_for_each_tx_queue(dev, netdev_init_one_queue, NULL);
@@ -5811,7 +5822,7 @@ free_all:
 
 free_pcpu:
free_percpu(dev->pcpu_refcnt);
-   kfree(dev->_tx);
+   netif_free_tx_queues(dev);
 #ifdef CONFIG_RPS
kfree(dev->_rx);
 #endif
@@ -5836,7 +5847,7 @@ void free_netdev(struct net_device *dev)
 
release_net(dev_net(dev));
 
-   kfree(dev->_tx);
+   netif_free_tx_queues(dev);
 #ifdef CONFIG_RPS
kfree(dev->_rx);
 #endif


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Michael S. Tsirkin
On Wed, Jun 19, 2013 at 02:56:03AM -0700, Eric Dumazet wrote:
> On Wed, 2013-06-19 at 12:11 +0300, Michael S. Tsirkin wrote:
> 
> > Well KVM supports up to 160 VCPUs on x86.
> > 
> > Creating a queue per CPU is very reasonable, and
> > assuming cache line size of 64 bytes, netdev_queue seems to be 320
> > bytes, that's 320*160 = 51200. So 12.5 pages, order-4 allocation.
> > I agree most people don't have such systems yet, but
> > they do exist.
> 
> Even so, it will just work, like a fork() is likely to work, even if a
> process needs order-1 allocation for kernel stack.
> 
> Some drivers still use order-10 allocations with kmalloc(), and nobody
> complained yet.
> 
> We had complains with mlx4 driver lately only bcause kmalloc() now gives
> a warning if allocations above MAX_ORDER are attempted.
> 
> Having a single pointer means that we can :
> 
> - Attempts a regular kmalloc() call, it will work most of the time.
> - fallback to vmalloc() _if_ kmalloc() failed.

That's a good trick too - vmalloc memory is a bit slower
on x86 since it's not using a huge page, but that's only
when we have lots of CPUs/queues...

Short term - how about switching to vmalloc if > 32 queues?

> Frankly, if you want one tx queue per cpu, I would rather use
> NETIF_F_LLTX, like some other virtual devices.
> 
> This way, you can have real per cpu memory, with proper NUMA affinity.
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Michael S. Tsirkin
On Wed, Jun 19, 2013 at 02:56:03AM -0700, Eric Dumazet wrote:
> On Wed, 2013-06-19 at 12:11 +0300, Michael S. Tsirkin wrote:
> 
> > Well KVM supports up to 160 VCPUs on x86.
> > 
> > Creating a queue per CPU is very reasonable, and
> > assuming cache line size of 64 bytes, netdev_queue seems to be 320
> > bytes, that's 320*160 = 51200. So 12.5 pages, order-4 allocation.
> > I agree most people don't have such systems yet, but
> > they do exist.
> 
> Even so, it will just work, like a fork() is likely to work, even if a
> process needs order-1 allocation for kernel stack.
> Some drivers still use order-10 allocations with kmalloc(), and nobody
> complained yet.
> 
> We had complains with mlx4 driver lately only bcause kmalloc() now gives
> a warning if allocations above MAX_ORDER are attempted.
> 
> Having a single pointer means that we can :
> 
> - Attempts a regular kmalloc() call, it will work most of the time.
> - fallback to vmalloc() _if_ kmalloc() failed.

Most drivers create devices at boot time, when this is more likely to work.
What makes tun (and macvlan) a bit special is that the device is created
from userspace.  Virt setups create/destroy them all the
time.

> 
> Frankly, if you want one tx queue per cpu, I would rather use
> NETIF_F_LLTX, like some other virtual devices.
> 
> This way, you can have real per cpu memory, with proper NUMA affinity.
> 

Hmm good point, worth looking at.

Thanks,
-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Eric Dumazet
On Wed, 2013-06-19 at 12:11 +0300, Michael S. Tsirkin wrote:

> Well KVM supports up to 160 VCPUs on x86.
> 
> Creating a queue per CPU is very reasonable, and
> assuming cache line size of 64 bytes, netdev_queue seems to be 320
> bytes, that's 320*160 = 51200. So 12.5 pages, order-4 allocation.
> I agree most people don't have such systems yet, but
> they do exist.

Even so, it will just work, like a fork() is likely to work, even if a
process needs order-1 allocation for kernel stack.

Some drivers still use order-10 allocations with kmalloc(), and nobody
complained yet.

We had complains with mlx4 driver lately only bcause kmalloc() now gives
a warning if allocations above MAX_ORDER are attempted.

Having a single pointer means that we can :

- Attempts a regular kmalloc() call, it will work most of the time.
- fallback to vmalloc() _if_ kmalloc() failed.

Frankly, if you want one tx queue per cpu, I would rather use
NETIF_F_LLTX, like some other virtual devices.

This way, you can have real per cpu memory, with proper NUMA affinity.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Michael S. Tsirkin
On Tue, Jun 18, 2013 at 11:31:58PM -0700, Eric Dumazet wrote:
> On Wed, 2013-06-19 at 13:40 +0800, Jason Wang wrote:
> > Currently, we use kcalloc to allocate rx/tx queues for a net device which 
> > could
> > be easily lead to a high order memory allocation request when initializing a
> > multiqueue net device. We can simply avoid this by switching to use flex 
> > array
> > which always allocate at order zero.
> > 
> > Signed-off-by: Jason Wang 
> > ---
> >  include/linux/netdevice.h |   13 ++
> >  net/core/dev.c|   57 
> > 
> >  net/core/net-sysfs.c  |   15 +++
> >  3 files changed, 58 insertions(+), 27 deletions(-)
> > 
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 09b4188..c0b5d04 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -32,6 +32,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -1230,7 +1231,7 @@ struct net_device {
> >  
> > 
> >  #ifdef CONFIG_RPS
> > -   struct netdev_rx_queue  *_rx;
> > +   struct flex_array   *_rx;
> >  
> > /* Number of RX queues allocated at register_netdev() time */
> > unsigned intnum_rx_queues;
> > @@ -1250,7 +1251,7 @@ struct net_device {
> >  /*
> >   * Cache lines mostly used on transmit path
> >   */
> > -   struct netdev_queue *_tx cacheline_aligned_in_smp;
> > +   struct flex_array   *_tx cacheline_aligned_in_smp;
> >  
> 
> Using flex_array and adding overhead in this super critical part of
> network stack, only to avoid order-1 allocations done in GFP_KERNEL
> context is simply insane.
> 
> We can revisit this in 2050 if we ever need order-4 allocations or so,
> and still use 4K pages.
> 
> 

Well KVM supports up to 160 VCPUs on x86.

Creating a queue per CPU is very reasonable, and
assuming cache line size of 64 bytes, netdev_queue seems to be 320
bytes, that's 320*160 = 51200. So 12.5 pages, order-4 allocation.
I agree most people don't have such systems yet, but
they do exist.

We can cut the size of netdev_queue, moving out kobj - which
does not seem to be used on data path to a separate structure.
It's 64 byte in size so exactly 256 bytes.
That will get us an order-3 allocation, and there's
some padding there so we won't immediately increase it
the moment we add some fields.

Comments on this idea?

Instead of always using a flex array, we could have
+   struct netdev_queue *_tx; /* Used with small # of queues */
+#ifdef CONFIG_NETDEV_HUGE_NUMBER_OR_QUEUES
+   struct flex_array *_tx_large; /* Used with large # of queues */
+#endif

And fix wrappers to use _tx if not NULL, otherwise _tx_large.

If configured in, it's an extra branch on data path but probably less
costly than the extra indirection.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Jason Wang
On 06/19/2013 02:31 PM, Eric Dumazet wrote:
> On Wed, 2013-06-19 at 13:40 +0800, Jason Wang wrote:
>> Currently, we use kcalloc to allocate rx/tx queues for a net device which 
>> could
>> be easily lead to a high order memory allocation request when initializing a
>> multiqueue net device. We can simply avoid this by switching to use flex 
>> array
>> which always allocate at order zero.
>>
>> Signed-off-by: Jason Wang 
>> ---
>>  include/linux/netdevice.h |   13 ++
>>  net/core/dev.c|   57 
>> 
>>  net/core/net-sysfs.c  |   15 +++
>>  3 files changed, 58 insertions(+), 27 deletions(-)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 09b4188..c0b5d04 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -32,6 +32,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -1230,7 +1231,7 @@ struct net_device {
>>  
>>
>>  #ifdef CONFIG_RPS
>> -struct netdev_rx_queue  *_rx;
>> +struct flex_array   *_rx;
>>  
>>  /* Number of RX queues allocated at register_netdev() time */
>>  unsigned intnum_rx_queues;
>> @@ -1250,7 +1251,7 @@ struct net_device {
>>  /*
>>   * Cache lines mostly used on transmit path
>>   */
>> -struct netdev_queue *_tx cacheline_aligned_in_smp;
>> +struct flex_array   *_tx cacheline_aligned_in_smp;
>>  
> Using flex_array and adding overhead in this super critical part of
> network stack, only to avoid order-1 allocations done in GFP_KERNEL
> context is simply insane.

Yes, and I also miss the fact of GFP_KERNEL allocation.
> We can revisit this in 2050 if we ever need order-4 allocations or so,
> and still use 4K pages.
>
>

Will drop this patch, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Eric Dumazet
On Wed, 2013-06-19 at 13:40 +0800, Jason Wang wrote:
> Currently, we use kcalloc to allocate rx/tx queues for a net device which 
> could
> be easily lead to a high order memory allocation request when initializing a
> multiqueue net device. We can simply avoid this by switching to use flex array
> which always allocate at order zero.
> 
> Signed-off-by: Jason Wang 
> ---
>  include/linux/netdevice.h |   13 ++
>  net/core/dev.c|   57 
>  net/core/net-sysfs.c  |   15 +++
>  3 files changed, 58 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 09b4188..c0b5d04 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -1230,7 +1231,7 @@ struct net_device {
>  
> 
>  #ifdef CONFIG_RPS
> - struct netdev_rx_queue  *_rx;
> + struct flex_array   *_rx;
>  
>   /* Number of RX queues allocated at register_netdev() time */
>   unsigned intnum_rx_queues;
> @@ -1250,7 +1251,7 @@ struct net_device {
>  /*
>   * Cache lines mostly used on transmit path
>   */
> - struct netdev_queue *_tx cacheline_aligned_in_smp;
> + struct flex_array   *_tx cacheline_aligned_in_smp;
>  

Using flex_array and adding overhead in this super critical part of
network stack, only to avoid order-1 allocations done in GFP_KERNEL
context is simply insane.

We can revisit this in 2050 if we ever need order-4 allocations or so,
and still use 4K pages.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Eric Dumazet
On Wed, 2013-06-19 at 13:40 +0800, Jason Wang wrote:
 Currently, we use kcalloc to allocate rx/tx queues for a net device which 
 could
 be easily lead to a high order memory allocation request when initializing a
 multiqueue net device. We can simply avoid this by switching to use flex array
 which always allocate at order zero.
 
 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  include/linux/netdevice.h |   13 ++
  net/core/dev.c|   57 
  net/core/net-sysfs.c  |   15 +++
  3 files changed, 58 insertions(+), 27 deletions(-)
 
 diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
 index 09b4188..c0b5d04 100644
 --- a/include/linux/netdevice.h
 +++ b/include/linux/netdevice.h
 @@ -32,6 +32,7 @@
  #include linux/atomic.h
  #include asm/cache.h
  #include asm/byteorder.h
 +#include linux/flex_array.h
  
  #include linux/percpu.h
  #include linux/rculist.h
 @@ -1230,7 +1231,7 @@ struct net_device {
  
 
  #ifdef CONFIG_RPS
 - struct netdev_rx_queue  *_rx;
 + struct flex_array   *_rx;
  
   /* Number of RX queues allocated at register_netdev() time */
   unsigned intnum_rx_queues;
 @@ -1250,7 +1251,7 @@ struct net_device {
  /*
   * Cache lines mostly used on transmit path
   */
 - struct netdev_queue *_tx cacheline_aligned_in_smp;
 + struct flex_array   *_tx cacheline_aligned_in_smp;
  

Using flex_array and adding overhead in this super critical part of
network stack, only to avoid order-1 allocations done in GFP_KERNEL
context is simply insane.

We can revisit this in 2050 if we ever need order-4 allocations or so,
and still use 4K pages.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Jason Wang
On 06/19/2013 02:31 PM, Eric Dumazet wrote:
 On Wed, 2013-06-19 at 13:40 +0800, Jason Wang wrote:
 Currently, we use kcalloc to allocate rx/tx queues for a net device which 
 could
 be easily lead to a high order memory allocation request when initializing a
 multiqueue net device. We can simply avoid this by switching to use flex 
 array
 which always allocate at order zero.

 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  include/linux/netdevice.h |   13 ++
  net/core/dev.c|   57 
 
  net/core/net-sysfs.c  |   15 +++
  3 files changed, 58 insertions(+), 27 deletions(-)

 diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
 index 09b4188..c0b5d04 100644
 --- a/include/linux/netdevice.h
 +++ b/include/linux/netdevice.h
 @@ -32,6 +32,7 @@
  #include linux/atomic.h
  #include asm/cache.h
  #include asm/byteorder.h
 +#include linux/flex_array.h
  
  #include linux/percpu.h
  #include linux/rculist.h
 @@ -1230,7 +1231,7 @@ struct net_device {
  

  #ifdef CONFIG_RPS
 -struct netdev_rx_queue  *_rx;
 +struct flex_array   *_rx;
  
  /* Number of RX queues allocated at register_netdev() time */
  unsigned intnum_rx_queues;
 @@ -1250,7 +1251,7 @@ struct net_device {
  /*
   * Cache lines mostly used on transmit path
   */
 -struct netdev_queue *_tx cacheline_aligned_in_smp;
 +struct flex_array   *_tx cacheline_aligned_in_smp;
  
 Using flex_array and adding overhead in this super critical part of
 network stack, only to avoid order-1 allocations done in GFP_KERNEL
 context is simply insane.

Yes, and I also miss the fact of GFP_KERNEL allocation.
 We can revisit this in 2050 if we ever need order-4 allocations or so,
 and still use 4K pages.



Will drop this patch, thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Michael S. Tsirkin
On Tue, Jun 18, 2013 at 11:31:58PM -0700, Eric Dumazet wrote:
 On Wed, 2013-06-19 at 13:40 +0800, Jason Wang wrote:
  Currently, we use kcalloc to allocate rx/tx queues for a net device which 
  could
  be easily lead to a high order memory allocation request when initializing a
  multiqueue net device. We can simply avoid this by switching to use flex 
  array
  which always allocate at order zero.
  
  Signed-off-by: Jason Wang jasow...@redhat.com
  ---
   include/linux/netdevice.h |   13 ++
   net/core/dev.c|   57 
  
   net/core/net-sysfs.c  |   15 +++
   3 files changed, 58 insertions(+), 27 deletions(-)
  
  diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
  index 09b4188..c0b5d04 100644
  --- a/include/linux/netdevice.h
  +++ b/include/linux/netdevice.h
  @@ -32,6 +32,7 @@
   #include linux/atomic.h
   #include asm/cache.h
   #include asm/byteorder.h
  +#include linux/flex_array.h
   
   #include linux/percpu.h
   #include linux/rculist.h
  @@ -1230,7 +1231,7 @@ struct net_device {
   
  
   #ifdef CONFIG_RPS
  -   struct netdev_rx_queue  *_rx;
  +   struct flex_array   *_rx;
   
  /* Number of RX queues allocated at register_netdev() time */
  unsigned intnum_rx_queues;
  @@ -1250,7 +1251,7 @@ struct net_device {
   /*
* Cache lines mostly used on transmit path
*/
  -   struct netdev_queue *_tx cacheline_aligned_in_smp;
  +   struct flex_array   *_tx cacheline_aligned_in_smp;
   
 
 Using flex_array and adding overhead in this super critical part of
 network stack, only to avoid order-1 allocations done in GFP_KERNEL
 context is simply insane.
 
 We can revisit this in 2050 if we ever need order-4 allocations or so,
 and still use 4K pages.
 
 

Well KVM supports up to 160 VCPUs on x86.

Creating a queue per CPU is very reasonable, and
assuming cache line size of 64 bytes, netdev_queue seems to be 320
bytes, that's 320*160 = 51200. So 12.5 pages, order-4 allocation.
I agree most people don't have such systems yet, but
they do exist.

We can cut the size of netdev_queue, moving out kobj - which
does not seem to be used on data path to a separate structure.
It's 64 byte in size so exactly 256 bytes.
That will get us an order-3 allocation, and there's
some padding there so we won't immediately increase it
the moment we add some fields.

Comments on this idea?

Instead of always using a flex array, we could have
+   struct netdev_queue *_tx; /* Used with small # of queues */
+#ifdef CONFIG_NETDEV_HUGE_NUMBER_OR_QUEUES
+   struct flex_array *_tx_large; /* Used with large # of queues */
+#endif

And fix wrappers to use _tx if not NULL, otherwise _tx_large.

If configured in, it's an extra branch on data path but probably less
costly than the extra indirection.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Eric Dumazet
On Wed, 2013-06-19 at 12:11 +0300, Michael S. Tsirkin wrote:

 Well KVM supports up to 160 VCPUs on x86.
 
 Creating a queue per CPU is very reasonable, and
 assuming cache line size of 64 bytes, netdev_queue seems to be 320
 bytes, that's 320*160 = 51200. So 12.5 pages, order-4 allocation.
 I agree most people don't have such systems yet, but
 they do exist.

Even so, it will just work, like a fork() is likely to work, even if a
process needs order-1 allocation for kernel stack.

Some drivers still use order-10 allocations with kmalloc(), and nobody
complained yet.

We had complains with mlx4 driver lately only bcause kmalloc() now gives
a warning if allocations above MAX_ORDER are attempted.

Having a single pointer means that we can :

- Attempts a regular kmalloc() call, it will work most of the time.
- fallback to vmalloc() _if_ kmalloc() failed.

Frankly, if you want one tx queue per cpu, I would rather use
NETIF_F_LLTX, like some other virtual devices.

This way, you can have real per cpu memory, with proper NUMA affinity.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Michael S. Tsirkin
On Wed, Jun 19, 2013 at 02:56:03AM -0700, Eric Dumazet wrote:
 On Wed, 2013-06-19 at 12:11 +0300, Michael S. Tsirkin wrote:
 
  Well KVM supports up to 160 VCPUs on x86.
  
  Creating a queue per CPU is very reasonable, and
  assuming cache line size of 64 bytes, netdev_queue seems to be 320
  bytes, that's 320*160 = 51200. So 12.5 pages, order-4 allocation.
  I agree most people don't have such systems yet, but
  they do exist.
 
 Even so, it will just work, like a fork() is likely to work, even if a
 process needs order-1 allocation for kernel stack.
 Some drivers still use order-10 allocations with kmalloc(), and nobody
 complained yet.
 
 We had complains with mlx4 driver lately only bcause kmalloc() now gives
 a warning if allocations above MAX_ORDER are attempted.
 
 Having a single pointer means that we can :
 
 - Attempts a regular kmalloc() call, it will work most of the time.
 - fallback to vmalloc() _if_ kmalloc() failed.

Most drivers create devices at boot time, when this is more likely to work.
What makes tun (and macvlan) a bit special is that the device is created
from userspace.  Virt setups create/destroy them all the
time.

 
 Frankly, if you want one tx queue per cpu, I would rather use
 NETIF_F_LLTX, like some other virtual devices.
 
 This way, you can have real per cpu memory, with proper NUMA affinity.
 

Hmm good point, worth looking at.

Thanks,
-- 
MST
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Michael S. Tsirkin
On Wed, Jun 19, 2013 at 02:56:03AM -0700, Eric Dumazet wrote:
 On Wed, 2013-06-19 at 12:11 +0300, Michael S. Tsirkin wrote:
 
  Well KVM supports up to 160 VCPUs on x86.
  
  Creating a queue per CPU is very reasonable, and
  assuming cache line size of 64 bytes, netdev_queue seems to be 320
  bytes, that's 320*160 = 51200. So 12.5 pages, order-4 allocation.
  I agree most people don't have such systems yet, but
  they do exist.
 
 Even so, it will just work, like a fork() is likely to work, even if a
 process needs order-1 allocation for kernel stack.
 
 Some drivers still use order-10 allocations with kmalloc(), and nobody
 complained yet.
 
 We had complains with mlx4 driver lately only bcause kmalloc() now gives
 a warning if allocations above MAX_ORDER are attempted.
 
 Having a single pointer means that we can :
 
 - Attempts a regular kmalloc() call, it will work most of the time.
 - fallback to vmalloc() _if_ kmalloc() failed.

That's a good trick too - vmalloc memory is a bit slower
on x86 since it's not using a huge page, but that's only
when we have lots of CPUs/queues...

Short term - how about switching to vmalloc if  32 queues?

 Frankly, if you want one tx queue per cpu, I would rather use
 NETIF_F_LLTX, like some other virtual devices.
 
 This way, you can have real per cpu memory, with proper NUMA affinity.
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Eric Dumazet
On Wed, 2013-06-19 at 18:40 +0300, Michael S. Tsirkin wrote:

 That's a good trick too - vmalloc memory is a bit slower
 on x86 since it's not using a huge page, but that's only
 when we have lots of CPUs/queues...
 
 Short term - how about switching to vmalloc if  32 queues?

You do not have to switch if  32 queues

diff --git a/net/core/dev.c b/net/core/dev.c
index fa007db..473cc9e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -130,6 +130,7 @@
 #include linux/cpu_rmap.h
 #include linux/static_key.h
 #include linux/hashtable.h
+#include linux/vmalloc.h
 
 #include net-sysfs.h
 
@@ -5253,6 +5254,14 @@ static void netdev_init_one_queue(struct net_device *dev,
 #endif
 }
 
+static void netif_free_tx_queues(struct net_device *dev)
+{
+   if (is_vmalloc_addr(dev-_tx))
+   vfree(dev-_tx);
+   else
+   kfree(dev-_tx);
+}
+
 static int netif_alloc_netdev_queues(struct net_device *dev)
 {
unsigned int count = dev-num_tx_queues;
@@ -5261,9 +5270,11 @@ static int netif_alloc_netdev_queues(struct net_device 
*dev)
BUG_ON(count  1);
 
tx = kcalloc(count, sizeof(struct netdev_queue), GFP_KERNEL);
-   if (!tx)
-   return -ENOMEM;
-
+   if (!tx) {
+   tx = vzalloc(count * sizeof(struct netdev_queue));
+   if (!tx)
+   return -ENOMEM;
+   }
dev-_tx = tx;
 
netdev_for_each_tx_queue(dev, netdev_init_one_queue, NULL);
@@ -5811,7 +5822,7 @@ free_all:
 
 free_pcpu:
free_percpu(dev-pcpu_refcnt);
-   kfree(dev-_tx);
+   netif_free_tx_queues(dev);
 #ifdef CONFIG_RPS
kfree(dev-_rx);
 #endif
@@ -5836,7 +5847,7 @@ void free_netdev(struct net_device *dev)
 
release_net(dev_net(dev));
 
-   kfree(dev-_tx);
+   netif_free_tx_queues(dev);
 #ifdef CONFIG_RPS
kfree(dev-_rx);
 #endif


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Eric Dumazet
On Wed, 2013-06-19 at 17:06 +0100, David Laight wrote:

 Given the number of places I've seen this code added, why
 not put it in a general helper?

Good luck with that. We had many attempts in the past.

 I also thought that malloc() with GFP_KERNEL would sleep.
 Under what conditions does it fail instead?

For example on 32 bit kernel, LOW memory exhaustion/fragmentation.

vmalloc() has an extra 128MB virtual space (on i386 at least with
standard 3G/1G split)



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread David Laight
 +static void netif_free_tx_queues(struct net_device *dev)
 +{
 + if (is_vmalloc_addr(dev-_tx))
 + vfree(dev-_tx);
 + else
 + kfree(dev-_tx);
 +}
 +
  static int netif_alloc_netdev_queues(struct net_device *dev)
  {
   unsigned int count = dev-num_tx_queues;
 @@ -5261,9 +5270,11 @@ static int netif_alloc_netdev_queues(struct net_device 
 *dev)
   BUG_ON(count  1);
 
   tx = kcalloc(count, sizeof(struct netdev_queue), GFP_KERNEL);
 - if (!tx)
 - return -ENOMEM;
 -
 + if (!tx) {
 + tx = vzalloc(count * sizeof(struct netdev_queue));
 + if (!tx)
 + return -ENOMEM;
 + }
   dev-_tx = tx;

Given the number of places I've seen this code added, why
not put it in a general helper?

I also thought that malloc() with GFP_KERNEL would sleep.
Under what conditions does it fail instead?

David



Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Michael S. Tsirkin
On Wed, Jun 19, 2013 at 08:58:31AM -0700, Eric Dumazet wrote:
 On Wed, 2013-06-19 at 18:40 +0300, Michael S. Tsirkin wrote:
 
  That's a good trick too - vmalloc memory is a bit slower
  on x86 since it's not using a huge page, but that's only
  when we have lots of CPUs/queues...
  
  Short term - how about switching to vmalloc if  32 queues?
 
 You do not have to switch if  32 queues
 
 diff --git a/net/core/dev.c b/net/core/dev.c
 index fa007db..473cc9e 100644
 --- a/net/core/dev.c
 +++ b/net/core/dev.c
 @@ -130,6 +130,7 @@
  #include linux/cpu_rmap.h
  #include linux/static_key.h
  #include linux/hashtable.h
 +#include linux/vmalloc.h
  
  #include net-sysfs.h
  
 @@ -5253,6 +5254,14 @@ static void netdev_init_one_queue(struct net_device 
 *dev,
  #endif
  }
  
 +static void netif_free_tx_queues(struct net_device *dev)
 +{
 + if (is_vmalloc_addr(dev-_tx))
 + vfree(dev-_tx);
 + else
 + kfree(dev-_tx);
 +}
 +
  static int netif_alloc_netdev_queues(struct net_device *dev)
  {
   unsigned int count = dev-num_tx_queues;
 @@ -5261,9 +5270,11 @@ static int netif_alloc_netdev_queues(struct net_device 
 *dev)
   BUG_ON(count  1);
  
   tx = kcalloc(count, sizeof(struct netdev_queue), GFP_KERNEL);

As we have explicit code to recover, maybe set __GFP_NOWARN?
If we are out of memory, vzalloc will warn too, we don't
need two warnings.

 - if (!tx)
 - return -ENOMEM;
 -
 + if (!tx) {
 + tx = vzalloc(count * sizeof(struct netdev_queue));
 + if (!tx)
 + return -ENOMEM;
 + }
   dev-_tx = tx;
  
   netdev_for_each_tx_queue(dev, netdev_init_one_queue, NULL);
 @@ -5811,7 +5822,7 @@ free_all:
  
  free_pcpu:
   free_percpu(dev-pcpu_refcnt);
 - kfree(dev-_tx);
 + netif_free_tx_queues(dev);
  #ifdef CONFIG_RPS
   kfree(dev-_rx);
  #endif
 @@ -5836,7 +5847,7 @@ void free_netdev(struct net_device *dev)
  
   release_net(dev_net(dev));
  
 - kfree(dev-_tx);
 + netif_free_tx_queues(dev);
  #ifdef CONFIG_RPS
   kfree(dev-_rx);
  #endif
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread Jason Wang
On 06/19/2013 05:56 PM, Eric Dumazet wrote:
 On Wed, 2013-06-19 at 12:11 +0300, Michael S. Tsirkin wrote:

 Well KVM supports up to 160 VCPUs on x86.

 Creating a queue per CPU is very reasonable, and
 assuming cache line size of 64 bytes, netdev_queue seems to be 320
 bytes, that's 320*160 = 51200. So 12.5 pages, order-4 allocation.
 I agree most people don't have such systems yet, but
 they do exist.
 Even so, it will just work, like a fork() is likely to work, even if a
 process needs order-1 allocation for kernel stack.

 Some drivers still use order-10 allocations with kmalloc(), and nobody
 complained yet.

 We had complains with mlx4 driver lately only bcause kmalloc() now gives
 a warning if allocations above MAX_ORDER are attempted.

 Having a single pointer means that we can :

 - Attempts a regular kmalloc() call, it will work most of the time.
 - fallback to vmalloc() _if_ kmalloc() failed.

 Frankly, if you want one tx queue per cpu, I would rather use
 NETIF_F_LLTX, like some other virtual devices.

A drawback of NETIF_F_LLTX is that we may contend on qdisc lock
especially when we have a huge number of tx queues.

 This way, you can have real per cpu memory, with proper NUMA affinity.




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/