Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes

2019-10-21 Thread Mel Gorman
On Mon, Oct 21, 2019 at 10:01:24AM -0400, Qian Cai wrote:
> Warnings from linux-next,
> 
> [   14.265911][  T659] BUG: sleeping function called from invalid context at 
> kernel/locking/mutex.c:935
> [   14.265992][  T659] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 
> 659, name: pgdatinit8
> [   14.266044][  T659] 1 lock held by pgdatinit8/659:

Fixed in v2 posted this morning. It should hit linux-next eventually.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes

2019-10-21 Thread Michal Hocko
On Mon 21-10-19 10:01:24, Qian Cai wrote:
> 
> 
> > On Oct 21, 2019, at 5:48 AM, Mel Gorman  wrote:
i[...]
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c0b2e0306720..f972076d0f6b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void)
> > /* Block until all are initialised */
> > wait_for_completion(&pgdat_init_all_done_comp);
> > 
> > +   /*
> > +* The number of managed pages has changed due to the initialisation
> > +* so the pcpu batch and high limits needs to be updated or the limits
> > +* will be artificially small.
> > +*/
> > +   for_each_populated_zone(zone)
> > +   zone_pcp_update(zone);
> > +
> > /*
> >  * We initialized the rest of the deferred pages.  Permanently disable
> >  * on-demand struct page initialization.
> > -- 
> > 2.16.4
> > 
> > 
> 
> Warnings from linux-next,
> 
> [   14.265911][  T659] BUG: sleeping function called from invalid context at 
> kernel/locking/mutex.c:935
> [   14.265992][  T659] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 
> 659, name: pgdatinit8
> [   14.266044][  T659] 1 lock held by pgdatinit8/659:
> [   14.266075][  T659]  #0: c000201ffca87b40 
> (&(&pgdat->node_size_lock)->rlock){}, at: deferred_init_memmap+0xc4/0x26c

This is really surprising to say the least. I do not see any spinlock
held here. Besides that we do sleep in wait_for_completion already.
Is it possible that the patch has been misplaced? zone_pcp_update is
called from page_alloc_init_late which is a different context than
deferred_init_memmap which runs in a separate kthread.

> [   14.266160][  T659] irq event stamp: 26
> [   14.266194][  T659] hardirqs last  enabled at (25): [] 
> _raw_spin_unlock_irq+0x44/0x80
> [   14.266246][  T659] hardirqs last disabled at (26): [] 
> _raw_spin_lock_irqsave+0x3c/0xa0
> [   14.266299][  T659] softirqs last  enabled at (0): [] 
> copy_process+0x720/0x19b0
> [   14.266339][  T659] softirqs last disabled at (0): [<>] 0x0
> [   14.266400][  T659] CPU: 64 PID: 659 Comm: pgdatinit8 Not tainted 
> 5.4.0-rc4-next-20191021 #1
> [   14.266462][  T659] Call Trace:
> [   14.266494][  T659] [c0003d8efae0] [c0921cf4] 
> dump_stack+0xe8/0x164 (unreliable)
> [   14.266538][  T659] [c0003d8efb30] [c0157c54] 
> ___might_sleep+0x334/0x370
> [   14.266577][  T659] [c0003d8efbb0] [c094a784] 
> __mutex_lock+0x84/0xb20
> [   14.266627][  T659] [c0003d8efcc0] [c0954038] 
> zone_pcp_update+0x34/0x64
> [   14.266677][  T659] [c0003d8efcf0] [c0b9e6bc] 
> deferred_init_memmap+0x1b8/0x26c
> [   14.266740][  T659] [c0003d8efdb0] [c0149528] 
> kthread+0x1a8/0x1b0
> [   14.266792][  T659] [c0003d8efe20] [c000b748] 
> ret_from_kernel_thread+0x5c/0x74
> [   14.268288][  T659] node 8 initialised, 1879186 pages in 12200ms
> [   14.268527][  T659] pgdatinit8 (659) used greatest stack depth: 27984 
> bytes left
> [   15.589983][  T658] BUG: sleeping function called from invalid context at 
> kernel/locking/mutex.c:935
> [   15.590041][  T658] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 
> 658, name: pgdatinit0
> [   15.590078][  T658] 1 lock held by pgdatinit0/658:
> [   15.590108][  T658]  #0: c01fff5c7b40 
> (&(&pgdat->node_size_lock)->rlock){}, at: deferred_init_memmap+0xc4/0x26c
> [   15.590192][  T658] irq event stamp: 18
> [   15.590224][  T658] hardirqs last  enabled at (17): [] 
> _raw_spin_unlock_irqrestore+0x94/0xd0
> [   15.590283][  T658] hardirqs last disabled at (18): [] 
> _raw_spin_lock_irqsave+0x3c/0xa0
> [   15.590332][  T658] softirqs last  enabled at (0): [] 
> copy_process+0x720/0x19b0
> [   15.590379][  T658] softirqs last disabled at (0): [<>] 0x0
> [   15.590414][  T658] CPU: 8 PID: 658 Comm: pgdatinit0 Tainted: GW   
>   5.4.0-rc4-next-20191021 #1
> [   15.590460][  T658] Call Trace:
> [   15.590491][  T658] [c0003d8cfae0] [c0921cf4] 
> dump_stack+0xe8/0x164 (unreliable)
> [   15.590541][  T658] [c0003d8cfb30] [c0157c54] 
> ___might_sleep+0x334/0x370
> [   15.590588][  T658] [c0003d8cfbb0] [c094a784] 
> __mutex_lock+0x84/0xb20
> [   15.590643][  T658] [c0003d8cfcc0] [c0954038] 
> zone_pcp_update+0x34/0x64
> [   15.590689][  T658] [c0003d8cfcf0] [c0b9e6bc] 
> deferred_init_memmap+0x1b8/0x26c
> [   15.590739][  T658] [c0003d8cfdb0] [c0149528] 
> kthread+0x1a8/0x1b0
> [   15.590790][  T658] [c0003d8cfe20] [c000b748] 
> ret_from_kernel_thread+0x5c/0x74

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes

2019-10-21 Thread Qian Cai



> On Oct 21, 2019, at 5:48 AM, Mel Gorman  wrote:
> 
> Deferred memory initialisation updates zone->managed_pages during
> the initialisation phase but before that finishes, the per-cpu page
> allocator (pcpu) calculates the number of pages allocated/freed in
> batches as well as the maximum number of pages allowed on a per-cpu list.
> As zone->managed_pages is not up to date yet, the pcpu initialisation
> calculates inappropriately low batch and high values.
> 
> This increases zone lock contention quite severely in some cases with the
> degree of severity depending on how many CPUs share a local zone and the
> size of the zone. A private report indicated that kernel build times were
> excessive with extremely high system CPU usage. A perf profile indicated
> that a large chunk of time was lost on zone->lock contention.
> 
> This patch recalculates the pcpu batch and high values after deferred
> initialisation completes for every populated zone in the system. It
> was tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
> workload -- allmodconfig and all available CPUs.
> 
> mmtests configuration: config-workload-kernbench-max
> Configuration was modified to build on a fresh XFS partition.
> 
> kernbench
>5.4.0-rc3  5.4.0-rc3
>  vanilla   resetpcpu-v2
> Amean user-25613249.50 (   0.00%)16401.31 * -23.79%*
> Amean syst-25614760.30 (   0.00%) 4448.39 *  69.86%*
> Amean elsp-256  162.42 (   0.00%)  119.13 *  26.65%*
> Stddevuser-256   42.97 (   0.00%)   19.15 (  55.43%)
> Stddevsyst-256  336.87 (   0.00%)6.71 (  98.01%)
> Stddevelsp-2562.46 (   0.00%)0.39 (  84.03%)
> 
>   5.4.0-rc35.4.0-rc3
> vanilla resetpcpu-v2
> Duration User   39766.24 49221.79
> Duration System 44298.10 13361.67
> Duration Elapsed  519.11   388.87
> 
> The patch reduces system CPU usage by 69.86% and total build time by
> 26.65%. The variance of system CPU usage is also much reduced.
> 
> Before, this was the breakdown of batch and high values over all zones was.
> 
>256   batch: 1
>256   batch: 63
>512   batch: 7
>256   high:  0
>256   high:  378
>512   high:  42
> 
> 512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After the 
> patch
> 
>256   batch: 1
>768   batch: 63
>256   high:  0
>768   high:  378
> 
> Cc: sta...@vger.kernel.org # v4.1+
> Signed-off-by: Mel Gorman 
> ---
> mm/page_alloc.c | 8 
> 1 file changed, 8 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c0b2e0306720..f972076d0f6b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void)
>   /* Block until all are initialised */
>   wait_for_completion(&pgdat_init_all_done_comp);
> 
> + /*
> +  * The number of managed pages has changed due to the initialisation
> +  * so the pcpu batch and high limits needs to be updated or the limits
> +  * will be artificially small.
> +  */
> + for_each_populated_zone(zone)
> + zone_pcp_update(zone);
> +
>   /*
>* We initialized the rest of the deferred pages.  Permanently disable
>* on-demand struct page initialization.
> -- 
> 2.16.4
> 
> 

Warnings from linux-next,

[   14.265911][  T659] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:935
[   14.265992][  T659] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 
659, name: pgdatinit8
[   14.266044][  T659] 1 lock held by pgdatinit8/659:
[   14.266075][  T659]  #0: c000201ffca87b40 
(&(&pgdat->node_size_lock)->rlock){}, at: deferred_init_memmap+0xc4/0x26c
[   14.266160][  T659] irq event stamp: 26
[   14.266194][  T659] hardirqs last  enabled at (25): [] 
_raw_spin_unlock_irq+0x44/0x80
[   14.266246][  T659] hardirqs last disabled at (26): [] 
_raw_spin_lock_irqsave+0x3c/0xa0
[   14.266299][  T659] softirqs last  enabled at (0): [] 
copy_process+0x720/0x19b0
[   14.266339][  T659] softirqs last disabled at (0): [<>] 0x0
[   14.266400][  T659] CPU: 64 PID: 659 Comm: pgdatinit8 Not tainted 
5.4.0-rc4-next-20191021 #1
[   14.266462][  T659] Call Trace:
[   14.266494][  T659] [c0003d8efae0] [c0921cf4] 
dump_stack+0xe8/0x164 (unreliable)
[   14.266538][  T659] [c0003d8efb30] [c0157c54] 
___might_sleep+0x334/0x370
[   14.266577][  T659] [c0003d8efbb0] [c094a784] 
__mutex_lock+0x84/0xb20
[   14.266627][  T659] [c0003d8efcc0] [c0954038] 
zone_pcp_update+0x34/0x64
[   14.266677][  T659] [c0003d8efcf0] [c0b9e6bc] 
deferred_init_memmap+0x1b8/0x26c
[   14.266740][  T659] [c0003d8efdb0] [c0149528] kt

Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes

2019-10-21 Thread Vlastimil Babka
On 10/21/19 11:48 AM, Mel Gorman wrote:
> Deferred memory initialisation updates zone->managed_pages during
> the initialisation phase but before that finishes, the per-cpu page
> allocator (pcpu) calculates the number of pages allocated/freed in
> batches as well as the maximum number of pages allowed on a per-cpu list.
> As zone->managed_pages is not up to date yet, the pcpu initialisation
> calculates inappropriately low batch and high values.
> 
> This increases zone lock contention quite severely in some cases with the
> degree of severity depending on how many CPUs share a local zone and the
> size of the zone. A private report indicated that kernel build times were
> excessive with extremely high system CPU usage. A perf profile indicated
> that a large chunk of time was lost on zone->lock contention.
> 
> This patch recalculates the pcpu batch and high values after deferred
> initialisation completes for every populated zone in the system. It
> was tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
> workload -- allmodconfig and all available CPUs.
> 
> mmtests configuration: config-workload-kernbench-max
> Configuration was modified to build on a fresh XFS partition.
> 
> kernbench
> 5.4.0-rc3  5.4.0-rc3
>   vanilla   resetpcpu-v2
> Amean user-25613249.50 (   0.00%)16401.31 * -23.79%*
> Amean syst-25614760.30 (   0.00%) 4448.39 *  69.86%*
> Amean elsp-256  162.42 (   0.00%)  119.13 *  26.65%*
> Stddevuser-256   42.97 (   0.00%)   19.15 (  55.43%)
> Stddevsyst-256  336.87 (   0.00%)6.71 (  98.01%)
> Stddevelsp-2562.46 (   0.00%)0.39 (  84.03%)
> 
>5.4.0-rc35.4.0-rc3
>  vanilla resetpcpu-v2
> Duration User   39766.24 49221.79
> Duration System 44298.10 13361.67
> Duration Elapsed  519.11   388.87
> 
> The patch reduces system CPU usage by 69.86% and total build time by
> 26.65%. The variance of system CPU usage is also much reduced.
> 
> Before, this was the breakdown of batch and high values over all zones was.
> 
> 256   batch: 1
> 256   batch: 63
> 512   batch: 7
> 256   high:  0
> 256   high:  378
> 512   high:  42
> 
> 512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After the 
> patch
> 
> 256   batch: 1
> 768   batch: 63
> 256   high:  0
> 768   high:  378
> 
> Cc: sta...@vger.kernel.org # v4.1+
> Signed-off-by: Mel Gorman 

Acked-by: Vlastimil Babka 

> ---
>  mm/page_alloc.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c0b2e0306720..f972076d0f6b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void)
>   /* Block until all are initialised */
>   wait_for_completion(&pgdat_init_all_done_comp);
>  
> + /*
> +  * The number of managed pages has changed due to the initialisation
> +  * so the pcpu batch and high limits needs to be updated or the limits
> +  * will be artificially small.
> +  */
> + for_each_populated_zone(zone)
> + zone_pcp_update(zone);
> +
>   /*
>* We initialized the rest of the deferred pages.  Permanently disable
>* on-demand struct page initialization.
> 



Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes

2019-10-21 Thread Michal Hocko
On Mon 21-10-19 10:48:06, Mel Gorman wrote:
> Deferred memory initialisation updates zone->managed_pages during
> the initialisation phase but before that finishes, the per-cpu page
> allocator (pcpu) calculates the number of pages allocated/freed in
> batches as well as the maximum number of pages allowed on a per-cpu list.
> As zone->managed_pages is not up to date yet, the pcpu initialisation
> calculates inappropriately low batch and high values.
> 
> This increases zone lock contention quite severely in some cases with the
> degree of severity depending on how many CPUs share a local zone and the
> size of the zone. A private report indicated that kernel build times were
> excessive with extremely high system CPU usage. A perf profile indicated
> that a large chunk of time was lost on zone->lock contention.
> 
> This patch recalculates the pcpu batch and high values after deferred
> initialisation completes for every populated zone in the system. It
> was tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
> workload -- allmodconfig and all available CPUs.
> 
> mmtests configuration: config-workload-kernbench-max
> Configuration was modified to build on a fresh XFS partition.
> 
> kernbench
> 5.4.0-rc3  5.4.0-rc3
>   vanilla   resetpcpu-v2
> Amean user-25613249.50 (   0.00%)16401.31 * -23.79%*
> Amean syst-25614760.30 (   0.00%) 4448.39 *  69.86%*
> Amean elsp-256  162.42 (   0.00%)  119.13 *  26.65%*
> Stddevuser-256   42.97 (   0.00%)   19.15 (  55.43%)
> Stddevsyst-256  336.87 (   0.00%)6.71 (  98.01%)
> Stddevelsp-2562.46 (   0.00%)0.39 (  84.03%)
> 
>5.4.0-rc35.4.0-rc3
>  vanilla resetpcpu-v2
> Duration User   39766.24 49221.79
> Duration System 44298.10 13361.67
> Duration Elapsed  519.11   388.87
> 
> The patch reduces system CPU usage by 69.86% and total build time by
> 26.65%. The variance of system CPU usage is also much reduced.
> 
> Before, this was the breakdown of batch and high values over all zones was.
> 
> 256   batch: 1
> 256   batch: 63
> 512   batch: 7
> 256   high:  0
> 256   high:  378
> 512   high:  42
> 
> 512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After the 
> patch
> 
> 256   batch: 1
> 768   batch: 63
> 256   high:  0
> 768   high:  378
> 
> Cc: sta...@vger.kernel.org # v4.1+
> Signed-off-by: Mel Gorman 

Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c0b2e0306720..f972076d0f6b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void)
>   /* Block until all are initialised */
>   wait_for_completion(&pgdat_init_all_done_comp);
>  
> + /*
> +  * The number of managed pages has changed due to the initialisation
> +  * so the pcpu batch and high limits needs to be updated or the limits
> +  * will be artificially small.
> +  */
> + for_each_populated_zone(zone)
> + zone_pcp_update(zone);
> +
>   /*
>* We initialized the rest of the deferred pages.  Permanently disable
>* on-demand struct page initialization.
> -- 
> 2.16.4

-- 
Michal Hocko
SUSE Labs