Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
On Mon, Oct 21, 2019 at 10:01:24AM -0400, Qian Cai wrote: > Warnings from linux-next, > > [ 14.265911][ T659] BUG: sleeping function called from invalid context at > kernel/locking/mutex.c:935 > [ 14.265992][ T659] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: > 659, name: pgdatinit8 > [ 14.266044][ T659] 1 lock held by pgdatinit8/659: Fixed in v2 posted this morning. It should hit linux-next eventually. -- Mel Gorman SUSE Labs
Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
On Mon 21-10-19 10:01:24, Qian Cai wrote: > > > > On Oct 21, 2019, at 5:48 AM, Mel Gorman wrote: i[...] > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index c0b2e0306720..f972076d0f6b 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void) > > /* Block until all are initialised */ > > wait_for_completion(&pgdat_init_all_done_comp); > > > > + /* > > +* The number of managed pages has changed due to the initialisation > > +* so the pcpu batch and high limits needs to be updated or the limits > > +* will be artificially small. > > +*/ > > + for_each_populated_zone(zone) > > + zone_pcp_update(zone); > > + > > /* > > * We initialized the rest of the deferred pages. Permanently disable > > * on-demand struct page initialization. > > -- > > 2.16.4 > > > > > > Warnings from linux-next, > > [ 14.265911][ T659] BUG: sleeping function called from invalid context at > kernel/locking/mutex.c:935 > [ 14.265992][ T659] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: > 659, name: pgdatinit8 > [ 14.266044][ T659] 1 lock held by pgdatinit8/659: > [ 14.266075][ T659] #0: c000201ffca87b40 > (&(&pgdat->node_size_lock)->rlock){}, at: deferred_init_memmap+0xc4/0x26c This is really surprising to say the least. I do not see any spinlock held here. Besides that we do sleep in wait_for_completion already. Is it possible that the patch has been misplaced? zone_pcp_update is called from page_alloc_init_late which is a different context than deferred_init_memmap which runs in a separate kthread. > [ 14.266160][ T659] irq event stamp: 26 > [ 14.266194][ T659] hardirqs last enabled at (25): [] > _raw_spin_unlock_irq+0x44/0x80 > [ 14.266246][ T659] hardirqs last disabled at (26): [] > _raw_spin_lock_irqsave+0x3c/0xa0 > [ 14.266299][ T659] softirqs last enabled at (0): [] > copy_process+0x720/0x19b0 > [ 14.266339][ T659] softirqs last disabled at (0): [<>] 0x0 > [ 14.266400][ T659] CPU: 64 PID: 659 Comm: pgdatinit8 Not tainted > 5.4.0-rc4-next-20191021 #1 > [ 14.266462][ T659] Call Trace: > [ 14.266494][ T659] [c0003d8efae0] [c0921cf4] > dump_stack+0xe8/0x164 (unreliable) > [ 14.266538][ T659] [c0003d8efb30] [c0157c54] > ___might_sleep+0x334/0x370 > [ 14.266577][ T659] [c0003d8efbb0] [c094a784] > __mutex_lock+0x84/0xb20 > [ 14.266627][ T659] [c0003d8efcc0] [c0954038] > zone_pcp_update+0x34/0x64 > [ 14.266677][ T659] [c0003d8efcf0] [c0b9e6bc] > deferred_init_memmap+0x1b8/0x26c > [ 14.266740][ T659] [c0003d8efdb0] [c0149528] > kthread+0x1a8/0x1b0 > [ 14.266792][ T659] [c0003d8efe20] [c000b748] > ret_from_kernel_thread+0x5c/0x74 > [ 14.268288][ T659] node 8 initialised, 1879186 pages in 12200ms > [ 14.268527][ T659] pgdatinit8 (659) used greatest stack depth: 27984 > bytes left > [ 15.589983][ T658] BUG: sleeping function called from invalid context at > kernel/locking/mutex.c:935 > [ 15.590041][ T658] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: > 658, name: pgdatinit0 > [ 15.590078][ T658] 1 lock held by pgdatinit0/658: > [ 15.590108][ T658] #0: c01fff5c7b40 > (&(&pgdat->node_size_lock)->rlock){}, at: deferred_init_memmap+0xc4/0x26c > [ 15.590192][ T658] irq event stamp: 18 > [ 15.590224][ T658] hardirqs last enabled at (17): [] > _raw_spin_unlock_irqrestore+0x94/0xd0 > [ 15.590283][ T658] hardirqs last disabled at (18): [] > _raw_spin_lock_irqsave+0x3c/0xa0 > [ 15.590332][ T658] softirqs last enabled at (0): [] > copy_process+0x720/0x19b0 > [ 15.590379][ T658] softirqs last disabled at (0): [<>] 0x0 > [ 15.590414][ T658] CPU: 8 PID: 658 Comm: pgdatinit0 Tainted: GW > 5.4.0-rc4-next-20191021 #1 > [ 15.590460][ T658] Call Trace: > [ 15.590491][ T658] [c0003d8cfae0] [c0921cf4] > dump_stack+0xe8/0x164 (unreliable) > [ 15.590541][ T658] [c0003d8cfb30] [c0157c54] > ___might_sleep+0x334/0x370 > [ 15.590588][ T658] [c0003d8cfbb0] [c094a784] > __mutex_lock+0x84/0xb20 > [ 15.590643][ T658] [c0003d8cfcc0] [c0954038] > zone_pcp_update+0x34/0x64 > [ 15.590689][ T658] [c0003d8cfcf0] [c0b9e6bc] > deferred_init_memmap+0x1b8/0x26c > [ 15.590739][ T658] [c0003d8cfdb0] [c0149528] > kthread+0x1a8/0x1b0 > [ 15.590790][ T658] [c0003d8cfe20] [c000b748] > ret_from_kernel_thread+0x5c/0x74 -- Michal Hocko SUSE Labs
Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
> On Oct 21, 2019, at 5:48 AM, Mel Gorman wrote: > > Deferred memory initialisation updates zone->managed_pages during > the initialisation phase but before that finishes, the per-cpu page > allocator (pcpu) calculates the number of pages allocated/freed in > batches as well as the maximum number of pages allowed on a per-cpu list. > As zone->managed_pages is not up to date yet, the pcpu initialisation > calculates inappropriately low batch and high values. > > This increases zone lock contention quite severely in some cases with the > degree of severity depending on how many CPUs share a local zone and the > size of the zone. A private report indicated that kernel build times were > excessive with extremely high system CPU usage. A perf profile indicated > that a large chunk of time was lost on zone->lock contention. > > This patch recalculates the pcpu batch and high values after deferred > initialisation completes for every populated zone in the system. It > was tested on a 2-socket AMD EPYC 2 machine using a kernel compilation > workload -- allmodconfig and all available CPUs. > > mmtests configuration: config-workload-kernbench-max > Configuration was modified to build on a fresh XFS partition. > > kernbench >5.4.0-rc3 5.4.0-rc3 > vanilla resetpcpu-v2 > Amean user-25613249.50 ( 0.00%)16401.31 * -23.79%* > Amean syst-25614760.30 ( 0.00%) 4448.39 * 69.86%* > Amean elsp-256 162.42 ( 0.00%) 119.13 * 26.65%* > Stddevuser-256 42.97 ( 0.00%) 19.15 ( 55.43%) > Stddevsyst-256 336.87 ( 0.00%)6.71 ( 98.01%) > Stddevelsp-2562.46 ( 0.00%)0.39 ( 84.03%) > > 5.4.0-rc35.4.0-rc3 > vanilla resetpcpu-v2 > Duration User 39766.24 49221.79 > Duration System 44298.10 13361.67 > Duration Elapsed 519.11 388.87 > > The patch reduces system CPU usage by 69.86% and total build time by > 26.65%. The variance of system CPU usage is also much reduced. > > Before, this was the breakdown of batch and high values over all zones was. > >256 batch: 1 >256 batch: 63 >512 batch: 7 >256 high: 0 >256 high: 378 >512 high: 42 > > 512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After the > patch > >256 batch: 1 >768 batch: 63 >256 high: 0 >768 high: 378 > > Cc: sta...@vger.kernel.org # v4.1+ > Signed-off-by: Mel Gorman > --- > mm/page_alloc.c | 8 > 1 file changed, 8 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c0b2e0306720..f972076d0f6b 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void) > /* Block until all are initialised */ > wait_for_completion(&pgdat_init_all_done_comp); > > + /* > + * The number of managed pages has changed due to the initialisation > + * so the pcpu batch and high limits needs to be updated or the limits > + * will be artificially small. > + */ > + for_each_populated_zone(zone) > + zone_pcp_update(zone); > + > /* >* We initialized the rest of the deferred pages. Permanently disable >* on-demand struct page initialization. > -- > 2.16.4 > > Warnings from linux-next, [ 14.265911][ T659] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935 [ 14.265992][ T659] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 659, name: pgdatinit8 [ 14.266044][ T659] 1 lock held by pgdatinit8/659: [ 14.266075][ T659] #0: c000201ffca87b40 (&(&pgdat->node_size_lock)->rlock){}, at: deferred_init_memmap+0xc4/0x26c [ 14.266160][ T659] irq event stamp: 26 [ 14.266194][ T659] hardirqs last enabled at (25): [] _raw_spin_unlock_irq+0x44/0x80 [ 14.266246][ T659] hardirqs last disabled at (26): [] _raw_spin_lock_irqsave+0x3c/0xa0 [ 14.266299][ T659] softirqs last enabled at (0): [] copy_process+0x720/0x19b0 [ 14.266339][ T659] softirqs last disabled at (0): [<>] 0x0 [ 14.266400][ T659] CPU: 64 PID: 659 Comm: pgdatinit8 Not tainted 5.4.0-rc4-next-20191021 #1 [ 14.266462][ T659] Call Trace: [ 14.266494][ T659] [c0003d8efae0] [c0921cf4] dump_stack+0xe8/0x164 (unreliable) [ 14.266538][ T659] [c0003d8efb30] [c0157c54] ___might_sleep+0x334/0x370 [ 14.266577][ T659] [c0003d8efbb0] [c094a784] __mutex_lock+0x84/0xb20 [ 14.266627][ T659] [c0003d8efcc0] [c0954038] zone_pcp_update+0x34/0x64 [ 14.266677][ T659] [c0003d8efcf0] [c0b9e6bc] deferred_init_memmap+0x1b8/0x26c [ 14.266740][ T659] [c0003d8efdb0] [c0149528] kt
Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
On 10/21/19 11:48 AM, Mel Gorman wrote: > Deferred memory initialisation updates zone->managed_pages during > the initialisation phase but before that finishes, the per-cpu page > allocator (pcpu) calculates the number of pages allocated/freed in > batches as well as the maximum number of pages allowed on a per-cpu list. > As zone->managed_pages is not up to date yet, the pcpu initialisation > calculates inappropriately low batch and high values. > > This increases zone lock contention quite severely in some cases with the > degree of severity depending on how many CPUs share a local zone and the > size of the zone. A private report indicated that kernel build times were > excessive with extremely high system CPU usage. A perf profile indicated > that a large chunk of time was lost on zone->lock contention. > > This patch recalculates the pcpu batch and high values after deferred > initialisation completes for every populated zone in the system. It > was tested on a 2-socket AMD EPYC 2 machine using a kernel compilation > workload -- allmodconfig and all available CPUs. > > mmtests configuration: config-workload-kernbench-max > Configuration was modified to build on a fresh XFS partition. > > kernbench > 5.4.0-rc3 5.4.0-rc3 > vanilla resetpcpu-v2 > Amean user-25613249.50 ( 0.00%)16401.31 * -23.79%* > Amean syst-25614760.30 ( 0.00%) 4448.39 * 69.86%* > Amean elsp-256 162.42 ( 0.00%) 119.13 * 26.65%* > Stddevuser-256 42.97 ( 0.00%) 19.15 ( 55.43%) > Stddevsyst-256 336.87 ( 0.00%)6.71 ( 98.01%) > Stddevelsp-2562.46 ( 0.00%)0.39 ( 84.03%) > >5.4.0-rc35.4.0-rc3 > vanilla resetpcpu-v2 > Duration User 39766.24 49221.79 > Duration System 44298.10 13361.67 > Duration Elapsed 519.11 388.87 > > The patch reduces system CPU usage by 69.86% and total build time by > 26.65%. The variance of system CPU usage is also much reduced. > > Before, this was the breakdown of batch and high values over all zones was. > > 256 batch: 1 > 256 batch: 63 > 512 batch: 7 > 256 high: 0 > 256 high: 378 > 512 high: 42 > > 512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After the > patch > > 256 batch: 1 > 768 batch: 63 > 256 high: 0 > 768 high: 378 > > Cc: sta...@vger.kernel.org # v4.1+ > Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka > --- > mm/page_alloc.c | 8 > 1 file changed, 8 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c0b2e0306720..f972076d0f6b 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void) > /* Block until all are initialised */ > wait_for_completion(&pgdat_init_all_done_comp); > > + /* > + * The number of managed pages has changed due to the initialisation > + * so the pcpu batch and high limits needs to be updated or the limits > + * will be artificially small. > + */ > + for_each_populated_zone(zone) > + zone_pcp_update(zone); > + > /* >* We initialized the rest of the deferred pages. Permanently disable >* on-demand struct page initialization. >
Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
On Mon 21-10-19 10:48:06, Mel Gorman wrote: > Deferred memory initialisation updates zone->managed_pages during > the initialisation phase but before that finishes, the per-cpu page > allocator (pcpu) calculates the number of pages allocated/freed in > batches as well as the maximum number of pages allowed on a per-cpu list. > As zone->managed_pages is not up to date yet, the pcpu initialisation > calculates inappropriately low batch and high values. > > This increases zone lock contention quite severely in some cases with the > degree of severity depending on how many CPUs share a local zone and the > size of the zone. A private report indicated that kernel build times were > excessive with extremely high system CPU usage. A perf profile indicated > that a large chunk of time was lost on zone->lock contention. > > This patch recalculates the pcpu batch and high values after deferred > initialisation completes for every populated zone in the system. It > was tested on a 2-socket AMD EPYC 2 machine using a kernel compilation > workload -- allmodconfig and all available CPUs. > > mmtests configuration: config-workload-kernbench-max > Configuration was modified to build on a fresh XFS partition. > > kernbench > 5.4.0-rc3 5.4.0-rc3 > vanilla resetpcpu-v2 > Amean user-25613249.50 ( 0.00%)16401.31 * -23.79%* > Amean syst-25614760.30 ( 0.00%) 4448.39 * 69.86%* > Amean elsp-256 162.42 ( 0.00%) 119.13 * 26.65%* > Stddevuser-256 42.97 ( 0.00%) 19.15 ( 55.43%) > Stddevsyst-256 336.87 ( 0.00%)6.71 ( 98.01%) > Stddevelsp-2562.46 ( 0.00%)0.39 ( 84.03%) > >5.4.0-rc35.4.0-rc3 > vanilla resetpcpu-v2 > Duration User 39766.24 49221.79 > Duration System 44298.10 13361.67 > Duration Elapsed 519.11 388.87 > > The patch reduces system CPU usage by 69.86% and total build time by > 26.65%. The variance of system CPU usage is also much reduced. > > Before, this was the breakdown of batch and high values over all zones was. > > 256 batch: 1 > 256 batch: 63 > 512 batch: 7 > 256 high: 0 > 256 high: 378 > 512 high: 42 > > 512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After the > patch > > 256 batch: 1 > 768 batch: 63 > 256 high: 0 > 768 high: 378 > > Cc: sta...@vger.kernel.org # v4.1+ > Signed-off-by: Mel Gorman Acked-by: Michal Hocko > --- > mm/page_alloc.c | 8 > 1 file changed, 8 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c0b2e0306720..f972076d0f6b 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void) > /* Block until all are initialised */ > wait_for_completion(&pgdat_init_all_done_comp); > > + /* > + * The number of managed pages has changed due to the initialisation > + * so the pcpu batch and high limits needs to be updated or the limits > + * will be artificially small. > + */ > + for_each_populated_zone(zone) > + zone_pcp_update(zone); > + > /* >* We initialized the rest of the deferred pages. Permanently disable >* on-demand struct page initialization. > -- > 2.16.4 -- Michal Hocko SUSE Labs