Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-12-20 Thread Andrew Morton
On Fri, 9 Nov 2018 16:34:29 +0530 Anshuman Khandual  
wrote:

> > 
> > Do you see any problems with the patch as is?
> 
> No, this patch does remove an erroneous node-cpu map update which help solve
> a real crash.

I think I'll take that as an ack.


Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-14 Thread Michal Hocko
On Wed 14-11-18 15:18:09, Andrew Morton wrote:
> On Wed, 14 Nov 2018 08:14:42 +0100 Michal Hocko  wrote:
> 
> > It seems there were no objections here. So can we have it in linux-next
> > for a wider testing a possibly target the next merge window?
> > 
> 
> top-posting sucks!

I thought it would make your life easier in this case. Will do it
differently next time.

> I already have this queued for 4.21-rc1.

Thanks! I must have missed the mm-commit email.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-14 Thread Michal Hocko
On Wed 14-11-18 15:18:09, Andrew Morton wrote:
> On Wed, 14 Nov 2018 08:14:42 +0100 Michal Hocko  wrote:
> 
> > It seems there were no objections here. So can we have it in linux-next
> > for a wider testing a possibly target the next merge window?
> > 
> 
> top-posting sucks!

I thought it would make your life easier in this case. Will do it
differently next time.

> I already have this queued for 4.21-rc1.

Thanks! I must have missed the mm-commit email.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-14 Thread Andrew Morton
On Wed, 14 Nov 2018 08:14:42 +0100 Michal Hocko  wrote:

> It seems there were no objections here. So can we have it in linux-next
> for a wider testing a possibly target the next merge window?
> 

top-posting sucks!

I already have this queued for 4.21-rc1.


Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-14 Thread Andrew Morton
On Wed, 14 Nov 2018 08:14:42 +0100 Michal Hocko  wrote:

> It seems there were no objections here. So can we have it in linux-next
> for a wider testing a possibly target the next merge window?
> 

top-posting sucks!

I already have this queued for 4.21-rc1.


Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-13 Thread Michal Hocko
It seems there were no objections here. So can we have it in linux-next
for a wider testing a possibly target the next merge window?

On Thu 08-11-18 11:04:13, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Per-cpu numa_node provides a default node for each possible cpu. The
> association gets initialized during the boot when the architecture
> specific code explores cpu->NUMA affinity. When the whole NUMA node is
> removed though we are clearing this association
> 
> try_offline_node
>   check_and_unmap_cpu_on_node
> unmap_cpu_on_node
>   numa_clear_node
> numa_set_node(cpu, NUMA_NO_NODE)
> 
> This means that whoever calls cpu_to_node for a cpu associated with such
> a node will get NUMA_NO_NODE. This is problematic for two reasons. First
> it is fragile because __alloc_pages_node would simply blow up on an
> out-of-bound access. We have encountered this when loading kvm module
> BUG: unable to handle kernel paging request at 21c0
> IP: [] __alloc_pages_nodemask+0x93/0xb70
> PGD 80ffe853e067 PUD 7336bbc067 PMD 0
> Oops:  [#1] SMP
> [...]
> CPU: 88 PID: 1223749 Comm: modprobe Tainted: GW  
> 4.4.156-94.64-default #1
> task: 88727eff1880 ti: 88735449 task.ti: 88735449
> RIP: 0010:[]  [] 
> __alloc_pages_nodemask+0x93/0xb70
> RSP: 0018:887354493b40  EFLAGS: 00010202
> RAX: 21c0 RBX:  RCX: 
> RDX:  RSI: 0002 RDI: 014000c0
> RBP: 014000c0 R08:  R09: 
> R10: 88fffc89e790 R11: 00014000 R12: 0101
> R13: a0772cd4 R14: a0769ac0 R15: 
> FS:  7fdf2f2f1700() GS:88fffc88() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 21c0 CR3: 0077205ee000 CR4: 00360670
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Stack:
>  0086 014000c014d20400 887354493bb8 882614d20f4c
>   0046 0046 810ac0c9
>  88ffe78c 009f e8ffe82d3500 88ff8ac55000
> Call Trace:
>  [] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
>  [] hardware_setup+0x781/0x849 [kvm_intel]
>  [] kvm_arch_hardware_setup+0x28/0x190 [kvm]
>  [] kvm_init+0x7c/0x2d0 [kvm]
>  [] vmx_init+0x1e/0x32c [kvm_intel]
>  [] do_one_initcall+0xca/0x1f0
>  [] do_init_module+0x5a/0x1d7
>  [] load_module+0x1393/0x1c90
>  [] SYSC_finit_module+0x70/0xa0
>  [] entry_SYSCALL_64_fastpath+0x1e/0xb7
> DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7
> 
> on an older kernel but the code is basically the same in the current
> Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which
> would recognize NUMA_NO_NODE and use alloc_pages_node which would translate
> it to numa_mem_id but that is wrong as well because it would use a cpu
> affinity of the local CPU which might be quite far from the original node.
> It is also reasonable to expect that cpu_to_node will provide a sane value
> and there might be many more callers like that.
> 
> The second problem is that __register_one_node relies on cpu_to_node
> to properly associate cpus back to the node when it is onlined. We do
> not want to lose that link as there is no arch independent way to get it
> from the early boot time AFAICS.
> 
> Drop the whole check_and_unmap_cpu_on_node machinery and keep the
> association to fix both issues. The NODE_DATA(nid) is not deallocated
> so it will stay in place and if anybody wants to allocate from that node
> then a fallback node will be used.
> 
> Thanks to Vlastimil Babka for his live system debugging skills that
> helped debugging the issue.
> 
> Debugged-by: Vlastimil Babka 
> Reported-by: Miroslav Benes 
> Fixes: e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when 
> offlining the node")
> Cc: Wen Congyang 
> Cc: Tang Chen 
> Signed-off-by: Michal Hocko 
> ---
> 
> Hi,
> please note that I am sending this as an RFC even though this has been
> confirmed to fix the oops in kvm_intel module because I cannot simply
> tell that there are no other side effect that I do not see from the code
> reading. I would appreciate some background from people who have
> introduced this code e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear
> cpu_to_node() when offlining the node") because the changelog doesn't
> really explain the motivation much.
> 
>  mm/memory_hotplug.c | 30 +-
>  1 file changed, 1 insertion(+), 29 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2b2b3ccbbfb5..87aeafac54ee 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1753,34 +1753,6 @@ static int check_cpu_on_node(pg_data_t *pgdat)
>   return 0;
>  }
>  
> -static void unmap_cpu_on_node(pg_data_t *pgdat)
> -{
> -#ifdef 

Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-13 Thread Michal Hocko
It seems there were no objections here. So can we have it in linux-next
for a wider testing a possibly target the next merge window?

On Thu 08-11-18 11:04:13, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Per-cpu numa_node provides a default node for each possible cpu. The
> association gets initialized during the boot when the architecture
> specific code explores cpu->NUMA affinity. When the whole NUMA node is
> removed though we are clearing this association
> 
> try_offline_node
>   check_and_unmap_cpu_on_node
> unmap_cpu_on_node
>   numa_clear_node
> numa_set_node(cpu, NUMA_NO_NODE)
> 
> This means that whoever calls cpu_to_node for a cpu associated with such
> a node will get NUMA_NO_NODE. This is problematic for two reasons. First
> it is fragile because __alloc_pages_node would simply blow up on an
> out-of-bound access. We have encountered this when loading kvm module
> BUG: unable to handle kernel paging request at 21c0
> IP: [] __alloc_pages_nodemask+0x93/0xb70
> PGD 80ffe853e067 PUD 7336bbc067 PMD 0
> Oops:  [#1] SMP
> [...]
> CPU: 88 PID: 1223749 Comm: modprobe Tainted: GW  
> 4.4.156-94.64-default #1
> task: 88727eff1880 ti: 88735449 task.ti: 88735449
> RIP: 0010:[]  [] 
> __alloc_pages_nodemask+0x93/0xb70
> RSP: 0018:887354493b40  EFLAGS: 00010202
> RAX: 21c0 RBX:  RCX: 
> RDX:  RSI: 0002 RDI: 014000c0
> RBP: 014000c0 R08:  R09: 
> R10: 88fffc89e790 R11: 00014000 R12: 0101
> R13: a0772cd4 R14: a0769ac0 R15: 
> FS:  7fdf2f2f1700() GS:88fffc88() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 21c0 CR3: 0077205ee000 CR4: 00360670
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Stack:
>  0086 014000c014d20400 887354493bb8 882614d20f4c
>   0046 0046 810ac0c9
>  88ffe78c 009f e8ffe82d3500 88ff8ac55000
> Call Trace:
>  [] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
>  [] hardware_setup+0x781/0x849 [kvm_intel]
>  [] kvm_arch_hardware_setup+0x28/0x190 [kvm]
>  [] kvm_init+0x7c/0x2d0 [kvm]
>  [] vmx_init+0x1e/0x32c [kvm_intel]
>  [] do_one_initcall+0xca/0x1f0
>  [] do_init_module+0x5a/0x1d7
>  [] load_module+0x1393/0x1c90
>  [] SYSC_finit_module+0x70/0xa0
>  [] entry_SYSCALL_64_fastpath+0x1e/0xb7
> DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7
> 
> on an older kernel but the code is basically the same in the current
> Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which
> would recognize NUMA_NO_NODE and use alloc_pages_node which would translate
> it to numa_mem_id but that is wrong as well because it would use a cpu
> affinity of the local CPU which might be quite far from the original node.
> It is also reasonable to expect that cpu_to_node will provide a sane value
> and there might be many more callers like that.
> 
> The second problem is that __register_one_node relies on cpu_to_node
> to properly associate cpus back to the node when it is onlined. We do
> not want to lose that link as there is no arch independent way to get it
> from the early boot time AFAICS.
> 
> Drop the whole check_and_unmap_cpu_on_node machinery and keep the
> association to fix both issues. The NODE_DATA(nid) is not deallocated
> so it will stay in place and if anybody wants to allocate from that node
> then a fallback node will be used.
> 
> Thanks to Vlastimil Babka for his live system debugging skills that
> helped debugging the issue.
> 
> Debugged-by: Vlastimil Babka 
> Reported-by: Miroslav Benes 
> Fixes: e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when 
> offlining the node")
> Cc: Wen Congyang 
> Cc: Tang Chen 
> Signed-off-by: Michal Hocko 
> ---
> 
> Hi,
> please note that I am sending this as an RFC even though this has been
> confirmed to fix the oops in kvm_intel module because I cannot simply
> tell that there are no other side effect that I do not see from the code
> reading. I would appreciate some background from people who have
> introduced this code e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear
> cpu_to_node() when offlining the node") because the changelog doesn't
> really explain the motivation much.
> 
>  mm/memory_hotplug.c | 30 +-
>  1 file changed, 1 insertion(+), 29 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2b2b3ccbbfb5..87aeafac54ee 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1753,34 +1753,6 @@ static int check_cpu_on_node(pg_data_t *pgdat)
>   return 0;
>  }
>  
> -static void unmap_cpu_on_node(pg_data_t *pgdat)
> -{
> -#ifdef 

Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-09 Thread Michal Hocko
On Fri 09-11-18 16:34:29, Anshuman Khandual wrote:
> 
> 
> On 11/09/2018 01:29 PM, Michal Hocko wrote:
> > On Fri 09-11-18 09:12:09, Anshuman Khandual wrote:
> >>
> >>
> >> On 11/08/2018 03:59 PM, Michal Hocko wrote:
> >>> [Removing Wen Congyang and Tang Chen from the CC list because their
> >>>  emails bounce. It seems that we will never learn about their motivation]
> >>>
> >>> On Thu 08-11-18 11:04:13, Michal Hocko wrote:
>  From: Michal Hocko 
> 
>  Per-cpu numa_node provides a default node for each possible cpu. The
>  association gets initialized during the boot when the architecture
>  specific code explores cpu->NUMA affinity. When the whole NUMA node is
>  removed though we are clearing this association
> 
>  try_offline_node
>    check_and_unmap_cpu_on_node
>  unmap_cpu_on_node
>    numa_clear_node
>  numa_set_node(cpu, NUMA_NO_NODE)
> 
>  This means that whoever calls cpu_to_node for a cpu associated with such
>  a node will get NUMA_NO_NODE. This is problematic for two reasons. First
>  it is fragile because __alloc_pages_node would simply blow up on an
>  out-of-bound access. We have encountered this when loading kvm module
>  BUG: unable to handle kernel paging request at 21c0
>  IP: [] __alloc_pages_nodemask+0x93/0xb70
>  PGD 80ffe853e067 PUD 7336bbc067 PMD 0
>  Oops:  [#1] SMP
>  [...]
>  CPU: 88 PID: 1223749 Comm: modprobe Tainted: GW  
>  4.4.156-94.64-default #1
>  task: 88727eff1880 ti: 88735449 task.ti: 88735449
>  RIP: 0010:[]  [] 
>  __alloc_pages_nodemask+0x93/0xb70
>  RSP: 0018:887354493b40  EFLAGS: 00010202
>  RAX: 21c0 RBX:  RCX: 
>  RDX:  RSI: 0002 RDI: 014000c0
>  RBP: 014000c0 R08:  R09: 
>  R10: 88fffc89e790 R11: 00014000 R12: 0101
>  R13: a0772cd4 R14: a0769ac0 R15: 
>  FS:  7fdf2f2f1700() GS:88fffc88() 
>  knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2: 21c0 CR3: 0077205ee000 CR4: 00360670
>  DR0:  DR1:  DR2: 
>  DR3:  DR6: fffe0ff0 DR7: 0400
>  Stack:
>   0086 014000c014d20400 887354493bb8 882614d20f4c
>    0046 0046 810ac0c9
>   88ffe78c 009f e8ffe82d3500 88ff8ac55000
>  Call Trace:
>   [] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
>   [] hardware_setup+0x781/0x849 [kvm_intel]
>   [] kvm_arch_hardware_setup+0x28/0x190 [kvm]
>   [] kvm_init+0x7c/0x2d0 [kvm]
>   [] vmx_init+0x1e/0x32c [kvm_intel]
>   [] do_one_initcall+0xca/0x1f0
>   [] do_init_module+0x5a/0x1d7
>   [] load_module+0x1393/0x1c90
>   [] SYSC_finit_module+0x70/0xa0
>   [] entry_SYSCALL_64_fastpath+0x1e/0xb7
>  DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7
> 
>  on an older kernel but the code is basically the same in the current
>  Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which
>  would recognize NUMA_NO_NODE and use alloc_pages_node which would 
>  translate
>  it to numa_mem_id but that is wrong as well because it would use a cpu
>  affinity of the local CPU which might be quite far from the original 
>  node.
> >>
> >> But then the original node is getting/already off-lined. The allocation is
> >> going to come from a different node. alloc_pages_node() at least steer the
> >> allocation alway from VM_BUG_ON() because of NUMA_NO_NODE by replacing it
> >> with numa_mem_id().
> >>
> >> If node fallback order is important for this allocation then could not it
> >> use __alloc_pages_nodemask() directly giving preference for its zonelist
> >> node and nodemask. Just curious.
> > 
> > How does the caller get the right node to allocate from? We do have the
> > proper zone list for the offline node so why not use it?
> I get your point. NODE_DATA() for the off lined node is still around and
> so does the proper zone list for allocation, so why the caller should work
> around the problem by building it's preferred nodemask_t etc. No problem,
> I was just curious.

I thought I've made it cler in the changelog. If not, I am open to
suggestions on how to make it more clear.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-09 Thread Michal Hocko
On Fri 09-11-18 16:34:29, Anshuman Khandual wrote:
> 
> 
> On 11/09/2018 01:29 PM, Michal Hocko wrote:
> > On Fri 09-11-18 09:12:09, Anshuman Khandual wrote:
> >>
> >>
> >> On 11/08/2018 03:59 PM, Michal Hocko wrote:
> >>> [Removing Wen Congyang and Tang Chen from the CC list because their
> >>>  emails bounce. It seems that we will never learn about their motivation]
> >>>
> >>> On Thu 08-11-18 11:04:13, Michal Hocko wrote:
>  From: Michal Hocko 
> 
>  Per-cpu numa_node provides a default node for each possible cpu. The
>  association gets initialized during the boot when the architecture
>  specific code explores cpu->NUMA affinity. When the whole NUMA node is
>  removed though we are clearing this association
> 
>  try_offline_node
>    check_and_unmap_cpu_on_node
>  unmap_cpu_on_node
>    numa_clear_node
>  numa_set_node(cpu, NUMA_NO_NODE)
> 
>  This means that whoever calls cpu_to_node for a cpu associated with such
>  a node will get NUMA_NO_NODE. This is problematic for two reasons. First
>  it is fragile because __alloc_pages_node would simply blow up on an
>  out-of-bound access. We have encountered this when loading kvm module
>  BUG: unable to handle kernel paging request at 21c0
>  IP: [] __alloc_pages_nodemask+0x93/0xb70
>  PGD 80ffe853e067 PUD 7336bbc067 PMD 0
>  Oops:  [#1] SMP
>  [...]
>  CPU: 88 PID: 1223749 Comm: modprobe Tainted: GW  
>  4.4.156-94.64-default #1
>  task: 88727eff1880 ti: 88735449 task.ti: 88735449
>  RIP: 0010:[]  [] 
>  __alloc_pages_nodemask+0x93/0xb70
>  RSP: 0018:887354493b40  EFLAGS: 00010202
>  RAX: 21c0 RBX:  RCX: 
>  RDX:  RSI: 0002 RDI: 014000c0
>  RBP: 014000c0 R08:  R09: 
>  R10: 88fffc89e790 R11: 00014000 R12: 0101
>  R13: a0772cd4 R14: a0769ac0 R15: 
>  FS:  7fdf2f2f1700() GS:88fffc88() 
>  knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2: 21c0 CR3: 0077205ee000 CR4: 00360670
>  DR0:  DR1:  DR2: 
>  DR3:  DR6: fffe0ff0 DR7: 0400
>  Stack:
>   0086 014000c014d20400 887354493bb8 882614d20f4c
>    0046 0046 810ac0c9
>   88ffe78c 009f e8ffe82d3500 88ff8ac55000
>  Call Trace:
>   [] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
>   [] hardware_setup+0x781/0x849 [kvm_intel]
>   [] kvm_arch_hardware_setup+0x28/0x190 [kvm]
>   [] kvm_init+0x7c/0x2d0 [kvm]
>   [] vmx_init+0x1e/0x32c [kvm_intel]
>   [] do_one_initcall+0xca/0x1f0
>   [] do_init_module+0x5a/0x1d7
>   [] load_module+0x1393/0x1c90
>   [] SYSC_finit_module+0x70/0xa0
>   [] entry_SYSCALL_64_fastpath+0x1e/0xb7
>  DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7
> 
>  on an older kernel but the code is basically the same in the current
>  Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which
>  would recognize NUMA_NO_NODE and use alloc_pages_node which would 
>  translate
>  it to numa_mem_id but that is wrong as well because it would use a cpu
>  affinity of the local CPU which might be quite far from the original 
>  node.
> >>
> >> But then the original node is getting/already off-lined. The allocation is
> >> going to come from a different node. alloc_pages_node() at least steer the
> >> allocation alway from VM_BUG_ON() because of NUMA_NO_NODE by replacing it
> >> with numa_mem_id().
> >>
> >> If node fallback order is important for this allocation then could not it
> >> use __alloc_pages_nodemask() directly giving preference for its zonelist
> >> node and nodemask. Just curious.
> > 
> > How does the caller get the right node to allocate from? We do have the
> > proper zone list for the offline node so why not use it?
> I get your point. NODE_DATA() for the off lined node is still around and
> so does the proper zone list for allocation, so why the caller should work
> around the problem by building it's preferred nodemask_t etc. No problem,
> I was just curious.

I thought I've made it cler in the changelog. If not, I am open to
suggestions on how to make it more clear.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-09 Thread Anshuman Khandual



On 11/09/2018 01:29 PM, Michal Hocko wrote:
> On Fri 09-11-18 09:12:09, Anshuman Khandual wrote:
>>
>>
>> On 11/08/2018 03:59 PM, Michal Hocko wrote:
>>> [Removing Wen Congyang and Tang Chen from the CC list because their
>>>  emails bounce. It seems that we will never learn about their motivation]
>>>
>>> On Thu 08-11-18 11:04:13, Michal Hocko wrote:
 From: Michal Hocko 

 Per-cpu numa_node provides a default node for each possible cpu. The
 association gets initialized during the boot when the architecture
 specific code explores cpu->NUMA affinity. When the whole NUMA node is
 removed though we are clearing this association

 try_offline_node
   check_and_unmap_cpu_on_node
 unmap_cpu_on_node
   numa_clear_node
 numa_set_node(cpu, NUMA_NO_NODE)

 This means that whoever calls cpu_to_node for a cpu associated with such
 a node will get NUMA_NO_NODE. This is problematic for two reasons. First
 it is fragile because __alloc_pages_node would simply blow up on an
 out-of-bound access. We have encountered this when loading kvm module
 BUG: unable to handle kernel paging request at 21c0
 IP: [] __alloc_pages_nodemask+0x93/0xb70
 PGD 80ffe853e067 PUD 7336bbc067 PMD 0
 Oops:  [#1] SMP
 [...]
 CPU: 88 PID: 1223749 Comm: modprobe Tainted: GW  
 4.4.156-94.64-default #1
 task: 88727eff1880 ti: 88735449 task.ti: 88735449
 RIP: 0010:[]  [] 
 __alloc_pages_nodemask+0x93/0xb70
 RSP: 0018:887354493b40  EFLAGS: 00010202
 RAX: 21c0 RBX:  RCX: 
 RDX:  RSI: 0002 RDI: 014000c0
 RBP: 014000c0 R08:  R09: 
 R10: 88fffc89e790 R11: 00014000 R12: 0101
 R13: a0772cd4 R14: a0769ac0 R15: 
 FS:  7fdf2f2f1700() GS:88fffc88() 
 knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 21c0 CR3: 0077205ee000 CR4: 00360670
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 Stack:
  0086 014000c014d20400 887354493bb8 882614d20f4c
   0046 0046 810ac0c9
  88ffe78c 009f e8ffe82d3500 88ff8ac55000
 Call Trace:
  [] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
  [] hardware_setup+0x781/0x849 [kvm_intel]
  [] kvm_arch_hardware_setup+0x28/0x190 [kvm]
  [] kvm_init+0x7c/0x2d0 [kvm]
  [] vmx_init+0x1e/0x32c [kvm_intel]
  [] do_one_initcall+0xca/0x1f0
  [] do_init_module+0x5a/0x1d7
  [] load_module+0x1393/0x1c90
  [] SYSC_finit_module+0x70/0xa0
  [] entry_SYSCALL_64_fastpath+0x1e/0xb7
 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7

 on an older kernel but the code is basically the same in the current
 Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which
 would recognize NUMA_NO_NODE and use alloc_pages_node which would translate
 it to numa_mem_id but that is wrong as well because it would use a cpu
 affinity of the local CPU which might be quite far from the original node.
>>
>> But then the original node is getting/already off-lined. The allocation is
>> going to come from a different node. alloc_pages_node() at least steer the
>> allocation alway from VM_BUG_ON() because of NUMA_NO_NODE by replacing it
>> with numa_mem_id().
>>
>> If node fallback order is important for this allocation then could not it
>> use __alloc_pages_nodemask() directly giving preference for its zonelist
>> node and nodemask. Just curious.
> 
> How does the caller get the right node to allocate from? We do have the
> proper zone list for the offline node so why not use it?
I get your point. NODE_DATA() for the off lined node is still around and
so does the proper zone list for allocation, so why the caller should work
around the problem by building it's preferred nodemask_t etc. No problem,
I was just curious.

> 
 It is also reasonable to expect that cpu_to_node will provide a sane value
 and there might be many more callers like that.
>>
>> AFAICS there are two choices here. Either mark them NUMA_NO_NODE for all
>> cpus of a node going offline or keep the existing mapping in case the node
>> comes back again.
> 
> Or update the mapping to the closeses node. I have chosen to keep the
> mapping because it is the easiest and the most natural one.

Agreed.

> 
 The second problem is that __register_one_node relies on cpu_to_node
 to properly associate cpus back to the node when it is onlined. We do
 not want to lose that link as there is no arch independent way to get it
 from the early boot time 

Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-09 Thread Anshuman Khandual



On 11/09/2018 01:29 PM, Michal Hocko wrote:
> On Fri 09-11-18 09:12:09, Anshuman Khandual wrote:
>>
>>
>> On 11/08/2018 03:59 PM, Michal Hocko wrote:
>>> [Removing Wen Congyang and Tang Chen from the CC list because their
>>>  emails bounce. It seems that we will never learn about their motivation]
>>>
>>> On Thu 08-11-18 11:04:13, Michal Hocko wrote:
 From: Michal Hocko 

 Per-cpu numa_node provides a default node for each possible cpu. The
 association gets initialized during the boot when the architecture
 specific code explores cpu->NUMA affinity. When the whole NUMA node is
 removed though we are clearing this association

 try_offline_node
   check_and_unmap_cpu_on_node
 unmap_cpu_on_node
   numa_clear_node
 numa_set_node(cpu, NUMA_NO_NODE)

 This means that whoever calls cpu_to_node for a cpu associated with such
 a node will get NUMA_NO_NODE. This is problematic for two reasons. First
 it is fragile because __alloc_pages_node would simply blow up on an
 out-of-bound access. We have encountered this when loading kvm module
 BUG: unable to handle kernel paging request at 21c0
 IP: [] __alloc_pages_nodemask+0x93/0xb70
 PGD 80ffe853e067 PUD 7336bbc067 PMD 0
 Oops:  [#1] SMP
 [...]
 CPU: 88 PID: 1223749 Comm: modprobe Tainted: GW  
 4.4.156-94.64-default #1
 task: 88727eff1880 ti: 88735449 task.ti: 88735449
 RIP: 0010:[]  [] 
 __alloc_pages_nodemask+0x93/0xb70
 RSP: 0018:887354493b40  EFLAGS: 00010202
 RAX: 21c0 RBX:  RCX: 
 RDX:  RSI: 0002 RDI: 014000c0
 RBP: 014000c0 R08:  R09: 
 R10: 88fffc89e790 R11: 00014000 R12: 0101
 R13: a0772cd4 R14: a0769ac0 R15: 
 FS:  7fdf2f2f1700() GS:88fffc88() 
 knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 21c0 CR3: 0077205ee000 CR4: 00360670
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 Stack:
  0086 014000c014d20400 887354493bb8 882614d20f4c
   0046 0046 810ac0c9
  88ffe78c 009f e8ffe82d3500 88ff8ac55000
 Call Trace:
  [] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
  [] hardware_setup+0x781/0x849 [kvm_intel]
  [] kvm_arch_hardware_setup+0x28/0x190 [kvm]
  [] kvm_init+0x7c/0x2d0 [kvm]
  [] vmx_init+0x1e/0x32c [kvm_intel]
  [] do_one_initcall+0xca/0x1f0
  [] do_init_module+0x5a/0x1d7
  [] load_module+0x1393/0x1c90
  [] SYSC_finit_module+0x70/0xa0
  [] entry_SYSCALL_64_fastpath+0x1e/0xb7
 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7

 on an older kernel but the code is basically the same in the current
 Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which
 would recognize NUMA_NO_NODE and use alloc_pages_node which would translate
 it to numa_mem_id but that is wrong as well because it would use a cpu
 affinity of the local CPU which might be quite far from the original node.
>>
>> But then the original node is getting/already off-lined. The allocation is
>> going to come from a different node. alloc_pages_node() at least steer the
>> allocation alway from VM_BUG_ON() because of NUMA_NO_NODE by replacing it
>> with numa_mem_id().
>>
>> If node fallback order is important for this allocation then could not it
>> use __alloc_pages_nodemask() directly giving preference for its zonelist
>> node and nodemask. Just curious.
> 
> How does the caller get the right node to allocate from? We do have the
> proper zone list for the offline node so why not use it?
I get your point. NODE_DATA() for the off lined node is still around and
so does the proper zone list for allocation, so why the caller should work
around the problem by building it's preferred nodemask_t etc. No problem,
I was just curious.

> 
 It is also reasonable to expect that cpu_to_node will provide a sane value
 and there might be many more callers like that.
>>
>> AFAICS there are two choices here. Either mark them NUMA_NO_NODE for all
>> cpus of a node going offline or keep the existing mapping in case the node
>> comes back again.
> 
> Or update the mapping to the closeses node. I have chosen to keep the
> mapping because it is the easiest and the most natural one.

Agreed.

> 
 The second problem is that __register_one_node relies on cpu_to_node
 to properly associate cpus back to the node when it is onlined. We do
 not want to lose that link as there is no arch independent way to get it
 from the early boot time 

Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-08 Thread Michal Hocko
On Fri 09-11-18 09:12:09, Anshuman Khandual wrote:
> 
> 
> On 11/08/2018 03:59 PM, Michal Hocko wrote:
> > [Removing Wen Congyang and Tang Chen from the CC list because their
> >  emails bounce. It seems that we will never learn about their motivation]
> > 
> > On Thu 08-11-18 11:04:13, Michal Hocko wrote:
> >> From: Michal Hocko 
> >>
> >> Per-cpu numa_node provides a default node for each possible cpu. The
> >> association gets initialized during the boot when the architecture
> >> specific code explores cpu->NUMA affinity. When the whole NUMA node is
> >> removed though we are clearing this association
> >>
> >> try_offline_node
> >>   check_and_unmap_cpu_on_node
> >> unmap_cpu_on_node
> >>   numa_clear_node
> >> numa_set_node(cpu, NUMA_NO_NODE)
> >>
> >> This means that whoever calls cpu_to_node for a cpu associated with such
> >> a node will get NUMA_NO_NODE. This is problematic for two reasons. First
> >> it is fragile because __alloc_pages_node would simply blow up on an
> >> out-of-bound access. We have encountered this when loading kvm module
> >> BUG: unable to handle kernel paging request at 21c0
> >> IP: [] __alloc_pages_nodemask+0x93/0xb70
> >> PGD 80ffe853e067 PUD 7336bbc067 PMD 0
> >> Oops:  [#1] SMP
> >> [...]
> >> CPU: 88 PID: 1223749 Comm: modprobe Tainted: GW  
> >> 4.4.156-94.64-default #1
> >> task: 88727eff1880 ti: 88735449 task.ti: 88735449
> >> RIP: 0010:[]  [] 
> >> __alloc_pages_nodemask+0x93/0xb70
> >> RSP: 0018:887354493b40  EFLAGS: 00010202
> >> RAX: 21c0 RBX:  RCX: 
> >> RDX:  RSI: 0002 RDI: 014000c0
> >> RBP: 014000c0 R08:  R09: 
> >> R10: 88fffc89e790 R11: 00014000 R12: 0101
> >> R13: a0772cd4 R14: a0769ac0 R15: 
> >> FS:  7fdf2f2f1700() GS:88fffc88() 
> >> knlGS:
> >> CS:  0010 DS:  ES:  CR0: 80050033
> >> CR2: 21c0 CR3: 0077205ee000 CR4: 00360670
> >> DR0:  DR1:  DR2: 
> >> DR3:  DR6: fffe0ff0 DR7: 0400
> >> Stack:
> >>  0086 014000c014d20400 887354493bb8 882614d20f4c
> >>   0046 0046 810ac0c9
> >>  88ffe78c 009f e8ffe82d3500 88ff8ac55000
> >> Call Trace:
> >>  [] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
> >>  [] hardware_setup+0x781/0x849 [kvm_intel]
> >>  [] kvm_arch_hardware_setup+0x28/0x190 [kvm]
> >>  [] kvm_init+0x7c/0x2d0 [kvm]
> >>  [] vmx_init+0x1e/0x32c [kvm_intel]
> >>  [] do_one_initcall+0xca/0x1f0
> >>  [] do_init_module+0x5a/0x1d7
> >>  [] load_module+0x1393/0x1c90
> >>  [] SYSC_finit_module+0x70/0xa0
> >>  [] entry_SYSCALL_64_fastpath+0x1e/0xb7
> >> DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7
> >>
> >> on an older kernel but the code is basically the same in the current
> >> Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which
> >> would recognize NUMA_NO_NODE and use alloc_pages_node which would translate
> >> it to numa_mem_id but that is wrong as well because it would use a cpu
> >> affinity of the local CPU which might be quite far from the original node.
> 
> But then the original node is getting/already off-lined. The allocation is
> going to come from a different node. alloc_pages_node() at least steer the
> allocation alway from VM_BUG_ON() because of NUMA_NO_NODE by replacing it
> with numa_mem_id().
> 
> If node fallback order is important for this allocation then could not it
> use __alloc_pages_nodemask() directly giving preference for its zonelist
> node and nodemask. Just curious.

How does the caller get the right node to allocate from? We do have the
proper zone list for the offline node so why not use it?

> >> It is also reasonable to expect that cpu_to_node will provide a sane value
> >> and there might be many more callers like that.
> 
> AFAICS there are two choices here. Either mark them NUMA_NO_NODE for all
> cpus of a node going offline or keep the existing mapping in case the node
> comes back again.

Or update the mapping to the closeses node. I have chosen to keep the
mapping because it is the easiest and the most natural one.

> >> The second problem is that __register_one_node relies on cpu_to_node
> >> to properly associate cpus back to the node when it is onlined. We do
> >> not want to lose that link as there is no arch independent way to get it
> >> from the early boot time AFAICS.
> 
> Retaining the links seems to be right unless unmap_cpu_on_node() is sort
> of a weak callback letting arch to decide.
> 
> >>
> >> Drop the whole check_and_unmap_cpu_on_node machinery and keep the
> >> association to fix both issues. The NODE_DATA(nid) is not deallocated
> Though retaining the link is a problem 

Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-08 Thread Michal Hocko
On Fri 09-11-18 09:12:09, Anshuman Khandual wrote:
> 
> 
> On 11/08/2018 03:59 PM, Michal Hocko wrote:
> > [Removing Wen Congyang and Tang Chen from the CC list because their
> >  emails bounce. It seems that we will never learn about their motivation]
> > 
> > On Thu 08-11-18 11:04:13, Michal Hocko wrote:
> >> From: Michal Hocko 
> >>
> >> Per-cpu numa_node provides a default node for each possible cpu. The
> >> association gets initialized during the boot when the architecture
> >> specific code explores cpu->NUMA affinity. When the whole NUMA node is
> >> removed though we are clearing this association
> >>
> >> try_offline_node
> >>   check_and_unmap_cpu_on_node
> >> unmap_cpu_on_node
> >>   numa_clear_node
> >> numa_set_node(cpu, NUMA_NO_NODE)
> >>
> >> This means that whoever calls cpu_to_node for a cpu associated with such
> >> a node will get NUMA_NO_NODE. This is problematic for two reasons. First
> >> it is fragile because __alloc_pages_node would simply blow up on an
> >> out-of-bound access. We have encountered this when loading kvm module
> >> BUG: unable to handle kernel paging request at 21c0
> >> IP: [] __alloc_pages_nodemask+0x93/0xb70
> >> PGD 80ffe853e067 PUD 7336bbc067 PMD 0
> >> Oops:  [#1] SMP
> >> [...]
> >> CPU: 88 PID: 1223749 Comm: modprobe Tainted: GW  
> >> 4.4.156-94.64-default #1
> >> task: 88727eff1880 ti: 88735449 task.ti: 88735449
> >> RIP: 0010:[]  [] 
> >> __alloc_pages_nodemask+0x93/0xb70
> >> RSP: 0018:887354493b40  EFLAGS: 00010202
> >> RAX: 21c0 RBX:  RCX: 
> >> RDX:  RSI: 0002 RDI: 014000c0
> >> RBP: 014000c0 R08:  R09: 
> >> R10: 88fffc89e790 R11: 00014000 R12: 0101
> >> R13: a0772cd4 R14: a0769ac0 R15: 
> >> FS:  7fdf2f2f1700() GS:88fffc88() 
> >> knlGS:
> >> CS:  0010 DS:  ES:  CR0: 80050033
> >> CR2: 21c0 CR3: 0077205ee000 CR4: 00360670
> >> DR0:  DR1:  DR2: 
> >> DR3:  DR6: fffe0ff0 DR7: 0400
> >> Stack:
> >>  0086 014000c014d20400 887354493bb8 882614d20f4c
> >>   0046 0046 810ac0c9
> >>  88ffe78c 009f e8ffe82d3500 88ff8ac55000
> >> Call Trace:
> >>  [] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
> >>  [] hardware_setup+0x781/0x849 [kvm_intel]
> >>  [] kvm_arch_hardware_setup+0x28/0x190 [kvm]
> >>  [] kvm_init+0x7c/0x2d0 [kvm]
> >>  [] vmx_init+0x1e/0x32c [kvm_intel]
> >>  [] do_one_initcall+0xca/0x1f0
> >>  [] do_init_module+0x5a/0x1d7
> >>  [] load_module+0x1393/0x1c90
> >>  [] SYSC_finit_module+0x70/0xa0
> >>  [] entry_SYSCALL_64_fastpath+0x1e/0xb7
> >> DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7
> >>
> >> on an older kernel but the code is basically the same in the current
> >> Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which
> >> would recognize NUMA_NO_NODE and use alloc_pages_node which would translate
> >> it to numa_mem_id but that is wrong as well because it would use a cpu
> >> affinity of the local CPU which might be quite far from the original node.
> 
> But then the original node is getting/already off-lined. The allocation is
> going to come from a different node. alloc_pages_node() at least steer the
> allocation alway from VM_BUG_ON() because of NUMA_NO_NODE by replacing it
> with numa_mem_id().
> 
> If node fallback order is important for this allocation then could not it
> use __alloc_pages_nodemask() directly giving preference for its zonelist
> node and nodemask. Just curious.

How does the caller get the right node to allocate from? We do have the
proper zone list for the offline node so why not use it?

> >> It is also reasonable to expect that cpu_to_node will provide a sane value
> >> and there might be many more callers like that.
> 
> AFAICS there are two choices here. Either mark them NUMA_NO_NODE for all
> cpus of a node going offline or keep the existing mapping in case the node
> comes back again.

Or update the mapping to the closeses node. I have chosen to keep the
mapping because it is the easiest and the most natural one.

> >> The second problem is that __register_one_node relies on cpu_to_node
> >> to properly associate cpus back to the node when it is onlined. We do
> >> not want to lose that link as there is no arch independent way to get it
> >> from the early boot time AFAICS.
> 
> Retaining the links seems to be right unless unmap_cpu_on_node() is sort
> of a weak callback letting arch to decide.
> 
> >>
> >> Drop the whole check_and_unmap_cpu_on_node machinery and keep the
> >> association to fix both issues. The NODE_DATA(nid) is not deallocated
> Though retaining the link is a problem 

Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-08 Thread Anshuman Khandual



On 11/08/2018 03:59 PM, Michal Hocko wrote:
> [Removing Wen Congyang and Tang Chen from the CC list because their
>  emails bounce. It seems that we will never learn about their motivation]
> 
> On Thu 08-11-18 11:04:13, Michal Hocko wrote:
>> From: Michal Hocko 
>>
>> Per-cpu numa_node provides a default node for each possible cpu. The
>> association gets initialized during the boot when the architecture
>> specific code explores cpu->NUMA affinity. When the whole NUMA node is
>> removed though we are clearing this association
>>
>> try_offline_node
>>   check_and_unmap_cpu_on_node
>> unmap_cpu_on_node
>>   numa_clear_node
>> numa_set_node(cpu, NUMA_NO_NODE)
>>
>> This means that whoever calls cpu_to_node for a cpu associated with such
>> a node will get NUMA_NO_NODE. This is problematic for two reasons. First
>> it is fragile because __alloc_pages_node would simply blow up on an
>> out-of-bound access. We have encountered this when loading kvm module
>> BUG: unable to handle kernel paging request at 21c0
>> IP: [] __alloc_pages_nodemask+0x93/0xb70
>> PGD 80ffe853e067 PUD 7336bbc067 PMD 0
>> Oops:  [#1] SMP
>> [...]
>> CPU: 88 PID: 1223749 Comm: modprobe Tainted: GW  
>> 4.4.156-94.64-default #1
>> task: 88727eff1880 ti: 88735449 task.ti: 88735449
>> RIP: 0010:[]  [] 
>> __alloc_pages_nodemask+0x93/0xb70
>> RSP: 0018:887354493b40  EFLAGS: 00010202
>> RAX: 21c0 RBX:  RCX: 
>> RDX:  RSI: 0002 RDI: 014000c0
>> RBP: 014000c0 R08:  R09: 
>> R10: 88fffc89e790 R11: 00014000 R12: 0101
>> R13: a0772cd4 R14: a0769ac0 R15: 
>> FS:  7fdf2f2f1700() GS:88fffc88() knlGS:
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2: 21c0 CR3: 0077205ee000 CR4: 00360670
>> DR0:  DR1:  DR2: 
>> DR3:  DR6: fffe0ff0 DR7: 0400
>> Stack:
>>  0086 014000c014d20400 887354493bb8 882614d20f4c
>>   0046 0046 810ac0c9
>>  88ffe78c 009f e8ffe82d3500 88ff8ac55000
>> Call Trace:
>>  [] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
>>  [] hardware_setup+0x781/0x849 [kvm_intel]
>>  [] kvm_arch_hardware_setup+0x28/0x190 [kvm]
>>  [] kvm_init+0x7c/0x2d0 [kvm]
>>  [] vmx_init+0x1e/0x32c [kvm_intel]
>>  [] do_one_initcall+0xca/0x1f0
>>  [] do_init_module+0x5a/0x1d7
>>  [] load_module+0x1393/0x1c90
>>  [] SYSC_finit_module+0x70/0xa0
>>  [] entry_SYSCALL_64_fastpath+0x1e/0xb7
>> DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7
>>
>> on an older kernel but the code is basically the same in the current
>> Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which
>> would recognize NUMA_NO_NODE and use alloc_pages_node which would translate
>> it to numa_mem_id but that is wrong as well because it would use a cpu
>> affinity of the local CPU which might be quite far from the original node.

But then the original node is getting/already off-lined. The allocation is
going to come from a different node. alloc_pages_node() at least steer the
allocation alway from VM_BUG_ON() because of NUMA_NO_NODE by replacing it
with numa_mem_id().

If node fallback order is important for this allocation then could not it
use __alloc_pages_nodemask() directly giving preference for its zonelist
node and nodemask. Just curious.

>> It is also reasonable to expect that cpu_to_node will provide a sane value
>> and there might be many more callers like that.

AFAICS there are two choices here. Either mark them NUMA_NO_NODE for all
cpus of a node going offline or keep the existing mapping in case the node
comes back again.

>>
>> The second problem is that __register_one_node relies on cpu_to_node
>> to properly associate cpus back to the node when it is onlined. We do
>> not want to lose that link as there is no arch independent way to get it
>> from the early boot time AFAICS.

Retaining the links seems to be right unless unmap_cpu_on_node() is sort
of a weak callback letting arch to decide.

>>
>> Drop the whole check_and_unmap_cpu_on_node machinery and keep the
>> association to fix both issues. The NODE_DATA(nid) is not deallocated
Though retaining the link is a problem in itself but the allocation related
crash could be solved by exploring __alloc_pages_nodemask() options.

>> so it will stay in place and if anybody wants to allocate from that node
>> then a fallback node will be used.

Right, NODE_DATA(nid) is an advantage of retaining the link.


Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-08 Thread Anshuman Khandual



On 11/08/2018 03:59 PM, Michal Hocko wrote:
> [Removing Wen Congyang and Tang Chen from the CC list because their
>  emails bounce. It seems that we will never learn about their motivation]
> 
> On Thu 08-11-18 11:04:13, Michal Hocko wrote:
>> From: Michal Hocko 
>>
>> Per-cpu numa_node provides a default node for each possible cpu. The
>> association gets initialized during the boot when the architecture
>> specific code explores cpu->NUMA affinity. When the whole NUMA node is
>> removed though we are clearing this association
>>
>> try_offline_node
>>   check_and_unmap_cpu_on_node
>> unmap_cpu_on_node
>>   numa_clear_node
>> numa_set_node(cpu, NUMA_NO_NODE)
>>
>> This means that whoever calls cpu_to_node for a cpu associated with such
>> a node will get NUMA_NO_NODE. This is problematic for two reasons. First
>> it is fragile because __alloc_pages_node would simply blow up on an
>> out-of-bound access. We have encountered this when loading kvm module
>> BUG: unable to handle kernel paging request at 21c0
>> IP: [] __alloc_pages_nodemask+0x93/0xb70
>> PGD 80ffe853e067 PUD 7336bbc067 PMD 0
>> Oops:  [#1] SMP
>> [...]
>> CPU: 88 PID: 1223749 Comm: modprobe Tainted: GW  
>> 4.4.156-94.64-default #1
>> task: 88727eff1880 ti: 88735449 task.ti: 88735449
>> RIP: 0010:[]  [] 
>> __alloc_pages_nodemask+0x93/0xb70
>> RSP: 0018:887354493b40  EFLAGS: 00010202
>> RAX: 21c0 RBX:  RCX: 
>> RDX:  RSI: 0002 RDI: 014000c0
>> RBP: 014000c0 R08:  R09: 
>> R10: 88fffc89e790 R11: 00014000 R12: 0101
>> R13: a0772cd4 R14: a0769ac0 R15: 
>> FS:  7fdf2f2f1700() GS:88fffc88() knlGS:
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2: 21c0 CR3: 0077205ee000 CR4: 00360670
>> DR0:  DR1:  DR2: 
>> DR3:  DR6: fffe0ff0 DR7: 0400
>> Stack:
>>  0086 014000c014d20400 887354493bb8 882614d20f4c
>>   0046 0046 810ac0c9
>>  88ffe78c 009f e8ffe82d3500 88ff8ac55000
>> Call Trace:
>>  [] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
>>  [] hardware_setup+0x781/0x849 [kvm_intel]
>>  [] kvm_arch_hardware_setup+0x28/0x190 [kvm]
>>  [] kvm_init+0x7c/0x2d0 [kvm]
>>  [] vmx_init+0x1e/0x32c [kvm_intel]
>>  [] do_one_initcall+0xca/0x1f0
>>  [] do_init_module+0x5a/0x1d7
>>  [] load_module+0x1393/0x1c90
>>  [] SYSC_finit_module+0x70/0xa0
>>  [] entry_SYSCALL_64_fastpath+0x1e/0xb7
>> DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7
>>
>> on an older kernel but the code is basically the same in the current
>> Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which
>> would recognize NUMA_NO_NODE and use alloc_pages_node which would translate
>> it to numa_mem_id but that is wrong as well because it would use a cpu
>> affinity of the local CPU which might be quite far from the original node.

But then the original node is getting/already off-lined. The allocation is
going to come from a different node. alloc_pages_node() at least steer the
allocation alway from VM_BUG_ON() because of NUMA_NO_NODE by replacing it
with numa_mem_id().

If node fallback order is important for this allocation then could not it
use __alloc_pages_nodemask() directly giving preference for its zonelist
node and nodemask. Just curious.

>> It is also reasonable to expect that cpu_to_node will provide a sane value
>> and there might be many more callers like that.

AFAICS there are two choices here. Either mark them NUMA_NO_NODE for all
cpus of a node going offline or keep the existing mapping in case the node
comes back again.

>>
>> The second problem is that __register_one_node relies on cpu_to_node
>> to properly associate cpus back to the node when it is onlined. We do
>> not want to lose that link as there is no arch independent way to get it
>> from the early boot time AFAICS.

Retaining the links seems to be right unless unmap_cpu_on_node() is sort
of a weak callback letting arch to decide.

>>
>> Drop the whole check_and_unmap_cpu_on_node machinery and keep the
>> association to fix both issues. The NODE_DATA(nid) is not deallocated
Though retaining the link is a problem in itself but the allocation related
crash could be solved by exploring __alloc_pages_nodemask() options.

>> so it will stay in place and if anybody wants to allocate from that node
>> then a fallback node will be used.

Right, NODE_DATA(nid) is an advantage of retaining the link.


Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-08 Thread Michal Hocko
[Removing Wen Congyang and Tang Chen from the CC list because their
 emails bounce. It seems that we will never learn about their motivation]

On Thu 08-11-18 11:04:13, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Per-cpu numa_node provides a default node for each possible cpu. The
> association gets initialized during the boot when the architecture
> specific code explores cpu->NUMA affinity. When the whole NUMA node is
> removed though we are clearing this association
> 
> try_offline_node
>   check_and_unmap_cpu_on_node
> unmap_cpu_on_node
>   numa_clear_node
> numa_set_node(cpu, NUMA_NO_NODE)
> 
> This means that whoever calls cpu_to_node for a cpu associated with such
> a node will get NUMA_NO_NODE. This is problematic for two reasons. First
> it is fragile because __alloc_pages_node would simply blow up on an
> out-of-bound access. We have encountered this when loading kvm module
> BUG: unable to handle kernel paging request at 21c0
> IP: [] __alloc_pages_nodemask+0x93/0xb70
> PGD 80ffe853e067 PUD 7336bbc067 PMD 0
> Oops:  [#1] SMP
> [...]
> CPU: 88 PID: 1223749 Comm: modprobe Tainted: GW  
> 4.4.156-94.64-default #1
> task: 88727eff1880 ti: 88735449 task.ti: 88735449
> RIP: 0010:[]  [] 
> __alloc_pages_nodemask+0x93/0xb70
> RSP: 0018:887354493b40  EFLAGS: 00010202
> RAX: 21c0 RBX:  RCX: 
> RDX:  RSI: 0002 RDI: 014000c0
> RBP: 014000c0 R08:  R09: 
> R10: 88fffc89e790 R11: 00014000 R12: 0101
> R13: a0772cd4 R14: a0769ac0 R15: 
> FS:  7fdf2f2f1700() GS:88fffc88() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 21c0 CR3: 0077205ee000 CR4: 00360670
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Stack:
>  0086 014000c014d20400 887354493bb8 882614d20f4c
>   0046 0046 810ac0c9
>  88ffe78c 009f e8ffe82d3500 88ff8ac55000
> Call Trace:
>  [] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
>  [] hardware_setup+0x781/0x849 [kvm_intel]
>  [] kvm_arch_hardware_setup+0x28/0x190 [kvm]
>  [] kvm_init+0x7c/0x2d0 [kvm]
>  [] vmx_init+0x1e/0x32c [kvm_intel]
>  [] do_one_initcall+0xca/0x1f0
>  [] do_init_module+0x5a/0x1d7
>  [] load_module+0x1393/0x1c90
>  [] SYSC_finit_module+0x70/0xa0
>  [] entry_SYSCALL_64_fastpath+0x1e/0xb7
> DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7
> 
> on an older kernel but the code is basically the same in the current
> Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which
> would recognize NUMA_NO_NODE and use alloc_pages_node which would translate
> it to numa_mem_id but that is wrong as well because it would use a cpu
> affinity of the local CPU which might be quite far from the original node.
> It is also reasonable to expect that cpu_to_node will provide a sane value
> and there might be many more callers like that.
> 
> The second problem is that __register_one_node relies on cpu_to_node
> to properly associate cpus back to the node when it is onlined. We do
> not want to lose that link as there is no arch independent way to get it
> from the early boot time AFAICS.
> 
> Drop the whole check_and_unmap_cpu_on_node machinery and keep the
> association to fix both issues. The NODE_DATA(nid) is not deallocated
> so it will stay in place and if anybody wants to allocate from that node
> then a fallback node will be used.
> 
> Thanks to Vlastimil Babka for his live system debugging skills that
> helped debugging the issue.
> 
> Debugged-by: Vlastimil Babka 
> Reported-by: Miroslav Benes 
> Fixes: e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when 
> offlining the node")
> Cc: Wen Congyang 
> Cc: Tang Chen 
> Signed-off-by: Michal Hocko 
> ---
> 
> Hi,
> please note that I am sending this as an RFC even though this has been
> confirmed to fix the oops in kvm_intel module because I cannot simply
> tell that there are no other side effect that I do not see from the code
> reading. I would appreciate some background from people who have
> introduced this code e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear
> cpu_to_node() when offlining the node") because the changelog doesn't
> really explain the motivation much.
> 
>  mm/memory_hotplug.c | 30 +-
>  1 file changed, 1 insertion(+), 29 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2b2b3ccbbfb5..87aeafac54ee 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1753,34 +1753,6 @@ static int check_cpu_on_node(pg_data_t *pgdat)
>   return 0;
>  }
>  
> -static void unmap_cpu_on_node(pg_data_t *pgdat)
> -{
> -#ifdef 

Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-08 Thread Michal Hocko
[Removing Wen Congyang and Tang Chen from the CC list because their
 emails bounce. It seems that we will never learn about their motivation]

On Thu 08-11-18 11:04:13, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Per-cpu numa_node provides a default node for each possible cpu. The
> association gets initialized during the boot when the architecture
> specific code explores cpu->NUMA affinity. When the whole NUMA node is
> removed though we are clearing this association
> 
> try_offline_node
>   check_and_unmap_cpu_on_node
> unmap_cpu_on_node
>   numa_clear_node
> numa_set_node(cpu, NUMA_NO_NODE)
> 
> This means that whoever calls cpu_to_node for a cpu associated with such
> a node will get NUMA_NO_NODE. This is problematic for two reasons. First
> it is fragile because __alloc_pages_node would simply blow up on an
> out-of-bound access. We have encountered this when loading kvm module
> BUG: unable to handle kernel paging request at 21c0
> IP: [] __alloc_pages_nodemask+0x93/0xb70
> PGD 80ffe853e067 PUD 7336bbc067 PMD 0
> Oops:  [#1] SMP
> [...]
> CPU: 88 PID: 1223749 Comm: modprobe Tainted: GW  
> 4.4.156-94.64-default #1
> task: 88727eff1880 ti: 88735449 task.ti: 88735449
> RIP: 0010:[]  [] 
> __alloc_pages_nodemask+0x93/0xb70
> RSP: 0018:887354493b40  EFLAGS: 00010202
> RAX: 21c0 RBX:  RCX: 
> RDX:  RSI: 0002 RDI: 014000c0
> RBP: 014000c0 R08:  R09: 
> R10: 88fffc89e790 R11: 00014000 R12: 0101
> R13: a0772cd4 R14: a0769ac0 R15: 
> FS:  7fdf2f2f1700() GS:88fffc88() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 21c0 CR3: 0077205ee000 CR4: 00360670
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Stack:
>  0086 014000c014d20400 887354493bb8 882614d20f4c
>   0046 0046 810ac0c9
>  88ffe78c 009f e8ffe82d3500 88ff8ac55000
> Call Trace:
>  [] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
>  [] hardware_setup+0x781/0x849 [kvm_intel]
>  [] kvm_arch_hardware_setup+0x28/0x190 [kvm]
>  [] kvm_init+0x7c/0x2d0 [kvm]
>  [] vmx_init+0x1e/0x32c [kvm_intel]
>  [] do_one_initcall+0xca/0x1f0
>  [] do_init_module+0x5a/0x1d7
>  [] load_module+0x1393/0x1c90
>  [] SYSC_finit_module+0x70/0xa0
>  [] entry_SYSCALL_64_fastpath+0x1e/0xb7
> DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7
> 
> on an older kernel but the code is basically the same in the current
> Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which
> would recognize NUMA_NO_NODE and use alloc_pages_node which would translate
> it to numa_mem_id but that is wrong as well because it would use a cpu
> affinity of the local CPU which might be quite far from the original node.
> It is also reasonable to expect that cpu_to_node will provide a sane value
> and there might be many more callers like that.
> 
> The second problem is that __register_one_node relies on cpu_to_node
> to properly associate cpus back to the node when it is onlined. We do
> not want to lose that link as there is no arch independent way to get it
> from the early boot time AFAICS.
> 
> Drop the whole check_and_unmap_cpu_on_node machinery and keep the
> association to fix both issues. The NODE_DATA(nid) is not deallocated
> so it will stay in place and if anybody wants to allocate from that node
> then a fallback node will be used.
> 
> Thanks to Vlastimil Babka for his live system debugging skills that
> helped debugging the issue.
> 
> Debugged-by: Vlastimil Babka 
> Reported-by: Miroslav Benes 
> Fixes: e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when 
> offlining the node")
> Cc: Wen Congyang 
> Cc: Tang Chen 
> Signed-off-by: Michal Hocko 
> ---
> 
> Hi,
> please note that I am sending this as an RFC even though this has been
> confirmed to fix the oops in kvm_intel module because I cannot simply
> tell that there are no other side effect that I do not see from the code
> reading. I would appreciate some background from people who have
> introduced this code e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear
> cpu_to_node() when offlining the node") because the changelog doesn't
> really explain the motivation much.
> 
>  mm/memory_hotplug.c | 30 +-
>  1 file changed, 1 insertion(+), 29 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2b2b3ccbbfb5..87aeafac54ee 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1753,34 +1753,6 @@ static int check_cpu_on_node(pg_data_t *pgdat)
>   return 0;
>  }
>  
> -static void unmap_cpu_on_node(pg_data_t *pgdat)
> -{
> -#ifdef 

[RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-08 Thread Michal Hocko
From: Michal Hocko 

Per-cpu numa_node provides a default node for each possible cpu. The
association gets initialized during the boot when the architecture
specific code explores cpu->NUMA affinity. When the whole NUMA node is
removed though we are clearing this association

try_offline_node
  check_and_unmap_cpu_on_node
unmap_cpu_on_node
  numa_clear_node
numa_set_node(cpu, NUMA_NO_NODE)

This means that whoever calls cpu_to_node for a cpu associated with such
a node will get NUMA_NO_NODE. This is problematic for two reasons. First
it is fragile because __alloc_pages_node would simply blow up on an
out-of-bound access. We have encountered this when loading kvm module
BUG: unable to handle kernel paging request at 21c0
IP: [] __alloc_pages_nodemask+0x93/0xb70
PGD 80ffe853e067 PUD 7336bbc067 PMD 0
Oops:  [#1] SMP
[...]
CPU: 88 PID: 1223749 Comm: modprobe Tainted: GW  
4.4.156-94.64-default #1
task: 88727eff1880 ti: 88735449 task.ti: 88735449
RIP: 0010:[]  [] 
__alloc_pages_nodemask+0x93/0xb70
RSP: 0018:887354493b40  EFLAGS: 00010202
RAX: 21c0 RBX:  RCX: 
RDX:  RSI: 0002 RDI: 014000c0
RBP: 014000c0 R08:  R09: 
R10: 88fffc89e790 R11: 00014000 R12: 0101
R13: a0772cd4 R14: a0769ac0 R15: 
FS:  7fdf2f2f1700() GS:88fffc88() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 21c0 CR3: 0077205ee000 CR4: 00360670
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Stack:
 0086 014000c014d20400 887354493bb8 882614d20f4c
  0046 0046 810ac0c9
 88ffe78c 009f e8ffe82d3500 88ff8ac55000
Call Trace:
 [] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
 [] hardware_setup+0x781/0x849 [kvm_intel]
 [] kvm_arch_hardware_setup+0x28/0x190 [kvm]
 [] kvm_init+0x7c/0x2d0 [kvm]
 [] vmx_init+0x1e/0x32c [kvm_intel]
 [] do_one_initcall+0xca/0x1f0
 [] do_init_module+0x5a/0x1d7
 [] load_module+0x1393/0x1c90
 [] SYSC_finit_module+0x70/0xa0
 [] entry_SYSCALL_64_fastpath+0x1e/0xb7
DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7

on an older kernel but the code is basically the same in the current
Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which
would recognize NUMA_NO_NODE and use alloc_pages_node which would translate
it to numa_mem_id but that is wrong as well because it would use a cpu
affinity of the local CPU which might be quite far from the original node.
It is also reasonable to expect that cpu_to_node will provide a sane value
and there might be many more callers like that.

The second problem is that __register_one_node relies on cpu_to_node
to properly associate cpus back to the node when it is onlined. We do
not want to lose that link as there is no arch independent way to get it
from the early boot time AFAICS.

Drop the whole check_and_unmap_cpu_on_node machinery and keep the
association to fix both issues. The NODE_DATA(nid) is not deallocated
so it will stay in place and if anybody wants to allocate from that node
then a fallback node will be used.

Thanks to Vlastimil Babka for his live system debugging skills that
helped debugging the issue.

Debugged-by: Vlastimil Babka 
Reported-by: Miroslav Benes 
Fixes: e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when 
offlining the node")
Cc: Wen Congyang 
Cc: Tang Chen 
Signed-off-by: Michal Hocko 
---

Hi,
please note that I am sending this as an RFC even though this has been
confirmed to fix the oops in kvm_intel module because I cannot simply
tell that there are no other side effect that I do not see from the code
reading. I would appreciate some background from people who have
introduced this code e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear
cpu_to_node() when offlining the node") because the changelog doesn't
really explain the motivation much.

 mm/memory_hotplug.c | 30 +-
 1 file changed, 1 insertion(+), 29 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2b2b3ccbbfb5..87aeafac54ee 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1753,34 +1753,6 @@ static int check_cpu_on_node(pg_data_t *pgdat)
return 0;
 }
 
-static void unmap_cpu_on_node(pg_data_t *pgdat)
-{
-#ifdef CONFIG_ACPI_NUMA
-   int cpu;
-
-   for_each_possible_cpu(cpu)
-   if (cpu_to_node(cpu) == pgdat->node_id)
-   numa_clear_node(cpu);
-#endif
-}
-
-static int check_and_unmap_cpu_on_node(pg_data_t *pgdat)
-{
-   int ret;
-
-   ret = check_cpu_on_node(pgdat);
-   if (ret)
-   return ret;
-
-   /*
-* the node will be offlined when 

[RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-11-08 Thread Michal Hocko
From: Michal Hocko 

Per-cpu numa_node provides a default node for each possible cpu. The
association gets initialized during the boot when the architecture
specific code explores cpu->NUMA affinity. When the whole NUMA node is
removed though we are clearing this association

try_offline_node
  check_and_unmap_cpu_on_node
unmap_cpu_on_node
  numa_clear_node
numa_set_node(cpu, NUMA_NO_NODE)

This means that whoever calls cpu_to_node for a cpu associated with such
a node will get NUMA_NO_NODE. This is problematic for two reasons. First
it is fragile because __alloc_pages_node would simply blow up on an
out-of-bound access. We have encountered this when loading kvm module
BUG: unable to handle kernel paging request at 21c0
IP: [] __alloc_pages_nodemask+0x93/0xb70
PGD 80ffe853e067 PUD 7336bbc067 PMD 0
Oops:  [#1] SMP
[...]
CPU: 88 PID: 1223749 Comm: modprobe Tainted: GW  
4.4.156-94.64-default #1
task: 88727eff1880 ti: 88735449 task.ti: 88735449
RIP: 0010:[]  [] 
__alloc_pages_nodemask+0x93/0xb70
RSP: 0018:887354493b40  EFLAGS: 00010202
RAX: 21c0 RBX:  RCX: 
RDX:  RSI: 0002 RDI: 014000c0
RBP: 014000c0 R08:  R09: 
R10: 88fffc89e790 R11: 00014000 R12: 0101
R13: a0772cd4 R14: a0769ac0 R15: 
FS:  7fdf2f2f1700() GS:88fffc88() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 21c0 CR3: 0077205ee000 CR4: 00360670
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Stack:
 0086 014000c014d20400 887354493bb8 882614d20f4c
  0046 0046 810ac0c9
 88ffe78c 009f e8ffe82d3500 88ff8ac55000
Call Trace:
 [] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
 [] hardware_setup+0x781/0x849 [kvm_intel]
 [] kvm_arch_hardware_setup+0x28/0x190 [kvm]
 [] kvm_init+0x7c/0x2d0 [kvm]
 [] vmx_init+0x1e/0x32c [kvm_intel]
 [] do_one_initcall+0xca/0x1f0
 [] do_init_module+0x5a/0x1d7
 [] load_module+0x1393/0x1c90
 [] SYSC_finit_module+0x70/0xa0
 [] entry_SYSCALL_64_fastpath+0x1e/0xb7
DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7

on an older kernel but the code is basically the same in the current
Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which
would recognize NUMA_NO_NODE and use alloc_pages_node which would translate
it to numa_mem_id but that is wrong as well because it would use a cpu
affinity of the local CPU which might be quite far from the original node.
It is also reasonable to expect that cpu_to_node will provide a sane value
and there might be many more callers like that.

The second problem is that __register_one_node relies on cpu_to_node
to properly associate cpus back to the node when it is onlined. We do
not want to lose that link as there is no arch independent way to get it
from the early boot time AFAICS.

Drop the whole check_and_unmap_cpu_on_node machinery and keep the
association to fix both issues. The NODE_DATA(nid) is not deallocated
so it will stay in place and if anybody wants to allocate from that node
then a fallback node will be used.

Thanks to Vlastimil Babka for his live system debugging skills that
helped debugging the issue.

Debugged-by: Vlastimil Babka 
Reported-by: Miroslav Benes 
Fixes: e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when 
offlining the node")
Cc: Wen Congyang 
Cc: Tang Chen 
Signed-off-by: Michal Hocko 
---

Hi,
please note that I am sending this as an RFC even though this has been
confirmed to fix the oops in kvm_intel module because I cannot simply
tell that there are no other side effect that I do not see from the code
reading. I would appreciate some background from people who have
introduced this code e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear
cpu_to_node() when offlining the node") because the changelog doesn't
really explain the motivation much.

 mm/memory_hotplug.c | 30 +-
 1 file changed, 1 insertion(+), 29 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2b2b3ccbbfb5..87aeafac54ee 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1753,34 +1753,6 @@ static int check_cpu_on_node(pg_data_t *pgdat)
return 0;
 }
 
-static void unmap_cpu_on_node(pg_data_t *pgdat)
-{
-#ifdef CONFIG_ACPI_NUMA
-   int cpu;
-
-   for_each_possible_cpu(cpu)
-   if (cpu_to_node(cpu) == pgdat->node_id)
-   numa_clear_node(cpu);
-#endif
-}
-
-static int check_and_unmap_cpu_on_node(pg_data_t *pgdat)
-{
-   int ret;
-
-   ret = check_cpu_on_node(pgdat);
-   if (ret)
-   return ret;
-
-   /*
-* the node will be offlined when