Re: [PATCH v3 2/3] mm: don't rely on system state to detect hot-plug operations

2020-09-15 Thread Michal Hocko
On Tue 15-09-20 11:41:42, Laurent Dufour wrote:
> In register_mem_sect_under_node() the system_state’s value is checked to
> detect whether the call is made during boot time or during an hot-plug
> operation. Unfortunately, that check against SYSTEM_BOOTING is wrong
> because regular memory is registered at SYSTEM_SCHEDULING state. In
> addition, memory hot-plug operation can be triggered at this system state
> by the ACPI [1]. So checking against the system state is not enough.
> 
> The consequence is that on system with interleaved node's ranges like this:
>  Early memory node ranges
>node   1: [mem 0x-0x00011fff]
>node   2: [mem 0x00012000-0x00014fff]
>node   1: [mem 0x00015000-0x0001]
>node   0: [mem 0x0002-0x00048fff]
>node   2: [mem 0x00049000-0x0007]
> 
> This can be seen on PowerPC LPAR after multiple memory hot-plug and
> hot-unplug operations are done. At the next reboot the node's memory ranges
> can be interleaved and since the call to link_mem_sections() is made in
> topology_init() while the system is in the SYSTEM_SCHEDULING state, the
> node's id is not checked, and the sections registered to multiple nodes:
> 
> $ ls -l /sys/devices/system/memory/memory21/node*
> total 0
> lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
> lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
> 
> In that case, the system is able to boot but if later one of theses memory
> blocks is hot-unplugged and then hot-plugged, the sysfs inconsistency is
> detected and this is triggering a BUG_ON():
> 
> [ cut here ]
> kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
> Oops: Exception in kernel mode, sig: 5 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto 
> gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
> CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
> NIP:  c0403f34 LR: c0403f2c CTR: 
> REGS: c004876e3660 TRAP: 0700   Not tainted  (5.9.0-rc1+)
> MSR:  8282b033   CR: 24000448  XER: 
> 2004
> CFAR: c0846d20 IRQMASK: 0
> GPR00: c0403f2c c004876e38f0 c12f6f00 ffef
> GPR04: 0227 c004805ae680  0004886f
> GPR08: 0226 0003 0002 fffd
> GPR12: 88000484 c0001ec96280  
> GPR16:   0004 0003
> GPR20: c0047814ffe0 c0077c08 0010 c13332c8
> GPR24:  c11f6cc0  
> GPR28: ffef 0001 00015000 1000
> NIP [c0403f34] add_memory_resource+0x244/0x340
> LR [c0403f2c] add_memory_resource+0x23c/0x340
> Call Trace:
> [c004876e38f0] [c0403f2c] add_memory_resource+0x23c/0x340 
> (unreliable)
> [c004876e39c0] [c040408c] __add_memory+0x5c/0xf0
> [c004876e39f0] [c00e2b94] dlpar_add_lmb+0x1b4/0x500
> [c004876e3ad0] [c00e3888] dlpar_memory+0x1f8/0xb80
> [c004876e3b60] [c00dc0d0] handle_dlpar_errorlog+0xc0/0x190
> [c004876e3bd0] [c00dc398] dlpar_store+0x198/0x4a0
> [c004876e3c90] [c072e630] kobj_attr_store+0x30/0x50
> [c004876e3cb0] [c051f954] sysfs_kf_write+0x64/0x90
> [c004876e3cd0] [c051ee40] kernfs_fop_write+0x1b0/0x290
> [c004876e3d20] [c0438dd8] vfs_write+0xe8/0x290
> [c004876e3d70] [c04391ac] ksys_write+0xdc/0x130
> [c004876e3dc0] [c0034e40] system_call_exception+0x160/0x270
> [c004876e3e20] [c000d740] system_call_common+0xf0/0x27c
> Instruction dump:
> 48442e35 6000 0b03 3cbe0001 7fa3eb78 7bc48402 38a5fffe 7ca5fa14
> 78a58402 48442db1 6000 7c7c1b78 <0b03> 7f23cb78 4bda371d 6000
> ---[ end trace 562fd6c109cd0fb2 ]---
> 
> This patch addresses the root cause by not relying on the system_state
> value to detect whether the call is due to a hot-plug operation. An extra
> parameter is added to link_mem_sections() detailing whether the operation
> is due to a hot-plug operation.
> 
> [1] According to Oscar Salvador, using this qemu command line, ACPI memory
> hotplug operations are raised at SYSTEM_SCHEDULING state:
> 
> $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host 
> -monitor pty \
> -m size=$MEM,slots=255,maxmem=4294967296k  \
> -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
> -object memory-backend-ram,id=memdimm0,size=134217728 -device 
> pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
> -object memory-backend-ram,id=memdimm1,size=134217728 -device 
> 

Re: [PATCH v3 2/3] mm: don't rely on system state to detect hot-plug operations

2020-09-15 Thread Oscar Salvador
On Tue, Sep 15, 2020 at 11:41:42AM +0200, Laurent Dufour wrote:
> [1] According to Oscar Salvador, using this qemu command line, ACPI memory
> hotplug operations are raised at SYSTEM_SCHEDULING state:

I would like to stress that this is not the only way we can end up
hotplugging memor while state = SYSTEM_SCHEDULING.
According to David, we can end up doing this if we reboot a VM
with hotplugged memory.
(And I have seen other virtualization technologies do the same)

 
> Fixes: 4fbce633910e ("mm/memory_hotplug.c: make 
> register_mem_sect_under_node() a callback of walk_memory_range()")
> Signed-off-by: Laurent Dufour 
> Reviewed-by: David Hildenbrand 
> Cc: sta...@vger.kernel.org
> Cc: Greg Kroah-Hartman 
> Cc: "Rafael J. Wysocki" 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


[PATCH v3 2/3] mm: don't rely on system state to detect hot-plug operations

2020-09-15 Thread Laurent Dufour
In register_mem_sect_under_node() the system_state’s value is checked to
detect whether the call is made during boot time or during an hot-plug
operation. Unfortunately, that check against SYSTEM_BOOTING is wrong
because regular memory is registered at SYSTEM_SCHEDULING state. In
addition, memory hot-plug operation can be triggered at this system state
by the ACPI [1]. So checking against the system state is not enough.

The consequence is that on system with interleaved node's ranges like this:
 Early memory node ranges
   node   1: [mem 0x-0x00011fff]
   node   2: [mem 0x00012000-0x00014fff]
   node   1: [mem 0x00015000-0x0001]
   node   0: [mem 0x0002-0x00048fff]
   node   2: [mem 0x00049000-0x0007]

This can be seen on PowerPC LPAR after multiple memory hot-plug and
hot-unplug operations are done. At the next reboot the node's memory ranges
can be interleaved and since the call to link_mem_sections() is made in
topology_init() while the system is in the SYSTEM_SCHEDULING state, the
node's id is not checked, and the sections registered to multiple nodes:

$ ls -l /sys/devices/system/memory/memory21/node*
total 0
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2

In that case, the system is able to boot but if later one of theses memory
blocks is hot-unplugged and then hot-plugged, the sysfs inconsistency is
detected and this is triggering a BUG_ON():

[ cut here ]
kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul 
binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
NIP:  c0403f34 LR: c0403f2c CTR: 
REGS: c004876e3660 TRAP: 0700   Not tainted  (5.9.0-rc1+)
MSR:  8282b033   CR: 24000448  XER: 
2004
CFAR: c0846d20 IRQMASK: 0
GPR00: c0403f2c c004876e38f0 c12f6f00 ffef
GPR04: 0227 c004805ae680  0004886f
GPR08: 0226 0003 0002 fffd
GPR12: 88000484 c0001ec96280  
GPR16:   0004 0003
GPR20: c0047814ffe0 c0077c08 0010 c13332c8
GPR24:  c11f6cc0  
GPR28: ffef 0001 00015000 1000
NIP [c0403f34] add_memory_resource+0x244/0x340
LR [c0403f2c] add_memory_resource+0x23c/0x340
Call Trace:
[c004876e38f0] [c0403f2c] add_memory_resource+0x23c/0x340 
(unreliable)
[c004876e39c0] [c040408c] __add_memory+0x5c/0xf0
[c004876e39f0] [c00e2b94] dlpar_add_lmb+0x1b4/0x500
[c004876e3ad0] [c00e3888] dlpar_memory+0x1f8/0xb80
[c004876e3b60] [c00dc0d0] handle_dlpar_errorlog+0xc0/0x190
[c004876e3bd0] [c00dc398] dlpar_store+0x198/0x4a0
[c004876e3c90] [c072e630] kobj_attr_store+0x30/0x50
[c004876e3cb0] [c051f954] sysfs_kf_write+0x64/0x90
[c004876e3cd0] [c051ee40] kernfs_fop_write+0x1b0/0x290
[c004876e3d20] [c0438dd8] vfs_write+0xe8/0x290
[c004876e3d70] [c04391ac] ksys_write+0xdc/0x130
[c004876e3dc0] [c0034e40] system_call_exception+0x160/0x270
[c004876e3e20] [c000d740] system_call_common+0xf0/0x27c
Instruction dump:
48442e35 6000 0b03 3cbe0001 7fa3eb78 7bc48402 38a5fffe 7ca5fa14
78a58402 48442db1 6000 7c7c1b78 <0b03> 7f23cb78 4bda371d 6000
---[ end trace 562fd6c109cd0fb2 ]---

This patch addresses the root cause by not relying on the system_state
value to detect whether the call is due to a hot-plug operation. An extra
parameter is added to link_mem_sections() detailing whether the operation
is due to a hot-plug operation.

[1] According to Oscar Salvador, using this qemu command line, ACPI memory
hotplug operations are raised at SYSTEM_SCHEDULING state:

$QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host 
-monitor pty \
-m size=$MEM,slots=255,maxmem=4294967296k  \
-numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
-object memory-backend-ram,id=memdimm0,size=134217728 -device 
pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
-object memory-backend-ram,id=memdimm1,size=134217728 -device 
pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
-object memory-backend-ram,id=memdimm2,size=134217728 -device 
pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
-object memory-backend-ram,id=memdimm3,size=134217728