Re: [v5 0/3] "Hotremove" persistent memory

2019-05-20 Thread David Hildenbrand
On 17.05.19 16:09, Pavel Tatashin wrote:
>>
>> I would think that ACPI hotplug would have a similar problem, but it does 
>> this:
>>
>> acpi_unbind_memory_blocks(info);
>> __remove_memory(nid, info->start_addr, info->length);
> 
> ACPI does have exactly the same problem, so this is not a bug for this
> series, I will submit a new version of my series with comments
> addressed, but without fix for this issue.
> 
> I was able to reproduce this issue on the current mainline kernel.
> Also, I been thinking more about how to fix it, and there is no easy
> fix without a major hotplug redesign. Basically, we have to remove
> sysfs memory entries before or after memory is hotplugged/hotremoved.
> But, we also have to guarantee that hotplug/hotremove will succeed or
> reinstate sysfs entries.
> 
> Qemu script:
> 
> qemu-system-x86_64  \
> -enable-kvm \
> -cpu host   \
> -parallel none  \
> -echr 1 \
> -serial none\
> -chardev stdio,id=console,signal=off,mux=on \
> -serial chardev:console \
> -mon chardev=console\
> -vga none   \
> -display none   \
> -kernel pmem/native/arch/x86/boot/bzImage   \
> -m 8G,slots=1,maxmem=16G\
> -smp 8  \
> -fsdev local,id=virtfs1,path=/,security_model=none  \
> -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs\
> -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0
> TERM=xterm ip=dhcp loglevel=7'
> 
> Config is attached.
> 
> Steps to reproduce:
> #
> # QEMU 4.0.0 monitor - type 'help' for more information
> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> (qemu)
> 
> # echo online_movable > /sys/devices/system/memory/memory79/state
> [   23.029552] Built 1 zonelists, mobility grouping on.  Total pages: 2045370
> [   23.032591] Policy zone: Normal
> # (qemu) device_del dimm1
> (qemu) [   32.013950] Offlined Pages 32768
> [   32.014307] Built 1 zonelists, mobility grouping on.  Total pages: 2031022
> [   32.014843] Policy zone: Normal
> [   32.015733]
> [   32.015881] ==
> [   32.016390] WARNING: possible circular locking dependency detected
> [   32.016881] 5.1.0_pt_pmem #38 Not tainted
> [   32.017202] --
> [   32.017680] kworker/u16:4/380 is trying to acquire lock:
> [   32.018096] 675cc7e1 (kn->count#18){}, at:
> kernfs_remove_by_name_ns+0x3b/0x80
> [   32.018745]
> [   32.018745] but task is already holding lock:
> [   32.019201] 53e50a99 (mem_sysfs_mutex){+.+.}, at:
> unregister_memory_section+0x1d/0xa0
> [   32.019859]
> [   32.019859] which lock already depends on the new lock.
> [   32.019859]
> [   32.020499]
> [   32.020499] the existing dependency chain (in reverse order) is:
> [   32.021080]
> [   32.021080] -> #4 (mem_sysfs_mutex){+.+.}:
> [   32.021522]__mutex_lock+0x8b/0x900
> [   32.021843]hotplug_memory_register+0x26/0xa0
> [   32.022231]__add_pages+0xe7/0x160
> [   32.022545]add_pages+0xd/0x60
> [   32.022835]add_memory_resource+0xc3/0x1d0
> [   32.023207]__add_memory+0x57/0x80
> [   32.023530]acpi_memory_device_add+0x13a/0x2d0
> [   32.023928]acpi_bus_attach+0xf1/0x200
> [   32.024272]acpi_bus_scan+0x3e/0x90
> [   32.024597]acpi_device_hotplug+0x284/0x3e0
> [   32.024972]acpi_hotplug_work_fn+0x15/0x20
> [   32.025342]process_one_work+0x2a0/0x650
> [   32.025755]worker_thread+0x34/0x3d0
> [   32.026077]kthread+0x118/0x130
> [   32.026442]ret_from_fork+0x3a/0x50
> [   32.026766]
> [   32.026766] -> #3 (mem_hotplug_lock.rw_sem){}:
> [   32.027261]get_online_mems+0x39/0x80
> [   32.027600]kmem_cache_create_usercopy+0x29/0x2c0
> [   32.028019]kmem_cache_create+0xd/0x10
> [   32.028367]ptlock_cache_init+0x1b/0x23
> [   32.028724]start_kernel+0x1d2/0x4b8
> [   32.029060]secondary_startup_64+0xa4/0xb0
> [   32.029447]
> [   32.029447] -> #2 (cpu_hotplug_lock.rw_sem){}:
> [   32.030007]cpus_read_lock+0x39/0x80
> [   32.030360]__offline_pages+0x32/0x790
> [   32.030709]

Re: [v5 0/3] "Hotremove" persistent memory

2019-05-17 Thread Pavel Tatashin
>
> I would think that ACPI hotplug would have a similar problem, but it does 
> this:
>
> acpi_unbind_memory_blocks(info);
> __remove_memory(nid, info->start_addr, info->length);

ACPI does have exactly the same problem, so this is not a bug for this
series, I will submit a new version of my series with comments
addressed, but without fix for this issue.

I was able to reproduce this issue on the current mainline kernel.
Also, I been thinking more about how to fix it, and there is no easy
fix without a major hotplug redesign. Basically, we have to remove
sysfs memory entries before or after memory is hotplugged/hotremoved.
But, we also have to guarantee that hotplug/hotremove will succeed or
reinstate sysfs entries.

Qemu script:

qemu-system-x86_64  \
-enable-kvm \
-cpu host   \
-parallel none  \
-echr 1 \
-serial none\
-chardev stdio,id=console,signal=off,mux=on \
-serial chardev:console \
-mon chardev=console\
-vga none   \
-display none   \
-kernel pmem/native/arch/x86/boot/bzImage   \
-m 8G,slots=1,maxmem=16G\
-smp 8  \
-fsdev local,id=virtfs1,path=/,security_model=none  \
-device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs\
-append 'earlyprintk=serial,ttyS0,115200 console=ttyS0
TERM=xterm ip=dhcp loglevel=7'

Config is attached.

Steps to reproduce:
#
# QEMU 4.0.0 monitor - type 'help' for more information
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
(qemu)

# echo online_movable > /sys/devices/system/memory/memory79/state
[   23.029552] Built 1 zonelists, mobility grouping on.  Total pages: 2045370
[   23.032591] Policy zone: Normal
# (qemu) device_del dimm1
(qemu) [   32.013950] Offlined Pages 32768
[   32.014307] Built 1 zonelists, mobility grouping on.  Total pages: 2031022
[   32.014843] Policy zone: Normal
[   32.015733]
[   32.015881] ==
[   32.016390] WARNING: possible circular locking dependency detected
[   32.016881] 5.1.0_pt_pmem #38 Not tainted
[   32.017202] --
[   32.017680] kworker/u16:4/380 is trying to acquire lock:
[   32.018096] 675cc7e1 (kn->count#18){}, at:
kernfs_remove_by_name_ns+0x3b/0x80
[   32.018745]
[   32.018745] but task is already holding lock:
[   32.019201] 53e50a99 (mem_sysfs_mutex){+.+.}, at:
unregister_memory_section+0x1d/0xa0
[   32.019859]
[   32.019859] which lock already depends on the new lock.
[   32.019859]
[   32.020499]
[   32.020499] the existing dependency chain (in reverse order) is:
[   32.021080]
[   32.021080] -> #4 (mem_sysfs_mutex){+.+.}:
[   32.021522]__mutex_lock+0x8b/0x900
[   32.021843]hotplug_memory_register+0x26/0xa0
[   32.022231]__add_pages+0xe7/0x160
[   32.022545]add_pages+0xd/0x60
[   32.022835]add_memory_resource+0xc3/0x1d0
[   32.023207]__add_memory+0x57/0x80
[   32.023530]acpi_memory_device_add+0x13a/0x2d0
[   32.023928]acpi_bus_attach+0xf1/0x200
[   32.024272]acpi_bus_scan+0x3e/0x90
[   32.024597]acpi_device_hotplug+0x284/0x3e0
[   32.024972]acpi_hotplug_work_fn+0x15/0x20
[   32.025342]process_one_work+0x2a0/0x650
[   32.025755]worker_thread+0x34/0x3d0
[   32.026077]kthread+0x118/0x130
[   32.026442]ret_from_fork+0x3a/0x50
[   32.026766]
[   32.026766] -> #3 (mem_hotplug_lock.rw_sem){}:
[   32.027261]get_online_mems+0x39/0x80
[   32.027600]kmem_cache_create_usercopy+0x29/0x2c0
[   32.028019]kmem_cache_create+0xd/0x10
[   32.028367]ptlock_cache_init+0x1b/0x23
[   32.028724]start_kernel+0x1d2/0x4b8
[   32.029060]secondary_startup_64+0xa4/0xb0
[   32.029447]
[   32.029447] -> #2 (cpu_hotplug_lock.rw_sem){}:
[   32.030007]cpus_read_lock+0x39/0x80
[   32.030360]__offline_pages+0x32/0x790
[   32.030709]memory_subsys_offline+0x3a/0x60
[   32.031089]device_offline+0x7e/0xb0
[   32.031425]acpi_bus_offline+0xd8/0x140
[   32.031821]acpi_device_hotplug+0x1b2/0x3e0
[   32.032202]acpi_hotplug_work_fn+0x15/0x20
[   32.032576]

Re: [v5 0/3] "Hotremove" persistent memory

2019-05-16 Thread David Hildenbrand
On 16.05.19 02:42, Dan Williams wrote:
> On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin
>  wrote:
>>
>>> Hi Pavel,
>>>
>>> I am working on adding this sort of a workflow into a new daxctl command
>>> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
>>> dax device to kmem, online the resulting memory, and with your patches,
>>> also attempt to offline the memory, and change back to device-dax.
>>>
>>> In running with these patches, and testing the offlining part, I ran
>>> into the following lockdep below.
>>>
>>> This is with just these three patches on top of -rc7.
>>>
>>>
>>> [  +0.004886] ==
>>> [  +0.001576] WARNING: possible circular locking dependency detected
>>> [  +0.001506] 5.1.0-rc7+ #13 Tainted: G   O
>>> [  +0.000929] --
>>> [  +0.000708] daxctl/22950 is trying to acquire lock:
>>> [  +0.000548] f4d397f7 (kn->count#424){}, at: 
>>> kernfs_remove_by_name_ns+0x40/0x80
>>> [  +0.000922]
>>>   but task is already holding lock:
>>> [  +0.000657] 2aa52a9f (mem_sysfs_mutex){+.+.}, at: 
>>> unregister_memory_section+0x22/0xa0
>>
>> I have studied this issue, and now have a clear understanding why it
>> happens, I am not yet sure how to fix it, so suggestions are welcomed
>> :)
> 
> I would think that ACPI hotplug would have a similar problem, but it does 
> this:
> 
> acpi_unbind_memory_blocks(info);
> __remove_memory(nid, info->start_addr, info->length);
> 
> I wonder if that ordering prevents going too deep into the
> device_unregister() call stack that you highlighted below.
> 

If that doesn't help, after we have

[PATCH v2 0/8] mm/memory_hotplug: Factor out memory block device handling

we could probably pull the memory device removal phase out from the
mem_hotplug_lock protection and let it be protected by the
device_hotplug_lock only. Might require some more work, though.

-- 

Thanks,

David / dhildenb


Re: [v5 0/3] "Hotremove" persistent memory

2019-05-15 Thread Dan Williams
On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin
 wrote:
>
> > Hi Pavel,
> >
> > I am working on adding this sort of a workflow into a new daxctl command
> > (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> > dax device to kmem, online the resulting memory, and with your patches,
> > also attempt to offline the memory, and change back to device-dax.
> >
> > In running with these patches, and testing the offlining part, I ran
> > into the following lockdep below.
> >
> > This is with just these three patches on top of -rc7.
> >
> >
> > [  +0.004886] ==
> > [  +0.001576] WARNING: possible circular locking dependency detected
> > [  +0.001506] 5.1.0-rc7+ #13 Tainted: G   O
> > [  +0.000929] --
> > [  +0.000708] daxctl/22950 is trying to acquire lock:
> > [  +0.000548] f4d397f7 (kn->count#424){}, at: 
> > kernfs_remove_by_name_ns+0x40/0x80
> > [  +0.000922]
> >   but task is already holding lock:
> > [  +0.000657] 2aa52a9f (mem_sysfs_mutex){+.+.}, at: 
> > unregister_memory_section+0x22/0xa0
>
> I have studied this issue, and now have a clear understanding why it
> happens, I am not yet sure how to fix it, so suggestions are welcomed
> :)

I would think that ACPI hotplug would have a similar problem, but it does this:

acpi_unbind_memory_blocks(info);
__remove_memory(nid, info->start_addr, info->length);

I wonder if that ordering prevents going too deep into the
device_unregister() call stack that you highlighted below.


>
> Here is the problem:
>
> When we offline pages we have the following call stack:
>
> # echo offline > /sys/devices/system/memory/memory8/state
> ksys_write
>  vfs_write
>   __vfs_write
>kernfs_fop_write
> kernfs_get_active
>  lock_acquire   kn->count#122 (lock for
> "memory8/state" kn)
> sysfs_kf_write
>  dev_attr_store
>   state_store
>device_offline
> memory_subsys_offline
>  memory_block_action
>   offline_pages
>__offline_pages
> percpu_down_write
>  down_write
>   lock_acquire  mem_hotplug_lock.rw_sem
>
> When we unbind dax0.0 we have the following  stack:
> # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
> drv_attr_store
>  unbind_store
>   device_driver_detach
>device_release_driver_internal
> dev_dax_kmem_remove
>  remove_memory  device_hotplug_lock
>   try_remove_memory mem_hotplug_lock.rw_sem
>arch_remove_memory
> __remove_pages
>  __remove_section
>   unregister_memory_section
>remove_memory_sectionmem_sysfs_mutex
> unregister_memory
>  device_unregister
>   device_del
>device_remove_attrs
> sysfs_remove_groups
>  sysfs_remove_group
>   remove_files
>kernfs_remove_by_name
> kernfs_remove_by_name_ns
>  __kernfs_removekn->count#122
>
> So, lockdep found the ordering issue with the above two stacks:
>
> 1. kn->count#122 -> mem_hotplug_lock.rw_sem
> 2. mem_hotplug_lock.rw_sem -> kn->count#122


Re: [v5 0/3] "Hotremove" persistent memory

2019-05-15 Thread Pavel Tatashin
> Hi Pavel,
>
> I am working on adding this sort of a workflow into a new daxctl command
> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> dax device to kmem, online the resulting memory, and with your patches,
> also attempt to offline the memory, and change back to device-dax.
>
> In running with these patches, and testing the offlining part, I ran
> into the following lockdep below.
>
> This is with just these three patches on top of -rc7.
>
>
> [  +0.004886] ==
> [  +0.001576] WARNING: possible circular locking dependency detected
> [  +0.001506] 5.1.0-rc7+ #13 Tainted: G   O
> [  +0.000929] --
> [  +0.000708] daxctl/22950 is trying to acquire lock:
> [  +0.000548] f4d397f7 (kn->count#424){}, at: 
> kernfs_remove_by_name_ns+0x40/0x80
> [  +0.000922]
>   but task is already holding lock:
> [  +0.000657] 2aa52a9f (mem_sysfs_mutex){+.+.}, at: 
> unregister_memory_section+0x22/0xa0

I have studied this issue, and now have a clear understanding why it
happens, I am not yet sure how to fix it, so suggestions are welcomed
:)

Here is the problem:

When we offline pages we have the following call stack:

# echo offline > /sys/devices/system/memory/memory8/state
ksys_write
 vfs_write
  __vfs_write
   kernfs_fop_write
kernfs_get_active
 lock_acquire   kn->count#122 (lock for
"memory8/state" kn)
sysfs_kf_write
 dev_attr_store
  state_store
   device_offline
memory_subsys_offline
 memory_block_action
  offline_pages
   __offline_pages
percpu_down_write
 down_write
  lock_acquire  mem_hotplug_lock.rw_sem

When we unbind dax0.0 we have the following  stack:
# echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
drv_attr_store
 unbind_store
  device_driver_detach
   device_release_driver_internal
dev_dax_kmem_remove
 remove_memory  device_hotplug_lock
  try_remove_memory mem_hotplug_lock.rw_sem
   arch_remove_memory
__remove_pages
 __remove_section
  unregister_memory_section
   remove_memory_sectionmem_sysfs_mutex
unregister_memory
 device_unregister
  device_del
   device_remove_attrs
sysfs_remove_groups
 sysfs_remove_group
  remove_files
   kernfs_remove_by_name
kernfs_remove_by_name_ns
 __kernfs_removekn->count#122

So, lockdep found the ordering issue with the above two stacks:

1. kn->count#122 -> mem_hotplug_lock.rw_sem
2. mem_hotplug_lock.rw_sem -> kn->count#122


Re: [v5 0/3] "Hotremove" persistent memory

2019-05-02 Thread Verma, Vishal L
On Thu, 2019-05-02 at 14:43 -0400, Pavel Tatashin wrote:
> The series of operations look like this:
> 
> 1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
>and free ramdisk.
> 2. Convert raw pmem0 to devdax
>ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
> 3. Hotadd to System RAM
>echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
>echo online_movable > /sys/devices/system/memoryXXX/state
> 4. Before reboot hotremove device-dax memory from System RAM
>echo offline > /sys/devices/system/memoryXXX/state
>echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind

Hi Pavel,

I am working on adding this sort of a workflow into a new daxctl command
(daxctl-reconfigure-device)- this will allow changing the 'mode' of a
dax device to kmem, online the resulting memory, and with your patches,
also attempt to offline the memory, and change back to device-dax.

In running with these patches, and testing the offlining part, I ran
into the following lockdep below.

This is with just these three patches on top of -rc7.


[  +0.004886] ==
[  +0.001576] WARNING: possible circular locking dependency detected
[  +0.001506] 5.1.0-rc7+ #13 Tainted: G   O 
[  +0.000929] --
[  +0.000708] daxctl/22950 is trying to acquire lock:
[  +0.000548] f4d397f7 (kn->count#424){}, at: 
kernfs_remove_by_name_ns+0x40/0x80
[  +0.000922] 
  but task is already holding lock:
[  +0.000657] 2aa52a9f (mem_sysfs_mutex){+.+.}, at: 
unregister_memory_section+0x22/0xa0
[  +0.000960] 
  which lock already depends on the new lock.

[  +0.001001] 
  the existing dependency chain (in reverse order) is:
[  +0.000837] 
  -> #3 (mem_sysfs_mutex){+.+.}:
[  +0.000631]__mutex_lock+0x82/0x9a0
[  +0.000477]unregister_memory_section+0x22/0xa0
[  +0.000582]__remove_pages+0xe9/0x520
[  +0.000489]arch_remove_memory+0x81/0xc0
[  +0.000510]devm_memremap_pages_release+0x180/0x270
[  +0.000633]release_nodes+0x234/0x280
[  +0.000483]device_release_driver_internal+0xf4/0x1d0
[  +0.000701]bus_remove_device+0xfc/0x170
[  +0.000529]device_del+0x16a/0x380
[  +0.000459]unregister_dev_dax+0x23/0x50
[  +0.000526]release_nodes+0x234/0x280
[  +0.000487]device_release_driver_internal+0xf4/0x1d0
[  +0.000646]unbind_store+0x9b/0x130
[  +0.000467]kernfs_fop_write+0xf0/0x1a0
[  +0.000510]vfs_write+0xba/0x1c0
[  +0.000438]ksys_write+0x5a/0xe0
[  +0.000521]do_syscall_64+0x60/0x210
[  +0.000489]entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  +0.000637] 
  -> #2 (mem_hotplug_lock.rw_sem){}:
[  +0.000717]get_online_mems+0x3e/0x80
[  +0.000491]kmem_cache_create_usercopy+0x2e/0x270
[  +0.000609]kmem_cache_create+0x12/0x20
[  +0.000507]ptlock_cache_init+0x20/0x28
[  +0.000506]start_kernel+0x240/0x4d0
[  +0.000480]secondary_startup_64+0xa4/0xb0
[  +0.000539] 
  -> #1 (cpu_hotplug_lock.rw_sem){}:
[  +0.000784]cpus_read_lock+0x3e/0x80
[  +0.000511]online_pages+0x37/0x310
[  +0.000469]memory_subsys_online+0x34/0x60
[  +0.000611]device_online+0x60/0x80
[  +0.000611]state_store+0x66/0xd0
[  +0.000552]kernfs_fop_write+0xf0/0x1a0
[  +0.000649]vfs_write+0xba/0x1c0
[  +0.000487]ksys_write+0x5a/0xe0
[  +0.000459]do_syscall_64+0x60/0x210
[  +0.000482]entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  +0.000646] 
  -> #0 (kn->count#424){}:
[  +0.000669]lock_acquire+0x9e/0x180
[  +0.000471]__kernfs_remove+0x26a/0x310
[  +0.000518]kernfs_remove_by_name_ns+0x40/0x80
[  +0.000583]remove_files.isra.1+0x30/0x70
[  +0.000555]sysfs_remove_group+0x3d/0x80
[  +0.000524]sysfs_remove_groups+0x29/0x40
[  +0.000532]device_remove_attrs+0x42/0x80
[  +0.000522]device_del+0x162/0x380
[  +0.000464]device_unregister+0x16/0x60
[  +0.000505]unregister_memory_section+0x6e/0xa0
[  +0.000591]__remove_pages+0xe9/0x520
[  +0.000492]arch_remove_memory+0x81/0xc0
[  +0.000568]try_remove_memory+0xba/0xd0
[  +0.000510]remove_memory+0x23/0x40
[  +0.000483]dev_dax_kmem_remove+0x29/0x57 [kmem]
[  +0.000608]device_release_driver_internal+0xe4/0x1d0
[  +0.000637]unbind_store+0x9b/0x130
[  +0.000464]kernfs_fop_write+0xf0/0x1a0
[  +0.000685]vfs_write+0xba/0x1c0
[  +0.000594]ksys_write+0x5a/0xe0
[  +0.000449]do_syscall_64+0x60/0x210
[  +0.000481]entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  +0.000619] 
  other info that might help us debug this:

[  +0.000889] Chain exists of: