[Kernel-packages] [Bug 1855177] Re: qemu nvdimm virtualization + linux 5.3.0-24-generic kernel PROBE ERROR

Rafael David Tinoco Thu, 05 Dec 2019 13:23:45 -0800

This way more complex than I thought and its not so easy to address.
Lets see if I can summarize the issue here. Whenever developing the
regressions tests for ndctl, it occurred to me the same backtrace, over
and over, when realizing the tests:


----
[  271.705646] memory add fail, invalid altmap
[  271.705677] WARNING: CPU: 5 PID: 886 at arch/x86/mm/init_64.c:852 
add_pages+0x5d/0x70
[  271.705679] Modules linked in: nls_iso8859_1 edac_mce_amd dax_pmem_compat 
nd_pmem device_dax nd_btt dax_pmem_core crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel joydev aesni_intel aes_x86_64 crypto_simd input_leds cryptd 
glue_helper serio_raw mac_hid qemu_fw_cfg nfit sch_fq_codel ip_tables x_tables 
autofs4 virtio_net psmouse net_failover virtio_blk i2c_piix4 failover pata_acpi 
floppy
[  271.705707] CPU: 5 PID: 886 Comm: ndctl Not tainted 5.3.0-24-generic 
#26-Ubuntu
[  271.705709] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.12.0-1 04/01/2014
[  271.705720] RIP: 0010:add_pages+0x5d/0x70
[  271.705721] Code: 33 c2 01 76 20 48 89 15 99 33 c2 01 48 89 15 a2 33 c2 01 
48 c1 e2 0c 48 03 15 97 96 39 01 48 89 15 48 0e c2 01 5b 41 5c 5d c3 <0f> 0b eb 
ba 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44
[  271.705722] RSP: 0018:ffffba02c0d2bbf0 EFLAGS: 00010282
[  271.705723] RAX: 00000000ffffffea RBX: 000000000017ffc0 RCX: 0000000000000000
[  271.705723] RDX: 0000000000000000 RSI: ffff9aaa3da97448 RDI: ffff9aaa3da97448
[  271.705724] RBP: ffffba02c0d2bc00 R08: ffff9aaa3da97448 R09: 0000000000000004
[  271.705724] R10: 0000000000000000 R11: 0000000000000001 R12: 000000000003fe40
[  271.705725] R13: 0000000000000001 R14: ffffba02c0d2bc48 R15: ffff9aa975efaaf8
[  271.705727] FS:  00007f70a62d4bc0(0000) GS:ffff9aaa3da80000(0000) 
knlGS:0000000000000000
[  271.705728] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  271.705729] CR2: 00005594a0aaa158 CR3: 0000000138110000 CR4: 00000000000406e0
[  271.705731] Call Trace:
[  271.705734]  arch_add_memory+0x41/0x50
[  271.705737]  devm_memremap_pages+0x47c/0x640
[  271.705740]  pmem_attach_disk+0x173/0x610 [nd_pmem]
[  271.705741]  ? devm_memremap+0x67/0xa0
[  271.705743]  nd_pmem_probe+0x7f/0xa0 [nd_pmem]
[  271.705745]  nvdimm_bus_probe+0x6b/0x170
[  271.705747]  really_probe+0xfb/0x3a0
[  271.705749]  driver_probe_device+0x5f/0xe0
[  271.705750]  device_driver_attach+0x5d/0x70
[  271.705751]  bind_store+0xd3/0x110
[  271.705753]  drv_attr_store+0x24/0x30
[  271.705754]  sysfs_kf_write+0x3e/0x50
[  271.705755]  kernfs_fop_write+0x11e/0x1a0
[  271.705757]  __vfs_write+0x1b/0x40
[  271.705758]  vfs_write+0xb9/0x1a0
[  271.705759]  ksys_write+0x67/0xe0
[  271.705760]  __x64_sys_write+0x1a/0x20
[  271.705762]  do_syscall_64+0x5a/0x130
[  271.705764]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  271.705765] RIP: 0033:0x7f70a6189327
[  271.705767] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 
f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  271.705767] RSP: 002b:00007ffc616998b8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000001
[  271.705768] RAX: ffffffffffffffda RBX: 00007f70a62d4ae8 RCX: 00007f70a6189327
[  271.705769] RDX: 0000000000000007 RSI: 00005594a0aa01a0 RDI: 0000000000000006
[  271.705769] RBP: 0000000000000006 R08: 0000000000000006 R09: 7375622f7379732f
[  271.705770] R10: 0000000000000000 R11: 0000000000000246 R12: 00005594a0aa01a0
[  271.705770] R13: 0000000000000001 R14: 0000000000000007 R15: 00007ffc61699908
[  271.705772] ---[ end trace 7ee621e68332018c ]---
----

And I realized that I could NOT re-generate the SECOND namespace (the
first one always worked). First I had to read about how qemu emulated
nvdimms and check why namespaces were not persistent on qemu nvdimms
emulation, then I had to discover why it looked like virtual nvdimms had
no labels (as RAW namespaces are always created by default) and then I
had to understand why the mapping was failing, to realize the real
issue.

First things first.

### QEMU emulated nvdimms:

https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt

Whenever backing filesystems are not DAX capable (on a REAL NVDIMM HW,
for example) then after the instance is shutdown all nvdimm data
(written to the backing files) are gone.

### QEMU virtual nvdimms lack of labels:

Label
-----

QEMU v2.7.0 and later implement the label support for vNVDIMM devices.
To enable label on vNVDIMM devices, users can simply add
"label-size=$SZ" option to "-device nvdimm", e.g.

 -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K

Note:

1. The minimal label size is 128KB.

2. QEMU v2.7.0 and later store labels at the end of backend storage.
   If a memory backend file, which was previously used as the backend
   of a vNVDIMM device without labels, is now used for a vNVDIMM
   device with label, the data in the label area at the end of file
   will be inaccessible to the guest. If any useful data (e.g. the
   meta-data of the file system) was stored there, the latter usage
   may result guest data corruption (e.g. breakage of guest file
   system).

### namespace1.0 always failing (with given back trace)

This is related to:

https://github.com/pmem/ndctl/issues/76

Specifically this comment:

https://github.com/pmem/ndctl/issues/76#issuecomment-440840503

"""
Linux needs 128MB alignment for each adjacent namespace. There isn't a fix 
because BIOS has no visibility or responsibility for Linux alignment 
constraints. Going forward Linux will eventually gain the capability to support 
fsdax mode with namespaces that collide within a section (128MB) until then the 
only workarounds are "raw" mode (not useful), or requiring fsdax namespaces to 
be created with "--align=1GB".

We faced something similar with section collisions with System RAM, but in that 
case we could interrogate the collision ahead of time. As it stands we don't 
find out about this collision until its too late. I'll try to think of 
something more clever, but the solution may devolve to just teaching the 
tooling to require large alignments.
"""

As we can see here:

rafaeldtinoco@ndctltest:~$ sudo cat /proc/iomem 
...
100000000-13fffffff : System RAM
140000000-17ffbffff : Persistent Memory
  140000000-17ffbffff : namespace0.0
17ffc0000-1bff7ffff : Persistent Memory
  17ffc0000-1bff7ffff : namespace1.0
340000000-3bfffffff : PCI Bus 0000:00

When using 2 nvdimms in QEMU, both regions (thus namespaces) share
boundaries and there is a special (to 128MB) alignment need for it. You
can make a RAW namespace to work, but no other:

----

rafaeldtinoco@ndctltest:~$ sudo ndctl disable-region all
disabled 2 regions
rafaeldtinoco@ndctltest:~$ sudo ndctl zero-labels all
zeroed 2 nmems
rafaeldtinoco@ndctltest:~$ sudo ndctl enable-region all
enabled 2 regions
rafaeldtinoco@ndctltest:~$ sudo ndctl list -N
rafaeldtinoco@ndctltest:~$ sudo ndctl create-namespace -r region0 -m raw
{
  "dev":"namespace0.0",
  "mode":"raw",
  "size":"1023.75 MiB (1073.48 MB)",
  "uuid":"54921448-1043-4779-bd77-bb77f70b11eb",
  "sector_size":512,
  "blockdev":"pmem0"
}
rafaeldtinoco@ndctltest:~$ sudo ndctl create-namespace -r region1 -m raw
{
  "dev":"namespace1.0",
  "mode":"raw",
  "size":"1023.75 MiB (1073.48 MB)",
  "uuid":"c5d32b36-c4b4-4c37-a401-0209e2b2e58a",
  "sector_size":512,
  "blockdev":"pmem1"
}
---

but if I try other namespace mode:

---
rafaeldtinoco@ndctltest:~$ sudo ndctl disable-region all
disabled 2 regions
rafaeldtinoco@ndctltest:~$ sudo ndctl zero-labels all
zeroed 2 nmems
rafaeldtinoco@ndctltest:~$ sudo ndctl enable-region all
enabled 2 regions
rafaeldtinoco@ndctltest:~$ sudo ndctl list -N
rafaeldtinoco@ndctltest:~$ sudo ndctl create-namespace -r region0 -m fsdax
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map":"dev",
  "size":"1004.00 MiB (1052.77 MB)",
  "uuid":"5c8e1059-2714-4e9a-b47f-33bb617d4489",
  "sector_size":512,
  "align":2097152,
  "blockdev":"pmem0"
}
rafaeldtinoco@ndctltest:~$ sudo ndctl create-namespace -r region1 -m fsdax
libndctl: ndctl_pfn_enable: pfn1.0: failed to enable
  Error: namespace1.0: failed to enable

failed to create namespace: No such device or address
----

I face the boundaries problem.

Seabios can be fixed by:

https://github.com/pmem/ndctl/issues/76#issuecomment-440848371

making sure alignment is correct. As the kernel is already taking care
of the issue:

https://github.com/0day-
ci/linux/commit/e50ad2650daecc1135bb28befd278fa291b6afe9

it looks like QEMU in this case would have to address this alignment.

For now, ndctl tests being made for:

https://bugs.launchpad.net/ubuntu/+source/ndctl/+bug/1853506

will have to deal with a single virtual nvdimm.

** Bug watch added: github.com/pmem/ndctl/issues #76
   https://github.com/pmem/ndctl/issues/76

** Summary changed:

- qemu nvdimm virtualization + linux 5.3.0-24-generic kernel PROBE ERROR
+ QEMU emulated nvdimm regions alignment need (128MB) or ndctl create-namespace 
namespace1.0 might fail

** Changed in: linux (Ubuntu)
       Status: Confirmed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1855177

Title:
  QEMU emulated nvdimm regions alignment need (128MB) or ndctl create-
  namespace namespace1.0 might fail

Status in linux package in Ubuntu:
  Fix Released
Status in ndctl package in Ubuntu:
  Confirmed
Status in qemu package in Ubuntu:
  Confirmed
Status in linux source package in Focal:
  Fix Released
Status in ndctl source package in Focal:
  Confirmed
Status in qemu source package in Focal:
  Confirmed

Bug description:
  I got a probe error for pfn1.0 (from both pfn0.0 and pfn1.0) when
  dealing with ndctl:

  ----
  [11257.765457] memory add fail, invalid altmap
  [11257.765489] WARNING: CPU: 6 PID: 5680 at arch/x86/mm/init_64.c:852 
add_pages+0x5d/0x70
  [11257.765489] Modules linked in: nls_iso8859_1 edac_mce_amd crct10dif_pclmul 
crc32_pclmul dax_pmem_compat device_dax dax_pmem_core nd_pmem nd_btt 
ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper 
input_leds joydev mac_hid nfit serio_raw qemu_fw_cfg sch_fq_codel ip_tables 
x_tables autofs4 virtio_net net_failover psmouse failover pata_acpi virtio_blk 
i2c_piix4 floppy
  [11257.765505] CPU: 6 PID: 5680 Comm: ndctl Not tainted 5.3.0-24-generic 
#26-Ubuntu
  [11257.765505] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.12.0-1 04/01/2014
  [11257.765507] RIP: 0010:add_pages+0x5d/0x70
  [11257.765509] Code: 33 c2 01 76 20 48 89 15 99 33 c2 01 48 89 15 a2 33 c2 01 
48 c1 e2 0c 48 03 15 97 96 39 01 48 89 15 48 0e c2 01 5b 41 5c 5d c3 <0f> 0b eb 
ba 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44
  [11257.765509] RSP: 0018:ffffa360c09dfbf0 EFLAGS: 00010282
  [11257.765510] RAX: 00000000ffffffea RBX: 000000000017ffe0 RCX: 
0000000000000000
  [11257.765511] RDX: 0000000000000000 RSI: ffff8acb7db17448 RDI: 
ffff8acb7db17448
  [11257.765512] RBP: ffffa360c09dfc00 R08: ffff8acb7db17448 R09: 
0000000000000004
  [11257.765512] R10: 0000000000000000 R11: 0000000000000001 R12: 
000000000003fe20
  [11257.765513] R13: 0000000000000001 R14: ffffa360c09dfc48 R15: 
ffff8acb7a7226f8
  [11257.765515] FS:  00007febc9fd6bc0(0000) GS:ffff8acb7db00000(0000) 
knlGS:0000000000000000
  [11257.765516] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [11257.765517] CR2: 000055eec8aab398 CR3: 000000013a8fa000 CR4: 
00000000000406e0
  [11257.765519] Call Trace:
  [11257.765523]  arch_add_memory+0x41/0x50
  [11257.765525]  devm_memremap_pages+0x47c/0x640
  [11257.765529]  pmem_attach_disk+0x173/0x610 [nd_pmem]
  [11257.765531]  ? devm_memremap+0x67/0xa0
  [11257.765532]  nd_pmem_probe+0x7f/0xa0 [nd_pmem]
  [11257.765542]  nvdimm_bus_probe+0x6b/0x170
  [11257.765547]  really_probe+0xfb/0x3a0
  [11257.765549]  driver_probe_device+0x5f/0xe0
  [11257.765550]  device_driver_attach+0x5d/0x70
  [11257.765551]  bind_store+0xd3/0x110
  [11257.765553]  drv_attr_store+0x24/0x30
  [11257.765554]  sysfs_kf_write+0x3e/0x50
  [11257.765555]  kernfs_fop_write+0x11e/0x1a0
  [11257.765557]  __vfs_write+0x1b/0x40
  [11257.765558]  vfs_write+0xb9/0x1a0
  [11257.765559]  ksys_write+0x67/0xe0
  [11257.765561]  __x64_sys_write+0x1a/0x20
  [11257.765567]  do_syscall_64+0x5a/0x130
  [11257.765693]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [11257.765696] RIP: 0033:0x7febc9e81327
  [11257.765698] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 
f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
  [11257.765698] RSP: 002b:00007ffd599433f8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000001
  [11257.765699] RAX: ffffffffffffffda RBX: 00007febc9fd6ae8 RCX: 
00007febc9e81327
  [11257.765700] RDX: 0000000000000007 RSI: 000055eec8a9bfa0 RDI: 
0000000000000004
  [11257.765701] RBP: 0000000000000004 R08: 0000000000000006 R09: 
7375622f7379732f
  [11257.765701] R10: 0000000000000000 R11: 0000000000000246 R12: 
000055eec8a9bfa0
  [11257.765702] R13: 0000000000000001 R14: 0000000000000007 R15: 
00007ffd59943448
  [11257.765703] ---[ end trace 442db04e33790cb5 ]---
  [11257.782659] nd_pmem: probe of pfn1.0 failed with error -22
  ----

  It seems that after this point I can't play with my second virtual
  nvdimm device (pfn1.0).

  A namespace destroy works but a namespace creation does not:

  rafaeldtinoco@ndctltest:~$ sudo ndctl list -B
  [
    {
      "provider":"ACPI.NFIT",
      "dev":"ndbus0"
    }
  ]

  rafaeldtinoco@ndctltest:~$ sudo ndctl list -D
  [
    {
      "dev":"nmem1",
      "id":"8680-57341200",
      "handle":2,
      "phys_id":0
    },
    {
      "dev":"nmem0",
      "id":"8680-56341200",
      "handle":1,
      "phys_id":0
    }
  ]

  rafaeldtinoco@ndctltest:~$ sudo ndctl list -R
  [
    {
      "dev":"region1",
      "size":1073610752,
      "available_size":1073610752,
      "max_available_extent":1073610752,
      "type":"pmem",
      "iset_id":52512795602891997,
      "persistence_domain":"unknown"
    },
    {
      "dev":"region0",
      "size":1073610752,
      "available_size":0,
      "max_available_extent":0,
      "type":"pmem",
      "iset_id":52512752653219036,
      "persistence_domain":"unknown"
    }
  ]

  Now, whenever trying to access namespace1.0 (from region1/nmem1/ndbus)
  I get:

  [11257.782659] nd_pmem: probe of pfn1.0 failed with error -22
  [11332.001388] pfn0.0 initialised, 257024 pages in 8ms
  [11332.001818] pmem0: detected capacity change from 0 to 1052770304
  [11359.739280] pfn0.1 initialised, 257024 pages in 0ms
  [11362.643212] pfn0.0 initialised, 257024 pages in 0ms
  [11362.644225] pmem0: detected capacity change from 0 to 1052770304
  [11406.230365] pfn0.1 initialised, 257024 pages in 0ms
  [11406.231281] pmem0: detected capacity change from 0 to 1052770304
  [11517.785147] pfn0.0 initialised, 257024 pages in 4ms
  [11517.785593] pmem0: detected capacity change from 0 to 1052770304
  [11537.431697] pfn0.1 initialised, 257024 pages in 0ms
  [11537.432256] pmem0: detected capacity change from 0 to 1052770304
  [11627.965947] pfn0.0 initialised, 257024 pages in 0ms
  [11627.966415] pmem0: detected capacity change from 0 to 1052770304
  [11653.277667] pfn0.1 initialised, 257024 pages in 4ms
  [11653.278086] pmem0: detected capacity change from 0 to 1052770304
  [11708.696361] pfn0.0 initialised, 257024 pages in 0ms
  [11708.697617] pmem0: detected capacity change from 0 to 1052770304
  [11753.621295] nd_pmem btt0.0: No existing arenas
  [11753.623118] pmem0s: detected capacity change from 0 to 1071484928
  [11767.087424] pfn0.1 initialised, 257024 pages in 4ms
  [11767.088272] pmem0: detected capacity change from 0 to 1052770304
  [11775.815396] dax0.0 initialised, 257024 pages in 4ms
  [12848.341346] pfn0.0 initialised, 257024 pages in 0ms
  [12848.341785] pmem0: detected capacity change from 0 to 1052770304
  [12851.897716] nd_pmem: probe of pfn1.0 failed with error -22
  [13023.693246] pfn0.1 initialised, 257024 pages in 0ms
  [13023.693662] pmem0: detected capacity change from 0 to 1052770304
  [13026.517467] nd_pmem: probe of pfn1.0 failed with error -22
  [13067.380701] pmem0: detected capacity change from 0 to 1073610752
  [13117.568499] nd_pmem: probe of pfn1.0 failed with error -22
  [13946.604199] pfn0.0 initialised, 257024 pages in 0ms
  [13946.604777] pmem0: detected capacity change from 0 to 1052770304
  [13957.948381] nd_pmem: probe of pfn1.0 failed with error -22

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1855177/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1855177] Re: qemu nvdimm virtualization + linux 5.3.0-24-generic kernel PROBE ERROR

Reply via email to