Hi Eneko,

Consumer RAM is always a tricky starting point. When under heavy load, the fault rates tend to be quite surprising (hence why ECC is preferable in enterprise/etc settings).

Your Gigabyte B450 Aurus M motherboard has many new BIOS iterations available - they're definitely worth reading into, and applying after your own research. https://www.gigabyte.com/Motherboard/B450-AORUS-M-rev-1x/support#support-dl-bios (F60/F61c in 2021 compared to F50 in 2019)

Where there's been a hardware fault downstream, we tend to see similar kernel taint flags/states to what you are with P O / P D O / etc. https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html (We've had hardware like OOB cards trigger module flags)

Have you run the system through extended memory testing (outside of PVE, which includes memtest at-boot, at least under the ISO)?

Your BIOS version is dated 1 month after this article, which would imply that a BIOS update may be beneficial to avoiding the RNG bug.
https://arstechnica.com/gadgets/2019/10/how-a-months-old-amd-microcode-bug-destroyed-my-weekend

With the kernel taint lines, do you have the procs/calls that marry up to the PIDs listed? What were they doing at the time?

From the brief logs you've included, they look varied implying that the problem is likely hardware-based. I'd guess RAM.

As it's only faulted once, I'd say a decent course of action would be to test the memory extensively, and go from there.

If it can identify a faulting module, then you can remove that DIMM and swap it for a known-good one instead, etc.

The testing can take a while, and in our experience it can be worth leaving it to cycle through, esp. with non-ECC.

Even though tainted kernels never "lose their taint", if you remove the underlying cause it should clear the state.

It'll be good to hear about how you get on with it all. Best of luck with it!

Cheers,

Luke Thompson
Operations Manager

[email protected]
PO Box 111, West Wallsend

On 15/7/21 6:40 pm, Eneko Lacunza via pve-user wrote:
Hi all,

Tonight a node of our 5-node Proxmox 6.4+Ceph cluster has frezeed at
~6:45. A reset has brought it online later in the morning and is working
well for 2 hours right now.

HA worked like a charm and Ceph has recovered in some minutes.

Fantastic success history really, thanks for your excelent work Proxmox
developer and contributors!

Now for the "post-mortem", I see node's 8 cores "general protection
fault"ing one after another in a minute, with different processes.

I suspect a memory module or main board fault (Ryzen 3700X 8-core,
4x32GB non-ECC RAM and gigabyte mainboard, all "consumer" parts, it has
been working well since dec 2019). What do you think?

Here a shortened syslog (I can provide all 437 lines if necessary):

---

Jul 15 06:45:00 sanmarko systemd[1]: Starting Proxmox VE replication
runner...

Jul 15 06:45:00 sanmarko systemd[1]: pvesr.service: Succeeded.

Jul 15 06:45:00 sanmarko systemd[1]: Started Proxmox VE replication runner.

Jul 15 06:45:01 sanmarko CRON[1913457]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)

Jul 15 06:45:01 sanmarko CRON[1913458]: (root) CMD (if [ -x
/etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update
7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then
/etc/munin/plugins/ap

t update 7200 12 >/dev/null; fi)

Jul 15 06:45:10 sanmarko kernel: [145747.429110] general protection
fault: 0000 [#1] SMP NOPTI

Jul 15 06:45:10 sanmarko kernel: [145747.429175] CPU: 11 PID: 1914237
Comm: ceph Tainted: P           O      5.4.124-1-pve #1

Jul 15 06:45:10 sanmarko kernel: [145747.429245] Hardware name: Gigabyte
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019

Jul 15 06:45:10 sanmarko kernel: [145747.429322] RIP:
0010:kmem_cache_alloc+0x89/0x240

Jul 15 06:45:10 sanmarko kernel: [145747.429382] Code: 08 65 4c 03 05 30
e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 7

0 01 00 00 4c 89 e0 48 0f c9 48 31 cb

[...]

Jul 15 06:45:10 sanmarko kernel: [145747.430245] Call Trace:

Jul 15 06:45:10 sanmarko kernel: [145747.430314]  ?
security_file_alloc+0x29/0x90

[...]

Jul 15 06:45:10 sanmarko kernel: [145747.431244]
entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

Jul 15 06:45:12 sanmarko kernel: [145749.037616] general protection
fault: 0000 [#2] SMP NOPTI

Jul 15 06:45:12 sanmarko kernel: [145749.037695] CPU: 11 PID: 2433 Comm:
tp_fstore_op Tainted: P      D    O      5.4.124-1-pve #1

Jul 15 06:45:12 sanmarko kernel: [145749.037793] Hardware name: Gigabyte
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019

Jul 15 06:45:12 sanmarko kernel: [145749.037898] RIP:
0010:apparmor_file_free_security+0x22/0x40

Jul 15 06:45:12 sanmarko kernel: [145749.037975] Code: 2c ff ff eb a2 0f
1f 00 0f 1f 44 00 00 48 63 05 28 fb fc 00 48 03 87 c0 00 00 00 74 1a 48
8b 78 08 48 85 ff 74 11 55 48 89 e5 <f0> ff 0f 0f 88 dc 96 62 00 74 03
5d c3 c3 e8 db 55 00 00 5d c3 66

[...]

Jul 15 06:45:12 sanmarko kernel: [145749.038942] Call Trace:

Jul 15 06:45:12 sanmarko kernel: [145749.039015]
security_file_free+0x27/0x60

[...]

Jul 15 06:45:12 sanmarko kernel: [145749.039441]
entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

Jul 15 06:45:29 sanmarko kernel: [145765.573841] general protection
fault: 0000 [#3] SMP NOPTI

Jul 15 06:45:29 sanmarko kernel: [145765.573922] CPU: 11 PID: 1733 Comm:
pve-firewall Tainted: P      D    O      5.4.124-1-pve #1

Jul 15 06:45:29 sanmarko kernel: [145765.574021] Hardware name: Gigabyte
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019

Jul 15 06:45:29 sanmarko kernel: [145765.574127] RIP:
0010:kmem_cache_alloc+0x89/0x240

Jul 15 06:45:29 sanmarko kernel: [145765.574201] Code: 08 65 4c 03 05 30
e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
00 00 4c 89 e0 48 0f c9 48 31 cb

[...]

Jul 15 06:45:29 sanmarko kernel: [145765.576321] Call Trace:

Jul 15 06:45:29 sanmarko kernel: [145765.576391]  ?
security_file_alloc+0x29/0x90

[...]

Jul 15 06:45:29 sanmarko kernel: [145765.577258]
entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

Jul 15 06:45:29 sanmarko systemd[1]: pve-firewall.service: Main process
exited, code=killed, status=11/SEGV

Jul 15 06:45:29 sanmarko systemd[1]: pve-firewall.service: Failed with
result 'signal'.

Jul 15 06:45:35 sanmarko kernel: [145772.194438] general protection
fault: 0000 [#4] SMP NOPTI

Jul 15 06:45:35 sanmarko kernel: [145772.194516] CPU: 11 PID: 1776 Comm:
ms_dispatch Tainted: P      D    O      5.4.124-1-pve #1

Jul 15 06:45:35 sanmarko kernel: [145772.194614] Hardware name: Gigabyte
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019

Jul 15 06:45:35 sanmarko kernel: [145772.194718] RIP:
0010:kmem_cache_alloc+0x89/0x240

Jul 15 06:45:35 sanmarko kernel: [145772.194792] Code: 08 65 4c 03 05 30
e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
00 00 4c 89 e0 48 0f c9 48 31 cb

[...]

Jul 15 06:45:35 sanmarko kernel: [145772.195750] Call Trace:

Jul 15 06:45:35 sanmarko kernel: [145772.195819]  ?
security_file_alloc+0x29/0x90

[...]

Jul 15 06:45:35 sanmarko kernel: [145772.197176]
entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

Jul 15 06:45:37 sanmarko kernel: [145774.137506] general protection
fault: 0000 [#5] SMP NOPTI

Jul 15 06:45:37 sanmarko kernel: [145774.137586] CPU: 11 PID: 2466 Comm:
tp_fstore_op Tainted: P      D    O      5.4.124-1-pve #1

Jul 15 06:45:37 sanmarko kernel: [145774.137687] Hardware name: Gigabyte
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019

Jul 15 06:45:37 sanmarko kernel: [145774.137791] RIP:
0010:kmem_cache_alloc+0x89/0x240

Jul 15 06:45:37 sanmarko kernel: [145774.137865] Code: 08 65 4c 03 05 30
e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
00 00 4c 89 e0 48 0f c9 48 31 cb

[...]

Jul 15 06:45:37 sanmarko kernel: [145774.139990] Call Trace:

Jul 15 06:45:37 sanmarko kernel: [145774.140059]  ?
security_file_alloc+0x29/0x90

[...]

Jul 15 06:45:37 sanmarko kernel: [145774.140991]
entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

Jul 15 06:45:40 sanmarko kernel: [145776.830930] general protection
fault: 0000 [#6] SMP NOPTI

Jul 15 06:45:40 sanmarko kernel: [145776.831010] CPU: 11 PID: 7234 Comm:
kvm Tainted: P      D    O      5.4.124-1-pve #1

Jul 15 06:45:40 sanmarko kernel: [145776.831109] Hardware name: Gigabyte
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019

Jul 15 06:45:40 sanmarko kernel: [145776.831217] RIP:
0010:kmem_cache_alloc+0x89/0x240

Jul 15 06:45:40 sanmarko kernel: [145776.831294] Code: 08 65 4c 03 05 30
e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
00 00 4c 89 e0 48 0f c9 48 31 cb

[...]

Jul 15 06:45:40 sanmarko kernel: [145776.832264] Call Trace:

Jul 15 06:45:40 sanmarko kernel: [145776.832334]  ?
security_file_alloc+0x29/0x90

[...]

Jul 15 06:45:40 sanmarko kernel: [145776.833336]
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Jul 15 06:45:40 sanmarko pve-ha-lrm[1914439]: starting service vm:149

Jul 15 06:45:40 sanmarko pve-ha-lrm[1914439]: <root@pam> starting task
UPID:sanmarko:001D3649:00DE7138:60EFBD74:qmstart:149:root@pam:

Jul 15 06:45:40 sanmarko pve-ha-lrm[1914441]: start VM 149:
UPID:sanmarko:001D3649:00DE7138:60EFBD74:qmstart:149:root@pam:

Jul 15 06:45:40 sanmarko systemd[1]: 149.scope: Succeeded.

Jul 15 06:45:40 sanmarko systemd[1]: Stopped 149.scope.

Jul 15 06:45:43 sanmarko kernel: [145779.840863] general protection
fault: 0000 [#7] SMP NOPTI

Jul 15 06:45:43 sanmarko kernel: [145779.840942] CPU: 11 PID: 1740 Comm:
pvestatd Tainted: P      D    O      5.4.124-1-pve #1

Jul 15 06:45:43 sanmarko kernel: [145779.842207] Hardware name: Gigabyte
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019

Jul 15 06:45:43 sanmarko kernel: [145779.842310] RIP:
0010:kmem_cache_alloc+0x89/0x240

Jul 15 06:45:43 sanmarko kernel: [145779.842383] Code: 08 65 4c 03 05 30
e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
00 00 4c 89 e0 48 0f c9 48 31 cb

[...]

Jul 15 06:45:43 sanmarko kernel: [145779.843337] Call Trace:

Jul 15 06:45:43 sanmarko kernel: [145779.843406]  ?
security_file_alloc+0x29/0x90

[...]

Jul 15 06:45:43 sanmarko kernel: [145779.844545]
entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

Jul 15 06:45:43 sanmarko systemd[1]: pvestatd.service: Main process
exited, code=killed, status=11/SEGV

Jul 15 06:45:43 sanmarko systemd[1]: pvestatd.service: Failed with
result 'signal'.

Jul 15 06:45:45 sanmarko pve-ha-lrm[1914439]: Task
'UPID:sanmarko:001D3649:00DE7138:60EFBD74:qmstart:149:root@pam:' still
active, waiting

Jul 15 06:45:45 sanmarko pve-ha-lrm[1914441]: timeout waiting on systemd

Jul 15 06:45:45 sanmarko pve-ha-lrm[1914439]: <root@pam> end task
UPID:sanmarko:001D3649:00DE7138:60EFBD74:qmstart:149:root@pam: timeout
waiting on systemd

Jul 15 06:45:45 sanmarko pve-ha-lrm[1914439]: unable to start service vm:149

Jul 15 06:45:50 sanmarko pve-ha-lrm[1804]: restart policy: retry number
1 for service 'vm:149'

Jul 15 06:45:56 sanmarko kernel: [145792.695695] general protection
fault: 0000 [#8] SMP NOPTI

Jul 15 06:45:56 sanmarko kernel: [145792.695777] CPU: 11 PID: 1783 Comm:
pve-ha-crm Tainted: P      D    O      5.4.124-1-pve #1

Jul 15 06:45:56 sanmarko kernel: [145792.695876] Hardware name: Gigabyte
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019

Jul 15 06:45:56 sanmarko kernel: [145792.695980] RIP:
0010:kmem_cache_alloc+0x89/0x240

Jul 15 06:45:56 sanmarko kernel: [145792.696054] Code: 08 65 4c 03 05 30
e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
00 00 4c 89 e0 48 0f c9 48 31 cb

[...]

Jul 15 06:45:56 sanmarko kernel: [145792.697012] Call Trace:

Jul 15 06:45:56 sanmarko kernel: [145792.697081]  ?
security_file_alloc+0x29/0x90

[...]

Jul 15 06:45:56 sanmarko kernel: [145792.698225]
entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

Jul 15 06:45:56 sanmarko watchdog-mux[895]: client did not stop watchdog
- disable watchdog updates

Jul 15 06:45:56 sanmarko systemd[1]: pve-ha-crm.service: Main process
exited, code=killed, status=11/SEGV

Jul 15 06:45:56 sanmarko systemd[1]: pve-ha-crm.service: Failed with
result 'signal'.

Jul 15 06:45:56 sanmarko kernel: [145792.701730] FS:
00007fae7b4141c0(0000) GS:ffff96b69eac0000(0000) knlGS:0000000000000000

Jul 15 06:45:56 sanmarko kernel: [145792.701826] CS:  0010 DS: 0000 ES:
0000 CR0: 0000000080050033

Jul 15 06:45:56 sanmarko kernel: [145792.701902] CR2: 00007fa6bbf73008
CR3: 0000001f29900000 CR4: 0000000000340ee0

[... no more logs until reset ...]


# pveversion -v

proxmox-ve: 6.4-1 (running kernel: 5.4.124-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-4
pve-kernel-helper: 6.4-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
ceph: 15.2.13-pve1~bpo10
ceph-fuse: 15.2.13-pve1~bpo10
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.10-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1


Thanks a lot


Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Reply via email to