Hi,
In case someone is interested, the problem is now solved, the system seems to be rock solid after ~ 2 month testing:

I changed the AMD Ryzen 3 3200G to a AMD Ryzen 5 3600 on one node and to a AMD Ryzen 3 3100 on the two other nodes, now the problem is gone.

I don't really know why, I can think of two reasons:

1) The 3200G did not support ECC but I use ECC RAM. Maybe this leads to errors (although intensive memory testing with memtest86 did not report anything). 2) The new CPUs do not have integrated graphic capabilities. I noticed that the two onboard 10GBit-Ethernet adapters now have other PCI addresses with the new CPU. And with the old CPUs there were problem with malfunctioning of these 10G adapters.

Many thanks for input + your help.

The ASRock Rack X470D4U2-2T is definitly stable now.

Best Regards,
Hermann

Am 04.09.20 um 16:45 schrieb Hermann Himmelbauer:
Dear Proxmox users,

I'm trying to install a 3-node cluster (latest proxmox/ceph) and
experience random freezes. The node can either be completely frozen (no
blinking cursor on console, no ping) or can get somewhat blocked / slow etc.

This happens most often on node 2 (approx. 3-4 times / day), node 3
never got stuck within 14 days runtime, node 1 once.

Unfortunately I did not find any way to trigger this behaviour, however,
I *think* that this happens most often if I stress the machine in some
way (performance test within a virtual machine) and then idling the machine.

When the machine freezes completely, there is no logfile. However, if it
is partially frozen, some info can be aquired via dmesg. (See attached
file). ("device=2b:00.0" is an intel 10GBit ethernet adapter (X550T). So
perhaps there is some driver issue regarding this ethernet adapter?)

The system consists of the following components:

- AMD Ryzen 3 3200G, 4x 3.60GHz, boxed (YD3200C5FHBOX)
- ASRock Rack X470D4U2-2T (Mainboard)
- Samsung SSD 970 EVO Plus 250GB, M.2 (MZ-V7S250BW) (builtin SSD for OS)
- 2 * Kingston Server Premier DIMM 16GB, DDR4-2666, CL19-19-19, ECC (BOM
Number: 9965745-002.A00G, Part Number: KSM26ED8/16ME)
- be quiet! Pure Power 11 CM 400W ATX 2.4 (BN296) (Power supply)
- 2 * Micron 5300 PRO - Read Intensive 960GB, SATA
(MTFDDAK960TDS-1AW1Z6) (SSD for Ceph)
- LogiLink PC0075, 2x RJ-45, PCIe 2.0 x1 (second NIC with two ports)

The system is Linux Debian 10.4 (Proxmox 6.2-4) with kernel 5.4.34-1-pve
#1 SMP PVE 5.4.34-2 (Thu, 07 May 2020 10:02:02 +0200) x86_64 GNU/Linux.

What I did so far (without success):

- Disabled C6 as I read that this CPU-state can lead to unstable systems
(via "python zenstates.py --c6-disable" -> still errors).
- Updated my Bios to the latest version (3.30)
- Checked that the CPU + RAM are compatible to the mainboard (they are
listed as compatible on the ASRock website)
- Checked logs in IPMI (undervoltage, temperature etc., nothing is logged)
- Memory test (memtest86, no errors)

Do you have any clue what could be the reason for these freezes? Should
I think of some hardware error? Or is this some known Linux bug that can
be fixed?

Best Regards,
Hermann


_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user


_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Reply via email to