Hey everyone,
I’m running into a strange problem after provisioning a node with xCAT, and
I’m trying to figure out if it’s something related to how I set up RAID1.
Setup:
-
Hardware: Supermicro server with AMD EPYC 7763 (64 cores)
-
OS: Oracle Linux 8.8 (kernel 4.18.0-477.21.1.el8_8)
-
Provisioning: xCAT
-
Storage:
-
2x 480GB SATA SSDs in RAID1 for system partitions
-
1x 1.8TB NVMe drive for /scratch
-
Filesystem: XFS
-
Network: Infiniband (ConnectX-5, switch: Mellanox SB8700/SB8790)
What`s happening:
After provisioning, the node (node01) looks fine — it boots, mounts
storage, RAID syncs, networking is working, etc.
But if I run simple commands like:
cat /proc/cpuinfo
cat /etc/fstab
cat /proc/mounts
vim /root/file_test
the *SSH session freezes*.
* (Other sessions are still fine, I can reconnect — it’s not a full system
crash.)
Other commands like:
cat /proc/mdstat
xfs_info /dev/md2
dmesg | grep error
dd if=/dev/sda of=/dev/null
work normally without any issues.
Wha I already cheched:
-
RAID1 is synced (/proc/mdstat shows [UU]).
-
XFS filesystems mount cleanly (xfs_info looks good).
-
No obvious errors in dmesg or journalctl.
-
Disk performance (dd) looks normal.
-
CPU microcode seems fine (0xa0011d5 for all cores).
-
Unloading Infiniband drivers (mlx5_ib, mlx5_core) had no effect.
-
strace shows the freeze while reading through /proc/cpuinfo.
Also important:
Other nodes (Dell servers with ConnectX-6) provisioned via the same xCAT
environment do not have this problem.
Could this be the cause?
*Could it be that something went wrong during the RAID1 creation with the
partitionfile script during provisioning?*
I created the RAID arrays (mdadm) during provisioning, plus a standalone
/scratch partition on the NVMe.
Thanks a lot if you have any ideas.
I’m happy to share more info if needed — just trying to understand if I
missed anything obvious.
_______________________________________________
xCAT-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/xcat-user