Hi folks, Trying the devel list to see if folks here have hit this issue when testing out as I suspect it's not something many users will have access to yet.
We have an issue where codes compiled with Open-MPI kill nodes with ConnectX-4 and ConnectX-5 cards connected to Mellanox Ethernet switches using the mlx5 driver from the latest Mellanox OFED, the kernel hangs with no oops (or any other error) and we have to power cycle the node to get it back. This happens with even a singleton (no srun or mpirun) and from what I can see from strace before the node hangs Open-MPI is starting to probe for what fabrics are available. The folks I'm helping have engaged Mellanox support but I was wondering if anyone else had run across this? Distro: RHEL 7.4 (x86-64) Kernel: 4.12.9 (needed for the CephFS filesystem they use) OFED: 4.1-1.0.2.0 Open-MPI: 1.10.x, 2.0.2, 3.0.0 All the best, Chris -- Christopher Samuel Senior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 _______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel