HI Arun,
Interesting. For problem b) I would suggest one of two things
- if you want to dig deeper yourself, and its possible on your system, I'd look
at the output of dmesg -H -w on the node where the job is hitting this failure
(you'll need to rerun the job)
- ping the UCX group mail list (se
Hi Howard,
Thank you very much for the reply.
Ucx is trying to setup the FIFO for shared memory communication using both sysv
and posix.
By default, these allocations are failing when tried with hugetlbfs
a) Failure log from strace(Pasting only for rank0):
[pid 3541286] shmget(IPC_PR
HI Arun,
Its going to be chatty, but you may want to see if strace helps in diagnosing:
mpirun -np 2 (all your favorite mpi args) strace -f send_recv 1000 1
huge pages often helps reduce pressure on a NIC's I/O MMU widget and speeds up
resolving va to pa memory addresses.
On 7/19/23, 9:24 PM