Re: [OMPI users] [EXTERNAL] Re: How to use hugetlbfs with openmpi and ucx

2023-07-24 Thread Pritchard Jr., Howard via users
HI Arun, Interesting. For problem b) I would suggest one of two things - if you want to dig deeper yourself, and its possible on your system, I'd look at the output of dmesg -H -w on the node where the job is hitting this failure (you'll need to rerun the job) - ping the UCX group mail list (se

Re: [OMPI users] [EXTERNAL] Re: How to use hugetlbfs with openmpi and ucx

2023-07-21 Thread Chandran, Arun via users
Hi Howard, Thank you very much for the reply. Ucx is trying to setup the FIFO for shared memory communication using both sysv and posix. By default, these allocations are failing when tried with hugetlbfs a) Failure log from strace(Pasting only for rank0): [pid 3541286] shmget(IPC_PR

Re: [OMPI users] [EXTERNAL] Re: How to use hugetlbfs with openmpi and ucx

2023-07-20 Thread Pritchard Jr., Howard via users
HI Arun, Its going to be chatty, but you may want to see if strace helps in diagnosing: mpirun -np 2 (all your favorite mpi args) strace -f send_recv 1000 1 huge pages often helps reduce pressure on a NIC's I/O MMU widget and speeds up resolving va to pa memory addresses. On 7/19/23, 9:24 PM