Re: [OMPI users] [EXTERNAL] Re: How to use hugetlbfs with openmpi and ucx

2023-07-24 Thread Pritchard Jr., Howard via users
HI Arun,

Interesting.  For problem b) I would suggest one of two things
- if you want to dig deeper yourself, and its possible on your system, I'd look 
at the output of dmesg -H -w on the node where the job is hitting this failure 
(you'll need to rerun the job)
- ping the UCX group mail list (see 
https://elist.ornl.gov/mailman/listinfo/ucx-group . 

As for your more general question, I would suggest keeping it simple and 
letting the applications use large pages via the usual libhugetlbfs mechanism 
(LD_PRELOAD libhugetlbfs and set libhugetlbfs env variables for specifying what 
type of process memory to try and map to large pages).But I'm no expert in 
the ways UCX may be able to take advantage of internally allocated large pages 
nor the extent to which such use of large pages has led to demonstrable 
application speedups.

Howard

On 7/21/23, 8:37 AM, "Chandran, Arun" mailto:arun.chand...@amd.com>> wrote:


Hi Howard,


Thank you very much for the reply.


Ucx is trying to setup the FIFO for shared memory communication using both sysv 
and posix.
By default, these allocations are failing when tried with hugetlbfs


a) Failure log from strace(Pasting only for rank0):
[pid 3541286] shmget(IPC_PRIVATE, 6291456, IPC_CREAT|IPC_EXCL|SHM_HUGETLB|0660) 
= -1 EPERM (Operation not permitted)
[pid 3541286] mmap(NULL, 6291456, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_HUGETLB, 
29, 0) = -1 EINVAL (Invalid argument)


b) I was able to overcome the failure for shmget allocation with hugetlbfs by 
adding my gid to "/proc/sys/vm/hugetlb_shm_group"
[pid 3541465] shmget(IPC_PRIVATE, 6291456, IPC_CREAT|IPC_EXCL|SHM_HUGETLB|0660) 
= 2916410--> success
[pid 3541465] mmap(NULL, 6291456, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_HUGETLB, 
29, 0) = -1 EINVAL (Invalid argument) --> still fail


But mmap with " MAP_SHARED|MAP_HUGETLB" is still failing. Any clues?


I am aware of the advantages of huge pagetables, I am asking from the openmpi 
library perspective,
Should I use it for openmpi internal buffers and data structures or leave it 
for the applications to use?
What are the community recommendations in this regard?


--Arun


-Original Message-
From: Pritchard Jr., Howard mailto:howa...@lanl.gov>> 
Sent: Thursday, July 20, 2023 9:36 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>; Florent GERMAIN mailto:florent.germ...@eviden.com>>
Cc: Chandran, Arun mailto:arun.chand...@amd.com>>
Subject: Re: [EXTERNAL] Re: [OMPI users] How to use hugetlbfs with openmpi and 
ucx


HI Arun,


Its going to be chatty, but you may want to see if strace helps in diagnosing:


mpirun -np 2 (all your favorite mpi args) strace -f send_recv 1000 1


huge pages often helps reduce pressure on a NIC's I/O MMU widget and speeds up 
resolving va to pa memory addresses.


On 7/19/23, 9:24 PM, "users on behalf of Chandran, Arun via users" 
mailto:users-boun...@lists.open-mpi.org> 
> on behalf of 
users@lists.open-mpi.org  
>> wrote:


Good luck,


Howard


Hi,




I am trying to use static huge pages, not transparent huge pages.




Ucx is allowed to allocate via hugetlbfs.




$ ./bin/ucx_info -c | grep -i huge
UCX_SELF_ALLOC=huge,thp,md,mmap,heap
UCX_TCP_ALLOC=huge,thp,md,mmap,heap
UCX_SYSV_HUGETLB_MODE=try --->It is trying this and failing 
UCX_SYSV_FIFO_HUGETLB=n UCX_POSIX_HUGETLB_MODE=try---> it is trying this and 
failing UCX_POSIX_FIFO_HUGETLB=n 
UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap
UCX_CMA_ALLOC=huge,thp,mmap,heap




It is failing even though I have static hugepages available in my system.




$ cat /proc/meminfo | grep HugePages_Total
HugePages_Total: 20




THP is also enabled:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never




--Arun




-Original Message-
From: Florent GERMAIN mailto:florent.germ...@eviden.com> >>
Sent: Wednesday, July 19, 2023 7:51 PM
To: Open MPI Users mailto:users@lists.open-mpi.org> 
>>; Chandran, 
Arun mailto:arun.chand...@amd.com> 
>>
Subject: RE: How to use hugetlbfs with openmpi and ucx




Hi,
You can check if there are dedicated huge pages on your system or if 
transparent huge pages are allowed.




Transparent huge pages on rhel systems :
$cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
-> this means that transparent huge pages are selected through mmap + 
-> madvise always = always try to aggregate pages on thp (for large 
-> enough allocation with good alignment) never = never try to aggregate 
-> pages on thp




Dedicated huge pages on rhel systems :
$ cat /proc/meminfo | grep HugePages_Total
HugePages_Total: 0
-> no dedicated huge pages here





Re: [OMPI users] [EXTERNAL] Re: How to use hugetlbfs with openmpi and ucx

2023-07-21 Thread Chandran, Arun via users
Hi Howard,

Thank you very much for the reply.

Ucx is trying to setup the FIFO for shared memory communication using both sysv 
and posix.
By default, these allocations are failing when tried with hugetlbfs

a) Failure log from strace(Pasting only for rank0):
   [pid 3541286] shmget(IPC_PRIVATE, 6291456, 
IPC_CREAT|IPC_EXCL|SHM_HUGETLB|0660) = -1 EPERM (Operation not permitted)
   [pid 3541286] mmap(NULL, 6291456, PROT_READ|PROT_WRITE, 
MAP_SHARED|MAP_HUGETLB, 29, 0) = -1 EINVAL (Invalid argument)

b) I was able to overcome the failure for shmget allocation with hugetlbfs by 
adding my gid to "/proc/sys/vm/hugetlb_shm_group"
[pid 3541465] shmget(IPC_PRIVATE, 6291456, 
IPC_CREAT|IPC_EXCL|SHM_HUGETLB|0660) = 2916410--> success
[pid 3541465] mmap(NULL, 6291456, PROT_READ|PROT_WRITE, 
MAP_SHARED|MAP_HUGETLB, 29, 0) = -1 EINVAL (Invalid argument) --> still fail

But mmap with " MAP_SHARED|MAP_HUGETLB" is still failing. Any clues?

I am aware of the advantages of huge pagetables, I am asking from the openmpi 
library perspective,
Should I use it for openmpi internal buffers and data structures or leave it 
for the applications to use?
What are the community recommendations in this regard?

--Arun

-Original Message-
From: Pritchard Jr., Howard  
Sent: Thursday, July 20, 2023 9:36 PM
To: Open MPI Users ; Florent GERMAIN 

Cc: Chandran, Arun 
Subject: Re: [EXTERNAL] Re: [OMPI users] How to use hugetlbfs with openmpi and 
ucx

HI Arun,

Its going to be chatty, but you may want to see if strace helps in diagnosing:

mpirun -np 2 (all your favorite mpi args) strace -f send_recv 1000 1

 huge pages often helps reduce pressure on a NIC's I/O MMU widget and speeds up 
resolving va to pa memory addresses.

On 7/19/23, 9:24 PM, "users on behalf of Chandran, Arun via users" 
mailto:users-boun...@lists.open-mpi.org> on 
behalf of users@lists.open-mpi.org > wrote:

Good luck,

Howard

Hi,


I am trying to use static huge pages, not transparent huge pages.


Ucx is allowed to allocate via hugetlbfs.


$ ./bin/ucx_info -c | grep -i huge
UCX_SELF_ALLOC=huge,thp,md,mmap,heap
UCX_TCP_ALLOC=huge,thp,md,mmap,heap
UCX_SYSV_HUGETLB_MODE=try --->It is trying this and failing 
UCX_SYSV_FIFO_HUGETLB=n UCX_POSIX_HUGETLB_MODE=try---> it is trying this and 
failing UCX_POSIX_FIFO_HUGETLB=n 
UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap
UCX_CMA_ALLOC=huge,thp,mmap,heap


It is failing even though I have static hugepages available in my system.


$ cat /proc/meminfo | grep HugePages_Total
HugePages_Total: 20


THP is also enabled:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never


--Arun


-Original Message-
From: Florent GERMAIN mailto:florent.germ...@eviden.com>>
Sent: Wednesday, July 19, 2023 7:51 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>; Chandran, Arun mailto:arun.chand...@amd.com>>
Subject: RE: How to use hugetlbfs with openmpi and ucx


Hi,
You can check if there are dedicated huge pages on your system or if 
transparent huge pages are allowed.


Transparent huge pages on rhel systems :
$cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
-> this means that transparent huge pages are selected through mmap + 
-> madvise always = always try to aggregate pages on thp (for large 
-> enough allocation with good alignment) never = never try to aggregate 
-> pages on thp


Dedicated huge pages on rhel systems :
$ cat /proc/meminfo | grep HugePages_Total
HugePages_Total: 0
-> no dedicated huge pages here


It seems that ucx tries to use dedicated huge pages (mmap(addr=(nil), 
length=6291456, flags= HUGETLB, fd=29)).
If there are no dedicated huge pages available, mmap fails.


Huge pages can accelerate virtual address to physical address translation and 
reduce TLB consumption.
It may be useful for large and frequently used buffers.


Regards,
Florent


-Message d'origine-
De : users mailto:users-boun...@lists.open-mpi.org>> De la part de Chandran, Arun via 
users Envoyé : mercredi 19 juillet 2023 15:44 À : users@lists.open-mpi.org 
 Cc : Chandran, Arun mailto:arun.chand...@amd.com>> Objet : [OMPI users] How to use hugetlbfs with 
openmpi and ucx


Hi All,


I am trying to see whether hugetlbfs is improving the latency of communication 
with a small send receive program.


mpirun -np 2 --map-by core --bind-to core --mca pml ucx --mca 
opal_common_ucx_tls any --mca opal_common_ucx_devices any -mca pml_base_verbose 
10 --mca mtl_base_verbose 10 -x OMPI_MCA_pml_ucx_verbose=10 -x 
UCX_LOG_LEVEL=debu g -x UCX_PROTO_INFO=y send_recv 1000 1




But the internal buffer allocation in ucx is unable to select the hugetlbfs.


[1688297246.205092] [lib-ssp-04:4022755:0] ucp_context.c:1979 UCX DEBUG 
allocation method[2] is 'huge'
[1688297246.208660] [lib-ssp-04:4022755:0] mm_sysv.c:97 UCX DEBUG mm failed to 
allocate 8447 bytes with hugetlb-> I checked the 

Re: [OMPI users] [EXTERNAL] Re: How to use hugetlbfs with openmpi and ucx

2023-07-20 Thread Pritchard Jr., Howard via users
HI Arun,

Its going to be chatty, but you may want to see if strace helps in diagnosing:

mpirun -np 2 (all your favorite mpi args) strace -f send_recv 1000 1

 huge pages often helps reduce pressure on a NIC's I/O MMU widget and speeds up 
resolving va to pa memory addresses.

On 7/19/23, 9:24 PM, "users on behalf of Chandran, Arun via users" 
mailto:users-boun...@lists.open-mpi.org> on 
behalf of users@lists.open-mpi.org > wrote:

Good luck,

Howard

Hi,


I am trying to use static huge pages, not transparent huge pages.


Ucx is allowed to allocate via hugetlbfs.


$ ./bin/ucx_info -c | grep -i huge
UCX_SELF_ALLOC=huge,thp,md,mmap,heap
UCX_TCP_ALLOC=huge,thp,md,mmap,heap
UCX_SYSV_HUGETLB_MODE=try --->It is trying this and failing
UCX_SYSV_FIFO_HUGETLB=n
UCX_POSIX_HUGETLB_MODE=try---> it is trying this and failing
UCX_POSIX_FIFO_HUGETLB=n
UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap
UCX_CMA_ALLOC=huge,thp,mmap,heap


It is failing even though I have static hugepages available in my system.


$ cat /proc/meminfo | grep HugePages_Total
HugePages_Total: 20


THP is also enabled:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never


--Arun


-Original Message-
From: Florent GERMAIN mailto:florent.germ...@eviden.com>> 
Sent: Wednesday, July 19, 2023 7:51 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>; Chandran, Arun mailto:arun.chand...@amd.com>>
Subject: RE: How to use hugetlbfs with openmpi and ucx


Hi,
You can check if there are dedicated huge pages on your system or if 
transparent huge pages are allowed.


Transparent huge pages on rhel systems :
$cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
-> this means that transparent huge pages are selected through mmap + 
-> madvise always = always try to aggregate pages on thp (for large 
-> enough allocation with good alignment) never = never try to aggregate 
-> pages on thp


Dedicated huge pages on rhel systems :
$ cat /proc/meminfo | grep HugePages_Total
HugePages_Total: 0
-> no dedicated huge pages here


It seems that ucx tries to use dedicated huge pages (mmap(addr=(nil), 
length=6291456, flags= HUGETLB, fd=29)).
If there are no dedicated huge pages available, mmap fails.


Huge pages can accelerate virtual address to physical address translation and 
reduce TLB consumption.
It may be useful for large and frequently used buffers.


Regards,
Florent


-Message d'origine-
De : users mailto:users-boun...@lists.open-mpi.org>> De la part de Chandran, Arun via 
users Envoyé : mercredi 19 juillet 2023 15:44 À : users@lists.open-mpi.org 
 Cc : Chandran, Arun mailto:arun.chand...@amd.com>> Objet : [OMPI users] How to use hugetlbfs with 
openmpi and ucx


Hi All,


I am trying to see whether hugetlbfs is improving the latency of communication 
with a small send receive program.


mpirun -np 2 --map-by core --bind-to core --mca pml ucx --mca 
opal_common_ucx_tls any --mca opal_common_ucx_devices any -mca pml_base_verbose 
10 --mca mtl_base_verbose 10 -x OMPI_MCA_pml_ucx_verbose=10 -x 
UCX_LOG_LEVEL=debu
g -x UCX_PROTO_INFO=y send_recv 1000 1




But the internal buffer allocation in ucx is unable to select the hugetlbfs.


[1688297246.205092] [lib-ssp-04:4022755:0] ucp_context.c:1979 UCX DEBUG 
allocation method[2] is 'huge'
[1688297246.208660] [lib-ssp-04:4022755:0] mm_sysv.c:97 UCX DEBUG mm failed to 
allocate 8447 bytes with hugetlb-> I checked the code, this is a valid 
failure as the size is small compared to huge page size of 2MB
[1688297246.208704] [lib-ssp-04:4022755:0] mm_sysv.c:97 UCX DEBUG mm failed to 
allocate 4292720 bytes with hugetlb
[1688297246.210048] [lib-ssp-04:4022755:0] mm_posix.c:332 UCX DEBUG shared 
memory mmap(addr=(nil), length=6291456, flags= HUGETLB, fd=29) failed: Invalid 
argument
[1688297246.211451] [lib-ssp-04:4022754:0] ucp_context.c:1979 UCX DEBUG 
allocation method[2] is 'huge'
[1688297246.214849] [lib-ssp-04:4022754:0] mm_sysv.c:97 UCX DEBUG mm failed to 
allocate 8447 bytes with hugetlb
[1688297246.214888] [lib-ssp-04:4022754:0] mm_sysv.c:97 UCX DEBUG mm failed to 
allocate 4292720 bytes with hugetlb
[1688297246.216235] [lib-ssp-04:4022754:0] mm_posix.c:332 UCX DEBUG shared 
memory mmap(addr=(nil), length=6291456, flags= HUGETLB, fd=29) failed: Invalid 
argument


Can someone suggest what are the steps to be done to enable hugetlbfs [I cannot 
run my application as root] ? Is using hugetlbfs for the internal buffers is 
recommended?


--Arun