What happens if you specify -mtl ofi ?

-----Original Message-----
From: users <users-boun...@lists.open-mpi.org> On Behalf Of Patrick Begou via 
users
Sent: Monday, January 25, 2021 12:54 PM
To: users@lists.open-mpi.org
Cc: Patrick Begou <patrick.be...@univ-grenoble-alpes.fr>
Subject: Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

Hi Howard and Michael,

thanks for your feedback. I did not want to write a toot long mail with non 
pertinent information so I just show how the two different builds give 
different result. I'm using a small test case based on my large code, the same 
used to show the memory leak with mpi_Alltoallv calls, but just running 2 
iterations. It is a 2D case and data storage is moved from distributions "along 
X axis" to "along Y axis" with mpi_Alltoallv and subarrays types. Datas 
initialization is based on the location in the array to allow checking for 
correct exchanges.

When the program runs (on 4 processes in my test) it must only show the max rss 
size of the processes. When it fails it shows the invalid locations. I've 
drastically reduced the size of the problem with nx=5 and ny=7.

Launching the non working setup with more details show:

dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array 
[dahu138:115761] mca: base: components_register: registering framework mtl 
components [dahu138:115763] mca: base: components_register: registering 
framework mtl components [dahu138:115763] mca: base: components_register: found 
loaded component psm2 [dahu138:115763] mca: base: components_register: 
component psm2 register function successful [dahu138:115763] mca: base: 
components_open: opening mtl components [dahu138:115763] mca: base: 
components_open: found loaded component psm2 [dahu138:115761] mca: base: 
components_register: found loaded component psm2 [dahu138:115763] mca: base: 
components_open: component psm2 open function successful [dahu138:115761] mca: 
base: components_register: component psm2 register function successful 
[dahu138:115761] mca: base: components_open: opening mtl components 
[dahu138:115761] mca: base: components_open: found loaded component psm2 
[dahu138:115761] mca: base: components_open: component psm2 open function 
successful [dahu138:115760] mca: base: components_register: registering 
framework mtl components [dahu138:115760] mca: base: components_register: found 
loaded component psm2 [dahu138:115760] mca: base: components_register: 
component psm2 register function successful [dahu138:115760] mca: base: 
components_open: opening mtl components [dahu138:115760] mca: base: 
components_open: found loaded component psm2 [dahu138:115762] mca: base: 
components_register: registering framework mtl components [dahu138:115762] mca: 
base: components_register: found loaded component psm2 [dahu138:115760] mca: 
base: components_open: component psm2 open function successful [dahu138:115762] 
mca: base: components_register: component psm2 register function successful 
[dahu138:115762] mca: base: components_open: opening mtl components 
[dahu138:115762] mca: base: components_open: found loaded component psm2 
[dahu138:115762] mca: base: components_open: component psm2 open function 
successful [dahu138:115760] mca:base:select: Auto-selecting mtl components 
[dahu138:115760] mca:base:select:(  mtl) Querying component [psm2] 
[dahu138:115760] mca:base:select:(  mtl) Query of component [psm2] set priority 
to 40 [dahu138:115761] mca:base:select: Auto-selecting mtl components 
[dahu138:115762] mca:base:select: Auto-selecting mtl components 
[dahu138:115762] mca:base:select:(  mtl) Querying component [psm2] 
[dahu138:115762] mca:base:select:(  mtl) Query of component [psm2] set priority 
to 40 [dahu138:115762] mca:base:select:(  mtl) Selected component [psm2] 
[dahu138:115762] select: initializing mtl component psm2 [dahu138:115761] 
mca:base:select:(  mtl) Querying component [psm2] [dahu138:115761] 
mca:base:select:(  mtl) Query of component [psm2] set priority to 40 
[dahu138:115761] mca:base:select:(  mtl) Selected component [psm2] 
[dahu138:115761] select: initializing mtl component psm2 [dahu138:115760] 
mca:base:select:(  mtl) Selected component [psm2] [dahu138:115760] select: 
initializing mtl component psm2 [dahu138:115763] mca:base:select: 
Auto-selecting mtl components [dahu138:115763] mca:base:select:(  mtl) Querying 
component [psm2] [dahu138:115763] mca:base:select:(  mtl) Query of component 
[psm2] set priority to 40 [dahu138:115763] mca:base:select:(  mtl) Selected 
component [psm2] [dahu138:115763] select: initializing mtl component psm2 
[dahu138:115761] select: init returned success [dahu138:115761] select: 
component psm2 selected [dahu138:115762] select: init returned success 
[dahu138:115762] select: component psm2 selected [dahu138:115763] select: init 
returned success [dahu138:115763] select: component psm2 selected 
[dahu138:115760] select: init returned success [dahu138:115760] select: 
component psm2 selected On 1 found 1007 but expect 3007 On 2 found 1007 but 
expect 4007

and with this setup the code freeze with this dimension of the problem.


Below is the same code with my no-ib setup of openMPI on the same node:

dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array 
[dahu138:116723] mca: base: components_register: registering framework mtl 
components [dahu138:116723] mca: base: components_open: opening mtl components 
[dahu138:116724] mca: base: components_register: registering framework mtl 
components [dahu138:116724] mca: base: components_open: opening mtl components 
[dahu138:116726] mca: base: components_register: registering framework mtl 
components [dahu138:116726] mca: base: components_open: opening mtl components 
[dahu138:116725] mca: base: components_register: registering framework mtl 
components [dahu138:116725] mca: base: components_open: opening mtl components 
[INFO MEMORY] : processor 0 uses  9948 kb max of resident memory [INFO MEMORY] 
: processor 0 uses  9948 kb max of resident memory

The test case used is provides in attachment but as it runs on many 
OS/OpenMPI/hardware associations I do not think the problem could be the 
tes-case even if it is also a possibility.

Patrick

Reply via email to