Hi Howard and Michael,

thanks for your feedback. I did not want to write a toot long mail with
non pertinent information so I just show how the two different builds
give different result. I'm using a small test case based on my large
code, the same used to show the memory leak with mpi_Alltoallv calls,
but just running 2 iterations. It is a 2D case and data storage is moved
from distributions "along X axis" to "along Y axis" with mpi_Alltoallv
and subarrays types. Datas initialization is based on the location in
the array to allow checking for correct exchanges.

When the program runs (on 4 processes in my test) it must only show the
max rss size of the processes. When it fails it shows the invalid
locations. I've drastically reduced the size of the problem with nx=5
and ny=7.

Launching the non working setup with more details show:

dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array
[dahu138:115761] mca: base: components_register: registering framework
mtl components
[dahu138:115763] mca: base: components_register: registering framework
mtl components
[dahu138:115763] mca: base: components_register: found loaded component psm2
[dahu138:115763] mca: base: components_register: component psm2 register
function successful
[dahu138:115763] mca: base: components_open: opening mtl components
[dahu138:115763] mca: base: components_open: found loaded component psm2
[dahu138:115761] mca: base: components_register: found loaded component psm2
[dahu138:115763] mca: base: components_open: component psm2 open
function successful
[dahu138:115761] mca: base: components_register: component psm2 register
function successful
[dahu138:115761] mca: base: components_open: opening mtl components
[dahu138:115761] mca: base: components_open: found loaded component psm2
[dahu138:115761] mca: base: components_open: component psm2 open
function successful
[dahu138:115760] mca: base: components_register: registering framework
mtl components
[dahu138:115760] mca: base: components_register: found loaded component psm2
[dahu138:115760] mca: base: components_register: component psm2 register
function successful
[dahu138:115760] mca: base: components_open: opening mtl components
[dahu138:115760] mca: base: components_open: found loaded component psm2
[dahu138:115762] mca: base: components_register: registering framework
mtl components
[dahu138:115762] mca: base: components_register: found loaded component psm2
[dahu138:115760] mca: base: components_open: component psm2 open
function successful
[dahu138:115762] mca: base: components_register: component psm2 register
function successful
[dahu138:115762] mca: base: components_open: opening mtl components
[dahu138:115762] mca: base: components_open: found loaded component psm2
[dahu138:115762] mca: base: components_open: component psm2 open
function successful
[dahu138:115760] mca:base:select: Auto-selecting mtl components
[dahu138:115760] mca:base:select:(  mtl) Querying component [psm2]
[dahu138:115760] mca:base:select:(  mtl) Query of component [psm2] set
priority to 40
[dahu138:115761] mca:base:select: Auto-selecting mtl components
[dahu138:115762] mca:base:select: Auto-selecting mtl components
[dahu138:115762] mca:base:select:(  mtl) Querying component [psm2]
[dahu138:115762] mca:base:select:(  mtl) Query of component [psm2] set
priority to 40
[dahu138:115762] mca:base:select:(  mtl) Selected component [psm2]
[dahu138:115762] select: initializing mtl component psm2
[dahu138:115761] mca:base:select:(  mtl) Querying component [psm2]
[dahu138:115761] mca:base:select:(  mtl) Query of component [psm2] set
priority to 40
[dahu138:115761] mca:base:select:(  mtl) Selected component [psm2]
[dahu138:115761] select: initializing mtl component psm2
[dahu138:115760] mca:base:select:(  mtl) Selected component [psm2]
[dahu138:115760] select: initializing mtl component psm2
[dahu138:115763] mca:base:select: Auto-selecting mtl components
[dahu138:115763] mca:base:select:(  mtl) Querying component [psm2]
[dahu138:115763] mca:base:select:(  mtl) Query of component [psm2] set
priority to 40
[dahu138:115763] mca:base:select:(  mtl) Selected component [psm2]
[dahu138:115763] select: initializing mtl component psm2
[dahu138:115761] select: init returned success
[dahu138:115761] select: component psm2 selected
[dahu138:115762] select: init returned success
[dahu138:115762] select: component psm2 selected
[dahu138:115763] select: init returned success
[dahu138:115763] select: component psm2 selected
[dahu138:115760] select: init returned success
[dahu138:115760] select: component psm2 selected
On 1 found 1007 but expect 3007
On 2 found 1007 but expect 4007

and with this setup the code freeze with this dimension of the problem.


Below is the same code with my no-ib setup of openMPI on the same node:

dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array
[dahu138:116723] mca: base: components_register: registering framework
mtl components
[dahu138:116723] mca: base: components_open: opening mtl components
[dahu138:116724] mca: base: components_register: registering framework
mtl components
[dahu138:116724] mca: base: components_open: opening mtl components
[dahu138:116726] mca: base: components_register: registering framework
mtl components
[dahu138:116726] mca: base: components_open: opening mtl components
[dahu138:116725] mca: base: components_register: registering framework
mtl components
[dahu138:116725] mca: base: components_open: opening mtl components
[INFO MEMORY] : processor 0 uses  9948 kb max of resident memory
[INFO MEMORY] : processor 0 uses  9948 kb max of resident memory

The test case used is provides in attachment but as it runs on many
OS/OpenMPI/hardware associations I do not think the problem could be the
tes-case even if it is also a possibility.

Patrick

Attachment: test_layout_array.tar.gz
Description: application/gzip

Reply via email to