Re: [OMPI users] Quality and details of implementation for Neighborhood collective operations

2022-06-08 Thread Michael Thomadakis via users
I see, thanks

Is there any plan to apply any optimizations on the Neighbor collectives at
some point?

regards
Michael

On Wed, Jun 8, 2022 at 1:29 PM George Bosilca  wrote:

> Michael,
>
> As far as I know none of the implementations of the
> neighborhood collectives in OMPI are architecture-aware. The only 2
> components that provide support for neighborhood collectives are basic (for
> the blocking version) and libnbc (for the non-blocking versions).
>
>   George.
>
>
> On Wed, Jun 8, 2022 at 1:27 PM Michael Thomadakis via users <
> users@lists.open-mpi.org> wrote:
>
>> Hello OpenMPI
>>
>> I was wondering if the MPI_Neighbor_x calls have received any
>> special design and optimizations in OpenMPI 4.1.x+ for these patterns of
>> communication.
>>
>> For instance, these could benefit from proximity awareness and intra- vs
>> inter-node communications. However, even single node communications have
>> hierarchical structure due to the increased number of num-domains, larger
>> L3 caches and so on.
>>
>> Is OpenMPI 4.1.x+ leveraging any special logic to optimize these calls?
>> Is UCX or UCC/HCOLL doing anything special or is OpenMPI using these lower
>> layers in a more "intelligent" way to provide
>> optimized neighborhood collectives?
>>
>> Thanks you much
>> Michael
>>
>


[OMPI users] Quality and details of implementation for Neighborhood collective operations

2022-06-08 Thread Michael Thomadakis via users
Hello OpenMPI

I was wondering if the MPI_Neighbor_x calls have received any special
design and optimizations in OpenMPI 4.1.x+ for these patterns of
communication.

For instance, these could benefit from proximity awareness and intra- vs
inter-node communications. However, even single node communications have
hierarchical structure due to the increased number of num-domains, larger
L3 caches and so on.

Is OpenMPI 4.1.x+ leveraging any special logic to optimize these calls? Is
UCX or UCC/HCOLL doing anything special or is OpenMPI using these lower
layers in a more "intelligent" way to provide
optimized neighborhood collectives?

Thanks you much
Michael


Re: [OMPI users] [EXTERNAL] strange pml error

2021-11-03 Thread Michael Di Domenico via users
this seemed to help me as well, so far at least.  still have a lot
more testing to do

On Tue, Nov 2, 2021 at 4:15 PM Shrader, David Lee  wrote:
>
> As a workaround for now, I have found that setting OMPI_MCA_pml=ucx seems to 
> get around this issue. I'm not sure why this works, but perhaps there is 
> different initialization that happens such that the offending device search 
> problem doesn't occur?
>
>
> Thanks,
>
> David
>
>
>
> 
> From: Shrader, David Lee
> Sent: Tuesday, November 2, 2021 2:09 PM
> To: Open MPI Users
> Cc: Michael Di Domenico
> Subject: Re: [EXTERNAL] [OMPI users] strange pml error
>
>
> I too have been getting this using 4.1.1, but not with the master nightly 
> tarballs from mid-October. I still have it on my to-do list to open a github 
> issue. The problem seems to come from device detection in the ucx pml: on 
> some ranks, it fails to find a device and thus the ucx pml disqualifies 
> itself. Which then just leaves the ob1 pml.
>
>
> Thanks,
>
> David
>
>
>
> 
> From: users  on behalf of Michael Di 
> Domenico via users 
> Sent: Tuesday, November 2, 2021 1:35 PM
> To: Open MPI Users
> Cc: Michael Di Domenico
> Subject: [EXTERNAL] [OMPI users] strange pml error
>
> fairly frequently, but not everytime when trying to run xhpl on a new
> machine i'm bumping into this.  it happens with a single node or
> multiple nodes
>
> node1 selected pml ob1, but peer on node1 selected pml ucx
>
> if i rerun the exact same command a few minutes later, it works fine.
> the machine is new and i'm the only one using it so there are no user
> conflicts
>
> the software stack is
>
> slurm 21.8.2.1
> ompi 4.1.1
> pmix 3.2.3
> ucx 1.9.0
>
> the hardware is HPE w/ mellanox edr cards (but i doubt that matters)
>
> any thoughts?


[OMPI users] strange pml error

2021-11-02 Thread Michael Di Domenico via users
fairly frequently, but not everytime when trying to run xhpl on a new
machine i'm bumping into this.  it happens with a single node or
multiple nodes

node1 selected pml ob1, but peer on node1 selected pml ucx

if i rerun the exact same command a few minutes later, it works fine.
the machine is new and i'm the only one using it so there are no user
conflicts

the software stack is

slurm 21.8.2.1
ompi 4.1.1
pmix 3.2.3
ucx 1.9.0

the hardware is HPE w/ mellanox edr cards (but i doubt that matters)

any thoughts?


Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-27 Thread Heinz, Michael William via users
Pavel,

Did you ever resolve this? A co-worker pointed out that setting that variable 
is the recommended way to use OMPI, PSM2 and SLURM. You can download the user 
manual here:

https://www.intel.com/content/www/us/en/design/products-and-solutions/networking-and-io/fabric-products/omni-path/product-releases-library.html?grouping=EMT_Content%20Type&sort=title:asc&filter=rdctopics:releaseversion%2Flatestrelease

From: users  On Behalf Of Pavel Mezentsev via 
users
Sent: Friday, May 21, 2021 7:57 AM
To: Open MPI Users 
Cc: Pavel Mezentsev 
Subject: Re: [OMPI users] unable to launch a job on a system with OmniPath

Thank you very much for all the suggestions.

1) Sadly setting 
`OMPI_MCA_orte_precondition_transports="0123456789ABCDEF-0123456789ABCDEF` did 
not help, still got the same error about not getting this piece of info from 
ORTE

2) I rebuilt OpenMPI without slurm. Don't remember the exact message but 
`configure` told me that libevent is necessary for PSM2 so I had to add it, 
ended up with the following options:
Configure command line: '--prefix=${BUILD_PATH}'
  '--build=x86_64-pc-linux-gnu'
  '--host=x86_64-pc-linux-gnu' '--enable-shared'
  '--with-hwloc=${HWLOC_PATH}'
  '--with-psm2' '--disable-oshmem' '--with-gpfs'
  '--with-libevent=${LIBEVENT_PATH}'

With this MPI I was able to achieve expected results:
mpirun -np 2 --hostfile hostfile --map-by node ./osu_bw
--
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   node01
  Local device: hfi1_0
--
# OSU MPI Bandwidth Test v5.7
# Size  Bandwidth (MB/s)
1   1.12
2   2.24
4   4.59
8   8.92
16 16.54
32 33.78
64 71.79
128   138.35
256   249.74
512   421.12
1024  719.52
2048 1258.79
4096 2034.16
8192 2021.40
163842269.95
327682573.85
655362749.05
131072   3178.84
262144   7150.88
524288   9027.82
1048576 10586.48
2097152 11828.10
4194304 11910.87
[jrlogin01.jureca:11482] 1 more process has sent help message 
help-mpi-btl-openib.txt / error in device init
[jrlogin01.jureca:11482] Set MCA parameter "orte_base_help_aggregate" to 0 to 
see all help / error messages

Latency also improved from 29.69 to 2.63

3) It would be nice to get it working with a build that supports both Slurm and 
PSM2 but for the time being doing it without slurm support is also an option 
for me.

4) This one is slightly off the original topic. Now I need to run an 
application across two partitions: one has an Infiniband fabric and the other 
has OmniPath, they are connected via gateways that have HCAs of both types so 
they can route IP traffic from one network to another. What would be the proper 
way to launch a job across the nodes that are connected to different types of 
networks?
In the old days I would just specify `OMPI_MCA_btl=tcp,sm,self`. However these 
days it's a bit more complicated. For each partition I have a different build 
of OpenMPI: for IB I have one with UCX, for OmniPath I have the one that I 
built recently. For UCX I could set `USX_TLS="tcbp,sm"` and something to 
accomplish the same on the OmniPath side. Would the processes launched by these 
two builds of OpenMPI be able to communicate with each other?
Another idea that came to mind was to get an OpenMPI build that would not have 
any high performance fabric support and would only work via TCP. So any advice 
on how to accomplish my goal would be appreciated.

I realize that performance-wise that is going to be quite... sad. But currently 
that's not the main concern.

Regards, Pavel Mezentsev.

On Wed, May 19, 2021 at 5:40 PM Heinz, Michael William via users 
mailto:users@lists.open-mpi.org>> wrote:
Right. there was a reference counting issue in OMPI that required a change to 
PSM2 to properly fix. There's a configuration option to disable the reference 
count check at build time, although  I don't recall what the option is off the 
top of my head.

From: Carlson, Timothy S 
mailto:timothy.carl...@pnnl.gov>>
Sent: Wednesday, May 19, 2021 11:31 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Heinz, Michael William 
mailto:michael.william.he...@cornelisnetworks.com>>
Subject: Re: [OMPI users] unable to launch a job on a system with OmniPath

Just some more data fr

Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-19 Thread Heinz, Michael William via users
Right. there was a reference counting issue in OMPI that required a change to 
PSM2 to properly fix. There's a configuration option to disable the reference 
count check at build time, although  I don't recall what the option is off the 
top of my head.

From: Carlson, Timothy S 
Sent: Wednesday, May 19, 2021 11:31 AM
To: Open MPI Users 
Cc: Heinz, Michael William 
Subject: Re: [OMPI users] unable to launch a job on a system with OmniPath

Just some more data from my OminPath based cluster.

There certainly was a change from 4.0.x to 4.1.x

With 4.0.1 I woud build openmpi with


./configure --with-psm2 --with-slurm --with-pmi=/usr



And while srun would spit out a warning, the performance was as expected.



srun -N 2 --ntasks-per-node=1 -A ops -p short mpi/pt2pt/osu_latency

-cut-

WARNING: There was an error initializing an OpenFabrics device.



  Local host:   n0005

  Local device: hfi1_0

--

# OSU MPI Latency Test v5.5

# Size  Latency (us)

0   1.13

1   1.13

2   1.13

4   1.13

8   1.13

16  1.49

32  1.49

64  1.39



-cut-



Similarly for bandwidth



327686730.96

655369801.56

131072  11887.62

262144  11959.18

524288  12062.57

1048576 12038.13

2097152 12048.90

4194304 12112.04



With 4.1.x it appears I need to upgrade my psm2 installation from what I have 
now



# rpm -qa | grep psm2

libpsm2-11.2.89-1.x86_64

libpsm2-devel-11.2.89-1.x86_64

libpsm2-compat-11.2.89-1.x86_64

libfabric-psm2-1.7.1-10.x86_64



because configure spit out this warning



WARNING: PSM2 needs to be version 11.2.173 or later. Disabling MTL



The above cluster is running IntelOPA 10.9.2



Tim

From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of "Heinz, Michael William via users" 
mailto:users@lists.open-mpi.org>>
Reply-To: Open MPI Users 
mailto:users@lists.open-mpi.org>>
Date: Wednesday, May 19, 2021 at 7:57 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: "Heinz, Michael William" 
mailto:michael.william.he...@cornelisnetworks.com>>
Subject: Re: [OMPI users] unable to launch a job on a system with OmniPath

Check twice before you click! This email originated from outside PNNL.

After thinking about this for a few more minutes, it occurred to me that you 
might be able to "fake" the required UUID support by passing it as a shell 
variable. For example:

export OMPI_MCA_orte_precondition_transports="0123456789ABCDEF-0123456789ABCDEF"

would probably do it. However, note that the format of the string must be 16 
hex digits, a hyphen, then 16 more hex digits. anything else will be rejected. 
Also, I have never tried doing this, YMMV.

From: Heinz, Michael William
Sent: Wednesday, May 19, 2021 10:35 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Ralph Castain mailto:r...@open-mpi.org>>
Subject: RE: [OMPI users] unable to launch a job on a system with OmniPath

So, the bad news is that the PSM2 MTL requires ORTE - ORTE generates a UUID to 
identify the job across all nodes in the fabric, allowing processes to find 
each other over OPA at init time.

I believe the reason this works when you use OFI/libfabric is that libfabrice 
generates its own UUIDs.

From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ralph Castain via users
Sent: Wednesday, May 19, 2021 10:19 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Ralph Castain mailto:r...@open-mpi.org>>
Subject: Re: [OMPI users] unable to launch a job on a system with OmniPath

The original configure line is correct ("--without-orte") - just a typo in the 
later text.

You may be running into some issues with Slurm's built-in support for OMPI. Try 
running it with OMPI's "mpirun" instead and see if you get better performance. 
You'll have to reconfigure to remove the "--without-orte" and 
"--with-ompi-pmix-rte" options. I would also recommend removing the 
"--with-pmix=external --with-libevent=external --with-hwloc=xxx 
--with-libevent=xxx" entries.

In other words, get down to a vanilla installation so we know what we are 
dealing with - otherwise, it gets very hard to help you.


On May 19, 2021, at 7:09 AM, Jorge D'Elia via users 
mailto:users@lists.open-mpi.org>> wrote:

- Mensaje original -
De: "Pavel Mezentsev via users" 
mailto:users@lists.open-mpi.org>>
Para: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
CC: "Pavel Mezentsev" 
mailto:pavel.mezent...@gmail.com>>
Enviado: Miércoles, 19 de Mayo 2021 10:53:

Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-19 Thread Heinz, Michael William via users
After thinking about this for a few more minutes, it occurred to me that you 
might be able to "fake" the required UUID support by passing it as a shell 
variable. For example:

export OMPI_MCA_orte_precondition_transports="0123456789ABCDEF-0123456789ABCDEF"

would probably do it. However, note that the format of the string must be 16 
hex digits, a hyphen, then 16 more hex digits. anything else will be rejected. 
Also, I have never tried doing this, YMMV.

From: Heinz, Michael William
Sent: Wednesday, May 19, 2021 10:35 AM
To: Open MPI Users 
Cc: Ralph Castain 
Subject: RE: [OMPI users] unable to launch a job on a system with OmniPath

So, the bad news is that the PSM2 MTL requires ORTE - ORTE generates a UUID to 
identify the job across all nodes in the fabric, allowing processes to find 
each other over OPA at init time.

I believe the reason this works when you use OFI/libfabric is that libfabrice 
generates its own UUIDs.

From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ralph Castain via users
Sent: Wednesday, May 19, 2021 10:19 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Ralph Castain mailto:r...@open-mpi.org>>
Subject: Re: [OMPI users] unable to launch a job on a system with OmniPath

The original configure line is correct ("--without-orte") - just a typo in the 
later text.

You may be running into some issues with Slurm's built-in support for OMPI. Try 
running it with OMPI's "mpirun" instead and see if you get better performance. 
You'll have to reconfigure to remove the "--without-orte" and 
"--with-ompi-pmix-rte" options. I would also recommend removing the 
"--with-pmix=external --with-libevent=external --with-hwloc=xxx 
--with-libevent=xxx" entries.

In other words, get down to a vanilla installation so we know what we are 
dealing with - otherwise, it gets very hard to help you.


On May 19, 2021, at 7:09 AM, Jorge D'Elia via users 
mailto:users@lists.open-mpi.org>> wrote:

- Mensaje original -
De: "Pavel Mezentsev via users" 
mailto:users@lists.open-mpi.org>>
Para: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
CC: "Pavel Mezentsev" 
mailto:pavel.mezent...@gmail.com>>
Enviado: Miércoles, 19 de Mayo 2021 10:53:50
Asunto: Re: [OMPI users] unable to launch a job on a system with OmniPath

It took some time but my colleague was able to build OpenMPI and get it
working with OmniPath, however the performance is quite disappointing.
The configuration line used was the following: ./configure
--prefix=$INSTALL_PATH  --build=x86_64-pc-linux-gnu
--host=x86_64-pc-linux-gnu --enable-shared --with-hwloc=$EBROOTHWLOC
--with-psm2 --with-ofi=$EBROOTLIBFABRIC --with-libevent=$EBROOTLIBEVENT
--without-orte --disable-oshmem --with-gpfs --with-slurm
--with-pmix=external --with-libevent=external --with-ompi-pmix-rte

/usr/bin/srun --cpu-bind=none --mpi=pspmix --ntasks-per-node 1 -n 2 xenv -L
Architecture/KNL -L GCC -L OpenMPI env OMPI_MCA_btl_base_verbose="99"
OMPI_MCA_mtl_base_verbose="99" numactl --physcpubind=1 ./osu_bw
...
[node:18318] select: init of component ofi returned success
[node:18318] mca: base: components_register: registering framework mtl
components
[node:18318] mca: base: components_register: found loaded component ofi

[node:18318] mca: base: components_register: component ofi register
function successful
[node:18318] mca: base: components_open: opening mtl components

[node:18318] mca: base: components_open: found loaded component ofi

[node:18318] mca: base: components_open: component ofi open function
successful
[node:18318] mca:base:select: Auto-selecting mtl components
[node:18318] mca:base:select:(  mtl) Querying component [ofi]

[node:18318] mca:base:select:(  mtl) Query of component [ofi] set priority
to 25
[node:18318] mca:base:select:(  mtl) Selected component [ofi]

[node:18318] select: initializing mtl component ofi
[node:18318] mtl_ofi_component.c:378: mtl:ofi:provider: hfi1_0
...
# OSU MPI Bandwidth Test v5.7
# Size  Bandwidth (MB/s)
1   0.05
2   0.10
4   0.20
8   0.41
16  0.77
32  1.54
64  3.10
128 6.09
25612.39
51224.23
1024   46.85
2048   87.99
4096  100.72
8192  139.91
16384 173.67
32768 197.82
65536 210.15
131072215.76
262144214.39
524288219.23
1048576   223.53
2097152   226.93
4194304   227.62

If I test directly with `ib_write_bw` I get
#bytes #iterationsBW peak[MB/sec]BW average[MB/sec]
MsgRate[Mpps]
Conflicting CPU frequency values detecte

Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-19 Thread Heinz, Michael William via users
%7C01%7Cmichael.william.heinz%40cornelisnetworks.com%7C88e742f811ee42209dfc08d91ad18326%7C4dbdb7da74ee4b458747ef5ce5ebe68a%7C0%7C0%7C637570309427740036%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=716g0iKEb2oWN%2FjCJQIvqizE09UG0PgQo58wzjOc6%2BQ%3D&reserved=0>
Predio CONICET-Santa Fe, Colec. Ruta Nac. 168,
Paraje El Pozo, 3000, Santa Fe, ARGENTINA.
Tel +54-342-4511594/95 ext 7062, fax: +54-342-4511169



What am I missing and how can I improve the performance?

Regards, Pavel Mezentsev.

On Mon, May 10, 2021 at 6:20 PM Heinz, Michael William <
michael.william.he...@cornelisnetworks.com<mailto:michael.william.he...@cornelisnetworks.com>>
 wrote:


*That warning is an annoying bit of cruft from the openib / verbs provider
that can be ignored. (Actually, I recommend using "-btl ^openib" to
suppress the warning.)*



*That said, there is a known issue with selecting PSM2 and OMPI 4.1.0. I'm
not sure that that's the problem you're hitting, though, because you really
haven't provided a lot of information.*



*I would suggest trying the following to see what happens:*



*${PATH_TO_OMPI}/mpirun -mca mtl psm2 -mca btl ^openib -mca
mtl_base_verbose 99 -mca btl_base_verbose 99 -n ${N} -H ${HOSTS}
my_application*



*This should give you detailed information on what transports were
selected and what happened next.*



*Oh - and make sure your fabric is up with an opainfo or opareport
command, just to make sure.*



*From:* users 
mailto:users-boun...@lists.open-mpi.org>> *On 
Behalf Of *Pavel
Mezentsev via users
*Sent:* Monday, May 10, 2021 8:41 AM
*To:* users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
*Cc:* Pavel Mezentsev 
mailto:pavel.mezent...@gmail.com>>
*Subject:* [OMPI users] unable to launch a job on a system with OmniPath



Hi!

I'm working on a system with KNL and OmniPath and I'm trying to launch a
job but it fails. Could someone please advise what parameters I need to add
to make it work properly? At first I need to make it work within one node,
however later I need to use multiple nodes and eventually I may need to
switch to TCP to run a hybrid job where some nodes are connected via
Infiniband and some nodes are connected via OmniPath.



So far without any extra parameters I get:
```
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA
parameter
to true.

 Local host:  XX
 Local adapter:   hfi1_0
 Local port:  1
```

If I add `OMPI_MCA_btl_openib_allow_ib="true"` then I get:
```
Error obtaining unique transport key from ORTE
(orte_precondition_transports not present in
the environment).

 Local host: XX

```
Then I tried adding OMPI_MCA_mtl="psm2" or OMPI_MCA_mtl="ofi" to make it
use omnipath or OMPI_MCA_btl="sm,self" to make it use only shared memory.
But these parameters did not make any difference.
There does not seem to be much omni-path related documentation, at least I
was not able to find anything that would help me but perhaps I missed
something:
https://www.open-mpi.org/faq/?category=running#opa-support<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.open-mpi.org%2Ffaq%2F%3Fcategory%3Drunning%23opa-support&data=04%7C01%7Cmichael.william.heinz%40cornelisnetworks.com%7C88e742f811ee42209dfc08d91ad18326%7C4dbdb7da74ee4b458747ef5ce5ebe68a%7C0%7C0%7C637570309427749992%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=KTCTeXVNSIa44Yda373kAUlcVRBSyEmdG7%2BJ09fxEcI%3D&reserved=0>
<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.open-mpi.org%2Ffaq%2F%3Fcategory%3Drunning%23opa-support&data=04%7C01%7Cmichael.william.heinz%40cornelisnetworks.com%7C57fa32f71d054ebd6a5a08d913cd8fbf%7C4dbdb7da74ee4b458747ef5ce5ebe68a%7C0%7C0%7C637562595871907805%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kJ830bXfZmIMEg4hJkdEw8D6lw66aooAjHMpLL7NZ8c%3D&reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.open-mpi.org%2Ffaq%2F%3Fcategory%3Drunning%23opa-support&data=04%7C01%7Cmichael.william.heinz%40cornelisnetworks.com%7C88e742f811ee42209dfc08d91ad18326%7C4dbdb7da74ee4b458747ef5ce5ebe68a%7C0%7C0%7C637570309427749992%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=KTCTeXVNSIa44Yda373kAUlcVRBSyEmdG7%2BJ09fxEcI%3D&reserved=0>>
https://www.open-mpi.org/faq/?category=opa<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.open-mpi.org%2Ffaq%2F%3Fcategory%3Dopa&data=04%7C01%7Cmichael.william.heinz%40cornelisnetworks.com%7C88e742f811ee4

Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-10 Thread Heinz, Michael William via users
That warning is an annoying bit of cruft from the openib / verbs provider that 
can be ignored. (Actually, I recommend using "-btl ^openib" to suppress the 
warning.)

That said, there is a known issue with selecting PSM2 and OMPI 4.1.0. I'm not 
sure that that's the problem you're hitting, though, because you really haven't 
provided a lot of information.

I would suggest trying the following to see what happens:

${PATH_TO_OMPI}/mpirun -mca mtl psm2 -mca btl ^openib -mca mtl_base_verbose 99 
-mca btl_base_verbose 99 -n ${N} -H ${HOSTS} my_application

This should give you detailed information on what transports were selected and 
what happened next.

Oh - and make sure your fabric is up with an opainfo or opareport command, just 
to make sure.

From: users  On Behalf Of Pavel Mezentsev via 
users
Sent: Monday, May 10, 2021 8:41 AM
To: users@lists.open-mpi.org
Cc: Pavel Mezentsev 
Subject: [OMPI users] unable to launch a job on a system with OmniPath

Hi!
I'm working on a system with KNL and OmniPath and I'm trying to launch a job 
but it fails. Could someone please advise what parameters I need to add to make 
it work properly? At first I need to make it work within one node, however 
later I need to use multiple nodes and eventually I may need to switch to TCP 
to run a hybrid job where some nodes are connected via Infiniband and some 
nodes are connected via OmniPath.

So far without any extra parameters I get:
```
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:  XX
  Local adapter:   hfi1_0
  Local port:  1
```

If I add `OMPI_MCA_btl_openib_allow_ib="true"` then I get:
```
Error obtaining unique transport key from ORTE (orte_precondition_transports 
not present in
the environment).

  Local host: XX

```
Then I tried adding OMPI_MCA_mtl="psm2" or OMPI_MCA_mtl="ofi" to make it use 
omnipath or OMPI_MCA_btl="sm,self" to make it use only shared memory. But these 
parameters did not make any difference.
There does not seem to be much omni-path related documentation, at least I was 
not able to find anything that would help me but perhaps I missed something:
https://www.open-mpi.org/faq/?category=running#opa-support
https://www.open-mpi.org/faq/?category=opa

This is the `configure` line:
```
./configure --prefix=X --build=x86_64-pc-linux-gnu  
--host=x86_64-pc-linux-gnu --enable-shared --with-hwloc=$EBROOTHWLOC 
--with-psm2 --with-libevent=$EBROOTLIBEVENT --without-orte --disable-oshmem 
--with-cuda=$EBROOTCUDA --with-gpfs --with-slurm --with-pmix=external 
--with-libevent=external --with-ompi-pmix-rte
```
Which also raises another question: if it was built with `--without-orte` then 
why do I get an error about failing to get something from ORTE.
The OpenMPI version is `4.1.0rc1` built with `gcc-9.3.0`.

Thank you in advance!
Regards, Pavel Mezentsev.


Re: [OMPI users] Building Open-MPI with Intel C

2021-04-07 Thread Heinz, Michael William via users
Sorry – I did actually send a thank you to Gilles and John @ 8:48 local time 
but it looks like at some point in my conversation with Gilles we stopped 
CC’ing the list – which means John never saw my thank you.

So, “Thanks for the help, John!”

From: users  On Behalf Of Jeff Squyres 
(jsquyres) via users
Sent: Wednesday, April 7, 2021 10:28 AM
To: John Hearns 
Cc: Jeff Squyres (jsquyres) ; Open MPI User's List 

Subject: Re: [OMPI users] Building Open-MPI with Intel C

:-)

For the web archives: Mike confirmed to me off-list that the non-interactive 
login setup was, indeed, the issue, and he's now good to go.



On Apr 7, 2021, at 10:09 AM, John Hearns 
mailto:hear...@gmail.com>> wrote:

Jeff, you know as well as I do that EVERYTHING is in the path at Cornelis 
Networks.

On Wed, 7 Apr 2021 at 14:59, Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
Check the output from ldd in a non-interactive login: your LD_LIBRARY_PATH 
probably doesn't include the location of the Intel runtime.

E.g.

ssh othernode ldd /path/to/orted

Your shell startup files may well differentiate between interactive and 
non-interactive logins (i.e., it may set PATH / LD_LIBRARY_PATH / etc. 
differently).



On Apr 7, 2021, at 7:21 AM, John Hearns via users 
mailto:users@lists.open-mpi.org>> wrote:

Manually log into one of your nodes. Load the modules you use in a batch job. 
Run 'ldd' on your executable.
Start at the bottom and work upwards...

By the way, have you looked at using Easybuild? Would be good to have your 
input there maybe.


On Wed, 7 Apr 2021 at 01:01, Heinz, Michael William via users 
mailto:users@lists.open-mpi.org>> wrote:
I’m having a heck of a time building OMPI with Intel C. Compilation goes fine, 
installation goes fine, compiling test apps (the OSU benchmarks) goes fine…

but when I go to actually run an MPI app I get:

[awbp025:~/work/osu-icc](N/A)$ /usr/mpi/icc/openmpi-icc/bin/mpirun -np 2 -H 
awbp025,awbp026,awbp027,awbp028 -x FI_PROVIDER=opa1x -x 
LD_LIBRARY_PATH=/usr/mpi/icc/openmpi-icc/lib64:/lib hostname
/usr/mpi/icc/openmpi-icc/bin/orted: error while loading shared libraries: 
libimf.so: cannot open shared object file: No such file or directory
/usr/mpi/icc/openmpi-icc/bin/orted: error while loading shared libraries: 
libimf.so: cannot open shared object file: No such file or directory

Looking at orted, it does seem like the binary is linking correctly:

[awbp025:~/work/osu-icc](N/A)$ /usr/mpi/icc/openmpi-icc/bin/orted
[awbp025:620372] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
ess_env_module.c at line 135
[awbp025:620372] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
util/session_dir.c at line 107
[awbp025:620372] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
util/session_dir.c at line 346
[awbp025:620372] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
base/ess_base_std_orted.c at line 264
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
--

and…

[awbp025:~/work/osu-icc](N/A)$ ldd /usr/mpi/icc/openmpi-icc/bin/orted
linux-vdso.so.1 (0x7fffc2ebf000)
libopen-rte.so.40 => /usr/mpi/icc/openmpi-icc/lib/libopen-rte.so.40 
(0x7fdaa6404000)
libopen-pal.so.40 => /usr/mpi/icc/openmpi-icc/lib/libopen-pal.so.40 
(0x7fdaa60bd000)
libopen-orted-mpir.so => 
/usr/mpi/icc/openmpi-icc/lib/libopen-orted-mpir.so (0x7fdaa5ebb000)
libm.so.6 => /lib64/libm.so.6 (0x7fdaa5b39000)
librt.so.1 => /lib64/librt.so.1 (0x7fdaa5931000)
libutil.so.1 => /lib64/libutil.so.1 (0x7fdaa572d000)
libz.so.1 => /lib64/libz.so.1 (0x7fdaa5516000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7fdaa52fe000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x7fdaa50de000)
libc.so.6 => /lib64/libc.so.6 (0x7fdaa4d1b000)
libdl.so.2 => /lib64/libdl.so.2 (0x7fdaa4b17000)
libimf.so => 
/opt/intel/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libimf.so
 (0x7fdaa4494000)
libsvml.so => 
/opt/intel/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libsvml.so
 (0x7fdaa29c4000)
libirng.so => 
/opt/intel/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libirng.so
 (0x7fdaa2659000)
libintlc.so.5 => 
/opt/inte

Re: [OMPI users] Building Open-MPI with Intel C

2021-04-07 Thread Heinz, Michael William via users
Giles,

I’ll double check - but the intel runtime is installed on all machines in the 
fabric.

-
Michael Heinz
michael.william.he...@cornelisnetworks.com<mailto:michael.william.he...@cornelisnetworks.com>

On Apr 7, 2021, at 2:42 AM, Gilles Gouaillardet via users 
mailto:users@lists.open-mpi.org>> wrote:

Michael,

orted is able to find its dependencies to the Intel runtime on the
host where you sourced the environment.
However, it is unlikely able to do it on a remote host
For example
ssh ... ldd `which opted`
will likely fail.

An option is to use -rpath (and add the path to the Intel runtime).
IIRC, there is also an option in the Intel compiler to statically link
to the runtime.

Cheers,

Gilles

On Wed, Apr 7, 2021 at 9:00 AM Heinz, Michael William via users
mailto:users@lists.open-mpi.org>> wrote:

I’m having a heck of a time building OMPI with Intel C. Compilation goes fine, 
installation goes fine, compiling test apps (the OSU benchmarks) goes fine…



but when I go to actually run an MPI app I get:



[awbp025:~/work/osu-icc](N/A)$ /usr/mpi/icc/openmpi-icc/bin/mpirun -np 2 -H 
awbp025,awbp026,awbp027,awbp028 -x FI_PROVIDER=opa1x -x 
LD_LIBRARY_PATH=/usr/mpi/icc/openmpi-icc/lib64:/lib hostname

/usr/mpi/icc/openmpi-icc/bin/orted: error while loading shared libraries: 
libimf.so: cannot open shared object file: No such file or directory

/usr/mpi/icc/openmpi-icc/bin/orted: error while loading shared libraries: 
libimf.so: cannot open shared object file: No such file or directory



Looking at orted, it does seem like the binary is linking correctly:



[awbp025:~/work/osu-icc](N/A)$ /usr/mpi/icc/openmpi-icc/bin/orted

[awbp025:620372] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
ess_env_module.c at line 135

[awbp025:620372] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
util/session_dir.c at line 107

[awbp025:620372] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
util/session_dir.c at line 346

[awbp025:620372] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
base/ess_base_std_orted.c at line 264

--

It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during orte_init; some of which are due to configuration or

environment problems.  This failure appears to be an internal failure;

here's some additional information (which may only be relevant to an

Open MPI developer):



 orte_session_dir failed

 --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS

--



and…



[awbp025:~/work/osu-icc](N/A)$ ldd /usr/mpi/icc/openmpi-icc/bin/orted

   linux-vdso.so.1 (0x7fffc2ebf000)

   libopen-rte.so.40 => /usr/mpi/icc/openmpi-icc/lib/libopen-rte.so.40 
(0x7fdaa6404000)

   libopen-pal.so.40 => /usr/mpi/icc/openmpi-icc/lib/libopen-pal.so.40 
(0x7fdaa60bd000)

   libopen-orted-mpir.so => 
/usr/mpi/icc/openmpi-icc/lib/libopen-orted-mpir.so (0x7fdaa5ebb000)

   libm.so.6 => /lib64/libm.so.6 (0x7fdaa5b39000)

   librt.so.1 => /lib64/librt.so.1 (0x7fdaa5931000)

   libutil.so.1 => /lib64/libutil.so.1 (0x7fdaa572d000)

   libz.so.1 => /lib64/libz.so.1 (0x7fdaa5516000)

   libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7fdaa52fe000)

   libpthread.so.0 => /lib64/libpthread.so.0 (0x7fdaa50de000)

   libc.so.6 => /lib64/libc.so.6 (0x7fdaa4d1b000)

   libdl.so.2 => /lib64/libdl.so.2 (0x7fdaa4b17000)

   libimf.so => 
/opt/intel/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libimf.so
 (0x7fdaa4494000)

   libsvml.so => 
/opt/intel/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libsvml.so
 (0x7fdaa29c4000)

   libirng.so => 
/opt/intel/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libirng.so
 (0x7fdaa2659000)

   libintlc.so.5 => 
/opt/intel/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libintlc.so.5
 (0x7fdaa23e1000)

   /lib64/ld-linux-x86-64.so.2 (0x7fdaa66d6000)



Can anyone suggest what I’m forgetting to do?



---

Michael Heinz
Fabric Software Engineer, Cornelis Networks





[OMPI users] Building Open-MPI with Intel C

2021-04-06 Thread Heinz, Michael William via users
I'm having a heck of a time building OMPI with Intel C. Compilation goes fine, 
installation goes fine, compiling test apps (the OSU benchmarks) goes fine...

but when I go to actually run an MPI app I get:

[awbp025:~/work/osu-icc](N/A)$ /usr/mpi/icc/openmpi-icc/bin/mpirun -np 2 -H 
awbp025,awbp026,awbp027,awbp028 -x FI_PROVIDER=opa1x -x 
LD_LIBRARY_PATH=/usr/mpi/icc/openmpi-icc/lib64:/lib hostname
/usr/mpi/icc/openmpi-icc/bin/orted: error while loading shared libraries: 
libimf.so: cannot open shared object file: No such file or directory
/usr/mpi/icc/openmpi-icc/bin/orted: error while loading shared libraries: 
libimf.so: cannot open shared object file: No such file or directory

Looking at orted, it does seem like the binary is linking correctly:

[awbp025:~/work/osu-icc](N/A)$ /usr/mpi/icc/openmpi-icc/bin/orted
[awbp025:620372] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
ess_env_module.c at line 135
[awbp025:620372] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
util/session_dir.c at line 107
[awbp025:620372] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
util/session_dir.c at line 346
[awbp025:620372] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
base/ess_base_std_orted.c at line 264
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
--

and...

[awbp025:~/work/osu-icc](N/A)$ ldd /usr/mpi/icc/openmpi-icc/bin/orted
linux-vdso.so.1 (0x7fffc2ebf000)
libopen-rte.so.40 => /usr/mpi/icc/openmpi-icc/lib/libopen-rte.so.40 
(0x7fdaa6404000)
libopen-pal.so.40 => /usr/mpi/icc/openmpi-icc/lib/libopen-pal.so.40 
(0x7fdaa60bd000)
libopen-orted-mpir.so => 
/usr/mpi/icc/openmpi-icc/lib/libopen-orted-mpir.so (0x7fdaa5ebb000)
libm.so.6 => /lib64/libm.so.6 (0x7fdaa5b39000)
librt.so.1 => /lib64/librt.so.1 (0x7fdaa5931000)
libutil.so.1 => /lib64/libutil.so.1 (0x7fdaa572d000)
libz.so.1 => /lib64/libz.so.1 (0x7fdaa5516000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7fdaa52fe000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x7fdaa50de000)
libc.so.6 => /lib64/libc.so.6 (0x7fdaa4d1b000)
libdl.so.2 => /lib64/libdl.so.2 (0x7fdaa4b17000)
libimf.so => 
/opt/intel/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libimf.so
 (0x7fdaa4494000)
libsvml.so => 
/opt/intel/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libsvml.so
 (0x7fdaa29c4000)
libirng.so => 
/opt/intel/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libirng.so
 (0x7fdaa2659000)
libintlc.so.5 => 
/opt/intel/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libintlc.so.5
 (0x7fdaa23e1000)
/lib64/ld-linux-x86-64.so.2 (0x7fdaa66d6000)

Can anyone suggest what I'm forgetting to do?

---
Michael Heinz
Fabric Software Engineer, Cornelis Networks



Re: [OMPI users] Newbie With Issues

2021-03-30 Thread Michael Fuckner via users
Hi,

Intel ships both compilers if installing intel-hpckit:

[root@f33-vm ~]# icc -v
icc version 2021.2.0 (gcc version 10.2.1 compatibility)
[root@f33-vm ~]# icx -v
Intel(R) oneAPI DPC++ Compiler 2021.2.0 (2021.2.0.20210317)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2021.2.0/linux/bin
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/10
Selected GCC installation: /usr/lib/gcc/x86_64-redhat-linux/10
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64

Regards,
 Michael!



> bend linux4ms.net via users  hat am 30.03.2021 
> 19:00 geschrieben:
> 
>  
> Thanks Mr Heinz for responding.
> 
> It maybe the case with clang, but doing a intel setvars.sh then issuing the 
> following
> compile gives me the message:
> 
> [root@jean-r8-sch24 openmpi-4.1.0]# icc
> icc: command line error: no files specified; for help type "icc -help"
> [root@jean-r8-sch24 openmpi-4.1.0]# icc -v
> icc version 2021.1 (gcc version 8.3.1 compatibility)
> [root@jean-r8-sch24 openmpi-4.1.0]# 
> 
> Would lead me to believe that icc is still available to use.
> 
> This is a government contract and they want the latest and greatest.
> 
> Ben Duncan - Business Network Solutions, Inc. 336 Elton Road Jackson MS, 39212
> "Never attribute to malice, that which can be adequately explained by 
> stupidity"
> - Hanlon's Razor
> 
> 
> 
> 
> 
> From: Heinz, Michael  William 
> Sent: Tuesday, March 30, 2021 11:52 AM
> To: Open MPI Users
> Cc: bend linux4ms.net
> Subject: RE: Newbie With Issues
> 
> It looks like you're trying to build Open MPI with the Intel C compiler. TBH 
> - I think that icc isn't included with the latest release of oneAPI, I think 
> they've switched to including clang instead. I had a similar issue to yours 
> but I resolved it by installing a 2020 version of the Intel HPC software. 
> Unfortunately, those versions require purchasing a license.
> 
> -Original Message-
> From: users  On Behalf Of bend linux4ms.net 
> via users
> Sent: Tuesday, March 30, 2021 12:42 PM
> To: Open MPI Open MPI 
> Cc: bend linux4ms.net 
> Subject: [OMPI users] Newbie With Issues
> 
> Hello group, My name is Ben Duncan. I have been tasked with installing 
> openMPI and Intel compiler on a HPC systems. I am new to the the whole HPC 
> and MPI environment so be patient with me.
> 
> I have successfully gotten the Intel compiler (oneapi version from  
> l_HPCKit_p_2021.1.0.2684_offline.sh installed without any errors.
> 
> I am trying to install and configure the openMPI version 4.1.0 however trying 
> to run configuration for openmpi gives me the following error:
> 
> 
> == Configuring Open MPI
> 
> 
> *** Startup tests
> checking build system type... x86_64-unknown-linux-gnu checking host system 
> type... x86_64-unknown-linux-gnu checking target system type... 
> x86_64-unknown-linux-gnu checking for gcc... icc checking whether the C 
> compiler works... no
> configure: error: in `/p/app/openmpi-4.1.0':
> configure: error: C compiler cannot create executables See `config.log' for 
> more details
> 
> With the error in config.log being:
> 
> configure:6499: $? = 0
> configure:6488: icc -qversion >&5
> icc: command line warning #10006: ignoring unknown option '-qversion'
> icc: command line error: no files specified; for help type "icc -help"
> configure:6499: $? = 1
> configure:6519: checking whether the C compiler works
> configure:6541: icc -O2   conftest.c  >&5
> ld: cannot find -lstdc++
> configure:6545: $? = 1
> configure:6583: result: no
> configure: failed program was:
> | /* confdefs.h */
> | #define PACKAGE_NAME "Open MPI"
> | #define PACKAGE_TARNAME "openmpi"
> | #define PACKAGE_VERSION "4.1.0"
> | #define PACKAGE_STRING "Open MPI 4.1.0"
> | #define PACKAGE_BUGREPORT 
> "https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.open-mpi.org%2Fcommunity%2Fhelp%2F&data=04%7C01%7Cmichael.william.heinz%40cornelisnetworks.com%7C452071550e3c40842a6008d8f39b1ab4%7C4dbdb7da74ee4b458747ef5ce5ebe68a%7C0%7C0%7C637527194795401887%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=MqdORKp%2Fbf6mS7NQ51RjPUe0WVVcBITwP0HpxpYyBjI%3D&reserved=0";
> | #define PACKAGE_URL ""
> | #define OPAL_ARCH "x86_64-unknown-linux-gnu"
> | /* end confdefs.h.  */
> |
> | in

Re: [OMPI users] Newbie With Issues

2021-03-30 Thread Heinz, Michael William via users
It looks like you're trying to build Open MPI with the Intel C compiler. TBH - 
I think that icc isn't included with the latest release of oneAPI, I think 
they've switched to including clang instead. I had a similar issue to yours but 
I resolved it by installing a 2020 version of the Intel HPC software. 
Unfortunately, those versions require purchasing a license.

-Original Message-
From: users  On Behalf Of bend linux4ms.net 
via users
Sent: Tuesday, March 30, 2021 12:42 PM
To: Open MPI Open MPI 
Cc: bend linux4ms.net 
Subject: [OMPI users] Newbie With Issues

Hello group, My name is Ben Duncan. I have been tasked with installing openMPI 
and Intel compiler on a HPC systems. I am new to the the whole HPC and MPI 
environment so be patient with me.

I have successfully gotten the Intel compiler (oneapi version from  
l_HPCKit_p_2021.1.0.2684_offline.sh installed without any errors.

I am trying to install and configure the openMPI version 4.1.0 however trying 
to run configuration for openmpi gives me the following error:


== Configuring Open MPI


*** Startup tests
checking build system type... x86_64-unknown-linux-gnu checking host system 
type... x86_64-unknown-linux-gnu checking target system type... 
x86_64-unknown-linux-gnu checking for gcc... icc checking whether the C 
compiler works... no
configure: error: in `/p/app/openmpi-4.1.0':
configure: error: C compiler cannot create executables See `config.log' for 
more details

With the error in config.log being:

configure:6499: $? = 0
configure:6488: icc -qversion >&5
icc: command line warning #10006: ignoring unknown option '-qversion'
icc: command line error: no files specified; for help type "icc -help"
configure:6499: $? = 1
configure:6519: checking whether the C compiler works
configure:6541: icc -O2   conftest.c  >&5
ld: cannot find -lstdc++
configure:6545: $? = 1
configure:6583: result: no
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "Open MPI"
| #define PACKAGE_TARNAME "openmpi"
| #define PACKAGE_VERSION "4.1.0"
| #define PACKAGE_STRING "Open MPI 4.1.0"
| #define PACKAGE_BUGREPORT 
"https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.open-mpi.org%2Fcommunity%2Fhelp%2F&data=04%7C01%7Cmichael.william.heinz%40cornelisnetworks.com%7C452071550e3c40842a6008d8f39b1ab4%7C4dbdb7da74ee4b458747ef5ce5ebe68a%7C0%7C0%7C637527194795401887%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=MqdORKp%2Fbf6mS7NQ51RjPUe0WVVcBITwP0HpxpYyBjI%3D&reserved=0";
| #define PACKAGE_URL ""
| #define OPAL_ARCH "x86_64-unknown-linux-gnu"
| /* end confdefs.h.  */
|
| int
| main ()
| {
|
|   ;
|   return 0;
| }
configure:6588: error: in `/p/app/openmpi-4.1.0':
configure:6590: error: C compiler cannot create executables See `config.log' 
for more details



My configure line looks like:

./configure --prefix=/p/app/compilers/openmpi-4.1.0/openmpi-4.1.0.intel  
--enable-wrapper-rpath   --disable-libompitrace  
--enable-mpirun-prefix-by-default --enable-mpi-fortran 

SO what am I doing wrong , or is it something else ?

Thanks


Ben Duncan - Business Network Solutions, Inc. 336 Elton Road Jackson MS, 39212 
"Never attribute to malice, that which can be adequately explained by stupidity"
- Hanlon's Razor




Re: [OMPI users] [EXTERNAL] building openshem on opa

2021-03-22 Thread Michael Di Domenico via users
On Mon, Mar 22, 2021 at 11:13 AM Pritchard Jr., Howard  wrote:
> https://github.com/Sandia-OpenSHMEM/SOS
> if you want to use OpenSHMEM over OPA.
> If you have lots of cycles for development work, you could write an OFI SPML 
> for the  OSHMEM component of Open MPI.

thanks, i am aware of the sandia version.  the devs in my organization
don't really use shmem, but there was a call for it recently.  i
hadn't even noticed shmem didn't build on our opa cluster.  for now we
have a smaller mellanox cluster they can build against.

my ability to code an spml is nil.  but if we had more interest from
the internal devs i'd certainly be willing to fund someone to do it..
:)


[OMPI users] building openshem on opa

2021-03-22 Thread Michael Di Domenico via users
i can build and run openmpi on an opa network just fine, but it turns
out building openshmem fails.  the message is (no spml) found

looking at the config log it looks like it tries to build spml ikrit
and ucx which fail.  i turn ucx off because it doesn't support opa and
isn't needed.

so this message is really just a confirmation that openshmem and opa
are not capable of being built or did i do something wrong

and a curiosity if anyone knows what kind of effort would be involved
in getting it to work


Re: [OMPI users] Error intialising an OpenFabrics device.

2021-03-13 Thread Heinz, Michael William via users
I’ve begun getting this annoyingly generic warning, too. It appears to be 
coming from the openib provider. If you disable it with -mtl ^openib the 
warning goes away.

Sent from my iPad

> On Mar 13, 2021, at 3:28 PM, Bob Beattie via users  
> wrote:
> 
> Hi everyone,
> 
> To be honest, as an MPI / IB noob, I don't know if this falls under OpenMPI 
> or Mellanox
> 
> Am running a small cluster of HP DL380 G6/G7 machines.
> Each runs Ubuntu server 20.04 and has a Mellanox ConnectX-3 card, connected 
> by an IS dumb switch.
> When I begin my MPI program (snappyHexMesh for OpenFOAM) I get an error 
> reported.
> The error doesn't stop my programs or appear to cause any problems, so this 
> request for help is more about delving into the why.
> 
> OMPI is compiled from source using v4.0.3; which is the default version for 
> Ubuntu 20.04
> This compiles and works.  I did this because I wanted to understand the 
> compilation process whilst using a known working OMPI version.
> 
> The Infiniband part is the Mellanox MLNXOFED installer v4.9-0.1.7.0 and I 
> install that with --dkms --without-fw-update --hpc --with-nfsrdma
> 
> The actual error reported is:
> Warning: There was an error initialising an OpenFabrics device.
>   Local host: of1
>   Local device: mlx4_0
> 
> Then shortly after:
> [of1:1015399] 19 more processes have sent help message 
> help-mpi-btl-openib.txt / error in device init
> [of1:1015399] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> help / error messages
> 
> Adding this MCA parameter to the mpirun line simply gives me 20 or so copies 
> of the first warning.
> 
> Any ideas anyone ?
> Cheers,
> Bob.


Re: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 ?

2021-03-04 Thread Heinz, Michael William via users
What interconnect are you using at run time? That is, are you using Ethernet or 
InfiniBand or Omnipath?

Sent from my iPad

On Mar 4, 2021, at 5:05 AM, Raut, S Biplab via users  
wrote:



[AMD Official Use Only - Internal Distribution Only]

After downloading a particular openMPI version, let’s say v3.1.1 from 
https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.1.tar.gz
 , I follow the below steps.

./configure --prefix="$INSTALL_DIR" --enable-mpi-fortran --enable-mpi-cxx 
--enable-shared=yes --enable-static=yes --enable-mpi1-compatibility
  make -j
  make install
  export PATH=$INSTALL_DIR/bin:$PATH
  export LD_LIBRARY_PATH=$INSTALL_DIR/lib:$LD_LIBRARY_PATH
Additionally, I also install libnuma-dev on the machine.

For all the machines having Ubuntu 18.04 and 19.04, it works correctly and 
results in expected performance/GFLOPS.
But, when OS is changed to Ubuntu 20.04, then I start getting the issues as 
mentioned in my original/previous mail below.

With Regards,
S. Biplab Raut

From: users  On Behalf Of John Hearns via 
users
Sent: Thursday, March 4, 2021 1:53 PM
To: Open MPI Users 
Cc: John Hearns 
Subject: Re: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 
?

[CAUTION: External Email]
How are you installing the OpenMPI versions? Are you using packages which are 
distributed by the OS?

It might be worth looking at using Easybuid or Spack
https://docs.easybuild.io/en/latest/Introduction.html
https://spack.readthedocs.io/en/latest/


On Thu, 4 Mar 2021 at 07:35, Raut, S Biplab via users 
mailto:users@lists.open-mpi.org>> wrote:

[AMD Official Use Only - Internal Distribution Only]

Dear Experts,
Until recently, I was using openMPI3.1.1 to run single 
node 128 ranks MPI application on Ubuntu18.04 and Ubuntu19.04.
But, now the OS on these machines are upgraded to Ubuntu20.04, and I have been 
observing program hangs with openMPI3.1.1 version.
So, I tried with openMPI4.0.5 version – The program ran properly without any 
issues but there is a performance regression in my application.

Can I know the stable openMPI version recommended for Ubuntu20.04 that has no 
known regression compared to v3.1.1.

With Regards,
S. Biplab Raut


[OMPI users] Unexpected issue with 4.1.x build

2021-03-02 Thread Heinz, Michael William via users
While testing the recent UCX PR I noticed I was getting this warning:

--
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   cn-priv-01
  Local device: hfi1_0
--
[cn-priv-01:3767216] select: init of component openib returned failure

The problem is, the ipoib interface is working fine on the nodes in this run - 
and there's no more information about what the error might have been. Can 
anyone shed any light on why this might be happening? I do not see this with 
OMPI 4.0.3.

---
Michael Heinz
Fabric Software Engineer, Cornelis Networks



Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-01-28 Thread Heinz, Michael William via users
Patrick,

A few more questions for you:

1. What version of IFS are you running?
2. Are you using CUDA cards by any chance? If so, what version of CUDA?

-Original Message-
From: Heinz, Michael William 
Sent: Wednesday, January 27, 2021 3:45 PM
To: Open MPI Users 
Subject: RE: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

Patrick,

Do you have any PSM2_* or HFI_* environment variables defined in your run time 
environment that could be affecting things?


-Original Message-
From: users  On Behalf Of Heinz, Michael 
William via users
Sent: Wednesday, January 27, 2021 3:37 PM
To: Open MPI Users 
Cc: Heinz, Michael William 
Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

Unfortunately, OPA/PSM support for Debian isn't handled by Intel directly or by 
Cornelis Networks - but I should point out you can download the latest official 
source for PSM2 and the drivers from Github.

-Original Message-
From: users  On Behalf Of Michael Di Domenico 
via users
Sent: Wednesday, January 27, 2021 3:32 PM
To: Open MPI Users 
Cc: Michael Di Domenico 
Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

if you have OPA cards, for openmpi you only need --with-ofi, you don't need 
psm/psm2/verbs/ucx.  but this assumes you're running a rhel based distro and 
have installed the OPA fabric suite of software from Intel/CornelisNetworks.  
which is what i have.  perhaps there's something really odd in debian or 
there's an incompatibility with the older ofed drivers perhaps included with 
debian.  unfortunately i don't have access to a debian, so i can't be much more 
help

if i had to guess totally pulling junk from the air, there's probably something 
incompatible with PSM and OPA when running specifically on debian (likely due 
to library versioning).  i don't know how common that is, so it's not clear how 
flushed out and tested it is




On Wed, Jan 27, 2021 at 3:07 PM Patrick Begou via users 
 wrote:
>
> Hi Howard and Michael
>
> first many thanks for testing with my short application. Yes, when the 
> test code runs fine it just show the max RSS size of rank 0 process.
> When it runs wrong it put a messages about each invalid value found.
>
> As I said, I have also deployed OpenMPI on various cluster (in DELL 
> data center at Austin) when I was testing some architectures some 
> months ago and nor on AMD/Mellanox_IB nor on Intel/Omni-path I got any 
> problem. The goal was running my tests with same software stacks and 
> be sure to be able to deploy my software stack on the selected solution.
> But as your clusters (and my small local clusters) they were all 
> running RedHat (or similar Linux flavors) and a modern Gnu compiler (9 or 10).
> The university's cluster I have access is running Debian stretch and 
> provides GCC6 as default compiler.
>
> I cannot ask for a different OS, but I can deploy a local gcc10 and 
> build again OpenMPI.  UCX is not available on this cluster, should I 
> deploy a local UCX too ?
>
> Libpsm2 seams good:
> dahu103 : dpkg -l |grep psm
> ii  libfabric-psm  1.10.0-2-1ifs+deb9amd64 Dynamic PSM
> provider for user-space Open Fabric Interfaces
> ii  libfabric-psm2 1.10.0-2-1ifs+deb9amd64 Dynamic PSM2
> provider for user-space Open Fabric Interfaces
> ii  libpsm-infinipath1 3.3-19-g67c0807-2ifs+deb9 amd64 PSM Messaging
> library for Intel Truescale adapters
> ii  libpsm-infinipath1-dev 3.3-19-g67c0807-2ifs+deb9 amd64 Development 
> files for libpsm-infinipath1
> ii  libpsm2-2  11.2.185-1-1ifs+deb9  amd64 Intel PSM2
> Libraries
> ii  libpsm2-2-compat   11.2.185-1-1ifs+deb9  amd64 Compat
> library for Intel PSM2
> ii  libpsm2-dev11.2.185-1-1ifs+deb9  amd64 Development
> files for Intel PSM2
> ii  psmisc 22.21-2.1+b2  amd64 utilities
> that use the proc file system
>
> This will be my next try to install OpenMPI on this cluster.
>
> Patrick
>
>
> Le 27/01/2021 à 18:09, Pritchard Jr., Howard via users a écrit :
> > Hi Folks,
> >
> > I'm also have problems reproducing this on one of our OPA clusters:
> >
> > libpsm2-11.2.78-1.el7.x86_64
> > libpsm2-devel-11.2.78-1.el7.x86_64
> >
> > cluster runs RHEL 7.8
> >
> > hca_id:   hfi1_0
> >   transport:  InfiniBand (0)
> >   fw_ver: 1.27.0
> >   node_guid:  0011:7501:0179:e2d7
> >   sys_image_guid: 0011:7501:0179:e2d7
> >   vendor_id:  0x1175
> >   vendor_part_id: 9456
> >   hw_ver: 0x11

Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Heinz, Michael William via users
Patrick,

Do you have any PSM2_* or HFI_* environment variables defined in your run time 
environment that could be affecting things?


-Original Message-
From: users  On Behalf Of Heinz, Michael 
William via users
Sent: Wednesday, January 27, 2021 3:37 PM
To: Open MPI Users 
Cc: Heinz, Michael William 
Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

Unfortunately, OPA/PSM support for Debian isn't handled by Intel directly or by 
Cornelis Networks - but I should point out you can download the latest official 
source for PSM2 and the drivers from Github.

-Original Message-
From: users  On Behalf Of Michael Di Domenico 
via users
Sent: Wednesday, January 27, 2021 3:32 PM
To: Open MPI Users 
Cc: Michael Di Domenico 
Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

if you have OPA cards, for openmpi you only need --with-ofi, you don't need 
psm/psm2/verbs/ucx.  but this assumes you're running a rhel based distro and 
have installed the OPA fabric suite of software from Intel/CornelisNetworks.  
which is what i have.  perhaps there's something really odd in debian or 
there's an incompatibility with the older ofed drivers perhaps included with 
debian.  unfortunately i don't have access to a debian, so i can't be much more 
help

if i had to guess totally pulling junk from the air, there's probably something 
incompatible with PSM and OPA when running specifically on debian (likely due 
to library versioning).  i don't know how common that is, so it's not clear how 
flushed out and tested it is




On Wed, Jan 27, 2021 at 3:07 PM Patrick Begou via users 
 wrote:
>
> Hi Howard and Michael
>
> first many thanks for testing with my short application. Yes, when the 
> test code runs fine it just show the max RSS size of rank 0 process.
> When it runs wrong it put a messages about each invalid value found.
>
> As I said, I have also deployed OpenMPI on various cluster (in DELL 
> data center at Austin) when I was testing some architectures some 
> months ago and nor on AMD/Mellanox_IB nor on Intel/Omni-path I got any 
> problem. The goal was running my tests with same software stacks and 
> be sure to be able to deploy my software stack on the selected solution.
> But as your clusters (and my small local clusters) they were all 
> running RedHat (or similar Linux flavors) and a modern Gnu compiler (9 or 10).
> The university's cluster I have access is running Debian stretch and 
> provides GCC6 as default compiler.
>
> I cannot ask for a different OS, but I can deploy a local gcc10 and 
> build again OpenMPI.  UCX is not available on this cluster, should I 
> deploy a local UCX too ?
>
> Libpsm2 seams good:
> dahu103 : dpkg -l |grep psm
> ii  libfabric-psm  1.10.0-2-1ifs+deb9amd64 Dynamic PSM
> provider for user-space Open Fabric Interfaces
> ii  libfabric-psm2 1.10.0-2-1ifs+deb9amd64 Dynamic PSM2
> provider for user-space Open Fabric Interfaces
> ii  libpsm-infinipath1 3.3-19-g67c0807-2ifs+deb9 amd64 PSM Messaging
> library for Intel Truescale adapters
> ii  libpsm-infinipath1-dev 3.3-19-g67c0807-2ifs+deb9 amd64 Development 
> files for libpsm-infinipath1
> ii  libpsm2-2  11.2.185-1-1ifs+deb9  amd64 Intel PSM2
> Libraries
> ii  libpsm2-2-compat   11.2.185-1-1ifs+deb9  amd64 Compat
> library for Intel PSM2
> ii  libpsm2-dev11.2.185-1-1ifs+deb9  amd64 Development
> files for Intel PSM2
> ii  psmisc 22.21-2.1+b2  amd64 utilities
> that use the proc file system
>
> This will be my next try to install OpenMPI on this cluster.
>
> Patrick
>
>
> Le 27/01/2021 à 18:09, Pritchard Jr., Howard via users a écrit :
> > Hi Folks,
> >
> > I'm also have problems reproducing this on one of our OPA clusters:
> >
> > libpsm2-11.2.78-1.el7.x86_64
> > libpsm2-devel-11.2.78-1.el7.x86_64
> >
> > cluster runs RHEL 7.8
> >
> > hca_id:   hfi1_0
> >   transport:  InfiniBand (0)
> >   fw_ver: 1.27.0
> >   node_guid:  0011:7501:0179:e2d7
> >   sys_image_guid: 0011:7501:0179:e2d7
> >   vendor_id:  0x1175
> >   vendor_part_id: 9456
> >   hw_ver: 0x11
> >   board_id:   Intel Omni-Path Host Fabric Interface 
> > Adapter 100 Series
> >   phys_port_cnt:  1
> >   port:   1
> >   state:  PORT_ACTIVE (4)
> >   max_mtu:4096 (5)
> >

Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Heinz, Michael William via users
Unfortunately, OPA/PSM support for Debian isn't handled by Intel directly or by 
Cornelis Networks - but I should point out you can download the latest official 
source for PSM2 and the drivers from Github.

-Original Message-
From: users  On Behalf Of Michael Di Domenico 
via users
Sent: Wednesday, January 27, 2021 3:32 PM
To: Open MPI Users 
Cc: Michael Di Domenico 
Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

if you have OPA cards, for openmpi you only need --with-ofi, you don't need 
psm/psm2/verbs/ucx.  but this assumes you're running a rhel based distro and 
have installed the OPA fabric suite of software from Intel/CornelisNetworks.  
which is what i have.  perhaps there's something really odd in debian or 
there's an incompatibility with the older ofed drivers perhaps included with 
debian.  unfortunately i don't have access to a debian, so i can't be much more 
help

if i had to guess totally pulling junk from the air, there's probably something 
incompatible with PSM and OPA when running specifically on debian (likely due 
to library versioning).  i don't know how common that is, so it's not clear how 
flushed out and tested it is




On Wed, Jan 27, 2021 at 3:07 PM Patrick Begou via users 
 wrote:
>
> Hi Howard and Michael
>
> first many thanks for testing with my short application. Yes, when the 
> test code runs fine it just show the max RSS size of rank 0 process.
> When it runs wrong it put a messages about each invalid value found.
>
> As I said, I have also deployed OpenMPI on various cluster (in DELL 
> data center at Austin) when I was testing some architectures some 
> months ago and nor on AMD/Mellanox_IB nor on Intel/Omni-path I got any 
> problem. The goal was running my tests with same software stacks and 
> be sure to be able to deploy my software stack on the selected solution.
> But as your clusters (and my small local clusters) they were all 
> running RedHat (or similar Linux flavors) and a modern Gnu compiler (9 or 10).
> The university's cluster I have access is running Debian stretch and 
> provides GCC6 as default compiler.
>
> I cannot ask for a different OS, but I can deploy a local gcc10 and 
> build again OpenMPI.  UCX is not available on this cluster, should I 
> deploy a local UCX too ?
>
> Libpsm2 seams good:
> dahu103 : dpkg -l |grep psm
> ii  libfabric-psm  1.10.0-2-1ifs+deb9amd64 Dynamic PSM
> provider for user-space Open Fabric Interfaces
> ii  libfabric-psm2 1.10.0-2-1ifs+deb9amd64 Dynamic PSM2
> provider for user-space Open Fabric Interfaces
> ii  libpsm-infinipath1 3.3-19-g67c0807-2ifs+deb9 amd64 PSM Messaging
> library for Intel Truescale adapters
> ii  libpsm-infinipath1-dev 3.3-19-g67c0807-2ifs+deb9 amd64 Development 
> files for libpsm-infinipath1
> ii  libpsm2-2  11.2.185-1-1ifs+deb9  amd64 Intel PSM2
> Libraries
> ii  libpsm2-2-compat   11.2.185-1-1ifs+deb9  amd64 Compat
> library for Intel PSM2
> ii  libpsm2-dev11.2.185-1-1ifs+deb9  amd64 Development
> files for Intel PSM2
> ii  psmisc 22.21-2.1+b2  amd64 utilities
> that use the proc file system
>
> This will be my next try to install OpenMPI on this cluster.
>
> Patrick
>
>
> Le 27/01/2021 à 18:09, Pritchard Jr., Howard via users a écrit :
> > Hi Folks,
> >
> > I'm also have problems reproducing this on one of our OPA clusters:
> >
> > libpsm2-11.2.78-1.el7.x86_64
> > libpsm2-devel-11.2.78-1.el7.x86_64
> >
> > cluster runs RHEL 7.8
> >
> > hca_id:   hfi1_0
> >   transport:  InfiniBand (0)
> >   fw_ver: 1.27.0
> >   node_guid:  0011:7501:0179:e2d7
> >   sys_image_guid: 0011:7501:0179:e2d7
> >   vendor_id:  0x1175
> >   vendor_part_id: 9456
> >   hw_ver: 0x11
> >   board_id:   Intel Omni-Path Host Fabric Interface 
> > Adapter 100 Series
> >   phys_port_cnt:  1
> >   port:   1
> >   state:  PORT_ACTIVE (4)
> >   max_mtu:4096 (5)
> >   active_mtu: 4096 (5)
> >   sm_lid:     1
> >   port_lid:   99
> >   port_lmc:   0x00
> >   link_layer: InfiniBand
> >
> > using gcc/gfortran 9.3.0
> >
> > Built Open MPI 4.0.5 without any specia

Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Michael Di Domenico via users
if you have OPA cards, for openmpi you only need --with-ofi, you don't
need psm/psm2/verbs/ucx.  but this assumes you're running a rhel based
distro and have installed the OPA fabric suite of software from
Intel/CornelisNetworks.  which is what i have.  perhaps there's
something really odd in debian or there's an incompatibility with the
older ofed drivers perhaps included with debian.  unfortunately i
don't have access to a debian, so i can't be much more help

if i had to guess totally pulling junk from the air, there's probably
something incompatible with PSM and OPA when running specifically on
debian (likely due to library versioning).  i don't know how common
that is, so it's not clear how flushed out and tested it is




On Wed, Jan 27, 2021 at 3:07 PM Patrick Begou via users
 wrote:
>
> Hi Howard and Michael
>
> first many thanks for testing with my short application. Yes, when the
> test code runs fine it just show the max RSS size of rank 0 process.
> When it runs wrong it put a messages about each invalid value found.
>
> As I said, I have also deployed OpenMPI on various cluster (in DELL data
> center at Austin) when I was testing some architectures some months ago
> and nor on AMD/Mellanox_IB nor on Intel/Omni-path I got any problem. The
> goal was running my tests with same software stacks and be sure to be
> able to deploy my software stack on the selected solution.
> But as your clusters (and my small local clusters) they were all running
> RedHat (or similar Linux flavors) and a modern Gnu compiler (9 or 10).
> The university's cluster I have access is running Debian stretch and
> provides GCC6 as default compiler.
>
> I cannot ask for a different OS, but I can deploy a local gcc10 and
> build again OpenMPI.  UCX is not available on this cluster, should I
> deploy a local UCX too ?
>
> Libpsm2 seams good:
> dahu103 : dpkg -l |grep psm
> ii  libfabric-psm  1.10.0-2-1ifs+deb9amd64 Dynamic PSM
> provider for user-space Open Fabric Interfaces
> ii  libfabric-psm2 1.10.0-2-1ifs+deb9amd64 Dynamic PSM2
> provider for user-space Open Fabric Interfaces
> ii  libpsm-infinipath1 3.3-19-g67c0807-2ifs+deb9 amd64 PSM Messaging
> library for Intel Truescale adapters
> ii  libpsm-infinipath1-dev 3.3-19-g67c0807-2ifs+deb9 amd64 Development
> files for libpsm-infinipath1
> ii  libpsm2-2  11.2.185-1-1ifs+deb9  amd64 Intel PSM2
> Libraries
> ii  libpsm2-2-compat   11.2.185-1-1ifs+deb9  amd64 Compat
> library for Intel PSM2
> ii  libpsm2-dev11.2.185-1-1ifs+deb9  amd64 Development
> files for Intel PSM2
> ii  psmisc 22.21-2.1+b2  amd64 utilities
> that use the proc file system
>
> This will be my next try to install OpenMPI on this cluster.
>
> Patrick
>
>
> Le 27/01/2021 à 18:09, Pritchard Jr., Howard via users a écrit :
> > Hi Folks,
> >
> > I'm also have problems reproducing this on one of our OPA clusters:
> >
> > libpsm2-11.2.78-1.el7.x86_64
> > libpsm2-devel-11.2.78-1.el7.x86_64
> >
> > cluster runs RHEL 7.8
> >
> > hca_id:   hfi1_0
> >   transport:  InfiniBand (0)
> >   fw_ver: 1.27.0
> >   node_guid:  0011:7501:0179:e2d7
> >   sys_image_guid: 0011:7501:0179:e2d7
> >   vendor_id:  0x1175
> >   vendor_part_id: 9456
> >   hw_ver: 0x11
> >   board_id:   Intel Omni-Path Host Fabric Interface 
> > Adapter 100 Series
> >   phys_port_cnt:  1
> >   port:   1
> >   state:  PORT_ACTIVE (4)
> >   max_mtu:4096 (5)
> >   active_mtu: 4096 (5)
> >   sm_lid:     1
> >   port_lid:   99
> >   port_lmc:   0x00
> >   link_layer: InfiniBand
> >
> > using gcc/gfortran 9.3.0
> >
> > Built Open MPI 4.0.5 without any special configure options.
> >
> > Howard
> >
> > On 1/27/21, 9:47 AM, "users on behalf of Michael Di Domenico via users" 
> >  
> > wrote:
> >
> > for whatever it's worth running the test program on my OPA cluster
> > seems to work.  well it keeps spitting out [INFO MEMORY] lines, not
> > sure if it's supposed to stop at some point
> >
> > i'm running rhel7, gcc 10.1, 

Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Michael Di Domenico via users
for whatever it's worth running the test program on my OPA cluster
seems to work.  well it keeps spitting out [INFO MEMORY] lines, not
sure if it's supposed to stop at some point

i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, without-{psm,ucx,verbs}

On Tue, Jan 26, 2021 at 3:44 PM Patrick Begou via users
 wrote:
>
> Hi Michael
>
> indeed I'm a little bit lost with all these parameters in OpenMPI, mainly 
> because for years it works just fine out of the box in all my deployments on 
> various architectures, interconnects and linux flavor. Some weeks ago I 
> deploy OpenMPI4.0.5 in Centos8 with gcc10, slurm and UCX on an AMD epyc2 
> cluster with connectX6, and it just works fine.  It is the first time I've 
> such trouble to deploy this library.
>
> If you have my mail posted  the 25/01/2021 in this discussion at 18h54 (may 
> be Paris TZ) there is a small test case attached that show the problem. Did 
> you got it or did the list strip these attachments ? I can provide it again.
>
> Many thanks
>
> Patrick
>
> Le 26/01/2021 à 19:25, Heinz, Michael William a écrit :
>
> Patrick how are you using original PSM if you’re using Omni-Path hardware? 
> The original PSM was written for QLogic DDR and QDR Infiniband adapters.
>
> As far as needing openib - the issue is that the PSM2 MTL doesn’t support a 
> subset of MPI operations that we previously used the pt2pt BTL for. For 
> recent version of OMPI, the preferred BTL to use with PSM2 is OFI.
>
> Is there any chance you can give us a sample MPI app that reproduces the 
> problem? I can’t think of another way I can give you more help without being 
> able to see what’s going on. It’s always possible there’s a bug in the PSM2 
> MTL but it would be surprising at this point.
>
> Sent from my iPad
>
> On Jan 26, 2021, at 1:13 PM, Patrick Begou via users 
>  wrote:
>
> 
> Hi all,
>
> I ran many tests today. I saw that an older 4.0.2 version of OpenMPI packaged 
> with Nix was running using openib. So I add the --with-verbs option to setup 
> this module.
>
> That I can see now is that:
>
> mpirun -hostfile $OAR_NODEFILE  --mca mtl psm -mca btl_openib_allow_ib true 
> 
>
> - the testcase test_layout_array is running without error
>
> - the bandwidth measured with osu_bw is half of thar it should be:
>
> # OSU MPI Bandwidth Test v5.7
> # Size  Bandwidth (MB/s)
> 1   0.54
> 2   1.13
> 4   2.26
> 8   4.51
> 16  9.06
> 32 17.93
> 64 33.87
> 12869.29
> 256   161.24
> 512   333.82
> 1024  682.66
> 2048 1188.63
> 4096 1760.14
> 8192 2166.08
> 163842036.95
> 327683466.63
> 655366296.73
> 131072   7509.43
> 262144   9104.78
> 524288   6908.55
> 1048576  5530.37
> 2097152  4489.16
> 4194304  3498.14
>
> mpirun -hostfile $OAR_NODEFILE  --mca mtl psm2 -mca btl_openib_allow_ib true 
> ...
>
> - the testcase test_layout_array is not giving correct results
>
> - the bandwidth measured with osu_bw is the right one:
>
> # OSU MPI Bandwidth Test v5.7
> # Size  Bandwidth (MB/s)
> 1   3.73
> 2   7.96
> 4  15.82
> 8  31.22
> 16 51.52
> 32107.61
> 64196.51
> 128   438.66
> 256   817.70
> 512  1593.90
> 1024 2786.09
> 2048 4459.77
> 4096 6658.70
> 8192 8092.95
> 163848664.43
> 327688495.96
> 65536   11458.77
> 131072  12094.64
> 262144  11781.84
> 524288  12297.58
> 1048576 12346.92
> 2097152 12206.53
> 4194304 12167.00
>
> But yes, I know openib is deprecated too in 4.0.5.
>
> Patrick
>
>


Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-26 Thread Heinz, Michael William via users
Patrick how are you using original PSM if you’re using Omni-Path hardware? The 
original PSM was written for QLogic DDR and QDR Infiniband adapters.

As far as needing openib - the issue is that the PSM2 MTL doesn’t support a 
subset of MPI operations that we previously used the pt2pt BTL for. For recent 
version of OMPI, the preferred BTL to use with PSM2 is OFI.

Is there any chance you can give us a sample MPI app that reproduces the 
problem? I can’t think of another way I can give you more help without being 
able to see what’s going on. It’s always possible there’s a bug in the PSM2 MTL 
but it would be surprising at this point.

Sent from my iPad

On Jan 26, 2021, at 1:13 PM, Patrick Begou via users  
wrote:


Hi all,

I ran many tests today. I saw that an older 4.0.2 version of OpenMPI packaged 
with Nix was running using openib. So I add the --with-verbs option to setup 
this module.

That I can see now is that:

mpirun -hostfile $OAR_NODEFILE  --mca mtl psm -mca btl_openib_allow_ib true 

- the testcase test_layout_array is running without error

- the bandwidth measured with osu_bw is half of thar it should be:

# OSU MPI Bandwidth Test v5.7
# Size  Bandwidth (MB/s)
1   0.54
2   1.13
4   2.26
8   4.51
16  9.06
32 17.93
64 33.87
12869.29
256   161.24
512   333.82
1024  682.66
2048 1188.63
4096 1760.14
8192 2166.08
163842036.95
327683466.63
655366296.73
131072   7509.43
262144   9104.78
524288   6908.55
1048576  5530.37
2097152  4489.16
4194304  3498.14

mpirun -hostfile $OAR_NODEFILE  --mca mtl psm2 -mca btl_openib_allow_ib true ...

- the testcase test_layout_array is not giving correct results

- the bandwidth measured with osu_bw is the right one:

# OSU MPI Bandwidth Test v5.7
# Size  Bandwidth (MB/s)
1   3.73
2   7.96
4  15.82
8  31.22
16 51.52
32107.61
64196.51
128   438.66
256   817.70
512  1593.90
1024 2786.09
2048 4459.77
4096 6658.70
8192 8092.95
163848664.43
327688495.96
65536   11458.77
131072  12094.64
262144  11781.84
524288  12297.58
1048576 12346.92
2097152 12206.53
4194304 12167.00

But yes, I know openib is deprecated too in 4.0.5.

Patrick


Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-25 Thread Heinz, Michael William via users
Patrick, is your application multi-threaded? PSM2 was not originally designed 
for multiple threads per process.

I do know that the OSU alltoallV test does pass when I try it.

Sent from my iPad

> On Jan 25, 2021, at 12:57 PM, Patrick Begou via users 
>  wrote:
> 
> Hi Howard and Michael,
> 
> thanks for your feedback. I did not want to write a toot long mail with
> non pertinent information so I just show how the two different builds
> give different result. I'm using a small test case based on my large
> code, the same used to show the memory leak with mpi_Alltoallv calls,
> but just running 2 iterations. It is a 2D case and data storage is moved
> from distributions "along X axis" to "along Y axis" with mpi_Alltoallv
> and subarrays types. Datas initialization is based on the location in
> the array to allow checking for correct exchanges.
> 
> When the program runs (on 4 processes in my test) it must only show the
> max rss size of the processes. When it fails it shows the invalid
> locations. I've drastically reduced the size of the problem with nx=5
> and ny=7.
> 
> Launching the non working setup with more details show:
> 
> dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array
> [dahu138:115761] mca: base: components_register: registering framework
> mtl components
> [dahu138:115763] mca: base: components_register: registering framework
> mtl components
> [dahu138:115763] mca: base: components_register: found loaded component psm2
> [dahu138:115763] mca: base: components_register: component psm2 register
> function successful
> [dahu138:115763] mca: base: components_open: opening mtl components
> [dahu138:115763] mca: base: components_open: found loaded component psm2
> [dahu138:115761] mca: base: components_register: found loaded component psm2
> [dahu138:115763] mca: base: components_open: component psm2 open
> function successful
> [dahu138:115761] mca: base: components_register: component psm2 register
> function successful
> [dahu138:115761] mca: base: components_open: opening mtl components
> [dahu138:115761] mca: base: components_open: found loaded component psm2
> [dahu138:115761] mca: base: components_open: component psm2 open
> function successful
> [dahu138:115760] mca: base: components_register: registering framework
> mtl components
> [dahu138:115760] mca: base: components_register: found loaded component psm2
> [dahu138:115760] mca: base: components_register: component psm2 register
> function successful
> [dahu138:115760] mca: base: components_open: opening mtl components
> [dahu138:115760] mca: base: components_open: found loaded component psm2
> [dahu138:115762] mca: base: components_register: registering framework
> mtl components
> [dahu138:115762] mca: base: components_register: found loaded component psm2
> [dahu138:115760] mca: base: components_open: component psm2 open
> function successful
> [dahu138:115762] mca: base: components_register: component psm2 register
> function successful
> [dahu138:115762] mca: base: components_open: opening mtl components
> [dahu138:115762] mca: base: components_open: found loaded component psm2
> [dahu138:115762] mca: base: components_open: component psm2 open
> function successful
> [dahu138:115760] mca:base:select: Auto-selecting mtl components
> [dahu138:115760] mca:base:select:(  mtl) Querying component [psm2]
> [dahu138:115760] mca:base:select:(  mtl) Query of component [psm2] set
> priority to 40
> [dahu138:115761] mca:base:select: Auto-selecting mtl components
> [dahu138:115762] mca:base:select: Auto-selecting mtl components
> [dahu138:115762] mca:base:select:(  mtl) Querying component [psm2]
> [dahu138:115762] mca:base:select:(  mtl) Query of component [psm2] set
> priority to 40
> [dahu138:115762] mca:base:select:(  mtl) Selected component [psm2]
> [dahu138:115762] select: initializing mtl component psm2
> [dahu138:115761] mca:base:select:(  mtl) Querying component [psm2]
> [dahu138:115761] mca:base:select:(  mtl) Query of component [psm2] set
> priority to 40
> [dahu138:115761] mca:base:select:(  mtl) Selected component [psm2]
> [dahu138:115761] select: initializing mtl component psm2
> [dahu138:115760] mca:base:select:(  mtl) Selected component [psm2]
> [dahu138:115760] select: initializing mtl component psm2
> [dahu138:115763] mca:base:select: Auto-selecting mtl components
> [dahu138:115763] mca:base:select:(  mtl) Querying component [psm2]
> [dahu138:115763] mca:base:select:(  mtl) Query of component [psm2] set
> priority to 40
> [dahu138:115763] mca:base:select:(  mtl) Selected component [psm2]
> [dahu138:115763] select: initializing mtl component psm2
> [dahu138:115761] select: init returned success
> [dahu138:115761] select: c

Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-25 Thread Heinz, Michael William via users
What happens if you specify -mtl ofi ?

-Original Message-
From: users  On Behalf Of Patrick Begou via 
users
Sent: Monday, January 25, 2021 12:54 PM
To: users@lists.open-mpi.org
Cc: Patrick Begou 
Subject: Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

Hi Howard and Michael,

thanks for your feedback. I did not want to write a toot long mail with non 
pertinent information so I just show how the two different builds give 
different result. I'm using a small test case based on my large code, the same 
used to show the memory leak with mpi_Alltoallv calls, but just running 2 
iterations. It is a 2D case and data storage is moved from distributions "along 
X axis" to "along Y axis" with mpi_Alltoallv and subarrays types. Datas 
initialization is based on the location in the array to allow checking for 
correct exchanges.

When the program runs (on 4 processes in my test) it must only show the max rss 
size of the processes. When it fails it shows the invalid locations. I've 
drastically reduced the size of the problem with nx=5 and ny=7.

Launching the non working setup with more details show:

dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array 
[dahu138:115761] mca: base: components_register: registering framework mtl 
components [dahu138:115763] mca: base: components_register: registering 
framework mtl components [dahu138:115763] mca: base: components_register: found 
loaded component psm2 [dahu138:115763] mca: base: components_register: 
component psm2 register function successful [dahu138:115763] mca: base: 
components_open: opening mtl components [dahu138:115763] mca: base: 
components_open: found loaded component psm2 [dahu138:115761] mca: base: 
components_register: found loaded component psm2 [dahu138:115763] mca: base: 
components_open: component psm2 open function successful [dahu138:115761] mca: 
base: components_register: component psm2 register function successful 
[dahu138:115761] mca: base: components_open: opening mtl components 
[dahu138:115761] mca: base: components_open: found loaded component psm2 
[dahu138:115761] mca: base: components_open: component psm2 open function 
successful [dahu138:115760] mca: base: components_register: registering 
framework mtl components [dahu138:115760] mca: base: components_register: found 
loaded component psm2 [dahu138:115760] mca: base: components_register: 
component psm2 register function successful [dahu138:115760] mca: base: 
components_open: opening mtl components [dahu138:115760] mca: base: 
components_open: found loaded component psm2 [dahu138:115762] mca: base: 
components_register: registering framework mtl components [dahu138:115762] mca: 
base: components_register: found loaded component psm2 [dahu138:115760] mca: 
base: components_open: component psm2 open function successful [dahu138:115762] 
mca: base: components_register: component psm2 register function successful 
[dahu138:115762] mca: base: components_open: opening mtl components 
[dahu138:115762] mca: base: components_open: found loaded component psm2 
[dahu138:115762] mca: base: components_open: component psm2 open function 
successful [dahu138:115760] mca:base:select: Auto-selecting mtl components 
[dahu138:115760] mca:base:select:(  mtl) Querying component [psm2] 
[dahu138:115760] mca:base:select:(  mtl) Query of component [psm2] set priority 
to 40 [dahu138:115761] mca:base:select: Auto-selecting mtl components 
[dahu138:115762] mca:base:select: Auto-selecting mtl components 
[dahu138:115762] mca:base:select:(  mtl) Querying component [psm2] 
[dahu138:115762] mca:base:select:(  mtl) Query of component [psm2] set priority 
to 40 [dahu138:115762] mca:base:select:(  mtl) Selected component [psm2] 
[dahu138:115762] select: initializing mtl component psm2 [dahu138:115761] 
mca:base:select:(  mtl) Querying component [psm2] [dahu138:115761] 
mca:base:select:(  mtl) Query of component [psm2] set priority to 40 
[dahu138:115761] mca:base:select:(  mtl) Selected component [psm2] 
[dahu138:115761] select: initializing mtl component psm2 [dahu138:115760] 
mca:base:select:(  mtl) Selected component [psm2] [dahu138:115760] select: 
initializing mtl component psm2 [dahu138:115763] mca:base:select: 
Auto-selecting mtl components [dahu138:115763] mca:base:select:(  mtl) Querying 
component [psm2] [dahu138:115763] mca:base:select:(  mtl) Query of component 
[psm2] set priority to 40 [dahu138:115763] mca:base:select:(  mtl) Selected 
component [psm2] [dahu138:115763] select: initializing mtl component psm2 
[dahu138:115761] select: init returned success [dahu138:115761] select: 
component psm2 selected [dahu138:115762] select: init returned success 
[dahu138:115762] select: component psm2 selected [dahu138:115763] select: init 
returned success [dahu138:115763] select: component psm2 selected 
[dahu138:115760] select: init returned success [dahu138:115760] select: 
component psm2 selected On 1 found 1007 but expect 3007 On 2 found 1007 b

[OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-25 Thread Heinz, Michael William via users
Patrick,

You really have to provide us some detailed information if you want assistance. 
At a minimum we need to know if you're using the PSM2 MTL or the OFI MTL and 
what the actual error is.

Please provide the actual command line you are having problems with, along with 
any errors. In addition, I recommend adding the following to your command line:

-mca mtl_base_verbose 99

If you have a way to reproduce the problem quickly you might also want to add:

-x PSM2_TRACEMASK=11

But that will add very detailed debug output to your command and you haven't 
mentioned that PSM2 is failing, so it may not be useful.


Re: [OMPI users] Differences 4.0.3 -> 4.0.4 (Regression?)

2020-08-10 Thread Michael Fuckner via users

Hi,

just tried 4.0.5rc1 and this is working as 4.0.3 (directly and via 
slurm). So it is just 4.0.4 not working. Diffed Config and build.sh, but 
couldn't find anything. I don't know why, but I'll accept it...


Regards,
 Michael!

On 08/08/2020 18:46, Howard Pritchard wrote:

Hello Michael,

Not sure what could be causing this in terms of delta between v4.0.3 and 
v4.0.4.

Two things to try

- add --debug-daemons and --mca pmix_base_verbose 100 to the mpirun line 
and compare output from the v4.0.3 and v4.0.4 installs
- perhaps try using the --enable-mpirun-prefix-by-default configure 
option and reinstall v4.0.4


Howard



--
DELTA Computer Products GmbH
Röntgenstr. 4
D-21465 Reinbek bei Hamburg
T: +49 40 300672-30
F: +49 40 300672-11
E: fuck...@delta.de

Internet: https://www.delta.de
Handelsregister Lübeck HRB 3678-RE, Ust.-IdNr.: DE135110550
Geschäftsführer: Hans-Peter Hellmann


Re: [OMPI users] Differences 4.0.3 -> 4.0.4 (Regression?)

2020-08-08 Thread Michael Fuckner via users

Hi Howard,,

anything you can see in the logfile?

https://download.deltacomputer.com/slurm-job-parallel.30.out

--

Is this a problem: srun: cluster configuration lacks support for cpu binding


This is the batchfile I am submitting:

#!/bin/bash

# 2 nodes, 8 processes (MPI ranks) per node
# request exclusive nodes (not sharing nodes with other jobs)

#SBATCH --nodes=2-2
#SBATCH --ntasks-per-node=8
#SBATCH --exclusive
#SBATCH -o slurm-job-parallel.%j.out


echo -n "this script is running on: "
hostname -f
date

env | grep ^SLURM | sort

for OPENMPI in 3.0.6 3.1.6 4.0.3 4.0.4
do
  echo "### running ./OWnetbench/OWnetbench.openmpi-${OPENMPI} with 
/opt/openmpi/${OPENMPI}/gcc/bin/mpirun ###"


  # process bindings are used for repeatable benchmark results
  # use with care when sharing node(s) with other jobs!
  # we've requested exclusive nodes so we don't have to care about 
other jobs!

  case "${OPENMPI}" in
1.6.5)
  BIND_OPT="--bind-to-core --bycore --report-bindings"
  ;;
*)
  BIND_OPT="--bind-to core --map-by core --report-bindings"
  ;;
  esac

  # because openmpi is compiled with slurm support there is no need to
  # specify the number of processes or a hostfile to mpirun.

  /opt/openmpi/${OPENMPI}/gcc/bin/mpirun ${BIND_OPT} --mca 
pmix_base_verbose 100  --debug-daemons 
./OWnetbench/OWnetbench.openmpi-${OPENMPI}


done


On 08/08/2020 18:46, Howard Pritchard wrote:

Hello Michael,

Not sure what could be causing this in terms of delta between v4.0.3 and 
v4.0.4.

Two things to try

- add --debug-daemons and --mca pmix_base_verbose 100 to the mpirun line 
and compare output from the v4.0.3 and v4.0.4 installs
- perhaps try using the --enable-mpirun-prefix-by-default configure 
option and reinstall v4.0.4


Howard


Am Do., 6. Aug. 2020 um 04:48 Uhr schrieb Michael Fuckner via users 
mailto:users@lists.open-mpi.org>>:


Hi,

I have a small setup with one headnode and two compute nodes connected
via IB-QDR running CentOS 8.2 and Mellanox OFED 4.9 LTS. I installed
openmpi 3.0.6, 3.1.6, 4.0.3 and 4.0.4 with identical configuration
(configure, compile, nothing configured in openmpi-mca-params.conf),
the
output from ompi-info and orte-info looks identical.

There is a small benchmark basically just doing MPI_Send() and
MPI_Recv(). I can invoke it directly like this (as 4.0.3 and 4.0.4)

/opt/openmpi/4.0.3/gcc/bin/mpirun -np 16 -hostfile HOSTFILE_2x8
-nolocal
./OWnetbench.openmpi-4.0.3

when running this job from slurm, it works with 4.0.3, but there is an
error with 4.0.4. Any hint what to check?


### running ./OWnetbench/OWnetbench.openmpi-4.0.4 with
/opt/openmpi/4.0.4/gcc/bin/mpirun ###
[node002.cluster:04960] MCW rank 0 bound to socket 0[core 7[hwt 0-1]]:
[../../../../../../../BB]
[node002.cluster:04963] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 231
[node002.cluster:04963] OPAL ERROR: Error in file pmix3x_client.c at
line 112
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node002.cluster:04963] Local abort before MPI_INIT completed completed
successfully, but am not able to aggregate error messages, and not able
to guarantee that all other processes were kil
led!
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

    Process name: [[15424,1],0]
    Exit code:    1
--

Any hint why 4.0.4 behaves not like the other versions?

-- 
DELTA Computer Products GmbH

Röntgenstr. 4
D-21465 Reinbek bei Hamburg
T: +49 40 300672-30
F: +49 40 300672-11
E: michael.fuck...@delta.de <mailto:michael.fuck...@delta.de>

Internet: https://www.delta.de
Handelsregister Lübeck HRB 3678-RE, Ust.-IdNr.: DE135110550
Geschäftsführer: Hans-Peter Hellmann




--
DELTA Computer Products GmbH
Röntgenstr. 4
D-21465 Reinbek bei Hamburg
T: +49 40 300672-30
F: +49 40 300672-11
E: fuck...@delta.de

Internet: https://www.delta.de
Handelsregister Lübeck HRB 3678-RE, Ust.-IdNr.: DE135110550
Geschäftsführer: Hans-Peter Hellmann


[OMPI users] Differences 4.0.3 -> 4.0.4 (Regression?)

2020-08-06 Thread Michael Fuckner via users

Hi,

I have a small setup with one headnode and two compute nodes connected 
via IB-QDR running CentOS 8.2 and Mellanox OFED 4.9 LTS. I installed 
openmpi 3.0.6, 3.1.6, 4.0.3 and 4.0.4 with identical configuration 
(configure, compile, nothing configured in openmpi-mca-params.conf), the 
output from ompi-info and orte-info looks identical.


There is a small benchmark basically just doing MPI_Send() and 
MPI_Recv(). I can invoke it directly like this (as 4.0.3 and 4.0.4)


/opt/openmpi/4.0.3/gcc/bin/mpirun -np 16 -hostfile HOSTFILE_2x8 -nolocal 
./OWnetbench.openmpi-4.0.3


when running this job from slurm, it works with 4.0.3, but there is an 
error with 4.0.4. Any hint what to check?



### running ./OWnetbench/OWnetbench.openmpi-4.0.4 with 
/opt/openmpi/4.0.4/gcc/bin/mpirun ###
[node002.cluster:04960] MCW rank 0 bound to socket 0[core 7[hwt 0-1]]: 
[../../../../../../../BB]
[node002.cluster:04963] PMIX ERROR: OUT-OF-RESOURCE in file 
client/pmix_client.c at line 231
[node002.cluster:04963] OPAL ERROR: Error in file pmix3x_client.c at 
line 112

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[node002.cluster:04963] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able 
to guarantee that all other processes were kil

led!
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun detected that one or more processes exited with non-zero status, 
thus causing

the job to be terminated. The first process to do so was:

  Process name: [[15424,1],0]
  Exit code:1
--

Any hint why 4.0.4 behaves not like the other versions?

--
DELTA Computer Products GmbH
Röntgenstr. 4
D-21465 Reinbek bei Hamburg
T: +49 40 300672-30
F: +49 40 300672-11
E: michael.fuck...@delta.de

Internet: https://www.delta.de
Handelsregister Lübeck HRB 3678-RE, Ust.-IdNr.: DE135110550
Geschäftsführer: Hans-Peter Hellmann


Re: [OMPI users] can't open /dev/ipath, network down (err=26)

2020-05-09 Thread Heinz, Michael William via users
That it! I was trying to remember what the setting was but I haven’t worked on 
those HCAs since around 2012, so it was faint.

That said, I found the Intel TrueScale manual online at 
https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/OFED_Host_Software_UserGuide_G91902_06.pdf

TS is the same hardware as the old QLogic QDR HCAs so the manual might be 
helpful to you in the future.

Sent from my iPad

On May 9, 2020, at 9:52 AM, Patrick Bégou via users  
wrote:


Le 08/05/2020 à 21:56, Prentice Bisbal via users a écrit :

We often get the following errors when more than one job runs on the same 
compute node. We are using Slurm with OpenMPI. The IB cards are QLogic using 
PSM:

10698ipath_userinit: assign_context command failed: Network is down
node01.10698can't open /dev/ipath, network down (err=26)
node01.10703ipath_userinit: assign_context command failed: Network is down
node01.10703can't open /dev/ipath, network down (err=26)
node01.10701ipath_userinit: assign_context command failed: Network is down
node01.10701can't open /dev/ipath, network down (err=26)
node01.10700ipath_userinit: assign_context command failed: Network is down
node01.10700can't open /dev/ipath, network down (err=26)
node01.10697ipath_userinit: assign_context command failed: Network is down
node01.10697can't open /dev/ipath, network down (err=26)
--
PSM was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.

Error: Could not detect network connectivity
--

Any Ideas how to fix this?

--
Prentice


Hi Prentice,

This is not openMPI related but merely due to your hardware. I've not many 
details but I think this occurs when several jobs share the same node and you 
have a large number of cores on these nodes (> 14). If this is the case:

On Qlogic (I'm using such a hardware at this time) you have 16 channel for 
communication on each HBA and, if I remember what I had read many years ago, 2 
are dedicated to the system. When launching MPI applications, each process of a 
job request for it's own dedicated channel if available, else they share ALL 
the available channels. So if a second job starts on the same node it do not 
remains any available channel.

To avoid this situation I force sharing the channels (my nodes have 20 codes) 
by 2 MPI processes. You can set this with a simple environment variable. On all 
my cluster nodes I create the file:

/etc/profile.d/ibsetcontext.sh

And it contains:

# allow 4 processes to share an hardware MPI context
# in infiniband with PSM
export PSM_RANKS_PER_CONTEXT=2

Of course if some people manage to oversubscribe on the cores (more than one 
process by core) it could rise again the problem but we do not oversubscribe.

Hope this can help you.

Patrick


[OMPI users] can't open /dev/ipath, network down (err=26)

2020-05-09 Thread Heinz, Michael William via users
Prentice,

Avoiding the obvious question of whether your FM is running and the fabric is 
in an active state, It sounds like your exhausting a resource on the cards. 
Ralph is correct about support for QLogic cards being long past but I’ll see 
what I can dig up in the archives on Monday to see if there’s a parameter you 
can adjust.

My vague recollection is that you shouldn’t try to have more compute processes 
than you have cores, that some resources are allocated on that basis. You might 
also look at the modinfo output for the device driver to see if there are any 
likely looking suspects.

Honestly, chances are better that you’ll get a hint from modinfo than that I’ll 
find a tuning guide laying around. Are these cards DDR or QDR?

Sent from my iPad

Re: [OMPI users] openmpi/pmix/ucx

2020-02-07 Thread Michael Di Domenico via users
did you happen to get 4.7.1 which comes with ucx-1.7.0-1.47100
compiled again openmpi 4.0.2?

i got snagged by this

https://github.com/open-mpi/ompi/issues/7128

which i thought would have had the fixes merged into the v4.0.2 tag,
but it doesn't seem so in my case


On Fri, Feb 7, 2020 at 11:34 AM Ray Muno via users
 wrote:
>
> Were using MLNX_OFED 4.7.3. It supplies UCX 1.7.0.
>
> We have OpenMPI 4.02 compiled against the Mellanox OFED 4.7.3 provided 
> versions of UCX, KNEM and
> HCOLL, along with HWLOC 2.1.0 from the OpenMPI site.
>
> I mirrored the build to be what Mellanox used to configure OpenMPI in HPC-X 
> 2.5.
>
> I have users using GCC, PGI, Intel and AOCC compilers with this config.  PGI 
> was the only one that
> was a challenge to build due to conflicts with HCOLL.
>
> -Ray Muno
>
> On 2/7/20 10:04 AM, Michael Di Domenico via users wrote:
> > i haven't compiled openmpi in a while, but i'm in the process of
> > upgrading our cluster.
> >
> > the last time i did this there were specific versions of mpi/pmix/ucx
> > that were all tested and supposed to work together.  my understanding
> > of this was because pmi/ucx was under rapid development and the api's
> > were changing
> >
> > is that still an issue or can i take the latest stable branches from
> > git for each and have a relatively good shot at it all working
> > together?
> >
> > the one semi-immovable i have right now is ucx which is at 1.7.0 as
> > installed by mellanox ofed.  if the above is true, is there a matrix
> > of versions i should be using for all the others?  nothing jumped out
> > at me on the openmpi website
> >
>
>
> --
>
>   Ray Muno
>   IT Manager
>   e-mail:   m...@aem.umn.edu
>   University of Minnesota
>   Aerospace Engineering and Mechanics Mechanical Engineering
>


[OMPI users] openmpi/pmix/ucx

2020-02-07 Thread Michael Di Domenico via users
i haven't compiled openmpi in a while, but i'm in the process of
upgrading our cluster.

the last time i did this there were specific versions of mpi/pmix/ucx
that were all tested and supposed to work together.  my understanding
of this was because pmi/ucx was under rapid development and the api's
were changing

is that still an issue or can i take the latest stable branches from
git for each and have a relatively good shot at it all working
together?

the one semi-immovable i have right now is ucx which is at 1.7.0 as
installed by mellanox ofed.  if the above is true, is there a matrix
of versions i should be using for all the others?  nothing jumped out
at me on the openmpi website


[OMPI users] Subject: need a tool and its use to verify use of infiniband network

2020-01-16 Thread Heinz, Michael William via users
btl_base_verbose may do what you need. Add it to your mpirun arguments. For 
example:

[LINUX hds1fna2271 20200116_1404 mpi_apps]# 
/usr/mpi/gcc/openmpi-3.1.6/bin/mpirun -np 2 -map-by node --allow-run-as-root 
-machinefile /usr/src/opa/mpi_apps/mpi_hosts -mca btl self,openib,vader -mca 
btl_base_verbose 9 -mca plm_rsh_no_tree_spawn 1 
/usr/src/opa/mpi_apps/imb/src/IMB-MPI1
[hds1fna2271:37308] Checking distance from this process to device=hfi1_0
[hds1fna2271:37308] Process is not bound: distance to device is 0.00
[hds1fna2271:37308] rdmacm CPC only supported when the first QP is a PP QP; 
skipped
[hds1fna2271:37308] openib BTL: rdmacm CPC unavailable for use on hfi1_0:1; 
skipped
[hds1fna2271:37308] [rank=0] openib: using port hfi1_0:1
[hds1fna2272:49507] Checking distance from this process to device=hfi1_0
[hds1fna2272:49507] Process is not bound: distance to device is 0.00
[hds1fna2272:49507] rdmacm CPC only supported when the first QP is a PP QP; 
skipped
[hds1fna2272:49507] openib BTL: rdmacm CPC unavailable for use on hfi1_0:1; 
skipped
[hds1fna2272:49507] [rank=1] openib: using port hfi1_0:1
.
.
.
.

> -Original Message-
> From: users  On Behalf Of users-
> requ...@lists.open-mpi.org
> Sent: Thursday, January 16, 2020 2:00 PM
> To: users@lists.open-mpi.org
> Subject: users Digest, Vol 4374, Issue 1
> 
> Send users mailing list submissions to
>   users@lists.open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>   https://lists.open-mpi.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
>   users-requ...@lists.open-mpi.org
> 
> You can reach the person managing the list at
>   users-ow...@lists.open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific than "Re:
> Contents of users digest..."
> 
> 
> Today's Topics:
> 
>1. need a tool and its use to verify use of infiniband network
>   (SOPORTE MODEMAT)
> 
> 
> --
> 
> Message: 1
> Date: Wed, 15 Jan 2020 21:22:03 +
> From: SOPORTE MODEMAT 
> To: "users@lists.open-mpi.org" 
> Subject: [OMPI users] need a tool and its use to verify use of
>   infiniband network
> Message-ID:
>09.namprd17.prod.outlook.com>
> 
> Content-Type: text/plain; charset="iso-8859-1"
> 
> Hello guys.
> 
> I would like you to help me with one tool or manner to verify the use of
> infiniband network interface when I run the command:
> 
> /opt/mpi/openmpi_intel-2.1.1/bin/mpirun --mca btl self,openib,vader
> python mpi_hola.py Is there a way to verify that the infiniband interface is
> being used? If so, how can I do it?
> 
> Thank you in advance for your help
> 
> Saludos cordiales.
> 
> Msc. Mercy Anchundia Ruiz.
> Especialista de TICS
> Tlf. +59322976300  EXT 1537
> https://hpcmodemat.epn.edu.ec/
> S?guenos en Twitter: @HPCModemat
> MODEMAT -EPN
> 
> -- next part --
> An HTML attachment was scrubbed...
> URL:  mpi.org/mailman/private/users/attachments/20200115/191e7a8f/attachme
> nt.html>
> 
> --
> 
> Subject: Digest Footer
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
> 
> --
> 
> End of users Digest, Vol 4374, Issue 1
> **


Re: [OMPI users] silent failure for large allgather

2019-09-25 Thread Heinz, Michael William via users
Emmanuel Thomé,

Thanks for bringing this to our attention. It turns out this issue affects all 
OFI providers in open-mpi. We've applied a fix to the 3.0.x and later branches 
of open-mpi/ompi on github. However, you should be aware that this fix simply 
adds the appropriate error message, it does not allow OFI to support message 
sizes larger than the OFI provider actually supports. That will require a more 
significant effort which we are evaluating now.

---
Mike Heinz
Networking Fabric Software Engineer
Intel Corporation

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] local rank to rank comms

2019-03-20 Thread Michael Di Domenico
unfortunately it takes a while to export the data, but here's what i see

On Mon, Mar 11, 2019 at 11:02 PM Gilles Gouaillardet  wrote:
>
> Michael,
>
>
> this is odd, I will have a look.
>
> Can you confirm you are running on a single node ?
>
>
> At first, you need to understand which component is used by Open MPI for
> communications.
>
> There are several options here, and since I do not know how Open MPI was
> built, nor which dependencies are installed,
>
> I can only list a few
>
>
> - pml/cm uses mtl/psm2 => omnipath is used for both inter and intra node
> communications
>
> - pml/cm uses mtl/ofi => libfabric is used for both inter and intra node
> communications. it definitely uses libpsm2 for inter node
> communications, and I do not know enough about the internals to tell how
> inter communications are handled
>
> - pml/ob1 is used, I guess it uses btl/ofi for inter node communications
> and btl/vader for intra node communications (in that case the NIC device
> is not used for intra node communications
>
> there could be other I am missing (does UCX support OmniPath ? could
> btl/ofi also be used for intra node communications ?)
>
>
> mpirun --mca pml_base_verbose 10 --mca btl_base_verbose 10 --mca
> mtl_base_verbose 10 ...
>
> should tell you what is used (feel free to compress and post the full
> output if you have some hard time understanding the logs)
>
>
> Cheers,
>
>
> Gilles
>
> On 3/12/2019 1:41 AM, Michael Di Domenico wrote:
> > On Mon, Mar 11, 2019 at 12:09 PM Gilles Gouaillardet
> >  wrote:
> >> You can force
> >> mpirun --mca pml ob1 ...
> >> And btl/vader (shared memory) will be used for intra node communications 
> >> ... unless MPI tasks are from different jobs (read MPI_Comm_spawn())
> > if i run
> >
> > mpirun -n 16 IMB-MPI1 alltoallv
> > things run fine, 12us on average for all ranks
> >
> > if i run
> >
> > mpirun -n 16 --mca pml ob1 IMB-MPI1 alltoallv
> > the program runs, but then it hangs at "List of benchmarks to run:
> > #Alltoallv"  and no tests run
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> >
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


ompi.run.ob1
Description: Binary data


ompi.run.cm
Description: Binary data
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] local rank to rank comms

2019-03-11 Thread Michael Di Domenico
On Mon, Mar 11, 2019 at 12:09 PM Gilles Gouaillardet
 wrote:
> You can force
> mpirun --mca pml ob1 ...
> And btl/vader (shared memory) will be used for intra node communications ... 
> unless MPI tasks are from different jobs (read MPI_Comm_spawn())

if i run

mpirun -n 16 IMB-MPI1 alltoallv
things run fine, 12us on average for all ranks

if i run

mpirun -n 16 --mca pml ob1 IMB-MPI1 alltoallv
the program runs, but then it hangs at "List of benchmarks to run:
#Alltoallv"  and no tests run
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] local rank to rank comms

2019-03-11 Thread Michael Di Domenico
On Mon, Mar 11, 2019 at 12:19 PM Ralph H Castain  wrote:
> OFI uses libpsm2 underneath it when omnipath detected
>
> > On Mar 11, 2019, at 9:06 AM, Gilles Gouaillardet 
> >  wrote:
> > It might show that pml/cm and mtl/psm2 are used. In that case, then yes, 
> > the OmniPath library is used even for intra node communications. If this 
> > library is optimized for intra node, then it will internally uses shared 
> > memory instead of the NIC.

would it be fair to assume that, if we assume the opa library is
optimized for intra-node using shared memory, there shouldn't be much
of a difference between the opa library and the ompi library for local
rank to rank comms

is there a way or tool to measure that?  i'd like to run the tests
toggling opa vs ompi libraries and see if or really how much a
difference there is
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] local rank to rank comms

2019-03-11 Thread Michael Di Domenico
On Mon, Mar 11, 2019 at 11:51 AM Ralph H Castain  wrote:
> You are probably using the ofi mtl - could be psm2 uses loopback method?

according to ompi_info i do in fact have mtl's ofi,psm,psm2.  i
haven't changed any of the defaults, so are you saying order to change
the behaviour i have to run mpirun --mca mtl psm2?  if true, what's
the recourse to not using the ofi mtl?
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] local rank to rank comms

2019-03-11 Thread Michael Di Domenico
i have a user that's claiming when two ranks on the same node want to
talk with each other, they're using the NIC to talk rather then just
talking directly.

i've never had to test such a scenario.  is there a way for me to
prove one way or another whether two ranks are talking through say the
kernel (or however it actually works) or using the nic?

i didn't set any flags when i compiled openmpi to change this.

i'm running ompi 3.1, pmix 2.2.1, and slurm 18.05 running atop omnipath
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] pmix and srun

2019-01-18 Thread Michael Di Domenico
seems to be better now.  jobs are running

On Fri, Jan 18, 2019 at 6:17 PM Ralph H Castain  wrote:
>
> I have pushed a fix to the v2.2 branch - could you please confirm it?
>
>
> > On Jan 18, 2019, at 2:23 PM, Ralph H Castain  wrote:
> >
> > Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm 
> > plugin folks seem to be off somewhere for awhile and haven’t been testing 
> > it. Sigh.
> >
> > I’ll patch the branch and let you know - we’d appreciate the feedback.
> > Ralph
> >
> >
> >> On Jan 18, 2019, at 2:09 PM, Michael Di Domenico  
> >> wrote:
> >>
> >> here's the branches i'm using.  i did a git clone on the repo's and
> >> then a git checkout
> >>
> >> [ec2-user@labhead bin]$ cd /hpc/src/pmix/
> >> [ec2-user@labhead pmix]$ git branch
> >> master
> >> * v2.2
> >> [ec2-user@labhead pmix]$ cd ../slurm/
> >> [ec2-user@labhead slurm]$ git branch
> >> * (detached from origin/slurm-18.08)
> >> master
> >> [ec2-user@labhead slurm]$ cd ../ompi/
> >> [ec2-user@labhead ompi]$ git branch
> >> * (detached from origin/v3.1.x)
> >> master
> >>
> >>
> >> attached is the debug out from the run with the debugging turned on
> >>
> >> On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain  wrote:
> >>>
> >>> Looks strange. I’m pretty sure Mellanox didn’t implement the event 
> >>> notification system in the Slurm plugin, but you should only be trying to 
> >>> call it if OMPI is registering a system-level event code - which OMPI 3.1 
> >>> definitely doesn’t do.
> >>>
> >>> If you are using PMIx v2.2.0, then please note that there is a bug in it 
> >>> that slipped through our automated testing. I replaced it today with 
> >>> v2.2.1 - you probably should update if that’s the case. However, that 
> >>> wouldn’t necessarily explain this behavior. I’m not that familiar with 
> >>> the Slurm plugin, but you might try adding
> >>>
> >>> PMIX_MCA_pmix_client_event_verbose=5
> >>> PMIX_MCA_pmix_server_event_verbose=5
> >>> OMPI_MCA_pmix_base_verbose=10
> >>>
> >>> to your environment and see if that provides anything useful.
> >>>
> >>>> On Jan 18, 2019, at 12:09 PM, Michael Di Domenico 
> >>>>  wrote:
> >>>>
> >>>> i compilied pmix slurm openmpi
> >>>>
> >>>> ---pmix
> >>>> ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
> >>>> --disable-debug
> >>>> ---slurm
> >>>> ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
> >>>> --with-pmix=/hpc/pmix/2.2
> >>>> ---openmpi
> >>>> ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
> >>>> --with-libevent=external --with-slurm=/hpc/slurm/18.08
> >>>> --with-pmix=/hpc/pmix/2.2
> >>>>
> >>>> everything seemed to compile fine, but when i do an srun i get the
> >>>> below errors, however, if i salloc and then mpirun it seems to work
> >>>> fine.  i'm not quite sure where the breakdown is or how to debug it
> >>>>
> >>>> ---
> >>>>
> >>>> [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
> >>>> [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
> >>>> event/pmix_event_registration.c at line 101
> >>>> [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
> >>>> event/pmix_event_registration.c at line 101
> >>>> [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
> >>>> event/pmix_event_registration.c at line 101
> >>>> --
> >>>> It looks like MPI_INIT failed for some reason; your parallel process is
> >>>> likely to abort.  There are many reasons that a parallel process can
> >>>> fail during MPI_INIT; some of which are due to configuration or 
> >>>> environment
> >>>> problems.  This failure appears to be an internal failure; here's some
> >>>> additional information (which may only be relevant to an Open MPI
> >>>> developer):
> >>>>
> >>>> ompi_interlib_declare
> >>>> --> Returned "Would block" (-10) instead of "Success" (0)
> >

Re: [OMPI users] Fwd: pmix and srun

2019-01-18 Thread Michael Di Domenico
here's the branches i'm using.  i did a git clone on the repo's and
then a git checkout

[ec2-user@labhead bin]$ cd /hpc/src/pmix/
[ec2-user@labhead pmix]$ git branch
  master
* v2.2
[ec2-user@labhead pmix]$ cd ../slurm/
[ec2-user@labhead slurm]$ git branch
* (detached from origin/slurm-18.08)
  master
[ec2-user@labhead slurm]$ cd ../ompi/
[ec2-user@labhead ompi]$ git branch
* (detached from origin/v3.1.x)
  master


attached is the debug out from the run with the debugging turned on

On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain  wrote:
>
> Looks strange. I’m pretty sure Mellanox didn’t implement the event 
> notification system in the Slurm plugin, but you should only be trying to 
> call it if OMPI is registering a system-level event code - which OMPI 3.1 
> definitely doesn’t do.
>
> If you are using PMIx v2.2.0, then please note that there is a bug in it that 
> slipped through our automated testing. I replaced it today with v2.2.1 - you 
> probably should update if that’s the case. However, that wouldn’t necessarily 
> explain this behavior. I’m not that familiar with the Slurm plugin, but you 
> might try adding
>
> PMIX_MCA_pmix_client_event_verbose=5
> PMIX_MCA_pmix_server_event_verbose=5
> OMPI_MCA_pmix_base_verbose=10
>
> to your environment and see if that provides anything useful.
>
> > On Jan 18, 2019, at 12:09 PM, Michael Di Domenico  
> > wrote:
> >
> > i compilied pmix slurm openmpi
> >
> > ---pmix
> > ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
> > --disable-debug
> > ---slurm
> > ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
> > --with-pmix=/hpc/pmix/2.2
> > ---openmpi
> > ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
> > --with-libevent=external --with-slurm=/hpc/slurm/18.08
> > --with-pmix=/hpc/pmix/2.2
> >
> > everything seemed to compile fine, but when i do an srun i get the
> > below errors, however, if i salloc and then mpirun it seems to work
> > fine.  i'm not quite sure where the breakdown is or how to debug it
> >
> > ---
> >
> > [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
> > [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > --
> > It looks like MPI_INIT failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during MPI_INIT; some of which are due to configuration or environment
> > problems.  This failure appears to be an internal failure; here's some
> > additional information (which may only be relevant to an Open MPI
> > developer):
> >
> >  ompi_interlib_declare
> >  --> Returned "Would block" (-10) instead of "Success" (0)
> > ...snipped...
> > [labcmp6:18355] *** An error occurred in MPI_Init
> > [labcmp6:18355] *** reported by process [140726281390153,15]
> > [labcmp6:18355] *** on a NULL communicator
> > [labcmp6:18355] *** Unknown error
> > [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
> > communicator will now abort,
> > [labcmp6:18355] ***and potentially your MPI job)
> > [labcmp6:18352] *** An error occurred in MPI_Init
> > [labcmp6:18352] *** reported by process [1677936713,12]
> > [labcmp6:18352] *** on a NULL communicator
> > [labcmp6:18352] *** Unknown error
> > [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
> > communicator will now abort,
> > [labcmp6:18352] ***and potentially your MPI job)
> > [labcmp6:18354] *** An error occurred in MPI_Init
> > [labcmp6:18354] *** reported by process [140726281390153,14]
> > [labcmp6:18354] *** on a NULL communicator
> > [labcmp6:18354] *** Unknown error
> > [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this
> > communicator will now abort,
> > [labcmp6:18354] ***and potentially your MPI job)
> > srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> > slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 
> > 2019-01-18T20:03:33 ***
> > [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > --
> > It looks like MPI_INIT failed for some reason; your parallel process is
&

[OMPI users] Fwd: pmix and srun

2019-01-18 Thread Michael Di Domenico
i compilied pmix slurm openmpi

---pmix
./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
--disable-debug
---slurm
./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
--with-pmix=/hpc/pmix/2.2
---openmpi
./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
--with-libevent=external --with-slurm=/hpc/slurm/18.08
--with-pmix=/hpc/pmix/2.2

everything seemed to compile fine, but when i do an srun i get the
below errors, however, if i salloc and then mpirun it seems to work
fine.  i'm not quite sure where the breakdown is or how to debug it

---

[ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
[labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
[labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
[labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_interlib_declare
  --> Returned "Would block" (-10) instead of "Success" (0)
...snipped...
[labcmp6:18355] *** An error occurred in MPI_Init
[labcmp6:18355] *** reported by process [140726281390153,15]
[labcmp6:18355] *** on a NULL communicator
[labcmp6:18355] *** Unknown error
[labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[labcmp6:18355] ***and potentially your MPI job)
[labcmp6:18352] *** An error occurred in MPI_Init
[labcmp6:18352] *** reported by process [1677936713,12]
[labcmp6:18352] *** on a NULL communicator
[labcmp6:18352] *** Unknown error
[labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[labcmp6:18352] ***and potentially your MPI job)
[labcmp6:18354] *** An error occurred in MPI_Init
[labcmp6:18354] *** reported by process [140726281390153,14]
[labcmp6:18354] *** on a NULL communicator
[labcmp6:18354] *** Unknown error
[labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[labcmp6:18354] ***and potentially your MPI job)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 2019-01-18T20:03:33 ***
[labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_interlib_declare
  --> Returned "Would block" (-10) instead of "Success" (0)
--
[labcmp5:18357] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
[labcmp5:18356] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
srun: error: labcmp6: tasks 12-15: Exited with exit code 1
srun: error: labcmp3: tasks 0-3: Killed
srun: error: labcmp4: tasks 4-7: Killed
srun: error: labcmp5: tasks 8-11: Killed
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] OpenFabrics warning

2018-11-12 Thread Michael Di Domenico
On Mon, Nov 12, 2018 at 8:08 AM Andrei Berceanu
 wrote:
>
> Running a CUDA+MPI application on a node with 2 K80 GPUs, I get the following 
> warnings:
>
> --
> WARNING: There is at least non-excluded one OpenFabrics device found,
> but there are no active ports detected (or Open MPI was unable to use
> them).  This is most certainly not what you wanted.  Check your
> cables, subnet manager configuration, etc.  The openib BTL will be
> ignored for this job.
>
>   Local host: gpu01
> --
> [gpu01:107262] 1 more process has sent help message help-mpi-btl-openib.txt / 
> no active ports found
> [gpu01:107262] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> help / error messages
>
> Any idea of what is going on and how I can fix this?
> I am using OpenMPI 3.1.2.

looks like openmpi found something like an infiniband card in the
compute node you're using, but it is not active/usable

as for a fix, it depends.

if you have an IB card should it be active?  if so, you'd have to
check the connections to see why it's disabled

if not, you'll can tell openmpi to disregard the IB ports, which will
clear the warning, but that might mean you're potentially using a
slower interface for message passing
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Problem running with UCX/oshmem on single node?

2018-05-14 Thread Michael Di Domenico
On Wed, May 9, 2018 at 9:45 PM, Howard Pritchard  wrote:
>
> You either need to go and buy a connectx4/5 HCA from mellanox (and maybe a
> switch), and install that
> on your system, or else install xpmem (https://github.com/hjelmn/xpmem).
> Note there is a bug right now
> in UCX that you may hit if you try to go thee xpmem only  route:

How stringent is the Connect-X 4/5 requirement?  i have Connect-X 3
cards will they work?  during the configure step is seems to yell at
me that mlx5 wont compile because i don't have Mellanox OFED v3.1
installed, is that also a requirement (i'm using the RHEl7.4 bundled
version of ofed, not then vendor versions)
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] shmem

2018-05-09 Thread Michael Di Domenico
before i debug ucx further (cause it's totally not working for me), i
figured i'd check to see if it's *really* required to use shmem inside
of openmpi.  i'm pretty sure the answer is yes, but i wanted to double
check.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] openmpi/slurm/pmix

2018-04-25 Thread Michael Di Domenico
On Mon, Apr 23, 2018 at 6:07 PM, r...@open-mpi.org  wrote:
> Looks like the problem is that you didn’t wind up with the external PMIx. The 
> component listed in your error is the internal PMIx one which shouldn’t have 
> built given that configure line.
>
> Check your config.out and see what happened. Also, ensure that your 
> LD_LIBRARY_PATH is properly pointing to the installation, and that you built 
> into a “clean” prefix.

the "clean prefix" part seemed to fix my issue.  i'm not exactly sure
i understand why/how though.  i recompiled pmix and removed the old
installation before doing a make install

when i recompiled openmpi it seems to have figured itself out

i think things are still a little wonky, but at least that issue is gone
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] openmpi/slurm/pmix

2018-04-23 Thread Michael Di Domenico
i'm trying to get slurm 17.11.5 and openmpi 3.0.1 working with pmix.

everything compiled, but when i run something it get

: symbol lookup error: /openmpi/mca_pmix_pmix2x.so: undefined symbol:
opal_libevent2022_evthread_use_pthreads

i more then sure i did something wrong, but i'm not sure what, here's what i did

compile libevent 2.1.8

./configure --prefix=/libevent-2.1.8

compile pmix 2.1.0

./configure --prefix=/pmix-2.1.0 --with-psm2
--with-munge=/munge-0.5.13 --with-libevent=/libevent-2.1.8

compile openmpi

./configure --prefix=/openmpi-3.0.1 --with-slurm=/slurm-17.11.5
--with-hwloc=external --with-mxm=/opt/mellanox/mxm
--with-cuda=/usr/local/cuda --with-pmix=/pmix-2.1.0
--with-libevent=/libevent-2.1.8

when i look at the symbols in the mca_pmix_pmix2x.so library the
function is indeed undefined (U) in the output, but checking ldd
against the library doesn't show any missing

any thoughts?
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] disabling libraries?

2018-04-10 Thread Michael Di Domenico
On Sat, Apr 7, 2018 at 3:50 PM, Jeff Squyres (jsquyres)
 wrote:
> On Apr 6, 2018, at 8:12 AM, Michael Di Domenico  
> wrote:
>> it would be nice if openmpi had (or may already have) a simple switch
>> that lets me disable entire portions of the library chain, ie this
>> host doesn't have a particular interconnect, so don't load any of the
>> libraries.  this might run counter to how openmpi discovers and load
>> libs though.
>
> We've actually been arguing about exactly how to do this for quite a while.  
> It's complicated (I can explain further, if you care).  :-\

i have no doubt its complicated.  i'm not overly interested in the
detail, but others i'm sure might be.  in reality you're correct, i
don't care that openmpi failed to load the libs given the fact that
the job continues to run without issue.  and in fact i don't even care
about the warnings, but my users will complain and ask questions.

achieving a single build binary where i can disable the
interconnects/libraries at runtime would be HIGHLY beneficial to me
(perhaps others as well).  it cuts my build version combinations from
like 12 to 4 (or less), that's a huge reduction in labour/maintenance.
which also means i can upgrade openmpi quicker and stay more up to
date.

i would garner this is probably not a high priority for the team
working on openmpi, but if there's something my organization or I can
do to push this higher, let me know.

> That being said, I think we *do* have a workaround that might be good enough 
> for you: disable those warnings about plugins not being able to be opened:
> mpirun --mca mca_component_show_load_errors 0 ...

disabled this: mca_base_component_repository_open: unable to open
mca_oob_ud: libibverbs.so.1
but not this: pmix_mca_base_component_repository_open: unable to open
mca_pnet_opa: libpsm2.so.2
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] disabling libraries?

2018-04-06 Thread Michael Di Domenico
On Thu, Apr 5, 2018 at 7:59 PM, Gilles Gouaillardet
 wrote:
> That being said, the error suggest mca_oob_ud.so is a module from a
> previous install,
> Open MPI was not built on the system it is running, or libibverbs.so.1
> has been removed after
> Open MPI was built.

yes, understood, i compiled openmpi on a node that has all the
libraries installed for our various interconnects, opa/psm/mxm/ib, but
i ran mpirun on a node that has none of them

so the resulting warnings i get

mca_btl_openib: lbrdmacm.so.1
mca_btl_usnic: libfabric.so.1
mca_oob_ud: libibverbs.so.1
mca_mtl_mxm: libmxm.so.2
mca_mtl_ofi: libfabric.so.1
mca_mtl_psm: libpsm_infinipath.so.1
mca_mtl_psm2: libpsm2.so.2
mca_pml_yalla: libmxm.so.2

you referenced them as "errors" above, but mpi actually runs just fine
for me even with these msgs, so i would consider them more warnings.

> So I do encourage you to take a step back, and think if you can find a
> better solution for your site.

there are two alternatives

1 i can compile a specific version of openmpi for each of our clusters
with each specific interconnect libraries

2 i can install all the libraries on all the machines regardless of
whether the interconnect is present

both are certainly plausible, but my effort here is to see if i can
reduce the size of our software stack and/or reduce the number of
compiled versions of openmpi

it would be nice if openmpi had (or may already have) a simple switch
that lets me disable entire portions of the library chain, ie this
host doesn't have a particular interconnect, so don't load any of the
libraries.  this might run counter to how openmpi discovers and load
libs though.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] disabling libraries?

2018-04-05 Thread Michael Di Domenico
i'm trying to compile openmpi to support all of our interconnects,
psm/openib/mxm/etc

this works fine, openmpi finds all the libs, compiles and runs on each
of the respective machines

however, we don't install the libraries for everything everywhere

so when i run things like ompi_info and mpirun i get

mca_base_component_reposity_open: unable to open mca_oob_ud:
libibverbs.so.1: cannot open shared object file: no such file or
directory (ignored)

and so on, for a bunch of other libs.

i understand how the lib linking works so this isn't unexpected and
doesn't stop the mpi programs from running.

here's the part i don't understand, how can i trace the above warning
and others like it back the required --mca parameters i need to add
into the configuration to make the warnings go away?

as an aside, i believe i can set most of them via environment
variables as well as the command, but what i really like to do is set
them from a file.  i know i can create a default param file, but is
there a way to feed a param file at invocation depending where mpirun
is being run?
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Using OMPI Standalone in a Windows/Cygwin Environment

2018-02-26 Thread Michael A. Saverino
OK,

Thanks for your help.

Mike...

On 02/26/2018 05:07 PM, Marco Atzeri wrote:
> On 26/02/2018 22:57, Michael A. Saverino wrote:
>>
>> Marco,
>>
>> If you disable the loopback as well as the other adapters via Device
>> Manager, you should be able to reproduce the error.
>>
>> Mike...
>
> It worked with also the loopback disabled.
> Probably the installation of the loopback just enabled some
> network basic functionality
>
> Regards
> Marco
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Michael A.Saverino
Contractor 
Senior Engineer, Information Technology Division
Code 5522
Naval Research Laboratory 
W (202)767-5652
C (814)242-0217
https://www.nrl.navy.mil/itd/ncs/

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Using OMPI Standalone in a Windows/Cygwin Environment

2018-02-26 Thread Michael A. Saverino

Marco,

If you disable the loopback as well as the other adapters via Device
Manager, you should be able to reproduce the error.

Mike...

On 02/26/2018 04:51 PM, Marco Atzeri wrote:
> On 26/02/2018 22:10, Michael A. Saverino wrote:
>>
>> Marco,
>>
>> I think oob still has a problem, at least on my machine, even though we
>> specify --mca oob ^tcp.   The workaround I found is to install the
>> Microsoft loopback adapter.   That satisfies OPMI at startup even though
>> the ethernet or WiFi is either disabled or disconnected.  You still have
>> to answer Windows firewall questions (if enabled) permitting/not
>> permitting orterun and my application.  Do you have the Microsoft
>> Loopback adapter installed on your system?
>>
>> Many Thanks,
>>
>> Mike...
>>
>
> Yes it is installed.
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>

-- 
Michael A.Saverino
Contractor 
Senior Engineer, Information Technology Division
Code 5522
Naval Research Laboratory 
W (202)767-5652
C (814)242-0217
https://www.nrl.navy.mil/itd/ncs/

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Using OMPI Standalone in a Windows/Cygwin Environment

2018-02-26 Thread Michael A. Saverino

Marco,

I think oob still has a problem, at least on my machine, even though we
specify --mca oob ^tcp.   The workaround I found is to install the
Microsoft loopback adapter.   That satisfies OPMI at startup even though
the ethernet or WiFi is either disabled or disconnected.  You still have
to answer Windows firewall questions (if enabled) permitting/not
permitting orterun and my application.  Do you have the Microsoft
Loopback adapter installed on your system?

Many Thanks,

Mike...

On 02/26/2018 02:11 PM, Marco Atzeri wrote:
> On 26/02/2018 18:14, Michael A. Saverino wrote:
>>
>> I am running the v-1.10.7 OMPI package that is available via the Cygwin
>> package manager.  I have a requirement to run my OMPI application
>> standalone on a Windows/Cygwin system without any network connectivity.
>> If my OMPI system is not connected to the network, I get the following
>> errors when I try to run my OMPI application:
>>   
>
> Michael,
> do you mean without a network connected or without any network
> services running ?
>
> On my W7 it works with both the Wireless and Cable connection disabled
> or disconnected.
>
> $ mpirun -n 2 ./hello_c.exe
> Hello, world, I am 1 of 2, (Open MPI v1.10.7, package: Open MPI
> marco@GE-MATZERI-EU Distribution, ident: 1.10.7, repo rev:
> v1.10.6-48-g5e373bf, May 16, 2017, 129)
> Hello, world, I am 0 of 2, (Open MPI v1.10.7, package: Open MPI
> marco@GE-MATZERI-EU Distribution, ident: 1.10.7, repo rev:
> v1.10.6-48-g5e373bf, May 16, 2017, 129)
>
>
> Is it possible that you have some type of virtual network driver
> active , like VPN , active ?
>
> Regards
> Marco
>
>
>
>
> _______
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>

-- 
Michael A.Saverino
Contractor 
Senior Engineer, Information Technology Division
Code 5522
Naval Research Laboratory 
W (202)767-5652
C (814)242-0217
https://www.nrl.navy.mil/itd/ncs/


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Using OMPI Standalone in a Windows/Cygwin Environment

2018-02-26 Thread Michael A. Saverino

Thank you for the quick response.  Your suggested commands did not work
with the network interface disabled or unplugged.  I still get:

[SAXM4WIN:02124] [[20996,1],0] tcp_peer_send_blocking: send() to socket
12 failed: Transport endpoint is not connected (128)



So, in spite of including --mca oob ^tcp, OMPI still wants to see a
connected port somewhere on the system. Do you have any other suggestions?

The whole command is as follows:

.\orterun.exe --mca oob ^tcp --mca btl self, sm -n 2 ./program

Many Thanks,

Mike...



On 02/26/2018 12:45 PM, r...@open-mpi.org wrote:
> There are a couple of problems here. First the “^tcp,self,sm” is telling OMPI 
> to turn off all three of those transports, which probably leaves you with 
> nothing. What you really want is to restrict to shared memory, so your param 
> should be “-mca btl self,sm”. This will disable all transports other than 
> shared memory - note that you always must enable the “self” btl.
>
> Second, you likely also need to ensure that the OOB isn’t trying to use tcp, 
> so add “-mca oob ^tcp” to your cmd line. It shouldn’t be active anyway, but 
> better safe.
>
>
>> On Feb 26, 2018, at 9:14 AM, Michael A. Saverino 
>>  wrote:
>>
>>
>> I am running the v-1.10.7 OMPI package that is available via the Cygwin
>> package manager.  I have a requirement to run my OMPI application
>> standalone on a Windows/Cygwin system without any network connectivity. 
>> If my OMPI system is not connected to the network, I get the following
>> errors when I try to run my OMPI application:
>>  
>> [SAXM4WIN:02124] [[20996,1],0] tcp_peer_send_blocking: send() to socket
>> 12 failed: Transport endpoint is not connected (128)
>> [SAXM4WIN:02124] [[20996,1],0] tcp_peer_send_blocking: send() to socket
>> 12 failed: Transport endpoint is not connected (128)
>>
>> I have tried the following qualifiers in my OMPI command to no avail:
>>
>> --mca btl ^tcp,self,sm
>>
>> So the question is, am I able to disable TCP networking, either via
>> command line or code, if I only plan to use cores on a single machine
>> for OMPI execution?
>>
>> Many Thanks,
>>
>> Mike...
>>
>>
>>
>> -- 
>> Michael A.Saverino
>> Contractor 
>> Senior Engineer, Information Technology Division
>> Code 5522
>> Naval Research Laboratory 
>> W (202)767-5652
>> C (814)242-0217
>> https://www.nrl.navy.mil/itd/ncs/
>>
>>
>> _______
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>

-- 
Michael A.Saverino
Contractor 
Senior Engineer, Information Technology Division
Code 5522
Naval Research Laboratory 
W (202)767-5652
C (814)242-0217
https://www.nrl.navy.mil/itd/ncs/


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Using OMPI Standalone in a Windows/Cygwin Environment

2018-02-26 Thread Michael A. Saverino

I am running the v-1.10.7 OMPI package that is available via the Cygwin
package manager.  I have a requirement to run my OMPI application
standalone on a Windows/Cygwin system without any network connectivity. 
If my OMPI system is not connected to the network, I get the following
errors when I try to run my OMPI application:
 
[SAXM4WIN:02124] [[20996,1],0] tcp_peer_send_blocking: send() to socket
12 failed: Transport endpoint is not connected (128)
[SAXM4WIN:02124] [[20996,1],0] tcp_peer_send_blocking: send() to socket
12 failed: Transport endpoint is not connected (128)

I have tried the following qualifiers in my OMPI command to no avail:

--mca btl ^tcp,self,sm

So the question is, am I able to disable TCP networking, either via
command line or code, if I only plan to use cores on a single machine
for OMPI execution?

Many Thanks,

Mike...



-- 
Michael A.Saverino
Contractor 
Senior Engineer, Information Technology Division
Code 5522
Naval Research Laboratory 
W (202)767-5652
C (814)242-0217
https://www.nrl.navy.mil/itd/ncs/


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] openmpi hang on IB disconnect

2018-01-17 Thread Michael Di Domenico
openmpi-2.0.2 running on rhel 7.4 with qlogic QDR infiniband
switches/adapters, also using slurm

i have a user that's running a job over multiple days.  unfortunately
after a few days at random the job will seemingly hang.  the latest
instance was caused by an infiniband adapter that went offline and
online several times.

the card is in a semi-working state at the moment, it's passing
traffic, but i suspect some of the IB messages during the job run got
lost and now the job is seemingly hung.

is there some mechanism i can put in place to detect this condition
either in the code or on the system.  it's causing two problems at the
moment.  first and foremost the user has no idea the job hung and for
what reason.  second it's wasting system time.

i'm sure other people have come across wonky IB cards, i'm curious how
everyone else is detecting this condition and dealing with it.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Vague error message while executing MPI-Fortran program

2017-11-05 Thread Michael Mauersberger
Hi,

thank you for your help. Unfortunately I don't have access to the source oft
he calling program. Maybe there is a subtle problem with some MPI commands.
But I have solved the problem in another way.

There is a module in the basic library using PRIVATE variables to call
predefined procedures according to several cases of calculation. That means
the private variables are changed so that they can adapt a general routine
to special calculations.

So I deleted the private variables and put them into procedure calling as
arguments. Now there is no problem with MPI calling any more.

Maybe you have an idea why it didn't work with those private variables? But
- well, if not there would not be a problem any more (although I don't know
why). ;)

Best regards

Michael



______
Dipl.-Ing. Michael Mauersberger
michael.mauersber...@tu-dresden.de
Tel +49 351 463-38099 | Fax +49 351 463-37263
Technische Universität Dresden
Institut für Luft- und Raumfahrttechnik / Institute of Aerospace Engineering
Professur für Luftfahrzeugtechnik / Chair of Aircraft Engineering
Prof. Dr. K. Wolf | 01062 Dresden | tu-dresden.de/ilr/lft

-Ursprüngliche Nachricht-
Von: users [mailto:users-boun...@lists.open-mpi.org] Im Auftrag von Reuti
Gesendet: Dienstag, 24. Oktober 2017 13:09
An: Open MPI Users 
Betreff: Re: [OMPI users] Vague error message while executing MPI-Fortran
program

Hi,

> Am 24.10.2017 um 09:33 schrieb Michael Mauersberger
:
> 
>  
>  
> Dear all,
>  
> When compiling and running a Fortran program on Linux (OpenSUSE Leap 42.3)
I get an undefinable error message stating, that some "Boundary Run-Time
Check Failure" ocurred for variable "ARGBLOCK_0.0.2". But this variable I
don't know or use in my code and the compiler is tracing me back to the line
of a "CONTAINS" statement in a module.

A `strings * | grep ARGBLOCK` in
/opt/intel/compilers_and_libraries_2017.4.196/linux/bin/intel64 reveals:

ARGBLOCK_%d
ARGBLOCK_REC_%d

So it looks like the output is generated on-the-fly and doesn't point to any
existing variable. But to which argument of which routine is still unclear.
Does the Intel Compile have the feature to output a cross-refernce of all
used variables? Maybe it's listed there.

-- Reuti


> I am using the Intel Fortran Compiler from Intel Composer XE 2013 with the
following Options:
> ifort -fPIC -g -traceback -O2 -check all,noarg_temp_created -warn all
>  
> Furthermore, the program uses Intel MKL with the functions DGETRF, 
> DGETRS, DSYGV, DGEMM, DGGEV and the C-Library NLopt.
>  
> The complete error message looks like:
>  
> Boundary Run-Time Check Failure for variable 'ARGBLOCK_0.0.2'
>  
> forrtl: error (76): Abort trap signal
> Image  PCRoutineLineSource

> libc.so.6  7F2BF06CC8D7  Unknown   Unknown
Unknown
> libc.so.6  7F2BF06CDCAA  Unknown   Unknown
Unknown
> geops  006A863F  Unknown   Unknown
Unknown
> libmodell.so   7F2BF119E54D  strukturtest_mod_ 223
strukturtest_mod.f90
> libmodell.so   7F2BF1184056  modell_start_ 169
modell_start.f90
> geops  0045D1A3  Unknown   Unknown
Unknown
> geops  0042C2C6  Unknown   Unknown
Unknown
> geops  0040A14C  Unknown   Unknown
Unknown
> libc.so.6  7F2BF06B86E5  Unknown   Unknown
Unknown
> geops  0040A049  Unknown   Unknown
Unknown
>

===
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 134
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ==
> = YOUR APPLICATION TERMINATED WITH THE EXIT STRING: 
> Aborted (signal 6) This typically refers to a problem with your 
> application.
> Please see the FAQ page for debugging suggestions
>  
>  
> The program has the following structure:
> - basic functions linked into static library (*.a), containing only 
> modules --> using MKL routines
> - main program linked into a dynamic library, containing 1 bare 
> subroutine, modules else
> - calling program (executed with mpiexec), calls mentioned subroutine 
> in main program
>  
> Without the calling program (in Open MPI) the subroutine runs without
problems. But when invoking it with the MPI program I get the error message
above.
>  
> So maybe some of you encountered a similar problem and is able to help me.
I would be really grateful.
>  
> Thanks,
&

[OMPI users] Vague error message while executing MPI-Fortran program

2017-10-24 Thread Michael Mauersberger


Dear all,

When compiling and running a Fortran program on Linux (OpenSUSE Leap 42.3) I 
get an undefinable error message stating, that some "Boundary Run-Time Check 
Failure" ocurred for variable "ARGBLOCK_0.0.2". But this variable I don't know 
or use in my code and the compiler is tracing me back to the line of a 
"CONTAINS" statement in a module.

I am using the Intel Fortran Compiler from Intel Composer XE 2013 with the 
following Options:
ifort -fPIC -g -traceback -O2 -check all,noarg_temp_created -warn all

Furthermore, the program uses Intel MKL with the functions
DGETRF, DGETRS, DSYGV, DGEMM, DGGEV
and the C-Library NLopt.

The complete error message looks like:

Boundary Run-Time Check Failure for variable 'ARGBLOCK_0.0.2'

forrtl: error (76): Abort trap signal
Image  PCRoutineLineSource
libc.so.6  7F2BF06CC8D7  Unknown   Unknown  Unknown
libc.so.6  7F2BF06CDCAA  Unknown   Unknown  Unknown
geops  006A863F  Unknown   Unknown  Unknown
libmodell.so   7F2BF119E54D  strukturtest_mod_ 223  
strukturtest_mod.f90
libmodell.so   7F2BF1184056  modell_start_ 169  
modell_start.f90
geops  0045D1A3  Unknown   Unknown  Unknown
geops  0042C2C6  Unknown   Unknown  Unknown
geops  0040A14C  Unknown   Unknown  Unknown
libc.so.6  7F2BF06B86E5  Unknown   Unknown  Unknown
geops  0040A049  Unknown   Unknown  Unknown
===
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions


The program has the following structure:
- basic functions linked into static library (*.a), containing only modules --> 
using MKL routines
- main program linked into a dynamic library, containing 1 bare subroutine, 
modules else
- calling program (executed with mpiexec), calls mentioned subroutine in main 
program

Without the calling program (in Open MPI) the subroutine runs without problems. 
But when invoking it with the MPI program I get the error message above.

So maybe some of you encountered a similar problem and is able to help me. I 
would be really grateful.

Thanks,

Michael

___

Dipl.-Ing. Michael Mauersberger<mailto:michael.mauersber...@tu-dresden.de>
Tel. +49 351 463 38099 | Fax +49 351 463 37263
Marschnerstraße 30, 01307 Dresden
Professur für Luftfahrzeugtechnik | Prof. Dr. Klaus 
Wolf<mailto:luftfahrzeugtechnik@?tu-dresden.de>
Institut für Luft- und Raumfahrttechnik | Fakultät 
Maschinenwesen
Technische Universität Dresden

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] openmpi mgmt traffic

2017-10-11 Thread Michael Di Domenico
my cluster nodes are connected on 1g ethernet eth0/eth1 and via
infiniband rdma and ib0

my understanding is that openmpi will detect all these interfaces.
using eth0/eth1 for connection setup and use rdma for msg passing

what would be an appropriate to command line parameters to tell
openmpi to ipoib for connection setup and rdma for message passing?

i effectively want to ignore the 1g ethernet connections for anything
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] alltoallv

2017-10-10 Thread Michael Di Domenico
i'm getting stuck trying to run some fairly large IMB-MPI alltoall
tests under openmpi 2.0.2 on rhel 7.4

i have two different clusters, one running mellanox fdr10 and one
running qlogic qdr

if i issue

mpirun -n 1024 ./IMB-MPI1 -npmin 1024 -iter 1 -mem 2.001 alltoallv

the job just stalls after the "List of Benchmarks to run: Alltoallv"
line outputs from IMB-MPI

if i switch it to alltoall the test does progress

often when running various size alltoall's i'll get

"too many retries sending message to <>:<>, giving up

i'm able to use infiniband just fine (our lustre filesystem mounts
over it) and i have other mpi programs running

it only seems to stem when i run alltoall type primitives

any thoughts on debugging where the failures are, i might just need to
turn up the debugging, but i'm not sure where
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Question concerning compatibility of languages used with building OpenMPI and languages OpenMPI uses to build MPI binaries.

2017-09-20 Thread Michael Thomadakis
This discussion started getting into an interesting question: ABI
standardization for portability by language. It makes sense to have ABI
standardization for portability of objects across environments. At the same
time it does mean that everyone follows the exact same recipe for low level
implementation details but there may be unnecessarily restrictive at times.

On Wed, Sep 20, 2017 at 4:45 PM, Jeff Hammond 
wrote:

>
>
> On Wed, Sep 20, 2017 at 5:55 AM, Dave Love 
> wrote:
>
>> Jeff Hammond  writes:
>>
>> > Please separate C and C++ here. C has a standard ABI.  C++ doesn't.
>> >
>> > Jeff
>>
>> [For some value of "standard".]  I've said the same about C++, but the
>> current GCC manual says its C++ ABI is "industry standard", and at least
>> Intel document compatibility with recent GCC on GNU/Linux.  It's
>> standard enough to have changed for C++11 (?), with resulting grief in
>> package repos, for instance.
>>
>
> I may have used imprecise language.  As a matter of practice, I switch C
> compilers all the time without recompiling MPI and life is good.  Switching
> between Clang with libc++ and GCC with libstd++ does not produce happiness.
>
> Jeff
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Question concerning compatibility of languages used with building OpenMPI and languages OpenMPI uses to build MPI binaries.

2017-09-18 Thread Michael Thomadakis
OMP is yet another source of incompatibility between GNU and Intel
environments. So compiling say Fortran OMP code into a library and trying
to link it with Intel Fortran codes just aggravates the problem.
Michael

On Mon, Sep 18, 2017 at 7:35 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Even if i do not fully understand the question, keep in mind Open MPI
> does not use OpenMP, so from that point of view, Open MPI is
> independant of the OpenMP runtime.
>
> Let me emphasize on what Jeff already wrote : use different installs
> of Open MPI (and you can use modules or lmod in order to choose
> between them easily) and always use the compilers that were used to
> build Open MPI. This is mandatory is you use Fortran bindings (use mpi
> and use mpi_f08), and you'd better keep yourself out of trouble with
> C/C++ and mpif.h
>
> Cheers,
>
> Gilles
>
> On Tue, Sep 19, 2017 at 5:57 AM, Michael Thomadakis
>  wrote:
> > Thanks for the note. How about OMP runtimes though?
> >
> > Michael
> >
> > On Mon, Sep 18, 2017 at 3:21 PM, n8tm via users <
> users@lists.open-mpi.org>
> > wrote:
> >>
> >> On Linux and Mac, Intel c and c++ are sufficiently compatible with gcc
> and
> >> g++ that this should be possible.  This is not so for Fortran libraries
> or
> >> Windows.
> >>
> >>
> >>
> >>
> >>
> >>
> >> Sent via the Samsung Galaxy S8 active, an AT&T 4G LTE smartphone
> >>
> >>  Original message 
> >> From: Michael Thomadakis 
> >> Date: 9/18/17 3:51 PM (GMT-05:00)
> >> To: users@lists.open-mpi.org
> >> Subject: [OMPI users] Question concerning compatibility of languages
> used
> >> with building OpenMPI and languages OpenMPI uses to build MPI binaries.
> >>
> >> Dear OpenMPI list,
> >>
> >> As far as I know, when we build OpenMPI itself with GNU or Intel
> compilers
> >> we expect that the subsequent MPI application binary will use the same
> >> compiler set and run-times.
> >>
> >> Would it be possible to build OpenMPI with the GNU tool chain but then
> >> subsequently instruct the OpenMPI compiler wrappers to use the Intel
> >> compiler set? Would there be any issues with compiling C++ / Fortran or
> >> corresponding OMP codes ?
> >>
> >> In general, what is clean way to build OpenMPI with a GNU compiler set
> but
> >> then instruct the wrappers to use Intel compiler set?
> >>
> >> Thanks!
> >> Michael
> >>
> >> ___
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/users
> >
> >
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Question concerning compatibility of languages used with building OpenMPI and languages OpenMPI uses to build MPI binaries.

2017-09-18 Thread Michael Thomadakis
Hello OpenMPI team,

Thank you for the insightful feedback. I am not claiming in any way that it
is a meaningful practice to build the OpenMPI stack with one compiler and
then just try to convince / force it to use another compilation environment
to build MPI applications. There are occasions though that one *may only
have an OpenMPI* stack built, say, by GNU compilers but for efficiency of
execution of the resulting MPI applications try to use Intel / PGI
compilers with the same OpenMPI stack to compile MPI applications.

It is too much unnecessary trouble to use the same MPI stack with different
compilation environments.

Thank you,
Michael

On Mon, Sep 18, 2017 at 7:35 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Even if i do not fully understand the question, keep in mind Open MPI
> does not use OpenMP, so from that point of view, Open MPI is
> independant of the OpenMP runtime.
>
> Let me emphasize on what Jeff already wrote : use different installs
> of Open MPI (and you can use modules or lmod in order to choose
> between them easily) and always use the compilers that were used to
> build Open MPI. This is mandatory is you use Fortran bindings (use mpi
> and use mpi_f08), and you'd better keep yourself out of trouble with
> C/C++ and mpif.h
>
> Cheers,
>
> Gilles
>
> On Tue, Sep 19, 2017 at 5:57 AM, Michael Thomadakis
>  wrote:
> > Thanks for the note. How about OMP runtimes though?
> >
> > Michael
> >
> > On Mon, Sep 18, 2017 at 3:21 PM, n8tm via users <
> users@lists.open-mpi.org>
> > wrote:
> >>
> >> On Linux and Mac, Intel c and c++ are sufficiently compatible with gcc
> and
> >> g++ that this should be possible.  This is not so for Fortran libraries
> or
> >> Windows.
> >>
> >>
> >>
> >>
> >>
> >>
> >> Sent via the Samsung Galaxy S8 active, an AT&T 4G LTE smartphone
> >>
> >>  Original message 
> >> From: Michael Thomadakis 
> >> Date: 9/18/17 3:51 PM (GMT-05:00)
> >> To: users@lists.open-mpi.org
> >> Subject: [OMPI users] Question concerning compatibility of languages
> used
> >> with building OpenMPI and languages OpenMPI uses to build MPI binaries.
> >>
> >> Dear OpenMPI list,
> >>
> >> As far as I know, when we build OpenMPI itself with GNU or Intel
> compilers
> >> we expect that the subsequent MPI application binary will use the same
> >> compiler set and run-times.
> >>
> >> Would it be possible to build OpenMPI with the GNU tool chain but then
> >> subsequently instruct the OpenMPI compiler wrappers to use the Intel
> >> compiler set? Would there be any issues with compiling C++ / Fortran or
> >> corresponding OMP codes ?
> >>
> >> In general, what is clean way to build OpenMPI with a GNU compiler set
> but
> >> then instruct the wrappers to use Intel compiler set?
> >>
> >> Thanks!
> >> Michael
> >>
> >> ___
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/users
> >
> >
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Question concerning compatibility of languages used with building OpenMPI and languages OpenMPI uses to build MPI binaries.

2017-09-18 Thread Michael Thomadakis
Thanks for the note. How about OMP runtimes though?

Michael

On Mon, Sep 18, 2017 at 3:21 PM, n8tm via users 
wrote:

> On Linux and Mac, Intel c and c++ are sufficiently compatible with gcc and
> g++ that this should be possible.  This is not so for Fortran libraries or
> Windows.
>
>
>
>
>
>
> Sent via the Samsung Galaxy S8 active, an AT&T 4G LTE smartphone
>
> ---- Original message 
> From: Michael Thomadakis 
> Date: 9/18/17 3:51 PM (GMT-05:00)
> To: users@lists.open-mpi.org
> Subject: [OMPI users] Question concerning compatibility of languages used
> with building OpenMPI and languages OpenMPI uses to build MPI binaries.
>
> Dear OpenMPI list,
>
> As far as I know, when we build OpenMPI itself with GNU or Intel compilers
> we expect that the subsequent MPI application binary will use the same
> compiler set and run-times.
>
> Would it be possible to build OpenMPI with the GNU tool chain but then
> subsequently instruct the OpenMPI compiler wrappers to use the Intel
> compiler set? Would there be any issues with compiling C++ / Fortran or
> corresponding OMP codes ?
>
> In general, what is clean way to build OpenMPI with a GNU compiler set but
> then instruct the wrappers to use Intel compiler set?
>
> Thanks!
> Michael
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Question concerning compatibility of languages used with building OpenMPI and languages OpenMPI uses to build MPI binaries.

2017-09-18 Thread Michael Thomadakis
Dear OpenMPI list,

As far as I know, when we build OpenMPI itself with GNU or Intel compilers
we expect that the subsequent MPI application binary will use the same
compiler set and run-times.

Would it be possible to build OpenMPI with the GNU tool chain but then
subsequently instruct the OpenMPI compiler wrappers to use the Intel
compiler set? Would there be any issues with compiling C++ / Fortran or
corresponding OMP codes ?

In general, what is clean way to build OpenMPI with a GNU compiler set but
then instruct the wrappers to use Intel compiler set?

Thanks!
Michael
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] disable slurm/munge from mpirun

2017-06-23 Thread Michael Di Domenico
On Thu, Jun 22, 2017 at 12:41 PM, r...@open-mpi.org  wrote:
> I gather you are using OMPI 2.x, yes? And you configured it 
> --with-pmi=, then moved the executables/libs to your 
> workstation?

correct

> I suppose I could state the obvious and say “don’t do that - just rebuild it”

correct...  but bummer...  so much for being lazy...

> and I fear that (after checking the 2.x code) you really have no choice. OMPI 
> v3.0 will have a way around the problem, but not the 2.x series.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] disable slurm/munge from mpirun

2017-06-22 Thread Michael Di Domenico
On Thu, Jun 22, 2017 at 10:43 AM, John Hearns via users
 wrote:
> Having had some problems with ssh launching (a few minutes ago) I can
> confirm that this works:
>
> --mca plm_rsh_agent "ssh -v"

this doesn't do anything for me

if i set OMPI_MCA_sec=^munge

i can clear the mca_sec_munge error

but the mca_pmix_pmix112 and opal_pmix_base_select errors still
exists.  the plm_rsh_agent switch/env var doesn't seem to affect that
error

down the road, i may still need the rsh_agent flag, but i think we're
still before that in the sequence of events
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] disable slurm/munge from mpirun

2017-06-22 Thread Michael Di Domenico
that took care of one of the errors, but i missed a re-type on the second error

mca_base_component_repository_open: unable to open mca_pmix_pmix112:
libmunge missing

and the opal_pmix_base_select error is still there (which is what's
actually halting my job)



On Thu, Jun 22, 2017 at 10:35 AM, r...@open-mpi.org  wrote:
> You can add "OMPI_MCA_plm=rsh OMPI_MCA_sec=^munge” to your environment
>
>
> On Jun 22, 2017, at 7:28 AM, John Hearns via users
>  wrote:
>
> Michael,  try
>  --mca plm_rsh_agent ssh
>
> I've been fooling with this myself recently, in the contect of a PBS cluster
>
> On 22 June 2017 at 16:16, Michael Di Domenico 
> wrote:
>>
>> is it possible to disable slurm/munge/psm/pmi(x) from the mpirun
>> command line or (better) using environment variables?
>>
>> i'd like to use the installed version of openmpi i have on a
>> workstation, but it's linked with slurm from one of my clusters.
>>
>> mpi/slurm work just fine on the cluster, but when i run it on a
>> workstation i get the below errors
>>
>> mca_base_component_repositoy_open: unable to open mca_sec_munge:
>> libmunge missing
>> ORTE_ERROR_LOG Not found in file ess_hnp_module.c at line 648
>> opal_pmix_base_select failed
>> returned value not found (-13) instead of orte_success
>>
>> there's probably a magical incantation of mca parameters, but i'm not
>> adept enough at determining what they are
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] disable slurm/munge from mpirun

2017-06-22 Thread Michael Di Domenico
is it possible to disable slurm/munge/psm/pmi(x) from the mpirun
command line or (better) using environment variables?

i'd like to use the installed version of openmpi i have on a
workstation, but it's linked with slurm from one of my clusters.

mpi/slurm work just fine on the cluster, but when i run it on a
workstation i get the below errors

mca_base_component_repositoy_open: unable to open mca_sec_munge:
libmunge missing
ORTE_ERROR_LOG Not found in file ess_hnp_module.c at line 648
opal_pmix_base_select failed
returned value not found (-13) instead of orte_success

there's probably a magical incantation of mca parameters, but i'm not
adept enough at determining what they are
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] openmpi 1.10.2 and PGI 15.9

2016-07-25 Thread Michael Di Domenico
On Mon, Jul 25, 2016 at 4:53 AM, Gilles Gouaillardet  wrote:
>
> as a workaround, you can configure without -noswitcherror.
>
> after you ran configure, you have to manually patch the generated 'libtool'
> file and add the line with pgcc*) and the next line like this :
>
> /* if pgcc is used, libtool does *not* pass -pthread to pgcc any more */
>
>
># Convert "-framework foo" to "foo.ltframework"
> # and "-pthread" to "-Wl,-pthread" if NAG compiler
> if test -n "$inherited_linker_flags"; then
>   case "$CC" in
> nagfor*)
>   tmp_inherited_linker_flags=`$ECHO "$inherited_linker_flags" |
> $SED 's/-framework \([^ $]*\)/\1.ltframework/g' | $SED
> 's/-pthread/-Wl,-pthread/g'`;;
> pgcc*)
>   tmp_inherited_linker_flags=`$ECHO "$inherited_linker_flags" |
> $SED 's/-framework \([^ $]*\)/\1.ltframework/g' | $SED 's/-pthread//g'`;;
> *)
>   tmp_inherited_linker_flags=`$ECHO "$inherited_linker_flags" |
> $SED 's/-framework \([^ $]*\)/\1.ltframework/g'`;;
>   esac
>
>
> i guess the right way is to patch libtool so it passes -noswitcherror to $CC
> and/or $LD, but i was not able to achieve that yet.


Thanks.  I managed to work around the issue, by hand compiling the
single module that failed during the build process.  but something is
definitely amiss in the openmpi compile system when it comes to pgi


Re: [OMPI users] openmpi 1.10.2 and PGI 15.9

2016-07-22 Thread Michael Di Domenico
So, the -noswitcherror is partially working.  I added the switch into
my configure line LDFLAGS param.  I can see the parameter being passed
to libtool, but for some reason libtool is refusing to passing it
along at compile.

if i sh -x the libtool command line, i can see it set in a few
variables, but at the end when eval's the compile line for pgcc the
option is missing.

if i cut and past the eval line and hand put it back in, the library
compiles with a pgcc warning instead of an error which i believe what
i want, but i'm not sure why libtool is dropping the switch



On Tue, Jul 19, 2016 at 5:27 AM, Sylvain Jeaugey  wrote:
> As a workaround, you can also try adding -noswitcherror to PGCC flags.
>
> On 07/11/2016 03:52 PM, Åke Sandgren wrote:
>>
>> Looks like you are compiling with slurm support.
>>
>> If so, you need to remove the "-pthread" from libslurm.la and libpmi.la
>>
>> On 07/11/2016 02:54 PM, Michael Di Domenico wrote:
>>>
>>> I'm trying to get openmpi compiled using the PGI compiler.
>>>
>>> the configure goes through and the code starts to compile, but then
>>> gets hung up with
>>>
>>> entering: openmpi-1.10.2/opal/mca/common/pmi
>>> CC common_pmi.lo
>>> CCLD libmca_common_pmi.la
>>> pgcc-Error-Unknown switch: - pthread
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/07/29635.php
>>>
>
> ---
> This email message is for the sole use of the intended recipient(s) and may
> contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
> ---
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/07/29692.php


Re: [OMPI users] openmpi 1.10.2 and PGI 15.9

2016-07-14 Thread Michael Di Domenico
On Mon, Jul 11, 2016 at 9:52 AM, Åke Sandgren  wrote:
> Looks like you are compiling with slurm support.
>
> If so, you need to remove the "-pthread" from libslurm.la and libpmi.la

i don't see a configure option in slurm to disable pthreads, so i'm
not sure this is possible.


Re: [OMPI users] openmpi 1.10.2 and PGI 15.9

2016-07-14 Thread Michael Di Domenico
On Thu, Jul 14, 2016 at 9:47 AM, Michael Di Domenico
 wrote:
> Have 1.10.3 unpacked, ran through the configure using the same command
> line options as 1.10.2
>
> but it fails even earlier in the make process at
>
> Entering openmpi-1.10.3/opal/asm
> CPPAS atomic-asm.lo
> This licensed Software was made available from Nvidia Corportation
> under a time-limited beta license the beta license expires on jun 1 2015
> any attempt to use this product after jun 1 2015 is a violation of the terms
> of the PGI end user license agreement.

sorry, i take this back, i accidentally used PGI 15.3 compiler instead of 15.9

using 15.9 i get the same -pthread error from the slurm_pmi library.


Re: [OMPI users] openmpi 1.10.2 and PGI 15.9

2016-07-14 Thread Michael Di Domenico
Have 1.10.3 unpacked, ran through the configure using the same command
line options as 1.10.2

but it fails even earlier in the make process at

Entering openmpi-1.10.3/opal/asm
CPPAS atomic-asm.lo
This licensed Software was made available from Nvidia Corportation
under a time-limited beta license the beta license expires on jun 1 2015
any attempt to use this product after jun 1 2015 is a violation of the terms
of the PGI end user license agreement.





On Mon, Jul 11, 2016 at 9:11 AM, Gilles Gouaillardet
 wrote:
> Can you try the latest 1.10.3 instead ?
>
> btw, do you have a license for the pgCC C++ compiler ?
> fwiw, FreePGI on OSX has no C++ license and PGI C and gnu g++ does not work
> together out of the box, hopefully I will have a fix ready sometimes this
> week
>
> Cheers,
>
> Gilles
>
>
> On Monday, July 11, 2016, Michael Di Domenico 
> wrote:
>>
>> I'm trying to get openmpi compiled using the PGI compiler.
>>
>> the configure goes through and the code starts to compile, but then
>> gets hung up with
>>
>> entering: openmpi-1.10.2/opal/mca/common/pmi
>> CC common_pmi.lo
>> CCLD libmca_common_pmi.la
>> pgcc-Error-Unknown switch: - pthread
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/07/29635.php
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/07/29636.php


Re: [OMPI users] openmpi 1.10.2 and PGI 15.9

2016-07-11 Thread Michael Di Domenico
On Mon, Jul 11, 2016 at 9:11 AM, Gilles Gouaillardet
 wrote:
> Can you try the latest 1.10.3 instead ?

i can but it'll take a few days to pull the software inside.

> btw, do you have a license for the pgCC C++ compiler ?
> fwiw, FreePGI on OSX has no C++ license and PGI C and gnu g++ does not work
> together out of the box, hopefully I will have a fix ready sometimes this
> week

we should, but i'm not positive.  we're running PGI on linux x64, we
typically buy the full suite, but i'll double check.


[OMPI users] openmpi 1.10.2 and PGI 15.9

2016-07-11 Thread Michael Di Domenico
I'm trying to get openmpi compiled using the PGI compiler.

the configure goes through and the code starts to compile, but then
gets hung up with

entering: openmpi-1.10.2/opal/mca/common/pmi
CC common_pmi.lo
CCLD libmca_common_pmi.la
pgcc-Error-Unknown switch: - pthread


Re: [OMPI users] locked memory and queue pairs

2016-03-17 Thread Michael Di Domenico
On Thu, Mar 17, 2016 at 12:15 PM, Cabral, Matias A
 wrote:
> I was looking for lines like" [nodexyz:17085] selected cm best priority 40" 
> and  " [nodexyz:17099] select: component psm selected"

this may have turned up more then i expected.  i recompiled openmpi
v1.8.4 as a test and reran the tests.  which seemed to run just fine.
looking at the debug output, i can clearly see a difference in the psm
calls.  i performed the same test using 1.10.2 and it works as well.

i've sent a msg off to the user to have him rerun and see where we're at.

i suspect my system level compile of openmpi might be all screwed up
with concern for psm.  i didn't see anything off in the configure
output, but i must have missed something.  i'll report back


Re: [OMPI users] locked memory and queue pairs

2016-03-17 Thread Michael Di Domenico
On Thu, Mar 17, 2016 at 12:52 PM, Jeff Squyres (jsquyres)
 wrote:
> Can you send all the information listed here?
>
> https://www.open-mpi.org/community/help/
>
> (including the full output from the run with the PML/BTL/MTL/etc. verbosity)
>
> This will allow Matias to look through all the relevant info, potentially 
> with fewer back-n-forth emails.

Understood, but unfortunately i cannot pull large dumps from the
system, its isolated.


Re: [OMPI users] locked memory and queue pairs

2016-03-17 Thread Michael Di Domenico
On Thu, Mar 17, 2016 at 12:15 PM, Cabral, Matias A
 wrote:
> I was looking for lines like" [nodexyz:17085] selected cm best priority 40" 
> and  " [nodexyz:17099] select: component psm selected"

i see cm best priority 20, which seems to relate to ob1 being
selected.  i don't see a mention of psm anywhere (i am NOT doing --mca
mtl ^psm), but i did compile openmpi with psm support


Re: [OMPI users] locked memory and queue pairs

2016-03-17 Thread Michael Di Domenico
On Wed, Mar 16, 2016 at 4:49 PM, Cabral, Matias A
 wrote:
> I didn't go into the code to see who is actually calling this error message, 
> but I suspect this may be a generic error for "out of memory" kind of thing 
> and not specific to the que pair. To confirm please add  -mca 
> pml_base_verbose 100 and add  -mca mtl_base_verbose 100  to see what is being 
> selected.

this didn't spit out anything overly useful, just lots of lines

[node001:00909] mca: base: components_register: registering pml components
[node001:00909] mca: base: components_register: found loaded component v
[node001:00909] mca: base: components_register: component v register
function successful
[node001:00909] mca: base: components_register: found loaded component bfo
[node001:00909] mca: base: components_register: component bfo register
function successful
[node001:00909] mca: base: components_register: found loaded component cm
[node001:00909] mca: base: components_register: component cm register
function successful
[node001:00909] mca: base: components_register: found loaded component ob1
[node001:00909] mca: base: components_register: component ob1 register
function successful

> I'm trying to remember some details of IMB  and alltoallv to see if it is 
> indeed requiring more resources that the other micro benchmarks.

i'm using IMB for my tests, but this issue came up because a
researcher isn't able to run large alltoall codes, so i don't believe
it's specific to IMB

> BTW, did you confirm the limits setup? Also do the nodes have all the same 
> amount of mem?

yes, all nodes have the limits set to unlimited and each node has
256GB of memory


Re: [OMPI users] locked memory and queue pairs

2016-03-16 Thread Michael Di Domenico
On Wed, Mar 16, 2016 at 3:37 PM, Cabral, Matias A
 wrote:
> Hi Michael,
>
> I may be missing some context, if you are using the qlogic cards you will 
> always want to use the psm mtl (-mca pml cm -mca mtl psm) and not openib btl. 
> As Tom suggest, confirm the limits are setup on every node: could it be the 
> alltoall is reaching a node that "others" are not? Please share the command 
> line and the error message.



Yes, under normal circumstances, I use PSM.  i only disabled to see if
it affected any kind of change.

the test i'm running is

mpirun -n 512 ./IMB-MPI1 alltoallv

when the system gets to 128 ranks, it freezes and errors out with

---

A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.

For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host: node001
Local device:   qib0
Queue pair type:Reliable connected (RC)

---

i've also tried various nodes across the cluster (200+).  i think i
ruled out errant switch (qlogic single 12800-120) problems, bad
cables, and bad nodes.  that's not to say they're may not be present,
i've just not been able to find it


Re: [OMPI users] locked memory and queue pairs

2016-03-16 Thread Michael Di Domenico
On Wed, Mar 16, 2016 at 12:12 PM, Elken, Tom  wrote:
> Hi Mike,
>
> In this file,
> $ cat /etc/security/limits.conf
> ...
> < do you see at the end ... >
>
> * hard memlock unlimited
> * soft memlock unlimited
> # -- All InfiniBand Settings End here --
> ?

Yes.  I double checked that it's set on all compute nodes in the
actual file and through the ulimit command


Re: [OMPI users] locked memory and queue pairs

2016-03-16 Thread Michael Di Domenico
On Thu, Mar 10, 2016 at 11:54 AM, Michael Di Domenico
 wrote:
> when i try to run an openmpi job with >128 ranks (16 ranks per node)
> using alltoall or alltoallv, i'm getting an error that the process was
> unable to get a queue pair.
>
> i've checked the max locked memory settings across my machines;
>
> using ulimit -l in and outside of mpirun and they're all set to unlimited
> pam modules to ensure pam_limits.so is loaded and working
> the /etc/security/limits.conf is set for soft/hard mem to unlimited
>
> i tried a couple of quick mpi config settings i could think of;
>
> -mca mtl ^psm no affect
> -mca btl_openib_flags 1 no affect
>
> the openmpi faq says to tweak some mtt values in /sys, but since i'm
> not on mellanox that doesn't apply to me
>
> the machines are rhel 6.7, kernel 2.6.32-573.12.1(with bundled ofed),
> running on qlogic single-port infiniband cards, psm is enabled
>
> other collectives seem to run okay, it seems to only be alltoall comms
> that fail and only at scale
>
> i believe (but can't prove) that this worked at one point, but i can't
> recall when i last tested it.  so it's reasonable to assume that some
> change to the system is preventing this.
>
> the question is, where should i start poking to find it?

bump?


[OMPI users] locked memory and queue pairs

2016-03-10 Thread Michael Di Domenico
when i try to run an openmpi job with >128 ranks (16 ranks per node)
using alltoall or alltoallv, i'm getting an error that the process was
unable to get a queue pair.

i've checked the max locked memory settings across my machines;

using ulimit -l in and outside of mpirun and they're all set to unlimited
pam modules to ensure pam_limits.so is loaded and working
the /etc/security/limits.conf is set for soft/hard mem to unlimited

i tried a couple of quick mpi config settings i could think of;

-mca mtl ^psm no affect
-mca btl_openib_flags 1 no affect

the openmpi faq says to tweak some mtt values in /sys, but since i'm
not on mellanox that doesn't apply to me

the machines are rhel 6.7, kernel 2.6.32-573.12.1(with bundled ofed),
running on qlogic single-port infiniband cards, psm is enabled

other collectives seem to run okay, it seems to only be alltoall comms
that fail and only at scale

i believe (but can't prove) that this worked at one point, but i can't
recall when i last tested it.  so it's reasonable to assume that some
change to the system is preventing this.

the question is, where should i start poking to find it?


Re: [OMPI users] Invalid read of size 4 (Valgrind error) with OpenMPI 1.8.7

2015-09-28 Thread Schlottke-Lakemper, Michael
Sorry for the long delay.

Unfortunately, I am no longer able to reproduce the Valgrind errors I reported 
earlier with either the debug version or the normally-compiled version of  OMPI 
1.8.7. I don’t know what happened - probably some change to our cluster 
infrastructure that I am not aware of and that I am not able to track down. 
Sorry for having wasted your collective time on this; if this error should 
arise again, I will try to get a proper Valgrind report with -enable-debug and 
report it here.

Michael

> On 30 Jul 2015, at 22:10 , Nathan Hjelm  wrote:
> 
> 
> I agree with Ralph. Please run again with --enable-debug. That will give
> more information (line number) on where the error is occuring.
> 
> Looking at the function in question the only place I see that could be
> causing this warning is the call to strlen. Some implementations of
> strlen use operate on larger chunks (4 or 8 bytes). This will make
> valgrind unhappy but does not make the implementation invalid as no read
> will cross a page boundary (so no SEGV). One example of such a strlen
> implementation is the one used by icc which uses vector operations on
> 8-byte chunks of the string.
> 
> -Nathan
> 
> On Wed, Jul 29, 2015 at 07:58:09AM -0700, Ralph Castain wrote:
>>   If you have the time, it would be helpful. You might also configure
>>   -enable-debug.
>>   Meantime, I can take another gander to see how it could happen - looking
>>   at the code, it sure seems impossible, but maybe there is some strange
>>   path that would break it.
>> 
>> On Jul 29, 2015, at 6:29 AM, Schlottke-Lakemper, Michael
>>  wrote:
>> If it is helpful, I can try to compile OpenMPI with debug information
>> and get more details on the reported error. However, it would be good if
>> someone could tell me the necessary compile flags (on top of -O0 -g) and
>> it would take me probably 1-2 weeks to do it.
>> Michael
>> 
>>  Original message 
>> From: Gilles Gouaillardet 
>> Date: 29/07/2015 14:17 (GMT+01:00)
>> To: Open MPI Users 
>> Subject: Re: [OMPI users] Invalid read of size 4 (Valgrind error) with
>> OpenMPI 1.8.7
>> 
>> Thomas,
>> can you please elaborate ?
>> I checked the code of opal_os_dirpath_create and could not find where
>> such a thing can happen
>> Thanks,
>> Gilles
>> On Wednesday, July 29, 2015, Thomas Jahns  wrote:
>> 
>>   Hello,
>> 
>>   On 07/28/15 17:34, Schlottke-Lakemper, Michael wrote:
>> 
>> That's what I suspected. Thank you for your confirmation.
>> 
>>   you are mistaken, the allocation is 51 bytes long, i.e. valid bytes
>>   are at offsets 0 to 50. But since the read of 4 bytes starts at offset
>>   48, the bytes at offsets 48, 49, 50 and 51 get read, the last of which
>>   is illegal. It probably does no harm at the moment in practice,
>>   because virtually all allocators always add some padding to the next
>>   multiple of some power of 2. But still this means the program is
>>   incorrect in terms of any programming language definition involved
>>   (might be C, C++ or Fortran).
>> 
>>   Regards, Thomas
>> 
>>   On 25 Jul 2015, at 16:10 , Ralph Castain >   <mailto:r...@open-mpi.org>> wrote:
>> 
>>   Looks to me like a false positive - we do malloc some space, and
>>   do access
>>   different parts of it. However, it looks like we are inside the
>>   space at all
>>   times.
>> 
>>   I'd suppress it
>> 
>> On Jul 23, 2015, at 12:47 AM, Schlottke-Lakemper, Michael
>> > <mailto:m.schlottke-lakem...@aia.rwth-aachen.de>> wrote:
>> 
>> Hi folks,
>> 
>> recently we've been getting a Valgrind error in PMPI_Init for
>> our suite of
>> regression tests:
>> 
>> ==5922== Invalid read of size 4
>> ==5922==at 0x61CC5C0: opal_os_dirpath_create (in
>> /aia/opt/openmpi-1.8.7/lib64/libopen-pal.so.6.2.2)
>> ==5922==by 0x5F207E5: orte_session_dir (in
>> /aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
>> ==5922==by 0x5F34F04: orte_ess_base_app_setup (in
>> /aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
>> ==5922==by 0x7E96679: rte_init (in
>> /aia/opt/openmpi-1.8.7/lib64/openmpi/mca_ess_env.so

Re: [OMPI users] Oversubscription disabled by default in OpenMPI 1.8.7

2015-08-14 Thread Schlottke-Lakemper, Michael
Hi Ralph,

Thanks a lot for the fast reply and the clarification. We’ve re-added the 
parameter to our MCA site configuration file.

Michael

On 14 Aug 2015, at 15:00 , Ralph Castain 
mailto:r...@open-mpi.org>> wrote:

During the 1.7 series, we changed things at the request of system admins so 
that we don't oversubscribe allocations given to us by resource managers unless 
specifically told to do so.


On Fri, Aug 14, 2015 at 12:52 AM, Schlottke-Lakemper, Michael 
mailto:m.schlottke-lakem...@aia.rwth-aachen.de>>
 wrote:
Hi folks,

It seems like oversubscription is disabled by default in OpenMPI 1.8.7, at 
least when running on a PBS/torque-managed node. When I run a program in 
parallel on a node with 8 cores, I am not able to use more than 8 ranks:

> mic@aia272:~> mpirun --display-allocation -n 9 hostname
>
> ==   ALLOCATED NODES   ==
>   aia272: slots=8 max_slots=0 slots_inuse=0 state=UP
> =
> --
> There are not enough slots available in the system to satisfy the 9 slots
> that were requested by the application:
>  hostname
>
> Either request fewer slots for your application, or make more slots available
> for use.
> --


However, if I specifically enable oversubscription through the 
rmaps_base_oversubscribe setting, it works again:

> mic@aia272:~> mpirun --display-allocation --mca rmaps_base_oversubscribe 1 -n 
> 9 hostname
>
> ==   ALLOCATED NODES   ==
>   aia272: slots=8 max_slots=0 slots_inuse=0 state=UP
> =
> aia272
> aia272
> aia272
> aia272
> aia272
> aia272
> aia272
> aia272
> aia272

Now I am wondering, is this a bug or a feature? We recently upgraded from 1.6.x 
to 1.8.7, and as far as I remember, in 1.6.x oversubscription was enabled by 
default.

Regards,

Michael

P.S.: In ompi_info, both rmaps_base_no_oversubscribe and 
rmaps_base_oversubscribe are reported as “false”. Our 
prefix/etc/openmpi-mca-params.conf file is empty.
___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27466.php

___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27467.php



[OMPI users] Oversubscription disabled by default in OpenMPI 1.8.7

2015-08-14 Thread Schlottke-Lakemper, Michael
Hi folks,

It seems like oversubscription is disabled by default in OpenMPI 1.8.7, at 
least when running on a PBS/torque-managed node. When I run a program in 
parallel on a node with 8 cores, I am not able to use more than 8 ranks:

> mic@aia272:~> mpirun --display-allocation -n 9 hostname
> 
> ==   ALLOCATED NODES   ==
>   aia272: slots=8 max_slots=0 slots_inuse=0 state=UP
> =
> --
> There are not enough slots available in the system to satisfy the 9 slots
> that were requested by the application:
>  hostname
> 
> Either request fewer slots for your application, or make more slots available
> for use.
> --


However, if I specifically enable oversubscription through the 
rmaps_base_oversubscribe setting, it works again:

> mic@aia272:~> mpirun --display-allocation --mca rmaps_base_oversubscribe 1 -n 
> 9 hostname
> 
> ==   ALLOCATED NODES   ==
>   aia272: slots=8 max_slots=0 slots_inuse=0 state=UP
> =
> aia272
> aia272
> aia272
> aia272
> aia272
> aia272
> aia272
> aia272
> aia272

Now I am wondering, is this a bug or a feature? We recently upgraded from 1.6.x 
to 1.8.7, and as far as I remember, in 1.6.x oversubscription was enabled by 
default.

Regards,

Michael

P.S.: In ompi_info, both rmaps_base_no_oversubscribe and 
rmaps_base_oversubscribe are reported as “false”. Our 
prefix/etc/openmpi-mca-params.conf file is empty.

Re: [OMPI users] Invalid read of size 4 (Valgrind error) with OpenMPI 1.8.7

2015-07-29 Thread Schlottke-Lakemper, Michael
If it is helpful, I can try to compile OpenMPI with debug information and get 
more details on the reported error. However, it would be good if someone could 
tell me the necessary compile flags (on top of -O0 -g) and it would take me 
probably 1-2 weeks to do it.

Michael


 Original message 
From: Gilles Gouaillardet 
List-Post: users@lists.open-mpi.org
Date: 29/07/2015 14:17 (GMT+01:00)
To: Open MPI Users 
Subject: Re: [OMPI users] Invalid read of size 4 (Valgrind error) with OpenMPI 
1.8.7

Thomas,

can you please elaborate ?
I checked the code of opal_os_dirpath_create and could not find where such a 
thing can happen

Thanks,

Gilles

On Wednesday, July 29, 2015, Thomas Jahns mailto:ja...@dkrz.de>> 
wrote:
Hello,

On 07/28/15 17:34, Schlottke-Lakemper, Michael wrote:
That’s what I suspected. Thank you for your confirmation.

you are mistaken, the allocation is 51 bytes long, i.e. valid bytes are at 
offsets 0 to 50. But since the read of 4 bytes starts at offset 48, the bytes 
at offsets 48, 49, 50 and 51 get read, the last of which is illegal. It 
probably does no harm at the moment in practice, because virtually all 
allocators always add some padding to the next multiple of some power of 2. But 
still this means the program is incorrect in terms of any programming language 
definition involved (might be C, C++ or Fortran).

Regards, Thomas

On 25 Jul 2015, at 16:10 , Ralph Castain mailto:r...@open-mpi.org>> wrote:

Looks to me like a false positive - we do malloc some space, and do access
different parts of it. However, it looks like we are inside the space at all
times.

I’d suppress it


On Jul 23, 2015, at 12:47 AM, Schlottke-Lakemper, Michael
mailto:m.schlottke-lakem...@aia.rwth-aachen.de>> wrote:

Hi folks,

recently we’ve been getting a Valgrind error in PMPI_Init for our suite of
regression tests:

==5922== Invalid read of size 4
==5922==at 0x61CC5C0: opal_os_dirpath_create (in
/aia/opt/openmpi-1.8.7/lib64/libopen-pal.so.6.2.2)
==5922==by 0x5F207E5: orte_session_dir (in
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x5F34F04: orte_ess_base_app_setup (in
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x7E96679: rte_init (in
/aia/opt/openmpi-1.8.7/lib64/openmpi/mca_ess_env.so)
==5922==by 0x5F12A77: orte_init (in
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x509883C: ompi_mpi_init (in
/aia/opt/openmpi-1.8.7/lib64/libmpi.so.1.6.2)
==5922==by 0x50B843A: PMPI_Init (in
/aia/opt/openmpi-1.8.7/lib64/libmpi.so.1.6.2)
==5922==by 0xEBA79C: ZFS::run() (in
/aia/r018/scratch/mic/.zfstester/.zacc_cron/zacc_cron_r9063/zfs_gnu_production)
==5922==by 0x4DC243: main (in
/aia/r018/scratch/mic/.zfstester/.zacc_cron/zacc_cron_r9063/zfs_gnu_production)
==5922==  Address 0x710f670 is 48 bytes inside a block of size 51 alloc'd
==5922==at 0x4C29110: malloc (in
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==5922==by 0x61CC572: opal_os_dirpath_create (in
/aia/opt/openmpi-1.8.7/lib64/libopen-pal.so.6.2.2)
==5922==by 0x5F207E5: orte_session_dir (in
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x5F34F04: orte_ess_base_app_setup (in
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x7E96679: rte_init (in
/aia/opt/openmpi-1.8.7/lib64/openmpi/mca_ess_env.so)
==5922==by 0x5F12A77: orte_init (in
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x509883C: ompi_mpi_init (in
/aia/opt/openmpi-1.8.7/lib64/libmpi.so.1.6.2)
==5922==by 0x50B843A: PMPI_Init (in
/aia/opt/openmpi-1.8.7/lib64/libmpi.so.1.6.2)
==5922==by 0xEBA79C: ZFS::run() (in
/aia/r018/scratch/mic/.zfstester/.zacc_cron/zacc_cron_r9063/zfs_gnu_production)
==5922==by 0x4DC243: main (in
/aia/r018/scratch/mic/.zfstester/.zacc_cron/zacc_cron_r9063/zfs_gnu_production)
==5922==

What is weird is that it seems to depend on the pbs/torque session we’re in:
sometimes the error does not occur and all and all tests run fine (this is in
fact the only Valgrind error we’re having at the moment). Other times every
single test we’re running has this error.

Has anyone seen this or might be able to offer an explanation? If it is a
false-positive, I’d be happy to suppress it :)

Thanks a lot in advance

Michael

P.S.: This error is not covered/suppressed by the default ompi suppression
file in $PREFIX/share/openmpi.


--
Michael Schlottke-Lakemper

SimLab Highly Scalable Fluids & Solids Engineering
Jülich Aachen Research Alliance (JARA-HPC)
RWTH Aachen University
Wüllnerstraße 5a
52062 Aachen
Germany

Phone: +49 (241) 80 95188
Fax: +49 (241) 80 92257
Mail: m.schlottke-lakem...@aia.rwth-aachen.de
<mailto:m.schlottke-lakem...@aia.rwth-aachen.de>
Web: http://www.jara.org/jara-hpc

___
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Lin

Re: [OMPI users] Invalid read of size 4 (Valgrind error) with OpenMPI 1.8.7

2015-07-28 Thread Schlottke-Lakemper, Michael
Hi Ralph,

That’s what I suspected. Thank you for your confirmation.

Michael

On 25 Jul 2015, at 16:10 , Ralph Castain 
mailto:r...@open-mpi.org>> wrote:

Looks to me like a false positive - we do malloc some space, and do access 
different parts of it. However, it looks like we are inside the space at all 
times.

I’d suppress it


On Jul 23, 2015, at 12:47 AM, Schlottke-Lakemper, Michael 
mailto:m.schlottke-lakem...@aia.rwth-aachen.de>>
 wrote:

Hi folks,

recently we’ve been getting a Valgrind error in PMPI_Init for our suite of 
regression tests:

==5922== Invalid read of size 4
==5922==at 0x61CC5C0: opal_os_dirpath_create (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-pal.so.6.2.2)
==5922==by 0x5F207E5: orte_session_dir (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x5F34F04: orte_ess_base_app_setup (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x7E96679: rte_init (in 
/aia/opt/openmpi-1.8.7/lib64/openmpi/mca_ess_env.so)
==5922==by 0x5F12A77: orte_init (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x509883C: ompi_mpi_init (in 
/aia/opt/openmpi-1.8.7/lib64/libmpi.so.1.6.2)
==5922==by 0x50B843A: PMPI_Init (in 
/aia/opt/openmpi-1.8.7/lib64/libmpi.so.1.6.2)
==5922==by 0xEBA79C: ZFS::run() (in 
/aia/r018/scratch/mic/.zfstester/.zacc_cron/zacc_cron_r9063/zfs_gnu_production)
==5922==by 0x4DC243: main (in 
/aia/r018/scratch/mic/.zfstester/.zacc_cron/zacc_cron_r9063/zfs_gnu_production)
==5922==  Address 0x710f670 is 48 bytes inside a block of size 51 alloc'd
==5922==at 0x4C29110: malloc (in 
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==5922==by 0x61CC572: opal_os_dirpath_create (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-pal.so.6.2.2)
==5922==by 0x5F207E5: orte_session_dir (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x5F34F04: orte_ess_base_app_setup (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x7E96679: rte_init (in 
/aia/opt/openmpi-1.8.7/lib64/openmpi/mca_ess_env.so)
==5922==by 0x5F12A77: orte_init (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x509883C: ompi_mpi_init (in 
/aia/opt/openmpi-1.8.7/lib64/libmpi.so.1.6.2)
==5922==by 0x50B843A: PMPI_Init (in 
/aia/opt/openmpi-1.8.7/lib64/libmpi.so.1.6.2)
==5922==by 0xEBA79C: ZFS::run() (in 
/aia/r018/scratch/mic/.zfstester/.zacc_cron/zacc_cron_r9063/zfs_gnu_production)
==5922==by 0x4DC243: main (in 
/aia/r018/scratch/mic/.zfstester/.zacc_cron/zacc_cron_r9063/zfs_gnu_production)
==5922==

What is weird is that it seems to depend on the pbs/torque session we’re in: 
sometimes the error does not occur and all and all tests run fine (this is in 
fact the only Valgrind error we’re having at the moment). Other times every 
single test we’re running has this error.

Has anyone seen this or might be able to offer an explanation? If it is a 
false-positive, I’d be happy to suppress it :)

Thanks a lot in advance

Michael

P.S.: This error is not covered/suppressed by the default ompi suppression file 
in $PREFIX/share/openmpi.


--
Michael Schlottke-Lakemper

SimLab Highly Scalable Fluids & Solids Engineering
Jülich Aachen Research Alliance (JARA-HPC)
RWTH Aachen University
Wüllnerstraße 5a
52062 Aachen
Germany

Phone: +49 (241) 80 95188
Fax: +49 (241) 80 92257
Mail: 
m.schlottke-lakem...@aia.rwth-aachen.de<mailto:m.schlottke-lakem...@aia.rwth-aachen.de>
Web: http://www.jara.org/jara-hpc

___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27303.php

___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27328.php



Re: [OMPI users] File coherence issues with OpenMPI/torque/NFS (?)

2015-07-23 Thread Schlottke-Lakemper, Michael
Hi Gilles,

Thanks, and yes, you are right:
ompi_info --all | grep "MCA io" | grep priority
  MCA io: parameter "io_romio_priority" (current value: "20", 
data source: default, level: 9 dev/all, type: int)
  MCA io: parameter "io_romio_delete_priority" (current value: 
"20", data source: default, level: 9 dev/all, type: int)
  MCA io: parameter "io_ompio_priority" (current value: "10", 
data source: default, level: 9 dev/all, type: int)
  MCA io: parameter "io_ompio_delete_priority" (current value: 
"10", data source: default, level: 9 dev/all, type: int)

So it seems we are indeed using ROMIO. Any suggestions what that means with 
respect to our file coherence issue?

Regards,

Michael

On 23 Jul 2015, at 14:07 , Gilles Gouaillardet 
mailto:gilles.gouaillar...@gmail.com>> wrote:

Michael,

ROMIO is the default in the 1.8 series
you can run
ompi_info --all | grep io | grep priority
ROMIO priority should be 20 and ompio priority should be 10.

Cheers,

Gilles

On Thursday, July 23, 2015, Schlottke-Lakemper, Michael 
mailto:m.schlottke-lakem...@aia.rwth-aachen.de>>
 wrote:
Hi Gilles,

> are you running 1.8.7 or master ?
1.8.7. We recently upgraded our cluster installation from OpenSUSE 11.3/OpenMPI 
1.6.5 to OpenSUSE 12.3/OpenMPI 1.8.7. Before the upgrade, we did not encounter 
this problem.

> if not default, which io module are you running ?
> (default is ROMIO with 1.8 but ompio with master)
We did not specify anything at configure time, so I guess we are using the 
default. But if you tell me how, I can check.

> by any chance, could you post a simple program that evidences this issue ?
As of this time, unfortunately no. We only experience this issue 
intermittently, and only when running our suite of regression tests. It *seems* 
to occur only with a handful of the ~40 tests, but if we run only a subset of 
the tests (instead of all of them), it may not occur at all, depending on the 
subset. I tried using a MWE program but could not reproduce the issue with it.

Sorry for not being more helpful, but we are also scratching our heads trying 
to understand what is going on and I just thought that maybe someone here has 
had a similar experience in the past (or might give us some pointers at what to 
look at).

Regards,

Michael




Re: [OMPI users] File coherence issues with OpenMPI/torque/NFS (?)

2015-07-23 Thread Schlottke-Lakemper, Michael
Hi Dave,

> That's probably not a good idea.  Have you read about NFS in the romio
> README?  It's old, but as far as I know, it's still relevant for NFS3.
> Maybe Rob Latham will see this and be able to comment on NFS4.
No, are you referring to the file openmpi-1.8.7/ompi/mca/io/romio/romio/README? 
The only hint they give is that they suggest to use the “noac” option, which 
according to the manpages should not affect file contents but rather file 
attributes only (if I understand it correctly). Or do you think it would still 
be worth trying? By the way, we are using NFSv3.

> It seems to me that building for NFS by default is a mistake.
Can you tell me the correct flags I need to provide at configure-time? Or where 
I can find more information about that (there does not seem to be anything 
about configure flags in the above-mentioned README). Also, as Gilles (see 
other mail in thread) suggested, I am not sure whether we use romio or ompio, 
but I do not know how to find out.

Michael

Re: [OMPI users] File coherence issues with OpenMPI/torque/NFS (?)

2015-07-23 Thread Schlottke-Lakemper, Michael
Hi Gilles,

> are you running 1.8.7 or master ?
1.8.7. We recently upgraded our cluster installation from OpenSUSE 11.3/OpenMPI 
1.6.5 to OpenSUSE 12.3/OpenMPI 1.8.7. Before the upgrade, we did not encounter 
this problem.

> if not default, which io module are you running ?
> (default is ROMIO with 1.8 but ompio with master)
We did not specify anything at configure time, so I guess we are using the 
default. But if you tell me how, I can check.

> by any chance, could you post a simple program that evidences this issue ?
As of this time, unfortunately no. We only experience this issue 
intermittently, and only when running our suite of regression tests. It *seems* 
to occur only with a handful of the ~40 tests, but if we run only a subset of 
the tests (instead of all of them), it may not occur at all, depending on the 
subset. I tried using a MWE program but could not reproduce the issue with it.

Sorry for not being more helpful, but we are also scratching our heads trying 
to understand what is going on and I just thought that maybe someone here has 
had a similar experience in the past (or might give us some pointers at what to 
look at).

Regards,

Michael



[OMPI users] File coherence issues with OpenMPI/torque/NFS (?)

2015-07-23 Thread Schlottke-Lakemper, Michael
Hi folks,

We are currently encountering a weird file coherence issue when running 
parallel jobs with OpenMPI (1.8.7) and writing files in parallel to an 
NFS-mounted file system using Parallel netCDF 1.6.1 (which internally uses 
MPI-I/O). Sometimes (~30-40% of our samples) we get a file whose contents are 
not consistent across different hosts.

Specifically, one of the hosts where the file was created will (persistently) 
show a different file than any other host (confirmed using md5sum/sha256sum and 
manually). From our observations it seems like the bad host keeps an older 
state of the file, i.e. one where not all write processes had finished. The 
error seems to occur only if the ranks are distributed to at least two nodes, 
and it only occurs if there are multiple programs running within the same 
pbs/torque job at the same time (MPMD; each mpirun gets a different subset of 
the job nodes using the -machinefile flag).

Has anyone encountered something similar or do you have an idea what I could do 
to track down the problem?

Regards,

Michael


--
Michael Schlottke-Lakemper

SimLab Highly Scalable Fluids & Solids Engineering
Jülich Aachen Research Alliance (JARA-HPC)
RWTH Aachen University
Wüllnerstraße 5a
52062 Aachen
Germany

Phone: +49 (241) 80 95188
Fax: +49 (241) 80 92257
Mail: 
m.schlottke-lakem...@aia.rwth-aachen.de<mailto:m.schlottke-lakem...@aia.rwth-aachen.de>
Web: http://www.jara.org/jara-hpc



[OMPI users] Invalid read of size 4 (Valgrind error) with OpenMPI 1.8.7

2015-07-23 Thread Schlottke-Lakemper, Michael
Hi folks,

recently we’ve been getting a Valgrind error in PMPI_Init for our suite of 
regression tests:

==5922== Invalid read of size 4
==5922==at 0x61CC5C0: opal_os_dirpath_create (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-pal.so.6.2.2)
==5922==by 0x5F207E5: orte_session_dir (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x5F34F04: orte_ess_base_app_setup (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x7E96679: rte_init (in 
/aia/opt/openmpi-1.8.7/lib64/openmpi/mca_ess_env.so)
==5922==by 0x5F12A77: orte_init (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x509883C: ompi_mpi_init (in 
/aia/opt/openmpi-1.8.7/lib64/libmpi.so.1.6.2)
==5922==by 0x50B843A: PMPI_Init (in 
/aia/opt/openmpi-1.8.7/lib64/libmpi.so.1.6.2)
==5922==by 0xEBA79C: ZFS::run() (in 
/aia/r018/scratch/mic/.zfstester/.zacc_cron/zacc_cron_r9063/zfs_gnu_production)
==5922==by 0x4DC243: main (in 
/aia/r018/scratch/mic/.zfstester/.zacc_cron/zacc_cron_r9063/zfs_gnu_production)
==5922==  Address 0x710f670 is 48 bytes inside a block of size 51 alloc'd
==5922==at 0x4C29110: malloc (in 
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==5922==by 0x61CC572: opal_os_dirpath_create (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-pal.so.6.2.2)
==5922==by 0x5F207E5: orte_session_dir (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x5F34F04: orte_ess_base_app_setup (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x7E96679: rte_init (in 
/aia/opt/openmpi-1.8.7/lib64/openmpi/mca_ess_env.so)
==5922==by 0x5F12A77: orte_init (in 
/aia/opt/openmpi-1.8.7/lib64/libopen-rte.so.7.0.6)
==5922==by 0x509883C: ompi_mpi_init (in 
/aia/opt/openmpi-1.8.7/lib64/libmpi.so.1.6.2)
==5922==by 0x50B843A: PMPI_Init (in 
/aia/opt/openmpi-1.8.7/lib64/libmpi.so.1.6.2)
==5922==by 0xEBA79C: ZFS::run() (in 
/aia/r018/scratch/mic/.zfstester/.zacc_cron/zacc_cron_r9063/zfs_gnu_production)
==5922==by 0x4DC243: main (in 
/aia/r018/scratch/mic/.zfstester/.zacc_cron/zacc_cron_r9063/zfs_gnu_production)
==5922==

What is weird is that it seems to depend on the pbs/torque session we’re in: 
sometimes the error does not occur and all and all tests run fine (this is in 
fact the only Valgrind error we’re having at the moment). Other times every 
single test we’re running has this error.

Has anyone seen this or might be able to offer an explanation? If it is a 
false-positive, I’d be happy to suppress it :)

Thanks a lot in advance

Michael

P.S.: This error is not covered/suppressed by the default ompi suppression file 
in $PREFIX/share/openmpi.


--
Michael Schlottke-Lakemper

SimLab Highly Scalable Fluids & Solids Engineering
Jülich Aachen Research Alliance (JARA-HPC)
RWTH Aachen University
Wüllnerstraße 5a
52062 Aachen
Germany

Phone: +49 (241) 80 95188
Fax: +49 (241) 80 92257
Mail: 
m.schlottke-lakem...@aia.rwth-aachen.de<mailto:m.schlottke-lakem...@aia.rwth-aachen.de>
Web: http://www.jara.org/jara-hpc



[OMPI users] slurm openmpi 1.8.3 core bindings

2015-01-30 Thread Michael Di Domenico
I'm trying to get slurm and openmpi to cooperate when running multi
thread jobs.  i'm sure i'm doing something wrong, but i can't figure
out what

my node configuration is

2 nodes
2 sockets
6 cores per socket

i want to run

sbatch -N2 -n 8 --ntasks-per-node=4 --cpus-per-task=3 -w node1,node2
program.sbatch

inside the program.sbatch i'm calling openmpi

mpirun -n $SLURM_NTASKS --report-bindings program

when the binds report comes out i get

node1 rank 0 socket 0 core 0
node1 rank 1 socket 1 core 6
node1 rank 2 socket 0 core 1
node1 rank 3 socket 1 core 7
node2 rank 4 socket 0 core 0
node2 rank 5 socket 1 core 6
node2 rank 6 socket 0 core 1
node2 rank 7 socket 1 core 7

which is semi-fine, but when the job runs the resulting threads from
the program are locked (according to top) to those eight cores rather
then spreading themselves over the 24 cores available

i tried a few incantations of the map-by, bind-to, etc, but openmpi
basically complained about everything i tried for one reason or
another

my understand is that slurm should be passing the requested config to
openmpi (or openmpi is pulling from the environment somehow) and it
should magically work

if i skip slurm and run

mpirun -n 8 --map-by node:pe=3 -bind-to core -host node1,node2
--report-bindings program

node1 rank 0 socket 0 core 0
node2 rank 1 socket 0 core 0
node1 rank 2 socket 0 core 3
node2 rank 3 socket 0 core 3
node1 rank 4 socket 1 core 6
node2 rank 5 socket 1 core 6
node1 rank 6 socket 1 core 9
node2 rank 7 socket 1 core 9

i do get the behavior i want (though i would prefer a -npernode switch
in there, but openmpi complains).  the bindings look better and the
threads are not locked to the particular cores

therefore i'm pretty sure this is a problem between openmpi and slurm
and not necessarily with either individually

i did compile openmpi with the slurm support switch and we're using
the cgroups taskplugin within slurm

i guess ancillary to this, is there a way to turn off core
binding/placement routines and control the placement manually?


Re: [OMPI users] ipath_userinit errors

2014-11-06 Thread Michael Di Domenico
Andrew,

Thanks.  We're using the RHEL version because it was less complicated
for our environment in the past, but sounds like we might want to
reconsider that decision.

Do you know why we don't see the message with lower node count
allocations?  It only seems to happen when the node count gets over a
certain point?

thanks

On Wed, Nov 5, 2014 at 5:51 PM, Friedley, Andrew
 wrote:
> Hi Michael,
>
> From what I understand, this is an issue with the qib driver and PSM from 
> RHEL 6.5 and 6.6, and will be fixed for 6.7.  There is no functional change 
> between qib->PSM API versions 11 and 12, so the message is harmless.  I 
> presume you're using the RHEL sourced package for a reason, but using an IFS 
> release would fix the problem until RHEL 6.7 is ready.
>
> Andrew
>
>> -Original Message-
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Michael Di
>> Domenico
>> Sent: Tuesday, November 4, 2014 8:35 AM
>> To: Open MPI Users
>> Subject: [OMPI users] ipath_userinit errors
>>
>> I'm getting the below message on my cluster(s).  It seems to only happen
>> when I try to use more then 64 nodes (16-cores each).  The clusters are
>> running RHEL 6.5 with Slurm and Openmpi-1.6.5 with PSM.
>> I'm using the OFED versions included with RHEL for infiniband support.
>>
>> ipath_userinit: Mismatched user minor version (12) and driver minor version
>> (11) while context sharing. Ensure that driver and library are from the same
>> release
>>
>> I already realize this is a warning message and the jobs complete.
>> Another user a little over a year ago had a similar issue that was tracked to
>> mismatched ofed versions.  Since i have a diskless cluster all my nodes are
>> identical.
>>
>> I'm not adverse to thinking there might not be something unique about my
>> machine, but since i have two separate machines doing it, I'm not really sure
>> where to look to triage the issue and see what might be set incorrectly.
>>
>> Any thoughts on where to start checking would be helpful, thanks...
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-
>> mpi.org/community/lists/users/2014/11/25667.php
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25694.php


[OMPI users] ipath_userinit errors

2014-11-04 Thread Michael Di Domenico
I'm getting the below message on my cluster(s).  It seems to only
happen when I try to use more then 64 nodes (16-cores each).  The
clusters are running RHEL 6.5 with Slurm and Openmpi-1.6.5 with PSM.
I'm using the OFED versions included with RHEL for infiniband support.

ipath_userinit: Mismatched user minor version (12) and driver minor
version (11) while context sharing. Ensure that driver and library are
from the same release

I already realize this is a warning message and the jobs complete.
Another user a little over a year ago had a similar issue that was
tracked to mismatched ofed versions.  Since i have a diskless cluster
all my nodes are identical.

I'm not adverse to thinking there might not be something unique about
my machine, but since i have two separate machines doing it, I'm not
really sure where to look to triage the issue and see what might be
set incorrectly.

Any thoughts on where to start checking would be helpful, thanks...


  1   2   3   4   >