Howard,

I have one more question. Is it possible to use MPI_Comm_spawn when
launching an OpenMPI job with aprun? I'm getting this error when I try:

nradclif@kay:/lus/scratch/nradclif> aprun -n 1 -N 1 ./manager
[nid00036:21772] [[14952,0],0] ORTE_ERROR_LOG: Not available in file dpm_orte.c 
at line 1190
[36:21772] *** An error occurred in MPI_Comm_spawn
[36:21772] *** reported by process [979894272,0]
[36:21772] *** on communicator MPI_COMM_SELF
[36:21772] *** MPI_ERR_UNKNOWN: unknown error
[36:21772] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[36:21772] ***    and potentially your MPI job)
aborting job:
N/A


Nick Radcliffe
Software Engineer
Cray, Inc.
________________________________
From: users [users-boun...@open-mpi.org] on behalf of Howard Pritchard 
[hpprit...@gmail.com]
Sent: Thursday, June 25, 2015 11:00 PM
To: Open MPI Users
Subject: Re: [OMPI users] Running with native ugni on a Cray XC

Hi Nick,

I will endeavor to put together a wiki for the master/v2.x series specific to 
Cray systems
(sans those customers who choose to neither 1) use Cray supported eslogin
setup nor 2)  permit users to directly log in to and build apps on service 
nodes)  that explains best practices for
using Open MPI on Cray XE/XK/XC systems.

A significant  amount of work went in to master, and now the v2.x release
stream to rationalize support for Open MPI on Cray XE/XK/XC systems using 
either aprun
or native slurm launch.

General advice for all on this mailing list, do not use the Open MPI 1.8.X 
release
series with direct ugni access enabled on Cray XE/XK/XC .  Rather use master, 
or as soon as
a release is available, from v2.x.   Note that if you are using CCM,  the 
performance
of Open MPI 1.8.X over the Cray IAA (simulated ibverbs) is pretty good.  I 
suggest this
as the preferred route for using the 1.8.X release stream on Cray XE/XK/XC.

Howard


2015-06-25 19:35 GMT-06:00 Nick Radcliffe 
<nradc...@cray.com<mailto:nradc...@cray.com>>:
Thanks Howard, using master worked for me.

Nick Radcliffe
Software Engineer
Cray, Inc.
________________________________
From: users [users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org>] on 
behalf of Howard Pritchard [hpprit...@gmail.com<mailto:hpprit...@gmail.com>]
Sent: Thursday, June 25, 2015 5:11 PM
To: Open MPI Users
Subject: Re: [OMPI users] Running with native ugni on a Cray XC


Hi Nick

use master not 1.8.x. for cray xc.  also for config do not pay attention to 
cray/lanl platform files.  just do config.  also if using nativized slurm 
launch with srun not mpirun.

howard

----------

sent from my smart phonr so no good type.

Howard

On Jun 25, 2015 2:56 PM, "Nick Radcliffe" 
<nradc...@cray.com<mailto:nradc...@cray.com>> wrote:
Hi,

I'm trying to build and run Open MPI 1.8.5 with native ugni on a Cray XC. The 
build works, but I'm getting this error when I run:

nradclif@kay:/lus/scratch/nradclif> aprun -n 2 -N 1 ./osu_latency
[nid00014:28784] [db_pmi.c:174:pmi_commit_packed] PMI_KVS_Put: Operation failed
[nid00014:28784] [db_pmi.c:457:commit] PMI_KVS_Commit: Operation failed
[nid00012:12788] [db_pmi.c:174:pmi_commit_packed] PMI_KVS_Put: Operation failed
[nid00012:12788] [db_pmi.c:457:commit] PMI_KVS_Commit: Operation failed
# OSU MPI Latency Test
# Size            Latency (us)
osu_latency: btl_ugni_endpoint.c:87: mca_btl_ugni_ep_connect_start: Assertion 
`0' failed.
[nid00012:12788] *** Process received signal ***
[nid00012:12788] Signal: Aborted (6)
[nid00012:12788] Signal code:  (-6)
[nid00012:12788] [ 0] /lib64/libpthread.so.0(+0xf850)[0x2aaaab42b850]
[nid00012:12788] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2aaaab66b885]
[nid00012:12788] [ 2] /lib64/libc.so.6(abort+0x181)[0x2aaaab66ce61]
[nid00012:12788] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x2aaaab664740]
[nid00012:12788] [ 4] 
/lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(mca_btl_ugni_ep_connect_progress+0x6c9)[0x2aaaaaff9869]
[nid00012:12788] [ 5] 
/lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(+0x5ae32)[0x2aaaaaf46e32]
[nid00012:12788] [ 6] 
/lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(mca_btl_ugni_sendi+0x8bd)[0x2aaaaaffaf7d]
[nid00012:12788] [ 7] 
/lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(+0x1f0c17)[0x2aaaab0dcc17]
[nid00012:12788] [ 8] 
/lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(mca_pml_ob1_isend+0xa8)[0x2aaaab0dd488]
[nid00012:12788] [ 9] 
/lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(ompi_coll_tuned_barrier_intra_two_procs+0x11b)[0x2aaaab07e84b]
[nid00012:12788] [10] 
/lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(PMPI_Barrier+0xb6)[0x2aaaaaf8a7c6]
[nid00012:12788] [11] ./osu_latency[0x401114]
[nid00012:12788] [12] /lib64/libc.so.6(__libc_start_main+0xe6)[0x2aaaab657c36]
[nid00012:12788] [13] ./osu_latency[0x400dd9]
[nid00012:12788] *** End of error message ***
osu_latency: btl_ugni_endpoint.c:87: mca_btl_ugni_ep_connect_start: Assertion 
`0' failed.


Here's how I build:

export FC=ftn         (I'm not using Fortran, but the configure fails if it 
can't find a Fortran compiler)
./configure --prefix=/lus/scratch/nradclif/openmpi_install 
--enable-mpi-fortran=none 
--with-platform=contrib/platform/lanl/cray_xe6/debug-lustre
make install

I didn't modify the debug-lustre file, but I did change cray-common to remove 
the hard-coding, e.g., rather than using the gemini-specific path 
"with_pmi=/opt/cray/pmi/2.1.4-1.0000.8596.8.9.gem", I used 
"with_pmi=/opt/cray/pmi/default".

I've tried running different executables with different numbers of ranks/nodes, 
but they all seem to run into problems with PMI_KVS_Put.

Any ideas what could be going wrong?

Thanks for any help,
Nick
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/06/27197.php

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/06/27199.php

Reply via email to