Siegmar,

thanks for the report

about the issue with Sun compiler and helloworld, the root cause is an incorrect packaging and a fix is available at https://github.com/open-mpi/ompi/pull/1285

(note the issue only occurs when building from a tarball)

i will have a look at the other issues

Cheers,

Gilles

On 1/6/2016 9:57 PM, Siegmar Gross wrote:
Hi,

I've successfully built openmpi-v2.x-dev-950-g995993b on my machines
(Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE Linux 12.1
x86_64) with gcc-5.1.0 and Sun C 5.13. Unfortunately I get errors
running some small test programs. All programs work as expected
using my gcc or cc version of openmpi-v1.10.1-138-g0e3b111. I get
similar errors for the master openmpi-dev-3329-ge4bdad0.
I used the following commands to build the package for gcc.


mkdir openmpi-v2.x-dev-950-g995993b-${SYSTEM_ENV}.${MACHINE_ENV}.64_gcc
cd openmpi-v2.x-dev-950-g995993b-${SYSTEM_ENV}.${MACHINE_ENV}.64_gcc

../openmpi-v2.x-dev-950-g995993b/configure \
  --prefix=/usr/local/openmpi-2.0.0_64_gcc \
  --libdir=/usr/local/openmpi-2.0.0_64_gcc/lib64 \
  --with-jdk-bindir=/usr/local/jdk1.8.0/bin \
  --with-jdk-headers=/usr/local/jdk1.8.0/include \
  JAVA_HOME=/usr/local/jdk1.8.0 \
  LDFLAGS="-m64" CC="gcc" CXX="g++" FC="gfortran" \
  CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \
  CPP="cpp" CXXCPP="cpp" \
  --enable-mpi-cxx \
  --enable-cxx-exceptions \
  --enable-mpi-java \
  --enable-heterogeneous \
  --enable-mpi-thread-multiple \
  --with-hwloc=internal \
  --without-verbs \
  --with-wrapper-cflags="-std=c11 -m64" \
  --with-wrapper-cxxflags="-m64" \
  --with-wrapper-fcflags="-m64" \
  --enable-debug \
  |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc

make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_gcc
rm -r /usr/local/openmpi-2.0.0_64_gcc.old
mv /usr/local/openmpi-2.0.0_64_gcc /usr/local/openmpi-2.0.0_64_gcc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_gcc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_gcc



A simple "hello world" or "matrix multiplication" program works with
my gcc version but breaks with my cc version as you can see at the
bottom. Spawning processes breaks with both versions.


tyr spawn 128 mpiexec -np 1 --hetero-nodes --host tyr,sunpc1,linpc1,tyr spawn_multiple_master

Parent process 0 running on tyr.informatik.hs-fulda.de
  I create 3 slave processes.

[tyr.informatik.hs-fulda.de:22370] PMIX ERROR: UNPACK-PAST-END in file ../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/server/pmix_server_ops.c at line 829 [tyr.informatik.hs-fulda.de:22370] PMIX ERROR: UNPACK-PAST-END in file ../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c at line 2176
[tyr:22378] *** An error occurred in MPI_Comm_spawn_multiple
[tyr:22378] *** reported by process [4047765505,0]
[tyr:22378] *** on communicator MPI_COMM_WORLD
[tyr:22378] *** MPI_ERR_SPAWN: could not spawn processes
[tyr:22378] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[tyr:22378] ***    and potentially your MPI job)
tyr spawn 129





tyr spawn 151 mpiexec -np 1 --hetero-nodes --host sunpc1,linpc1,linpc1 spawn_intra_comm
Parent process 0: I create 2 slave processes

Parent process 0 running on sunpc1
    MPI_COMM_WORLD ntasks:              1
    COMM_CHILD_PROCESSES ntasks_local:  1
    COMM_CHILD_PROCESSES ntasks_remote: 2
    COMM_ALL_PROCESSES ntasks:          3
    mytid in COMM_ALL_PROCESSES:        0

Child process 1 running on linpc1
    MPI_COMM_WORLD ntasks:              2
    COMM_ALL_PROCESSES ntasks:          3
    mytid in COMM_ALL_PROCESSES:        2

Child process 0 running on linpc1
    MPI_COMM_WORLD ntasks:              2
    COMM_ALL_PROCESSES ntasks:          3
    mytid in COMM_ALL_PROCESSES:        1
-------------------------------------------------------------------------- mpiexec noticed that process rank 0 with PID 16203 on node sunpc1 exited on signal 13 (Broken Pipe). --------------------------------------------------------------------------
tyr spawn 152



I don't see a broken pipe, if a change the sequence of sunpc1 and
linpc1.

tyr spawn 146 mpiexec -np 1 --hetero-nodes --host linpc1,sunpc1,sunpc1 spawn_intra_comm
Parent process 0: I create 2 slave processes

Child process 1 running on sunpc1
    MPI_COMM_WORLD ntasks:              2
    COMM_ALL_PROCESSES ntasks:          3
    mytid in COMM_ALL_PROCESSES:        2

Child process 0 running on sunpc1
    MPI_COMM_WORLD ntasks:              2
    COMM_ALL_PROCESSES ntasks:          3
    mytid in COMM_ALL_PROCESSES:        1

Parent process 0 running on linpc1
    MPI_COMM_WORLD ntasks:              1
    COMM_CHILD_PROCESSES ntasks_local:  1
    COMM_CHILD_PROCESSES ntasks_remote: 2
    COMM_ALL_PROCESSES ntasks:          3
    mytid in COMM_ALL_PROCESSES:        0



The process doesn't return and uses about 50% cpu time (1 of 2
processors), if I combine a x86_64 processor (sunpc1, linpc1) with
a Sparc processor (tyr).

tyr spawn 147 mpiexec -np 1 --hetero-nodes --host linpc1,tyr,tyr spawn_intra_comm
Parent process 0: I create 2 slave processes
^CKilled by signal 2.

tyr spawn 148 mpiexec -np 1 --hetero-nodes --host sunpc1,tyr,tyr spawn_intra_comm
Parent process 0: I create 2 slave processes
^CKilled by signal 2.
tyr spawn 149







The following programs break only with my Sun C 5.13 version.


tyr hello_1 114 mpiexec -np 4 --hetero-nodes --host tyr,sunpc1,linpc1 hello_1_mpi [tyr.informatik.hs-fulda.de:21472] [[62918,0],0] ORTE_ERROR_LOG: Not found in file ../../../../../openmpi-v2.x-dev-950-g995993b/orte/mca/ess/hnp/ess_hnp_module.c at line 638 --------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_pmix_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
tyr hello_1 115





tyr java 118 mpiexec -np 4 --hetero-nodes --host tyr,sunpc1,linpc1 java MatMultWithAnyProc2DarrayIn1DarrayMain [tyr.informatik.hs-fulda.de:21508] [[61986,0],0] ORTE_ERROR_LOG: Not found in file ../../../../../openmpi-v2.x-dev-950-g995993b/orte/mca/ess/hnp/ess_hnp_module.c at line 638 --------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_pmix_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
tyr java 119



I would be grateful if somebody can fix the problems. Please let me
know if you need anything else. Thank you very much for any help in
advance.


Best regards

Siegmar

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/01/28215.php


Reply via email to