Ralph,

so it seems the root cause is a kind of incompatibility between the --host and the --slot-list options


on a single node with two six cores sockets,

this works :

mpirun -np 1 ./spawn_master
mpirun -np 1 --slot-list 0:0-5,1:0-5 ./spawn_master
mpirun -np 1 --host motomachi --oversubscribe ./spawn_master
mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:12 ./spawn_master


this does not work

mpirun -np 1 --host motomachi ./spawn_master # not enough slots available, aborts with a user friendly error message mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi ./spawn_master # various errors sm_segment_attach() fails, a task crashes
and this ends up with the following error message

At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[15519,2],0]) is on host: motomachi
  Process 2 ([[15519,2],1]) is on host: unknown!
  BTLs attempted: self tcp

mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:1 ./spawn_master # same error as above mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:2 ./spawn_master # same error as above


for the record, the following command surprisingly works

mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:3 --mca btl tcp,self ./spawn_master



bottom line, my guess is that when the user specifies the --slot-list and the --host options *and* there are no default slot numbers to hosts, we should default to using the number
of slots from the slot list.
(e.g. in this case, defaults to --host motomachi:12 instead of (i guess) --host motomachi:1)


/* fwiw, i made

https://github.com/open-mpi/ompi/pull/2715

https://github.com/open-mpi/ompi/pull/2715

but these are not the root cause */


Cheers,


Gilles



-------- Forwarded Message --------
Subject: Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3 on Linux
Date:   Wed, 11 Jan 2017 20:39:02 +0900
From:   Gilles Gouaillardet <gilles.gouaillar...@gmail.com>
Reply-To:       Open MPI Users <us...@lists.open-mpi.org>
To:     Open MPI Users <us...@lists.open-mpi.org>



Siegmar,

Your slot list is correct.
An invalid slot list for your node would be 0:1-7,1:0-7

/* and since the test requires only 5 tasks, that could even work with such an invalid list. My vm is single socket with 4 cores, so a 0:0-4 slot list results in an unfriendly pmix error */

Bottom line, your test is correct, and there is a bug in v2.0.x that I will investigate from tomorrow

Cheers,

Gilles

On Wednesday, January 11, 2017, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote:

   Hi Gilles,

   thank you very much for your help. What does incorrect slot list
   mean? My machine has two 6-core processors so that I specified
   "--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't
   allowed to specify more slots than available, to specify fewer
   slots than available, or to specify more slots than needed for
   the processes?


   Kind regards

   Siegmar

   Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet:

       Siegmar,

       I was able to reproduce the issue on my vm
       (No need for a real heterogeneous cluster here)

       I will keep digging tomorrow.
       Note that if you specify an incorrect slot list, MPI_Comm_spawn
       fails with a very unfriendly error message.
       Right now, the 4th spawn'ed task crashes, so this is a different
       issue

       Cheers,

       Gilles

       r...@open-mpi.org wrote:
       I think there is some relevant discussion here:
       https://github.com/open-mpi/ompi/issues/1569
       <https://github.com/open-mpi/ompi/issues/1569>

       It looks like Gilles had (at least at one point) a fix for
       master when enable-heterogeneous, but I don’t know if that was
       committed.

           On Jan 9, 2017, at 8:23 AM, Howard Pritchard
           <hpprit...@gmail.com <mailto:hpprit...@gmail.com>> wrote:

           HI Siegmar,

           You have some config parameters I wasn't trying that may
           have some impact.
           I'll give a try with these parameters.

           This should be enough info for now,

           Thanks,

           Howard


           2017-01-09 0:59 GMT-07:00 Siegmar Gross
           <siegmar.gr...@informatik.hs-fulda.de
           <mailto:siegmar.gr...@informatik.hs-fulda.de>>:

                Hi Howard,

                I use the following commands to build and install the
           package.
                ${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64"
           for my
                Linux machine.

                mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
                cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc

                ../openmpi-2.0.2rc3/configure \
                  --prefix=/usr/local/openmpi-2.0.2_64_cc \
                  --libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \
                  --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
                  --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
                  JAVA_HOME=/usr/local/jdk1.8.0_66 \
                  LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc"
           CXX="CC" FC="f95" \
                  CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
                  CPP="cpp" CXXCPP="cpp" \
                  --enable-mpi-cxx \
                  --enable-mpi-cxx-bindings \
                  --enable-cxx-exceptions \
                  --enable-mpi-java \
                  --enable-heterogeneous \
                  --enable-mpi-thread-multiple \
                  --with-hwloc=internal \
                  --without-verbs \
                  --with-wrapper-cflags="-m64 -mt" \
                  --with-wrapper-cxxflags="-m64" \
                  --with-wrapper-fcflags="-m64" \
                  --with-wrapper-ldflags="-mt" \
                  --enable-debug \
                  |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc

                make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
                rm -r /usr/local/openmpi-2.0.2_64_cc.old
                mv /usr/local/openmpi-2.0.2_64_cc
           /usr/local/openmpi-2.0.2_64_cc.old
                make install |& tee
           log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
                make check |& tee
           log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc


                I get a different error if I run the program with gdb.

                loki spawn 118 gdb
           /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec
                GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
                Copyright (C) 2016 Free Software Foundation, Inc.
                License GPLv3+: GNU GPL version 3 or later
           <http://gnu.org/licenses/gpl.html
           <http://gnu.org/licenses/gpl.html>
           <http://gnu.org/licenses/gpl.html
           <http://gnu.org/licenses/gpl.html>>>
                This is free software: you are free to change and
           redistribute it.
                There is NO WARRANTY, to the extent permitted by law.
           Type "show copying"
                and "show warranty" for details.
                This GDB was configured as "x86_64-suse-linux".
                Type "show configuration" for configuration details.
                For bug reporting instructions, please see:
                <http://bugs.opensuse.org/>.
                Find the GDB manual and other documentation resources
           online at:
                <http://www.gnu.org/software/gdb/documentation/
           <http://www.gnu.org/software/gdb/documentation/>
           <http://www.gnu.org/software/gdb/documentation/
           <http://www.gnu.org/software/gdb/documentation/>>>.
                For help, type "help".
                Type "apropos word" to search for commands related to
           "word"...
                Reading symbols from
           /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec...done.
                (gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5
           spawn_master
                Starting program:
           /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 --host loki
           --slot-list 0:0-5,1:0-5 spawn_master
                Missing separate debuginfos, use: zypper install
           glibc-debuginfo-2.24-2.3.x86_64
                [Thread debugging using libthread_db enabled]
                Using host libthread_db library "/lib64/libthread_db.so.1".
                [New Thread 0x7ffff3b97700 (LWP 13582)]
                [New Thread 0x7ffff18a4700 (LWP 13583)]
                [New Thread 0x7ffff10a3700 (LWP 13584)]
                [New Thread 0x7fffebbba700 (LWP 13585)]
                Detaching after fork from child process 13586.

                Parent process 0 running on loki
                  I create 4 slave processes

                Detaching after fork from child process 13589.
                Detaching after fork from child process 13590.
                Detaching after fork from child process 13591.
                [loki:13586] OPAL ERROR: Timeout in file
           ../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c
           at line 193
                [loki:13586] *** An error occurred in MPI_Comm_spawn
                [loki:13586] *** reported by process [2873294849,0]
                [loki:13586] *** on communicator MPI_COMM_WORLD
                [loki:13586] *** MPI_ERR_UNKNOWN: unknown error
                [loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in
           this communicator will now abort,
                [loki:13586] ***    and potentially your MPI job)
                [Thread 0x7fffebbba700 (LWP 13585) exited]
                [Thread 0x7ffff10a3700 (LWP 13584) exited]
                [Thread 0x7ffff18a4700 (LWP 13583) exited]
                [Thread 0x7ffff3b97700 (LWP 13582) exited]
                [Inferior 1 (process 13567) exited with code 016]
                Missing separate debuginfos, use: zypper install
           libpciaccess0-debuginfo-0.13.2-5.1.x86_64
           libudev1-debuginfo-210-116.3.3.x86_64
                (gdb) bt
                No stack.
                (gdb)

                Do you need anything else?


                Kind regards

                Siegmar

                Am 08.01.2017 um 17:02 schrieb Howard Pritchard:

                    HI Siegmar,

                    Could you post the configury options you use when
           building the 2.0.2rc3?
                    Maybe that will help in trying to reproduce the
           segfault you are observing.

                    Howard


                    2017-01-07 2:30 GMT-07:00 Siegmar Gross
           <siegmar.gr...@informatik.hs-fulda.de
           <mailto:siegmar.gr...@informatik.hs-fulda.de>
                    <mailto:siegmar.gr...@informatik.hs-fulda.de
           <mailto:siegmar.gr...@informatik.hs-fulda.de>>>:

                        Hi,

                        I have installed openmpi-2.0.2rc3 on my "SUSE
           Linux Enterprise
                        Server 12 (x86_64)" with Sun C 5.14 and
           gcc-6.3.0. Unfortunately,
                        I still get the same error that I reported for rc2.

                        I would be grateful, if somebody can fix the
           problem before
                        releasing the final version. Thank you very
           much for any help
                        in advance.


                        Kind regards

                        Siegmar
                        _______________________________________________
                        users mailing list
           us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>
           <mailto:us...@lists.open-mpi.org
           <mailto:us...@lists.open-mpi.org>>
           https://rfd.newmexicoconsortium.org/mailman/listinfo/users
           <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
           <https://rfd.newmexicoconsortium.org/mailman/listinfo/users
           <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users
           <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
           <https://rfd.newmexicoconsortium.org/mailman/listinfo/users
           <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>>




                    _______________________________________________
                    users mailing list
           us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>
           https://rfd.newmexicoconsortium.org/mailman/listinfo/users
           <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
           <https://rfd.newmexicoconsortium.org/mailman/listinfo/users
           <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>

                _______________________________________________
                users mailing list
           us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>
           https://rfd.newmexicoconsortium.org/mailman/listinfo/users
           <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
           <https://rfd.newmexicoconsortium.org/mailman/listinfo/users
           <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>


           _______________________________________________
           users mailing list
           us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>
           https://rfd.newmexicoconsortium.org/mailman/listinfo/users
           <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>




       _______________________________________________
       users mailing list
       us...@lists.open-mpi.org
       https://rfd.newmexicoconsortium.org/mailman/listinfo/users
       <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

   _______________________________________________
   users mailing list
   us...@lists.open-mpi.org
   https://rfd.newmexicoconsortium.org/mailman/listinfo/users
   <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

_______________________________________________
users mailing list
us...@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to