Siegmar,

Your slot list is correct.
An invalid slot list for your node would be 0:1-7,1:0-7

/* and since the test requires only 5 tasks, that could even work with such
an invalid list.
My vm is single socket with 4 cores, so a 0:0-4 slot list results in an
unfriendly pmix error */

Bottom line, your test is correct, and there is a bug in v2.0.x that I will
investigate from tomorrow

Cheers,

Gilles

On Wednesday, January 11, 2017, Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de> wrote:

> Hi Gilles,
>
> thank you very much for your help. What does incorrect slot list
> mean? My machine has two 6-core processors so that I specified
> "--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't
> allowed to specify more slots than available, to specify fewer
> slots than available, or to specify more slots than needed for
> the processes?
>
>
> Kind regards
>
> Siegmar
>
> Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet:
>
>> Siegmar,
>>
>> I was able to reproduce the issue on my vm
>> (No need for a real heterogeneous cluster here)
>>
>> I will keep digging tomorrow.
>> Note that if you specify an incorrect slot list, MPI_Comm_spawn fails
>> with a very unfriendly error message.
>> Right now, the 4th spawn'ed task crashes, so this is a different issue
>>
>> Cheers,
>>
>> Gilles
>>
>> r...@open-mpi.org wrote:
>> I think there is some relevant discussion here:
>> https://github.com/open-mpi/ompi/issues/1569
>>
>> It looks like Gilles had (at least at one point) a fix for master when
>> enable-heterogeneous, but I don’t know if that was committed.
>>
>> On Jan 9, 2017, at 8:23 AM, Howard Pritchard <hpprit...@gmail.com
>>> <mailto:hpprit...@gmail.com>> wrote:
>>>
>>> HI Siegmar,
>>>
>>> You have some config parameters I wasn't trying that may have some
>>> impact.
>>> I'll give a try with these parameters.
>>>
>>> This should be enough info for now,
>>>
>>> Thanks,
>>>
>>> Howard
>>>
>>>
>>> 2017-01-09 0:59 GMT-07:00 Siegmar Gross <siegmar.gr...@informatik.hs-
>>> fulda.de <mailto:siegmar.gr...@informatik.hs-fulda.de>>:
>>>
>>>     Hi Howard,
>>>
>>>     I use the following commands to build and install the package.
>>>     ${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my
>>>     Linux machine.
>>>
>>>     mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
>>>     cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
>>>
>>>     ../openmpi-2.0.2rc3/configure \
>>>       --prefix=/usr/local/openmpi-2.0.2_64_cc \
>>>       --libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \
>>>       --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
>>>       --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
>>>       JAVA_HOME=/usr/local/jdk1.8.0_66 \
>>>       LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC"
>>> FC="f95" \
>>>       CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
>>>       CPP="cpp" CXXCPP="cpp" \
>>>       --enable-mpi-cxx \
>>>       --enable-mpi-cxx-bindings \
>>>       --enable-cxx-exceptions \
>>>       --enable-mpi-java \
>>>       --enable-heterogeneous \
>>>       --enable-mpi-thread-multiple \
>>>       --with-hwloc=internal \
>>>       --without-verbs \
>>>       --with-wrapper-cflags="-m64 -mt" \
>>>       --with-wrapper-cxxflags="-m64" \
>>>       --with-wrapper-fcflags="-m64" \
>>>       --with-wrapper-ldflags="-mt" \
>>>       --enable-debug \
>>>       |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>>>
>>>     make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>>>     rm -r /usr/local/openmpi-2.0.2_64_cc.old
>>>     mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc.old
>>>     make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>>>     make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>>>
>>>
>>>     I get a different error if I run the program with gdb.
>>>
>>>     loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec
>>>     GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
>>>     Copyright (C) 2016 Free Software Foundation, Inc.
>>>     License GPLv3+: GNU GPL version 3 or later <
>>> http://gnu.org/licenses/gpl.html <http://gnu.org/licenses/gpl.html>>
>>>     This is free software: you are free to change and redistribute it.
>>>     There is NO WARRANTY, to the extent permitted by law.  Type "show
>>> copying"
>>>     and "show warranty" for details.
>>>     This GDB was configured as "x86_64-suse-linux".
>>>     Type "show configuration" for configuration details.
>>>     For bug reporting instructions, please see:
>>>     <http://bugs.opensuse.org/>.
>>>     Find the GDB manual and other documentation resources online at:
>>>     <http://www.gnu.org/software/gdb/documentation/ <
>>> http://www.gnu.org/software/gdb/documentation/>>.
>>>     For help, type "help".
>>>     Type "apropos word" to search for commands related to "word"...
>>>     Reading symbols from /usr/local/openmpi-2.0.2_64_cc
>>> /bin/mpiexec...done.
>>>     (gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
>>>     Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1
>>> --host loki --slot-list 0:0-5,1:0-5 spawn_master
>>>     Missing separate debuginfos, use: zypper install
>>> glibc-debuginfo-2.24-2.3.x86_64
>>>     [Thread debugging using libthread_db enabled]
>>>     Using host libthread_db library "/lib64/libthread_db.so.1".
>>>     [New Thread 0x7ffff3b97700 (LWP 13582)]
>>>     [New Thread 0x7ffff18a4700 (LWP 13583)]
>>>     [New Thread 0x7ffff10a3700 (LWP 13584)]
>>>     [New Thread 0x7fffebbba700 (LWP 13585)]
>>>     Detaching after fork from child process 13586.
>>>
>>>     Parent process 0 running on loki
>>>       I create 4 slave processes
>>>
>>>     Detaching after fork from child process 13589.
>>>     Detaching after fork from child process 13590.
>>>     Detaching after fork from child process 13591.
>>>     [loki:13586] OPAL ERROR: Timeout in file
>>> ../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c at line
>>> 193
>>>     [loki:13586] *** An error occurred in MPI_Comm_spawn
>>>     [loki:13586] *** reported by process [2873294849,0]
>>>     [loki:13586] *** on communicator MPI_COMM_WORLD
>>>     [loki:13586] *** MPI_ERR_UNKNOWN: unknown error
>>>     [loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this
>>> communicator will now abort,
>>>     [loki:13586] ***    and potentially your MPI job)
>>>     [Thread 0x7fffebbba700 (LWP 13585) exited]
>>>     [Thread 0x7ffff10a3700 (LWP 13584) exited]
>>>     [Thread 0x7ffff18a4700 (LWP 13583) exited]
>>>     [Thread 0x7ffff3b97700 (LWP 13582) exited]
>>>     [Inferior 1 (process 13567) exited with code 016]
>>>     Missing separate debuginfos, use: zypper install
>>> libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libudev1-debuginfo-210-116.3.3
>>> .x86_64
>>>     (gdb) bt
>>>     No stack.
>>>     (gdb)
>>>
>>>     Do you need anything else?
>>>
>>>
>>>     Kind regards
>>>
>>>     Siegmar
>>>
>>>     Am 08.01.2017 um 17:02 schrieb Howard Pritchard:
>>>
>>>         HI Siegmar,
>>>
>>>         Could you post the configury options you use when building the
>>> 2.0.2rc3?
>>>         Maybe that will help in trying to reproduce the segfault you are
>>> observing.
>>>
>>>         Howard
>>>
>>>
>>>         2017-01-07 2:30 GMT-07:00 Siegmar Gross <
>>> siegmar.gr...@informatik.hs-fulda.de <mailto:siegmar.gross@
>>> informatik.hs-fulda.de>
>>>         <mailto:siegmar.gr...@informatik.hs-fulda.de <mailto:
>>> siegmar.gr...@informatik.hs-fulda.de>>>:
>>>
>>>             Hi,
>>>
>>>             I have installed openmpi-2.0.2rc3 on my "SUSE Linux
>>> Enterprise
>>>             Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0.
>>> Unfortunately,
>>>             I still get the same error that I reported for rc2.
>>>
>>>             I would be grateful, if somebody can fix the problem before
>>>             releasing the final version. Thank you very much for any help
>>>             in advance.
>>>
>>>
>>>             Kind regards
>>>
>>>             Siegmar
>>>             _______________________________________________
>>>             users mailing list
>>>             users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
>>>             https://rfd.newmexicoconsortium.org/mailman/listinfo/users <
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>         <https://rfd.newmexicoconsortium.org/mailman/listinfo/users <
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>
>>>
>>>
>>>
>>>
>>>         _______________________________________________
>>>         users mailing list
>>>         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>         https://rfd.newmexicoconsortium.org/mailman/listinfo/users <
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>
>>>     _______________________________________________
>>>     users mailing list
>>>     users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/users <
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to