Siegmar, Your slot list is correct. An invalid slot list for your node would be 0:1-7,1:0-7
/* and since the test requires only 5 tasks, that could even work with such an invalid list. My vm is single socket with 4 cores, so a 0:0-4 slot list results in an unfriendly pmix error */ Bottom line, your test is correct, and there is a bug in v2.0.x that I will investigate from tomorrow Cheers, Gilles On Wednesday, January 11, 2017, Siegmar Gross < siegmar.gr...@informatik.hs-fulda.de> wrote: > Hi Gilles, > > thank you very much for your help. What does incorrect slot list > mean? My machine has two 6-core processors so that I specified > "--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't > allowed to specify more slots than available, to specify fewer > slots than available, or to specify more slots than needed for > the processes? > > > Kind regards > > Siegmar > > Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet: > >> Siegmar, >> >> I was able to reproduce the issue on my vm >> (No need for a real heterogeneous cluster here) >> >> I will keep digging tomorrow. >> Note that if you specify an incorrect slot list, MPI_Comm_spawn fails >> with a very unfriendly error message. >> Right now, the 4th spawn'ed task crashes, so this is a different issue >> >> Cheers, >> >> Gilles >> >> r...@open-mpi.org wrote: >> I think there is some relevant discussion here: >> https://github.com/open-mpi/ompi/issues/1569 >> >> It looks like Gilles had (at least at one point) a fix for master when >> enable-heterogeneous, but I don’t know if that was committed. >> >> On Jan 9, 2017, at 8:23 AM, Howard Pritchard <hpprit...@gmail.com >>> <mailto:hpprit...@gmail.com>> wrote: >>> >>> HI Siegmar, >>> >>> You have some config parameters I wasn't trying that may have some >>> impact. >>> I'll give a try with these parameters. >>> >>> This should be enough info for now, >>> >>> Thanks, >>> >>> Howard >>> >>> >>> 2017-01-09 0:59 GMT-07:00 Siegmar Gross <siegmar.gr...@informatik.hs- >>> fulda.de <mailto:siegmar.gr...@informatik.hs-fulda.de>>: >>> >>> Hi Howard, >>> >>> I use the following commands to build and install the package. >>> ${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my >>> Linux machine. >>> >>> mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc >>> cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc >>> >>> ../openmpi-2.0.2rc3/configure \ >>> --prefix=/usr/local/openmpi-2.0.2_64_cc \ >>> --libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \ >>> --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \ >>> --with-jdk-headers=/usr/local/jdk1.8.0_66/include \ >>> JAVA_HOME=/usr/local/jdk1.8.0_66 \ >>> LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" >>> FC="f95" \ >>> CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \ >>> CPP="cpp" CXXCPP="cpp" \ >>> --enable-mpi-cxx \ >>> --enable-mpi-cxx-bindings \ >>> --enable-cxx-exceptions \ >>> --enable-mpi-java \ >>> --enable-heterogeneous \ >>> --enable-mpi-thread-multiple \ >>> --with-hwloc=internal \ >>> --without-verbs \ >>> --with-wrapper-cflags="-m64 -mt" \ >>> --with-wrapper-cxxflags="-m64" \ >>> --with-wrapper-fcflags="-m64" \ >>> --with-wrapper-ldflags="-mt" \ >>> --enable-debug \ >>> |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc >>> >>> make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc >>> rm -r /usr/local/openmpi-2.0.2_64_cc.old >>> mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc.old >>> make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc >>> make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc >>> >>> >>> I get a different error if I run the program with gdb. >>> >>> loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec >>> GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1 >>> Copyright (C) 2016 Free Software Foundation, Inc. >>> License GPLv3+: GNU GPL version 3 or later < >>> http://gnu.org/licenses/gpl.html <http://gnu.org/licenses/gpl.html>> >>> This is free software: you are free to change and redistribute it. >>> There is NO WARRANTY, to the extent permitted by law. Type "show >>> copying" >>> and "show warranty" for details. >>> This GDB was configured as "x86_64-suse-linux". >>> Type "show configuration" for configuration details. >>> For bug reporting instructions, please see: >>> <http://bugs.opensuse.org/>. >>> Find the GDB manual and other documentation resources online at: >>> <http://www.gnu.org/software/gdb/documentation/ < >>> http://www.gnu.org/software/gdb/documentation/>>. >>> For help, type "help". >>> Type "apropos word" to search for commands related to "word"... >>> Reading symbols from /usr/local/openmpi-2.0.2_64_cc >>> /bin/mpiexec...done. >>> (gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master >>> Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 >>> --host loki --slot-list 0:0-5,1:0-5 spawn_master >>> Missing separate debuginfos, use: zypper install >>> glibc-debuginfo-2.24-2.3.x86_64 >>> [Thread debugging using libthread_db enabled] >>> Using host libthread_db library "/lib64/libthread_db.so.1". >>> [New Thread 0x7ffff3b97700 (LWP 13582)] >>> [New Thread 0x7ffff18a4700 (LWP 13583)] >>> [New Thread 0x7ffff10a3700 (LWP 13584)] >>> [New Thread 0x7fffebbba700 (LWP 13585)] >>> Detaching after fork from child process 13586. >>> >>> Parent process 0 running on loki >>> I create 4 slave processes >>> >>> Detaching after fork from child process 13589. >>> Detaching after fork from child process 13590. >>> Detaching after fork from child process 13591. >>> [loki:13586] OPAL ERROR: Timeout in file >>> ../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c at line >>> 193 >>> [loki:13586] *** An error occurred in MPI_Comm_spawn >>> [loki:13586] *** reported by process [2873294849,0] >>> [loki:13586] *** on communicator MPI_COMM_WORLD >>> [loki:13586] *** MPI_ERR_UNKNOWN: unknown error >>> [loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this >>> communicator will now abort, >>> [loki:13586] *** and potentially your MPI job) >>> [Thread 0x7fffebbba700 (LWP 13585) exited] >>> [Thread 0x7ffff10a3700 (LWP 13584) exited] >>> [Thread 0x7ffff18a4700 (LWP 13583) exited] >>> [Thread 0x7ffff3b97700 (LWP 13582) exited] >>> [Inferior 1 (process 13567) exited with code 016] >>> Missing separate debuginfos, use: zypper install >>> libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libudev1-debuginfo-210-116.3.3 >>> .x86_64 >>> (gdb) bt >>> No stack. >>> (gdb) >>> >>> Do you need anything else? >>> >>> >>> Kind regards >>> >>> Siegmar >>> >>> Am 08.01.2017 um 17:02 schrieb Howard Pritchard: >>> >>> HI Siegmar, >>> >>> Could you post the configury options you use when building the >>> 2.0.2rc3? >>> Maybe that will help in trying to reproduce the segfault you are >>> observing. >>> >>> Howard >>> >>> >>> 2017-01-07 2:30 GMT-07:00 Siegmar Gross < >>> siegmar.gr...@informatik.hs-fulda.de <mailto:siegmar.gross@ >>> informatik.hs-fulda.de> >>> <mailto:siegmar.gr...@informatik.hs-fulda.de <mailto: >>> siegmar.gr...@informatik.hs-fulda.de>>>: >>> >>> Hi, >>> >>> I have installed openmpi-2.0.2rc3 on my "SUSE Linux >>> Enterprise >>> Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. >>> Unfortunately, >>> I still get the same error that I reported for rc2. >>> >>> I would be grateful, if somebody can fix the problem before >>> releasing the final version. Thank you very much for any help >>> in advance. >>> >>> >>> Kind regards >>> >>> Siegmar >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users < >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users < >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users>> >>> >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users < >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users < >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> >> >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users