HI Siegmar, Could you please rerun the spawn_slave program with 4 processes? Your original traceback indicates a failure in the barrier in the slave program. I'm interested in seeing if when you run the slave program standalone with 4 processes the barrier failure is observed.
Thanks, Howard 2017-01-03 0:32 GMT-07:00 Siegmar Gross < siegmar.gr...@informatik.hs-fulda.de>: > Hi Howard, > > thank you very much that you try to solve my problem. I haven't > changed the programs since 2013 so that you use the correct > version. The program works as expected with the master trunk as > you can see at the bottom of this email from my last mail. The > slave program works when I launch it directly. > > loki spawn 122 mpicc --showme > cc -I/usr/local/openmpi-2.0.2_64_cc/include -m64 -mt -mt -Wl,-rpath > -Wl,/usr/local/openmpi-2.0.2_64_cc/lib64 -Wl,--enable-new-dtags > -L/usr/local/openmpi-2.0.2_64_cc/lib64 -lmpi > loki spawn 123 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:" > Open MPI: 2.0.2rc2 > C compiler absolute: /opt/solstudio12.5b/bin/cc > loki spawn 124 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 --mca > btl_base_verbose 10 spawn_slave > [loki:05572] mca: base: components_register: registering framework btl > components > [loki:05572] mca: base: components_register: found loaded component self > [loki:05572] mca: base: components_register: component self register > function successful > [loki:05572] mca: base: components_register: found loaded component sm > [loki:05572] mca: base: components_register: component sm register > function successful > [loki:05572] mca: base: components_register: found loaded component tcp > [loki:05572] mca: base: components_register: component tcp register > function successful > [loki:05572] mca: base: components_register: found loaded component vader > [loki:05572] mca: base: components_register: component vader register > function successful > [loki:05572] mca: base: components_open: opening btl components > [loki:05572] mca: base: components_open: found loaded component self > [loki:05572] mca: base: components_open: component self open function > successful > [loki:05572] mca: base: components_open: found loaded component sm > [loki:05572] mca: base: components_open: component sm open function > successful > [loki:05572] mca: base: components_open: found loaded component tcp > [loki:05572] mca: base: components_open: component tcp open function > successful > [loki:05572] mca: base: components_open: found loaded component vader > [loki:05572] mca: base: components_open: component vader open function > successful > [loki:05572] select: initializing btl component self > [loki:05572] select: init of component self returned success > [loki:05572] select: initializing btl component sm > [loki:05572] select: init of component sm returned failure > [loki:05572] mca: base: close: component sm closed > [loki:05572] mca: base: close: unloading component sm > [loki:05572] select: initializing btl component tcp > [loki:05572] select: init of component tcp returned success > [loki:05572] select: initializing btl component vader > [loki][[35331,1],0][../../../../../openmpi-2.0.2rc2/opal/mca > /btl/vader/btl_vader_component.c:454:mca_btl_vader_component_init] No > peers to communicate with. Disabling vader. > [loki:05572] select: init of component vader returned failure > [loki:05572] mca: base: close: component vader closed > [loki:05572] mca: base: close: unloading component vader > [loki:05572] mca: bml: Using self btl for send to [[35331,1],0] on node > loki > Slave process 0 of 1 running on loki > spawn_slave 0: argv[0]: spawn_slave > [loki:05572] mca: base: close: component self closed > [loki:05572] mca: base: close: unloading component self > [loki:05572] mca: base: close: component tcp closed > [loki:05572] mca: base: close: unloading component tcp > loki spawn 125 > > > Kind regards and thank you very much once more > > Siegmar > > Am 03.01.2017 um 00:17 schrieb Howard Pritchard: > >> HI Siegmar, >> >> I've attempted to reproduce this using gnu compilers and >> the version of this test program(s) you posted earlier in 2016 >> but am unable to reproduce the problem. >> >> Could you double check that the slave program can be >> successfully run when launched directly by mpirun/mpiexec? >> It might also help to use --mca btl_base_verbose 10 when >> running the slave program standalone. >> >> Thanks, >> >> Howard >> >> >> >> 2016-12-28 7:06 GMT-07:00 Siegmar Gross <siegmar.gr...@informatik.hs-f >> ulda.de <mailto:siegmar.gr...@informatik.hs-fulda.de>>: >> >> >> Hi, >> >> I have installed openmpi-2.0.2rc2 on my "SUSE Linux Enterprise >> Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.2.0. Unfortunately, >> I get an error when I run one of my programs. Everything works as >> expected with openmpi-master-201612232109-67a08e8. The program >> gets a timeout with openmpi-v2.x-201612232156-5ce66b0. >> >> loki spawn 144 ompi_info | grep -e "Open MPI:" -e "C compiler >> absolute:" >> Open MPI: 2.0.2rc2 >> C compiler absolute: /opt/solstudio12.5b/bin/cc >> >> >> loki spawn 145 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 >> spawn_master >> >> Parent process 0 running on loki >> I create 4 slave processes >> >> ------------------------------------------------------------ >> -------------- >> A system call failed during shared memory initialization that should >> not have. It is likely that your MPI job will now either abort or >> experience performance degradation. >> >> Local host: loki >> System call: open(2) >> Error: No such file or directory (errno 2) >> ------------------------------------------------------------ >> -------------- >> [loki:17855] *** Process received signal *** >> [loki:17855] Signal: Segmentation fault (11) >> [loki:17855] Signal code: Address not mapped (1) >> [loki:17855] Failing at address: 0x8 >> [loki:17855] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f053d0e9870] >> [loki:17855] [ 1] /usr/local/openmpi-2.0.2_64_cc >> /lib64/openmpi/mca_pml_ob1.so(+0x990ae)[0x7f05325060ae] >> [loki:17855] [ 2] /usr/local/openmpi-2.0.2_64_cc >> /lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_req_start+0x1 >> 96)[0x7f053250cb16] >> [loki:17855] [ 3] /usr/local/openmpi-2.0.2_64_cc >> /lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_irecv+0x2f8)[0x7f05324bd3d8] >> [loki:17855] [ 4] /usr/local/openmpi-2.0.2_64_cc >> /lib64/libmpi.so.20(ompi_coll_base_bcast_intra_generic+0x34c >> )[0x7f053e52300c] >> [loki:17855] [ 5] /usr/local/openmpi-2.0.2_64_cc >> /lib64/libmpi.so.20(ompi_coll_base_bcast_intra_binomial+ >> 0x1ed)[0x7f053e523eed] >> [loki:17855] [ 6] /usr/local/openmpi-2.0.2_64_cc >> /lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_ >> intra_dec_fixed+0x1a3)[0x7f0531ea7c03] >> [loki:17855] [ 7] /usr/local/openmpi-2.0.2_64_cc >> /lib64/libmpi.so.20(ompi_dpm_connect_accept+0xab8)[0x7f053d484f38] >> [loki:17855] [ 8] [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not >> found in file ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c >> at line 186 >> /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_dpm_ >> dyn_init+0xcd)[0x7f053d48aeed] >> [loki:17855] [ 9] /usr/local/openmpi-2.0.2_64_cc >> /lib64/libmpi.so.20(ompi_mpi_init+0xf93)[0x7f053d53d5f3] >> [loki:17855] [10] /usr/local/openmpi-2.0.2_64_cc >> /lib64/libmpi.so.20(PMPI_Init+0x8d)[0x7f053db209cd] >> [loki:17855] [11] spawn_slave[0x4009cf] >> [loki:17855] [12] /lib64/libc.so.6(__libc_start_ >> main+0xf5)[0x7f053cd53b25] >> [loki:17855] [13] spawn_slave[0x400892] >> [loki:17855] *** End of error message *** >> [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in file >> ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line 186 >> ------------------------------------------------------------ >> -------------- >> At least one pair of MPI processes are unable to reach each other for >> MPI communications. This means that no Open MPI device has indicated >> that it can be used to communicate between these processes. This is >> an error; Open MPI requires that all MPI processes be able to reach >> each other. This error can sometimes be the result of forgetting to >> specify the "self" BTL. >> >> Process 1 ([[55817,2],0]) is on host: loki >> Process 2 ([[55817,2],1]) is on host: unknown! >> BTLs attempted: self sm tcp vader >> >> Your MPI job is now going to abort; sorry. >> ------------------------------------------------------------ >> -------------- >> *** An error occurred in MPI_Init >> *** on a NULL communicator >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >> abort, >> *** and potentially your MPI job) >> ------------------------------------------------------------ >> -------------- >> It looks like MPI_INIT failed for some reason; your parallel process >> is >> likely to abort. There are many reasons that a parallel process can >> fail during MPI_INIT; some of which are due to configuration or >> environment >> problems. This failure appears to be an internal failure; here's some >> additional information (which may only be relevant to an Open MPI >> developer): >> >> ompi_dpm_dyn_init() failed >> --> Returned "Unreachable" (-12) instead of "Success" (0) >> ------------------------------------------------------------ >> -------------- >> loki spawn 146 >> >> >> >> >> >> >> >> loki spawn 120 ompi_info | grep -e "Open MPI:" -e "C compiler >> absolute:" >> Open MPI: 2.0.2a1 >> C compiler absolute: /opt/solstudio12.5b/bin/cc >> loki spawn 121 which mpiexec >> /usr/local/openmpi-2.1.0_64_cc/bin/mpiexec >> loki spawn 122 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 >> spawn_master >> >> Parent process 0 running on loki >> I create 4 slave processes >> >> [loki:21301] OPAL ERROR: Timeout in file >> ../../../../openmpi-v2.x-201612232156-5ce66b0/opal/mca/pmix/base/pmix_base_fns.c >> at line 195 >> [loki:21301] *** An error occurred in MPI_Comm_spawn >> [loki:21301] *** reported by process [3431727105,0] >> [loki:21301] *** on communicator MPI_COMM_WORLD >> [loki:21301] *** MPI_ERR_UNKNOWN: unknown error >> [loki:21301] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >> will now abort, >> [loki:21301] *** and potentially your MPI job) >> loki spawn 123 >> >> >> >> >> >> >> loki spawn 111 ompi_info | grep -e "Open MPI:" -e "C compiler" >> Open MPI: 3.0.0a1 >> C compiler: cc >> C compiler absolute: /opt/solstudio12.5b/bin/cc >> C compiler family name: SUN >> C compiler version: 0x5140 >> loki spawn 111 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 >> spawn_master >> >> Parent process 0 running on loki >> I create 4 slave processes >> >> Parent process 0: tasks in MPI_COMM_WORLD: 1 >> tasks in COMM_CHILD_PROCESSES local group: 1 >> tasks in COMM_CHILD_PROCESSES remote group: 4 >> >> Slave process 1 of 4 running on loki >> Slave process 3 of 4 running on loki >> Slave process 0 of 4 running on loki >> Slave process 2 of 4 running on loki >> spawn_slave 2: argv[0]: spawn_slave >> spawn_slave 3: argv[0]: spawn_slave >> spawn_slave 0: argv[0]: spawn_slave >> spawn_slave 1: argv[0]: spawn_slave >> loki spawn 112 >> >> >> I would be grateful, if somebody can fix the problems. Thank you >> very much for any help in advance. >> >> >> Kind regards >> >> Siegmar >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users < >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> >> >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users