HI Siegmar,

I've attempted to reproduce this using gnu compilers and
the version of this test program(s) you posted earlier in 2016
but am unable to reproduce the problem.

Could you double check that the slave program can be
successfully run when launched directly by mpirun/mpiexec?
It might also help to use --mca btl_base_verbose 10 when
running the slave program standalone.

Thanks,

Howard



2016-12-28 7:06 GMT-07:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi,
>
> I have installed openmpi-2.0.2rc2 on my "SUSE Linux Enterprise
> Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.2.0. Unfortunately,
> I get an error when I run one of my programs. Everything works as
> expected with openmpi-master-201612232109-67a08e8. The program
> gets a timeout with openmpi-v2.x-201612232156-5ce66b0.
>
> loki spawn 144 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
>                 Open MPI: 2.0.2rc2
>      C compiler absolute: /opt/solstudio12.5b/bin/cc
>
>
> loki spawn 145 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5
> spawn_master
>
> Parent process 0 running on loki
>   I create 4 slave processes
>
> --------------------------------------------------------------------------
> A system call failed during shared memory initialization that should
> not have.  It is likely that your MPI job will now either abort or
> experience performance degradation.
>
>   Local host:  loki
>   System call: open(2)
>   Error:       No such file or directory (errno 2)
> --------------------------------------------------------------------------
> [loki:17855] *** Process received signal ***
> [loki:17855] Signal: Segmentation fault (11)
> [loki:17855] Signal code: Address not mapped (1)
> [loki:17855] Failing at address: 0x8
> [loki:17855] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f053d0e9870]
> [loki:17855] [ 1] /usr/local/openmpi-2.0.2_64_cc
> /lib64/openmpi/mca_pml_ob1.so(+0x990ae)[0x7f05325060ae]
> [loki:17855] [ 2] /usr/local/openmpi-2.0.2_64_cc
> /lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_req_start+0x1
> 96)[0x7f053250cb16]
> [loki:17855] [ 3] /usr/local/openmpi-2.0.2_64_cc
> /lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_irecv+0x2f8)[0x7f05324bd3d8]
> [loki:17855] [ 4] /usr/local/openmpi-2.0.2_64_cc
> /lib64/libmpi.so.20(ompi_coll_base_bcast_intra_generic+0x34c
> )[0x7f053e52300c]
> [loki:17855] [ 5] /usr/local/openmpi-2.0.2_64_cc
> /lib64/libmpi.so.20(ompi_coll_base_bcast_intra_binomial+
> 0x1ed)[0x7f053e523eed]
> [loki:17855] [ 6] /usr/local/openmpi-2.0.2_64_cc
> /lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_
> intra_dec_fixed+0x1a3)[0x7f0531ea7c03]
> [loki:17855] [ 7] /usr/local/openmpi-2.0.2_64_cc
> /lib64/libmpi.so.20(ompi_dpm_connect_accept+0xab8)[0x7f053d484f38]
> [loki:17855] [ 8] [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in
> file ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line
> 186
> /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_dpm_
> dyn_init+0xcd)[0x7f053d48aeed]
> [loki:17855] [ 9] /usr/local/openmpi-2.0.2_64_cc
> /lib64/libmpi.so.20(ompi_mpi_init+0xf93)[0x7f053d53d5f3]
> [loki:17855] [10] /usr/local/openmpi-2.0.2_64_cc
> /lib64/libmpi.so.20(PMPI_Init+0x8d)[0x7f053db209cd]
> [loki:17855] [11] spawn_slave[0x4009cf]
> [loki:17855] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f053cd53b25]
> [loki:17855] [13] spawn_slave[0x400892]
> [loki:17855] *** End of error message ***
> [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in file
> ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line 186
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[55817,2],0]) is on host: loki
>   Process 2 ([[55817,2],1]) is on host: unknown!
>   BTLs attempted: self sm tcp vader
>
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   ompi_dpm_dyn_init() failed
>   --> Returned "Unreachable" (-12) instead of "Success" (0)
> --------------------------------------------------------------------------
> loki spawn 146
>
>
>
>
>
>
>
> loki spawn 120 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
>                 Open MPI: 2.0.2a1
>      C compiler absolute: /opt/solstudio12.5b/bin/cc
> loki spawn 121 which mpiexec
> /usr/local/openmpi-2.1.0_64_cc/bin/mpiexec
> loki spawn 122 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5
> spawn_master
>
> Parent process 0 running on loki
>   I create 4 slave processes
>
> [loki:21301] OPAL ERROR: Timeout in file ../../../../openmpi-v2.x-20161
> 2232156-5ce66b0/opal/mca/pmix/base/pmix_base_fns.c at line 195
> [loki:21301] *** An error occurred in MPI_Comm_spawn
> [loki:21301] *** reported by process [3431727105,0]
> [loki:21301] *** on communicator MPI_COMM_WORLD
> [loki:21301] *** MPI_ERR_UNKNOWN: unknown error
> [loki:21301] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
> now abort,
> [loki:21301] ***    and potentially your MPI job)
> loki spawn 123
>
>
>
>
>
>
> loki spawn 111 ompi_info | grep -e "Open MPI:" -e "C compiler"
>                 Open MPI: 3.0.0a1
>               C compiler: cc
>      C compiler absolute: /opt/solstudio12.5b/bin/cc
>   C compiler family name: SUN
>       C compiler version: 0x5140
> loki spawn 111 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5
> spawn_master
>
> Parent process 0 running on loki
>   I create 4 slave processes
>
> Parent process 0: tasks in MPI_COMM_WORLD:                    1
>                   tasks in COMM_CHILD_PROCESSES local group:  1
>                   tasks in COMM_CHILD_PROCESSES remote group: 4
>
> Slave process 1 of 4 running on loki
> Slave process 3 of 4 running on loki
> Slave process 0 of 4 running on loki
> Slave process 2 of 4 running on loki
> spawn_slave 2: argv[0]: spawn_slave
> spawn_slave 3: argv[0]: spawn_slave
> spawn_slave 0: argv[0]: spawn_slave
> spawn_slave 1: argv[0]: spawn_slave
> loki spawn 112
>
>
> I would be grateful, if somebody can fix the problems. Thank you
> very much for any help in advance.
>
>
> Kind regards
>
> Siegmar
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to