HI Siegmar,

Could you please rerun the spawn_slave program with 4 processes?
Your original traceback indicates a failure in the barrier in the slave
program.  I'm interested in seeing if when you run the slave program
standalone with 4 processes the barrier failure is observed.

Thanks,

Howard


2017-01-03 0:32 GMT-07:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi Howard,
>
> thank you very much that you try to solve my problem. I haven't
> changed the programs since 2013 so that you use the correct
> version. The program works as expected with the master trunk as
> you can see at the bottom of this email from my last mail. The
> slave program works when I launch it directly.
>
> loki spawn 122 mpicc --showme
> cc -I/usr/local/openmpi-2.0.2_64_cc/include -m64 -mt -mt -Wl,-rpath
> -Wl,/usr/local/openmpi-2.0.2_64_cc/lib64 -Wl,--enable-new-dtags
> -L/usr/local/openmpi-2.0.2_64_cc/lib64 -lmpi
> loki spawn 123 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
>                 Open MPI: 2.0.2rc2
>      C compiler absolute: /opt/solstudio12.5b/bin/cc
> loki spawn 124 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 --mca
> btl_base_verbose 10 spawn_slave
> [loki:05572] mca: base: components_register: registering framework btl
> components
> [loki:05572] mca: base: components_register: found loaded component self
> [loki:05572] mca: base: components_register: component self register
> function successful
> [loki:05572] mca: base: components_register: found loaded component sm
> [loki:05572] mca: base: components_register: component sm register
> function successful
> [loki:05572] mca: base: components_register: found loaded component tcp
> [loki:05572] mca: base: components_register: component tcp register
> function successful
> [loki:05572] mca: base: components_register: found loaded component vader
> [loki:05572] mca: base: components_register: component vader register
> function successful
> [loki:05572] mca: base: components_open: opening btl components
> [loki:05572] mca: base: components_open: found loaded component self
> [loki:05572] mca: base: components_open: component self open function
> successful
> [loki:05572] mca: base: components_open: found loaded component sm
> [loki:05572] mca: base: components_open: component sm open function
> successful
> [loki:05572] mca: base: components_open: found loaded component tcp
> [loki:05572] mca: base: components_open: component tcp open function
> successful
> [loki:05572] mca: base: components_open: found loaded component vader
> [loki:05572] mca: base: components_open: component vader open function
> successful
> [loki:05572] select: initializing btl component self
> [loki:05572] select: init of component self returned success
> [loki:05572] select: initializing btl component sm
> [loki:05572] select: init of component sm returned failure
> [loki:05572] mca: base: close: component sm closed
> [loki:05572] mca: base: close: unloading component sm
> [loki:05572] select: initializing btl component tcp
> [loki:05572] select: init of component tcp returned success
> [loki:05572] select: initializing btl component vader
> [loki][[35331,1],0][../../../../../openmpi-2.0.2rc2/opal/mca
> /btl/vader/btl_vader_component.c:454:mca_btl_vader_component_init] No
> peers to communicate with. Disabling vader.
> [loki:05572] select: init of component vader returned failure
> [loki:05572] mca: base: close: component vader closed
> [loki:05572] mca: base: close: unloading component vader
> [loki:05572] mca: bml: Using self btl for send to [[35331,1],0] on node
> loki
> Slave process 0 of 1 running on loki
> spawn_slave 0: argv[0]: spawn_slave
> [loki:05572] mca: base: close: component self closed
> [loki:05572] mca: base: close: unloading component self
> [loki:05572] mca: base: close: component tcp closed
> [loki:05572] mca: base: close: unloading component tcp
> loki spawn 125
>
>
> Kind regards and thank you very much once more
>
> Siegmar
>
> Am 03.01.2017 um 00:17 schrieb Howard Pritchard:
>
>> HI Siegmar,
>>
>> I've attempted to reproduce this using gnu compilers and
>> the version of this test program(s) you posted earlier in 2016
>> but am unable to reproduce the problem.
>>
>> Could you double check that the slave program can be
>> successfully run when launched directly by mpirun/mpiexec?
>> It might also help to use --mca btl_base_verbose 10 when
>> running the slave program standalone.
>>
>> Thanks,
>>
>> Howard
>>
>>
>>
>> 2016-12-28 7:06 GMT-07:00 Siegmar Gross <siegmar.gr...@informatik.hs-f
>> ulda.de <mailto:siegmar.gr...@informatik.hs-fulda.de>>:
>>
>>
>>     Hi,
>>
>>     I have installed openmpi-2.0.2rc2 on my "SUSE Linux Enterprise
>>     Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.2.0. Unfortunately,
>>     I get an error when I run one of my programs. Everything works as
>>     expected with openmpi-master-201612232109-67a08e8. The program
>>     gets a timeout with openmpi-v2.x-201612232156-5ce66b0.
>>
>>     loki spawn 144 ompi_info | grep -e "Open MPI:" -e "C compiler
>> absolute:"
>>                     Open MPI: 2.0.2rc2
>>          C compiler absolute: /opt/solstudio12.5b/bin/cc
>>
>>
>>     loki spawn 145 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5
>> spawn_master
>>
>>     Parent process 0 running on loki
>>       I create 4 slave processes
>>
>>     ------------------------------------------------------------
>> --------------
>>     A system call failed during shared memory initialization that should
>>     not have.  It is likely that your MPI job will now either abort or
>>     experience performance degradation.
>>
>>       Local host:  loki
>>       System call: open(2)
>>       Error:       No such file or directory (errno 2)
>>     ------------------------------------------------------------
>> --------------
>>     [loki:17855] *** Process received signal ***
>>     [loki:17855] Signal: Segmentation fault (11)
>>     [loki:17855] Signal code: Address not mapped (1)
>>     [loki:17855] Failing at address: 0x8
>>     [loki:17855] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f053d0e9870]
>>     [loki:17855] [ 1] /usr/local/openmpi-2.0.2_64_cc
>> /lib64/openmpi/mca_pml_ob1.so(+0x990ae)[0x7f05325060ae]
>>     [loki:17855] [ 2] /usr/local/openmpi-2.0.2_64_cc
>> /lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_req_start+0x1
>> 96)[0x7f053250cb16]
>>     [loki:17855] [ 3] /usr/local/openmpi-2.0.2_64_cc
>> /lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_irecv+0x2f8)[0x7f05324bd3d8]
>>     [loki:17855] [ 4] /usr/local/openmpi-2.0.2_64_cc
>> /lib64/libmpi.so.20(ompi_coll_base_bcast_intra_generic+0x34c
>> )[0x7f053e52300c]
>>     [loki:17855] [ 5] /usr/local/openmpi-2.0.2_64_cc
>> /lib64/libmpi.so.20(ompi_coll_base_bcast_intra_binomial+
>> 0x1ed)[0x7f053e523eed]
>>     [loki:17855] [ 6] /usr/local/openmpi-2.0.2_64_cc
>> /lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_
>> intra_dec_fixed+0x1a3)[0x7f0531ea7c03]
>>     [loki:17855] [ 7] /usr/local/openmpi-2.0.2_64_cc
>> /lib64/libmpi.so.20(ompi_dpm_connect_accept+0xab8)[0x7f053d484f38]
>>     [loki:17855] [ 8] [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not
>> found in file ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c
>> at line 186
>>     /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_dpm_
>> dyn_init+0xcd)[0x7f053d48aeed]
>>     [loki:17855] [ 9] /usr/local/openmpi-2.0.2_64_cc
>> /lib64/libmpi.so.20(ompi_mpi_init+0xf93)[0x7f053d53d5f3]
>>     [loki:17855] [10] /usr/local/openmpi-2.0.2_64_cc
>> /lib64/libmpi.so.20(PMPI_Init+0x8d)[0x7f053db209cd]
>>     [loki:17855] [11] spawn_slave[0x4009cf]
>>     [loki:17855] [12] /lib64/libc.so.6(__libc_start_
>> main+0xf5)[0x7f053cd53b25]
>>     [loki:17855] [13] spawn_slave[0x400892]
>>     [loki:17855] *** End of error message ***
>>     [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in file
>> ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line 186
>>     ------------------------------------------------------------
>> --------------
>>     At least one pair of MPI processes are unable to reach each other for
>>     MPI communications.  This means that no Open MPI device has indicated
>>     that it can be used to communicate between these processes.  This is
>>     an error; Open MPI requires that all MPI processes be able to reach
>>     each other.  This error can sometimes be the result of forgetting to
>>     specify the "self" BTL.
>>
>>       Process 1 ([[55817,2],0]) is on host: loki
>>       Process 2 ([[55817,2],1]) is on host: unknown!
>>       BTLs attempted: self sm tcp vader
>>
>>     Your MPI job is now going to abort; sorry.
>>     ------------------------------------------------------------
>> --------------
>>     *** An error occurred in MPI_Init
>>     *** on a NULL communicator
>>     *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
>> abort,
>>     ***    and potentially your MPI job)
>>     ------------------------------------------------------------
>> --------------
>>     It looks like MPI_INIT failed for some reason; your parallel process
>> is
>>     likely to abort.  There are many reasons that a parallel process can
>>     fail during MPI_INIT; some of which are due to configuration or
>> environment
>>     problems.  This failure appears to be an internal failure; here's some
>>     additional information (which may only be relevant to an Open MPI
>>     developer):
>>
>>       ompi_dpm_dyn_init() failed
>>       --> Returned "Unreachable" (-12) instead of "Success" (0)
>>     ------------------------------------------------------------
>> --------------
>>     loki spawn 146
>>
>>
>>
>>
>>
>>
>>
>>     loki spawn 120 ompi_info | grep -e "Open MPI:" -e "C compiler
>> absolute:"
>>                     Open MPI: 2.0.2a1
>>          C compiler absolute: /opt/solstudio12.5b/bin/cc
>>     loki spawn 121 which mpiexec
>>     /usr/local/openmpi-2.1.0_64_cc/bin/mpiexec
>>     loki spawn 122 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5
>> spawn_master
>>
>>     Parent process 0 running on loki
>>       I create 4 slave processes
>>
>>     [loki:21301] OPAL ERROR: Timeout in file
>> ../../../../openmpi-v2.x-201612232156-5ce66b0/opal/mca/pmix/base/pmix_base_fns.c
>> at line 195
>>     [loki:21301] *** An error occurred in MPI_Comm_spawn
>>     [loki:21301] *** reported by process [3431727105,0]
>>     [loki:21301] *** on communicator MPI_COMM_WORLD
>>     [loki:21301] *** MPI_ERR_UNKNOWN: unknown error
>>     [loki:21301] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>> will now abort,
>>     [loki:21301] ***    and potentially your MPI job)
>>     loki spawn 123
>>
>>
>>
>>
>>
>>
>>     loki spawn 111 ompi_info | grep -e "Open MPI:" -e "C compiler"
>>                     Open MPI: 3.0.0a1
>>                   C compiler: cc
>>          C compiler absolute: /opt/solstudio12.5b/bin/cc
>>       C compiler family name: SUN
>>           C compiler version: 0x5140
>>     loki spawn 111 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5
>> spawn_master
>>
>>     Parent process 0 running on loki
>>       I create 4 slave processes
>>
>>     Parent process 0: tasks in MPI_COMM_WORLD:                    1
>>                       tasks in COMM_CHILD_PROCESSES local group:  1
>>                       tasks in COMM_CHILD_PROCESSES remote group: 4
>>
>>     Slave process 1 of 4 running on loki
>>     Slave process 3 of 4 running on loki
>>     Slave process 0 of 4 running on loki
>>     Slave process 2 of 4 running on loki
>>     spawn_slave 2: argv[0]: spawn_slave
>>     spawn_slave 3: argv[0]: spawn_slave
>>     spawn_slave 0: argv[0]: spawn_slave
>>     spawn_slave 1: argv[0]: spawn_slave
>>     loki spawn 112
>>
>>
>>     I would be grateful, if somebody can fix the problems. Thank you
>>     very much for any help in advance.
>>
>>
>>     Kind regards
>>
>>     Siegmar
>>     _______________________________________________
>>     users mailing list
>>     users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/users <
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to