Hi Timur, seems that yalla component was not found in your OMPI tree. can it be that your mpirun is not from hpcx? Can you please check LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing to the right mpirun?
Also, could you please check that yalla is present in the ompi_info -l 9 output? Thanks On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov <[email protected]> wrote: > I can password-less ssh to all nodes: > base$ ssh node1 > node1$ssh node2 > Last login: Mon May 25 18:41:23 > node2$ssh node3 > Last login: Mon May 25 16:25:01 > node3$ssh node4 > Last login: Mon May 25 16:27:04 > node4$ > > Is this correct? > > In ompi-1.9 i do not have no-tree-spawn problem. > > > Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain <[email protected]>: > > I can’t speak to the mxm problem, but the no-tree-spawn issue indicates > that you don’t have password-less ssh authorized between the compute nodes > > > On May 25, 2015, at 8:55 AM, Timur Ismagilov <[email protected] > <https://e.mail.ru/compose/?mailto=mailto%[email protected]>> wrote: > > Hello! > > I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2; > OFED-1.5.4.1; > CentOS release 6.2; > infiniband 4x FDR > > > > I have two problems: > *1. I can not use mxm*: > *1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29 > -mca plm_rsh_no_tree_spawn 1 -np 4 ./hello * > -------------------------------------------------------------------------- > > A requested component was not found, or was unable to be opened. > This > means that this component is either not installed or is unable to > be > used on your system (e.g., sometimes this means that shared > libraries > that the component requires are unable to be found/loaded). Note > that > Open MPI stopped checking at the first component that it did not > find. > > > Host: > node14 > > Framework: > pml > > Component: > yalla > > -------------------------------------------------------------------------- > > *** An error occurred in > MPI_Init > > -------------------------------------------------------------------------- > > It looks like MPI_INIT failed for some reason; your parallel process > is > likely to abort. There are many reasons that a parallel process > can > fail during MPI_INIT; some of which are due to configuration or > environment > problems. This failure appears to be an internal failure; here's > some > additional information (which may only be relevant to an Open > MPI > developer): > > > > mca_pml_base_open() > failed > > --> Returned "Not found" (-13) instead of "Success" > (0) > -------------------------------------------------------------------------- > > *** on a NULL > communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > abort, > *** and potentially your MPI > job) > *** An error occurred in > MPI_Init > > [node28:102377] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, > and not able to guarantee that all other processes were > killed! > *** on a NULL > communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > abort, > *** and potentially your MPI > job) > [node29:105600] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, > and not able to guarantee that all other processes were > killed! > *** An error occurred in > MPI_Init > > *** on a NULL > communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > abort, > *** and potentially your MPI > job) > [node5:102409] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, > and not able to guarantee that all other processes were > killed! > *** An error occurred in > MPI_Init > > *** on a NULL > communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > abort, > *** and potentially your MPI > job) > [node14:85284] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, > and not able to guarantee that all other processes were > killed! > ------------------------------------------------------- > > Primary job terminated normally, but 1 process > returned > a non-zero exit code.. Per user-direction, the job has been > aborted. > ------------------------------------------------------- > > -------------------------------------------------------------------------- > > mpirun detected that one or more processes exited with non-zero status, > thus causing > the job to be terminated. The first process to do so > was: > > > Process name: [[9372,1],2] > Exit code: > 1 > > -------------------------------------------------------------------------- > > [login:08295] 3 more processes have sent help message help-mca-base.txt / > find-available:not-valid > [login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > [login:08295] 3 more processes have sent help message help-mpi-runtime / > mpi_init:startup:internal-failur > e > > > *1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca > plm_rsh_no_tree_spawn 1 -np 4 ./hello * > -------------------------------------------------------------------------- > > A requested component was not found, or was unable to be opened. > This > means that this component is either not installed or is unable to > be > used on your system (e.g., sometimes this means that shared > libraries > that the component requires are unable to be found/loaded). Note > that > Open MPI stopped checking at the first component that it did not > find. > > > Host: > node5 > > Framework: > pml > > Component: > yalla > > -------------------------------------------------------------------------- > > *** An error occurred in > MPI_Init > > *** on a NULL > communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > abort, > *** and potentially your MPI > job) > [node5:102449] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, > and not able to guarantee that all other processes were > killed! > -------------------------------------------------------------------------- > > It looks like MPI_INIT failed for some reason; your parallel process > is > likely to abort. There are many reasons that a parallel process > can > fail during MPI_INIT; some of which are due to configuration or > environment > problems. This failure appears to be an internal failure; here's > some > additional information (which may only be relevant to an Open > MPI > developer): > > > > mca_pml_base_open() > failed > > --> Returned "Not found" (-13) instead of "Success" > (0) > -------------------------------------------------------------------------- > > ------------------------------------------------------- > > Primary job terminated normally, but 1 process > returned > a non-zero exit code.. Per user-direction, the job has been > aborted. > ------------------------------------------------------- > > *** An error occurred in > MPI_Init > > *** on a NULL > communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > abort, > *** and potentially your MPI > job) > [node14:85325] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, > and not able to guarantee that all other processes were > killed! > -------------------------------------------------------------------------- > > mpirun detected that one or more processes exited with non-zero status, > thus causing > the job to be terminated. The first process to do so > was: > > > Process name: > [[9619,1],0] > > Exit code: > 1 > > -------------------------------------------------------------------------- > > [login:08552] 1 more process has sent help message help-mca-base.txt / > find-available:not-valid > [login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > > *2. I can not remove -mca plm_rsh_no_tree_spawn 1 from mpirun cmd line:* > $mpirun -host node5,node14,node28,node29 -np 4 ./hello > sh: -c: line 0: syntax error near unexpected token > `--tree-spawn' > sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; > OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc > es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; export > OPAL_PREFIX; PATH=/gpfs/NETHOME/o > ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH > ; export PA > TH ; > LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi > -mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; > DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice > vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH > ; expor > t DYLD_LIBRARY_PATH ; > /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o > mpi-mellanox-v1.8/bin/orted --hnp-topo-sig > 2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es > s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca > orte_parent_uri "625606656.1;tc > p://10.65.0.105,10.64.0.105,10.67.0.105:56862" -mca orte_hnp_uri > "625606656.0;tcp://10.65.0.2,10.67.0.2,8 > 3.149.214.101,10.64.0.2:54893" --mca pml "yalla" -mca > plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s > pawn' > > -------------------------------------------------------------------------- > > ORTE was unable to reliably start one or more > daemons. > This usually is caused > by: > > > > * not finding the required libraries and/or binaries > on > one or more nodes. Please check your PATH and > LD_LIBRARY_PATH > settings, or configure OMPI with > --enable-orterun-prefix-by-default > > > * lack of authority to execute on one or more specified > nodes. > Please verify your allocation and > authorities. > > > * the inability to write startup files into /tmp > (--tmpdir/orte_tmpdir_base). > Please check with your sys admin to determine the correct location to > use. > > > * compilation of the orted with dynamic libraries when static are > required > (e.g., on Cray). Please check your configure cmd line and consider > using > one of the contrib/platform definitions for your system > type. > > > * an inability to create a connection back to mpirun due to > a > lack of common network interfaces and/or no route found > between > them. Please check network connectivity (including > firewalls > and network routing > requirements). > > -------------------------------------------------------------------------- > > mpirun: abort is already in progress...hit ctrl-c again to forcibly > terminate > > > Thank you for your comments. > > Best regards, > Timur. > > > > > _______________________________________________ > users mailing list > [email protected] > <https://e.mail.ru/compose/?mailto=mailto%3ausers@open%2dmpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/05/26919.php > > > > > > > _______________________________________________ > users mailing list > [email protected] > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/05/26922.php > -- Kind Regards, M.
