Hi Timur,
seems that yalla component was not found in your OMPI tree.
can it be that your mpirun is not from hpcx? Can you please check
LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing to the
right mpirun?

Also, could you please check that yalla is present in the ompi_info -l 9
output?

Thanks

On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov <tismagi...@mail.ru> wrote:

> I can password-less ssh to all nodes:
> base$ ssh node1
> node1$ssh node2
> Last login: Mon May 25 18:41:23
> node2$ssh node3
> Last login: Mon May 25 16:25:01
> node3$ssh node4
> Last login: Mon May 25 16:27:04
> node4$
>
> Is this correct?
>
> In ompi-1.9 i do not have no-tree-spawn problem.
>
>
> Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain <r...@open-mpi.org>:
>
>   I can’t speak to the mxm problem, but the no-tree-spawn issue indicates
> that you don’t have password-less ssh authorized between the compute nodes
>
>
> On May 25, 2015, at 8:55 AM, Timur Ismagilov <tismagi...@mail.ru
> <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>> wrote:
>
> Hello!
>
> I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
> OFED-1.5.4.1;
> CentOS release 6.2;
> infiniband 4x FDR
>
>
>
> I have two problems:
> *1. I can not use mxm*:
> *1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29
> -mca plm_rsh_no_tree_spawn 1 -np 4 ./hello *
> --------------------------------------------------------------------------
>
> A requested component was not found, or was unable to be opened.
> This
> means that this component is either not installed or is unable to
> be
> used on your system (e.g., sometimes this means that shared
> libraries
> that the component requires are unable to be found/loaded).  Note
> that
> Open MPI stopped checking at the first component that it did not
> find.
>
>
> Host:
> node14
>
> Framework:
> pml
>
> Component:
> yalla
>
> --------------------------------------------------------------------------
>
> *** An error occurred in
> MPI_Init
>
> --------------------------------------------------------------------------
>
> It looks like MPI_INIT failed for some reason; your parallel process
> is
> likely to abort.  There are many reasons that a parallel process
> can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems.  This failure appears to be an internal failure; here's
> some
> additional information (which may only be relevant to an Open
> MPI
> developer):
>
>
>
>   mca_pml_base_open()
> failed
>
>   --> Returned "Not found" (-13) instead of "Success"
> (0)
> --------------------------------------------------------------------------
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> *** An error occurred in
> MPI_Init
>
> [node28:102377] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
>  and not able to guarantee that all other processes were
> killed!
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> [node29:105600] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
>  and not able to guarantee that all other processes were
> killed!
> *** An error occurred in
> MPI_Init
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> [node5:102409] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
> and not able to guarantee that all other processes were
> killed!
> *** An error occurred in
> MPI_Init
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> [node14:85284] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
> and not able to guarantee that all other processes were
> killed!
> -------------------------------------------------------
>
> Primary job  terminated normally, but 1 process
> returned
> a non-zero exit code.. Per user-direction, the job has been
> aborted.
> -------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so
> was:
>
>
>   Process name: [[9372,1],2]
>   Exit code:
> 1
>
> --------------------------------------------------------------------------
>
> [login:08295] 3 more processes have sent help message help-mca-base.txt /
> find-available:not-valid
> [login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
> help / error messages
> [login:08295] 3 more processes have sent help message help-mpi-runtime /
> mpi_init:startup:internal-failur
> e
>
>
> *1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca
> plm_rsh_no_tree_spawn 1 -np 4 ./hello *
> --------------------------------------------------------------------------
>
> A requested component was not found, or was unable to be opened.
> This
> means that this component is either not installed or is unable to
> be
> used on your system (e.g., sometimes this means that shared
> libraries
> that the component requires are unable to be found/loaded).  Note
> that
> Open MPI stopped checking at the first component that it did not
> find.
>
>
> Host:
> node5
>
> Framework:
> pml
>
> Component:
> yalla
>
> --------------------------------------------------------------------------
>
> *** An error occurred in
> MPI_Init
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> [node5:102449] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
> and not able to guarantee that all other processes were
> killed!
> --------------------------------------------------------------------------
>
> It looks like MPI_INIT failed for some reason; your parallel process
> is
> likely to abort.  There are many reasons that a parallel process
> can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems.  This failure appears to be an internal failure; here's
> some
> additional information (which may only be relevant to an Open
> MPI
> developer):
>
>
>
>   mca_pml_base_open()
> failed
>
>   --> Returned "Not found" (-13) instead of "Success"
> (0)
> --------------------------------------------------------------------------
>
> -------------------------------------------------------
>
> Primary job  terminated normally, but 1 process
> returned
> a non-zero exit code.. Per user-direction, the job has been
> aborted.
> -------------------------------------------------------
>
> *** An error occurred in
> MPI_Init
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> [node14:85325] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
> and not able to guarantee that all other processes were
> killed!
> --------------------------------------------------------------------------
>
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so
> was:
>
>
>   Process name:
> [[9619,1],0]
>
>   Exit code:
> 1
>
> --------------------------------------------------------------------------
>
> [login:08552] 1 more process has sent help message help-mca-base.txt /
> find-available:not-valid
> [login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
> help / error messages
>
> *2. I can not remove -mca plm_rsh_no_tree_spawn 1 from mpirun cmd line:*
> $mpirun -host node5,node14,node28,node29 -np 4 ./hello
> sh: -c: line 0: syntax error near unexpected token
> `--tree-spawn'
> sh: -c: line 0: `( test ! -r ./.profile || . ./.profile;
> OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc
> es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; export
> OPAL_PREFIX; PATH=/gpfs/NETHOME/o
> ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH
> ; export PA
> TH ;
> LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi
> -mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;
> DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice
> vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH
> ; expor
> t DYLD_LIBRARY_PATH ;
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o
> mpi-mellanox-v1.8/bin/orted --hnp-topo-sig
> 2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es
> s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca
> orte_parent_uri "625606656.1;tc
> p://10.65.0.105,10.64.0.105,10.67.0.105:56862" -mca orte_hnp_uri
> "625606656.0;tcp://10.65.0.2,10.67.0.2,8
> 3.149.214.101,10.64.0.2:54893" --mca pml "yalla" -mca
> plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s
> pawn'
>
> --------------------------------------------------------------------------
>
> ORTE was unable to reliably start one or more
> daemons.
> This usually is caused
> by:
>
>
>
> * not finding the required libraries and/or binaries
> on
>   one or more nodes. Please check your PATH and
> LD_LIBRARY_PATH
>   settings, or configure OMPI with
> --enable-orterun-prefix-by-default
>
>
> * lack of authority to execute on one or more specified
> nodes.
>   Please verify your allocation and
> authorities.
>
>
> * the inability to write startup files into /tmp
> (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to
> use.
>
>
> *  compilation of the orted with dynamic libraries when static are
> required
>   (e.g., on Cray). Please check your configure cmd line and consider
> using
>   one of the contrib/platform definitions for your system
> type.
>
>
> * an inability to create a connection back to mpirun due to
> a
>   lack of common network interfaces and/or no route found
> between
>   them. Please check network connectivity (including
> firewalls
>   and network routing
> requirements).
>
> --------------------------------------------------------------------------
>
> mpirun: abort is already in progress...hit ctrl-c again to forcibly
> terminate
>
>
> Thank you for your comments.
>
> Best regards,
> Timur.
>
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> <https://e.mail.ru/compose/?mailto=mailto%3ausers@open%2dmpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26919.php
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26922.php
>



-- 

Kind Regards,

M.

Reply via email to