Re: [OMPI users] MXM problem

Mike Dubman Thu, 28 May 2015 13:07:17 -0400 (EDT)

it is fine to recompile OMPI from HPCx to apply site default (choice of job
scheduler for example, OMPI from HPCX compiled with ssh support only, etc.).


If ssh launcher is working on your system - than OMPI from HPCX should work
as well.

could you please send to Alina (in cc) the command line and its output from
hpcx/ompi failure?

Thanks


On Thu, May 28, 2015 at 7:33 PM, Timur Ismagilov <tismagi...@mail.ru> wrote:

> Is it normal to rebuild openmpi from hpcx?
> Why binaries don't work?
>
>
>
>
> Четверг, 28 мая 2015, 14:01 +03:00 от Alina Sklarevich <
> ali...@dev.mellanox.co.il>:
>
>   Thank you for this info.
>
> If 'yalla' now works for you, is there anything that is still wrong?
>
> Thanks,
> Alina.
>
> On Thu, May 28, 2015 at 10:21 AM, Timur Ismagilov <tismagi...@mail.ru
> <//e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>> wrote:
>
> I'm sorry for the delay.
>
> Here it is:
> (*I used 5 min **time limit*)
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8/bin/mpirun
> -x
> LD_PRELOAD=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-
> redhat6.2-x86_64/mxm/debug/lib/libmxm.so -x MXM_LOG_LEVEL=data -x
> MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off --mca pml yalla --hostfile
> hostlist ./hello 1> hello_debugMXM_n-2_ppn-2.out
> 2>hello_debugMXM_n-2_ppn-2.err
>
> P.S.
> yalla warks fine with rebuilded ompi: --with-mxm=$HPCX_MXM_DIR
>
>
>
>
>
>
> Вторник, 26 мая 2015, 16:22 +03:00 от Alina Sklarevich <
> ali...@dev.mellanox.co.il
> <//e.mail.ru/compose/?mailto=mailto%3aali...@dev.mellanox.co.il>>:
>
>   Hi Timur,
>
> HPCX has a debug version of MXM. Can you please add the following to your
> command line with pml yalla in order to use it and attach the output?
> "-x LD_PRELOAD=$HPCX_MXM_DIR/debug/lib/libmxm.so -x MXM_LOG_LEVEL=data"
>
> Also, could you please attach the entire output of
> "$HPCX_MPI_DIR/bin/ompi_info -a"
>
> Thank you,
> Alina.
>
> On Tue, May 26, 2015 at 3:39 PM, Mike Dubman <mi...@dev.mellanox.co.il
> <https://e.mail.ru/compose/?mailto=mailto%3ami...@dev.mellanox.co.il>>
> wrote:
>
> Alina - could you please take a look?
> Thx
>
>
> ---------- Forwarded message ----------
> From: *Timur Ismagilov* <tismagi...@mail.ru
> <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>>
> Date: Tue, May 26, 2015 at 12:40 PM
> Subject: Re[12]: [OMPI users] MXM problem
> To: Open MPI Users <us...@open-mpi.org
> <https://e.mail.ru/compose/?mailto=mailto%3ausers@open%2dmpi.org>>
> Cc: Mike Dubman <mi...@dev.mellanox.co.il
> <https://e.mail.ru/compose/?mailto=mailto%3ami...@dev.mellanox.co.il>>
>
>
> It does not work for single node:
>
> *1)* host: $  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x
> MXM_SHM_KCOPY_MODE=off -host node5 -mca pml yalla -x MXM_TLS=ud,self,shm
> --prefix $HPCX_MPI_DIR -mca plm_base_verbose 5  -mca oob_base_verbose 10
> -mca rml_base_verbose 10 --debug-daemons  -np 1 ./hello &> *yalla.out *
>
>
> *2)* host: $  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x
> MXM_SHM_KCOPY_MODE=off -host node5  --mca pml cm --mca mtl mxm --prefix
> $HPCX_MPI_DIR -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca
> rml_base_verbose 10 --debug-daemons -np 1 ./hello &> *cm_mxm.out*
>
> I've attached the *yalla.out* and *cm_mxm.out* to this email.
>
>
>
> Вторник, 26 мая 2015, 11:54 +03:00 от Mike Dubman <
> mi...@dev.mellanox.co.il
> <https://e.mail.ru/compose/?mailto=mailto%3ami...@dev.mellanox.co.il>>:
>
>   does it work from single node?
> could you please run with opts below and attach output?
>
>  -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca rml_base_verbose
> 10 --debug-daemons
>
> On Tue, May 26, 2015 at 11:38 AM, Timur Ismagilov <tismagi...@mail.ru
> <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>> wrote:
>
> 1. mxm_perf_test - OK.
> 2. no_tree_spawn  - OK.
> 3. ompi yalla and "--mca pml cm --mca mtl mxm" still does not work (I use
> prebuild ompi-1.8.5 from hpcx-v1.3.330)
> *3.a) host:$  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x
> MXM_SHM_KCOPY_MODE=off -host node5,node153  --mca pml cm --mca mtl mxm
> --prefix $HPCX_MPI_DIR ./hello*
> --------------------------------------------------------------------------
>
> A requested component was not found, or was unable to be opened.
> This
> means that this component is either not installed or is unable to
> be
> used on your system (e.g., sometimes this means that shared
> libraries
> that the component requires are unable to be found/loaded).  Note
> that
> Open MPI stopped checking at the first component that it did not
> find.
>
>
> Host:
> node153
>
> Framework:
> mtl
>
> Component:
> mxm
>
> --------------------------------------------------------------------------
>
> [node5:113560] PML cm cannot be
> selected
> --------------------------------------------------------------------------
>
> No available pml components were
> found!
>
>
> This means that there are no components of this type installed on
> your
> system or all the components reported that they could not be
> used.
>
>
> This is a fatal error; your MPI process is likely to abort.  Check
> the
> output of the "ompi_info" command and ensure that components of
> this
> type are available on your system.  You may also wish to check
> the
> value of the "component_path" MCA parameter and ensure that it has
> at
> least one directory that contains valid MCA
> components.
> --------------------------------------------------------------------------
>
> [node153:44440] PML cm cannot be
> selected
> -------------------------------------------------------
>
> Primary job  terminated normally, but 1 process
> returned
> a non-zero exit code.. Per user-direction, the job has been
> aborted.
> -------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so
> was:
>
>
>   Process name:
> [[43917,1],0]
>
>   Exit code:
> 1
>
> --------------------------------------------------------------------------
>
> [login:110455] 1 more process has sent help message help-mca-base.txt /
> find-available:not-valid
> [login:110455] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> [login:110455] 1 more process has sent help message help-mca-base.txt /
> find-available:none-found
>
> *3.b) host:$  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x
> MXM_SHM_KCOPY_MODE=off -host node5,node153 -mca pml yalla --prefix
> $HPCX_MPI_DIR ./hello*
> --------------------------------------------------------------------------
>
> A requested component was not found, or was unable to be opened.
> This
> means that this component is either not installed or is unable to
> be
> used on your system (e.g., sometimes this means that shared
> libraries
> that the component requires are unable to be found/loaded).  Note
> that
> Open MPI stopped checking at the first component that it did not
> find.
>
>
> Host:
> node153
>
> Framework:
> pml
>
> Component:
> yalla
>
> --------------------------------------------------------------------------
>
> *** An error occurred in
> MPI_Init
>
> --------------------------------------------------------------------------
>
> It looks like MPI_INIT failed for some reason; your parallel process
> is
> likely to abort.  There are many reasons that a parallel process
> can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems.  This failure appears to be an internal failure; here's
> some
> additional information (which may only be relevant to an Open
> MPI
> developer):
>
>
>
>   mca_pml_base_open()
> failed
>
>   --> Returned "Not found" (-13) instead of "Success"
> (0)
> --------------------------------------------------------------------------
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> [node153:43979] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
>  and not able to guarantee that all other processes were
> killed!
> -------------------------------------------------------
>
> Primary job  terminated normally, but 1 process
> returned
> a non-zero exit code.. Per user-direction, the job has been
> aborted.
> -------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so
> was:
>
>
>   Process name:
> [[44992,1],1]
>
>   Exit code:
> 1
>
> --------------------------------------------------------------------------
>
>
>
>
> *host:$  echo $HPCX_MPI_DIR*
>
>
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8
> *host:$ ompi_info | grep pml *
>
>                  MCA pml: v (MCA v2.0, API v2.0, Component
> v1.8.5)
>                  MCA pml: cm (MCA v2.0, API v2.0, Component
> v1.8.5)
>                  MCA pml: bfo (MCA v2.0, API v2.0, Component
> v1.8.5)
>                  MCA pml: ob1 (MCA v2.0, API v2.0, Component
> v1.8.5)
>                  MCA pml: yalla (MCA v2.0, API v2.0, Component v1.8.5)
> *host: tests$  ompi_info | grep mtl                                   *
>                  MCA mtl: mxm (MCA v2.0, API v2.0, Component v1.8.5)
>
> *P.S.*
> possible error in the FAQ? (
> http://www.open-mpi.org/faq/?category=openfabrics#mxm)
>
> 47. Does Open MPI support MXM?
> ............
> *NOTE:* Please note that the 'yalla' pml is available only from Open MPI
> v1.9 and above
> ...........
> But here we have(or not...) yalla in ompi 1.8.5
>
>
>
> Вторник, 26 мая 2015, 9:53 +03:00 от Mike Dubman <mi...@dev.mellanox.co.il
> <https://e.mail.ru/compose/?mailto=mailto%3ami...@dev.mellanox.co.il>>:
>
>   Hi Timur,
>
> Here it goes:
>
> wget
> ftp://bgate.mellanox.com/hpc/hpcx/custom/v1.3/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64.tbz
>
> Please let me know if it works for you and will add 1.5.4.1 mofed to the
> default distribution list.
>
> M
>
>
> On Mon, May 25, 2015 at 9:38 PM, Timur Ismagilov <tismagi...@mail.ru
> <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>> wrote:
>
> Thanks a lot.
>
> Понедельник, 25 мая 2015, 21:28 +03:00 от Mike Dubman <
> mi...@dev.mellanox.co.il
> <https://e.mail.ru/compose/?mailto=mailto%3ami...@dev.mellanox.co.il>>:
>
>   will send u the link tomorrow.
>
> On Mon, May 25, 2015 at 9:15 PM, Timur Ismagilov <tismagi...@mail.ru
> <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>> wrote:
>
> Where can i find MXM for ofed 1.5.4.1?
>
>
> Понедельник, 25 мая 2015, 21:11 +03:00 от Mike Dubman <
> mi...@dev.mellanox.co.il
> <https://e.mail.ru/compose/?mailto=mailto%3ami...@dev.mellanox.co.il>>:
>
>   btw, the ofed on your system is 1.5.4.1 while HPCx in use is for ofed
> 1.5.3
>
> seems like ABI issue between ofed versions
>
> On Mon, May 25, 2015 at 8:59 PM, Timur Ismagilov <tismagi...@mail.ru
> <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>> wrote:
>
> I did as you said, but got an error:
>
>
> node1$ export MXM_IB_PORTS=mlx4_0:1
> node1$
> ./mxm_perftest
>
> Waiting for
> connection...
>
> Accepted connection from
> 10.65.0.253
>
> [1432576262.370195] [node153:35388:0]         shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or directory.
> Won't use knem.
> Failed to create endpoint: No such
> device
>
> node2$ export MXM_IB_PORTS=mlx4_0:1
> node2$ ./mxm_perftest node1  -t
> send_lat
> [1432576262.367523] [node158:99366:0]         shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or directory.
> Won't use knem.
> Failed to create endpoint: No such device
>
>
>
>
> Понедельник, 25 мая 2015, 20:31 +03:00 от Mike Dubman <
> mi...@dev.mellanox.co.il
> <https://e.mail.ru/compose/?mailto=mailto%3ami...@dev.mellanox.co.il>>:
>
>   scif is a OFA device from Intel.
> can you please select export MXM_IB_PORTS=mlx4_0:1 explicitly and retry
>
> On Mon, May 25, 2015 at 8:26 PM, Timur Ismagilov <tismagi...@mail.ru
> <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>> wrote:
>
> Hi, Mike,
> that is what i have:
>
> $ echo $LD_LIBRARY_PATH | tr ":" "\n"
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib
>
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib
>
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib
>
>
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib
>  +intel compiler paths
>
> $ echo
> $OPAL_PREFIX
>
>
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8
>
> I don't use LD_PRELOAD.
>
> In the attached file(ompi_info.out) you will find the output of ompi_info
> -l 9  command.
>
> *P.S*.
> node1 $ ./mxm_perftest
> node2 $  ./mxm_perftest node1  -t send_lat
> [1432568685.067067] [node151:87372:0]         shm.c:65   MXM  WARN  Could
> not open the KNEM device file $t /dev/knem : No such file or directory.
> Won't use knem.         *( I don't have knem)*
> [1432568685.069699] [node151:87372:0]      ib_dev.c:531  MXM  WARN
> skipping device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a Mellanox
> device                               *(???)*
> Failed to create endpoint: No such device
>
> $  ibv_devinfo
> hca_id: mlx4_0
>         transport:                      InfiniBand (0)
>         fw_ver:                         2.10.600
>         node_guid:                      0002:c903:00a1:13b0
>         sys_image_guid:                 0002:c903:00a1:13b3
>         vendor_id:                      0x02c9
>         vendor_part_id:                 4099
>         hw_ver:                         0x0
>         board_id:                       MT_1090120019
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 1
>                         port_lid:               83
>                         port_lmc:               0x00
>
>                 port:   2
>                         state:                  PORT_DOWN (1)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 0
>                         port_lid:               0
>                         port_lmc:               0x00
>
> Best regards,
> Timur.
>
>
> Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman <
> mi...@dev.mellanox.co.il
> <https://e.mail.ru/compose/?mailto=mailto%3ami...@dev.mellanox.co.il>>:
>
>   Hi Timur,
> seems that yalla component was not found in your OMPI tree.
> can it be that your mpirun is not from hpcx? Can you please check
> LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing to the
> right mpirun?
>
> Also, could you please check that yalla is present in the ompi_info -l 9
> output?
>
> Thanks
>
> On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov <tismagi...@mail.ru
> <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>> wrote:
>
> I can password-less ssh to all nodes:
> base$ ssh node1
> node1$ssh node2
> Last login: Mon May 25 18:41:23
> node2$ssh node3
> Last login: Mon May 25 16:25:01
> node3$ssh node4
> Last login: Mon May 25 16:27:04
> node4$
>
> Is this correct?
>
> In ompi-1.9 i do not have no-tree-spawn problem.
>
>
> Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain <r...@open-mpi.org
> <https://e.mail.ru/compose/?mailto=mailto%3arhc@open%2dmpi.org>>:
>
>   I can’t speak to the mxm problem, but the no-tree-spawn issue indicates
> that you don’t have password-less ssh authorized between the compute nodes
>
>
> On May 25, 2015, at 8:55 AM, Timur Ismagilov <tismagi...@mail.ru
> <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>> wrote:
>
> Hello!
>
> I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
> OFED-1.5.4.1;
> CentOS release 6.2;
> infiniband 4x FDR
>
>
>
> I have two problems:
> *1. I can not use mxm*:
> *1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29
> -mca plm_rsh_no_tree_spawn 1 -np 4 ./hello *
> --------------------------------------------------------------------------
>
> A requested component was not found, or was unable to be opened.
> This
> means that this component is either not installed or is unable to
> be
> used on your system (e.g., sometimes this means that shared
> libraries
> that the component requires are unable to be found/loaded).  Note
> that
> Open MPI stopped checking at the first component that it did not
> find.
>
>
> Host:
> node14
>
> Framework:
> pml
>
> Component:
> yalla
>
> --------------------------------------------------------------------------
>
> *** An error occurred in
> MPI_Init
>
> --------------------------------------------------------------------------
>
> It looks like MPI_INIT failed for some reason; your parallel process
> is
> likely to abort.  There are many reasons that a parallel process
> can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems.  This failure appears to be an internal failure; here's
> some
> additional information (which may only be relevant to an Open
> MPI
> developer):
>
>
>
>   mca_pml_base_open()
> failed
>
>   --> Returned "Not found" (-13) instead of "Success"
> (0)
> --------------------------------------------------------------------------
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> *** An error occurred in
> MPI_Init
>
> [node28:102377] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
>  and not able to guarantee that all other processes were
> killed!
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> [node29:105600] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
>  and not able to guarantee that all other processes were
> killed!
> *** An error occurred in
> MPI_Init
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> [node5:102409] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
> and not able to guarantee that all other processes were
> killed!
> *** An error occurred in
> MPI_Init
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> [node14:85284] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
> and not able to guarantee that all other processes were
> killed!
> -------------------------------------------------------
>
> Primary job  terminated normally, but 1 process
> returned
> a non-zero exit code.. Per user-direction, the job has been
> aborted.
> -------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so
> was:
>
>
>   Process name: [[9372,1],2]
>   Exit code:
> 1
>
> --------------------------------------------------------------------------
>
> [login:08295] 3 more processes have sent help message help-mca-base.txt /
> find-available:not-valid
> [login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
> help / error messages
> [login:08295] 3 more processes have sent help message help-mpi-runtime /
> mpi_init:startup:internal-failur
> e
>
>
> *1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca
> plm_rsh_no_tree_spawn 1 -np 4 ./hello *
> --------------------------------------------------------------------------
>
> A requested component was not found, or was unable to be opened.
> This
> means that this component is either not installed or is unable to
> be
> used on your system (e.g., sometimes this means that shared
> libraries
> that the component requires are unable to be found/loaded).  Note
> that
> Open MPI stopped checking at the first component that it did not
> find.
>
>
> Host:
> node5
>
> Framework:
> pml
>
> Component:
> yalla
>
> --------------------------------------------------------------------------
>
> *** An error occurred in
> MPI_Init
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> [node5:102449] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
> and not able to guarantee that all other processes were
> killed!
> --------------------------------------------------------------------------
>
> It looks like MPI_INIT failed for some reason; your parallel process
> is
> likely to abort.  There are many reasons that a parallel process
> can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems.  This failure appears to be an internal failure; here's
> some
> additional information (which may only be relevant to an Open
> MPI
> developer):
>
>
>
>   mca_pml_base_open()
> failed
>
>   --> Returned "Not found" (-13) instead of "Success"
> (0)
> --------------------------------------------------------------------------
>
> -------------------------------------------------------
>
> Primary job  terminated normally, but 1 process
> returned
> a non-zero exit code.. Per user-direction, the job has been
> aborted.
> -------------------------------------------------------
>
> *** An error occurred in
> MPI_Init
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI
> job)
> [node14:85325] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
> and not able to guarantee that all other processes were
> killed!
> --------------------------------------------------------------------------
>
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so
> was:
>
>
>   Process name:
> [[9619,1],0]
>
>   Exit code:
> 1
>
> --------------------------------------------------------------------------
>
> [login:08552] 1 more process has sent help message help-mca-base.txt /
> find-available:not-valid
> [login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
> help / error messages
>
> *2. I can not remove -mca plm_rsh_no_tree_spawn 1 from mpirun cmd line:*
> $mpirun -host node5,node14,node28,node29 -np 4 ./hello
> sh: -c: line 0: syntax error near unexpected token
> `--tree-spawn'
> sh: -c: line 0: `( test ! -r ./.profile || . ./.profile;
> OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc
> es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; export
> OPAL_PREFIX; PATH=/gpfs/NETHOME/o
> ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH
> ; export PA
> TH ;
> LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi
> -mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;
> DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice
> vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH
> ; expor
> t DYLD_LIBRARY_PATH ;
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o
> mpi-mellanox-v1.8/bin/orted --hnp-topo-sig
> 2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es
> s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca
> orte_parent_uri "625606656.1;tc
> p://10.65.0.105,10.64.0.105,10.67.0.105:56862" -mca orte_hnp_uri
> "625606656.0;tcp://10.65.0.2,10.67.0.2,8
> 3.149.214.101,10.64.0.2:54893" --mca pml "yalla" -mca
> plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s
> pawn'
>
> --------------------------------------------------------------------------
>
> ORTE was unable to reliably start one or more
> daemons.
> This usually is caused
> by:
>
>
>
> * not finding the required libraries and/or binaries
> on
>   one or more nodes. Please check your PATH and
> LD_LIBRARY_PATH
>   settings, or configure OMPI with
> --enable-orterun-prefix-by-default
>
>
> * lack of authority to execute on one or more specified
> nodes.
>   Please verify your allocation and
> authorities.
>
>
> * the inability to write startup files into /tmp
> (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to
> use.
>
>
> *  compilation of the orted with dynamic libraries when static are
> required
>   (e.g., on Cray). Please check your configure cmd line and consider
> using
>   one of the contrib/platform definitions for your system
> type.
>
>
> * an inability to create a connection back to mpirun due to
> a
>   lack of common network interfaces and/or no route found
> between
>   them. Please check network connectivity (including
> firewalls
>   and network routing
> requirements).
>
> --------------------------------------------------------------------------
>
> mpirun: abort is already in progress...hit ctrl-c again to forcibly
> terminate
>
>
>
> Thank you for your comments.
>
> Best regards,
> Timur.
>
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> <https://e.mail.ru/compose/?mailto=mailto%3ausers@open%2dmpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26919.php
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> <https://e.mail.ru/compose/?mailto=mailto%3ausers@open%2dmpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26922.php
>
>
>
>
> --
>
> Kind Regards,
>
> M.
>
>
>
>
>
>
>
> --
>
> Kind Regards,
>
> M.
>
>
>
>
>
>
>
> --
>
> Kind Regards,
>
> M.
>
>
>
>
>
>
>
> --
>
> Kind Regards,
>
> M.
>
>
>
>
>
>
>
> --
>
> Kind Regards,
>
> M.
>
>
>
>
>
>
>
> --
>
> Kind Regards,
>
> M.
>
>
>
>
>
>
>
> --
>
> Kind Regards,
>
> M.
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26965.php
>



-- 

Kind Regards,

M.

Re: [OMPI users] MXM problem

Reply via email to