I have already sent it On Thu, May 28, 2015 at 10:21 AM.
it is fine to recompile OMPI from HPCx to apply site default (choice of job scheduler for example, OMPI from HPCX compiled with ssh support only, etc.).If ssh launcher is working on your system - than OMPI from HPCX should work as well.could you please send to Alina (in cc) the command line and its output from hpcx/ompi failure?ThanksOn Thu, May 28, 2015 at 7:33 PM, Timur Ismagilov <tismagilov@mail.ru> wrote:Is it normal to rebuild openmpi from hpcx?
Why binaries don't work?
Четверг, 28 мая 2015, 14:01 +03:00 от Alina Sklarevich <alinas@dev.mellanox.co.il>:
Thank you for this info.If 'yalla' now works for you, is there anything that is still wrong?Thanks,Alina.On Thu, May 28, 2015 at 10:21 AM, Timur Ismagilov <tismagilov@mail.ru> wrote:I'm sorry for the delay.
Here it is:
(I used 5 min time limit)
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8/bin/mpirun -x LD_PRELOAD=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1- redhat6.2-x86_64/mxm/debug/lib/libmxm.so -x MXM_LOG_LEVEL=data -x MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off --mca pml yalla --hostfile hostlist ./hello 1> hello_debugMXM_n-2_ppn-2.out 2>hello_debugMXM_n-2_ppn-2.errP.S.
yalla warks fine with rebuilded ompi: --with-mxm=$HPCX_MXM_DIR
Вторник, 26 мая 2015, 16:22 +03:00 от Alina Sklarevich <alinas@dev.mellanox.co.il>:
Hi Timur,HPCX has a debug version of MXM. Can you please add the following to your command line with pml yalla in order to use it and attach the output?"-x LD_PRELOAD=$HPCX_MXM_DIR/debug/lib/libmxm.so -x MXM_LOG_LEVEL=data"Also, could you please attach the entire output of "$HPCX_MPI_DIR/bin/ompi_info -a"Thank you,Alina.On Tue, May 26, 2015 at 3:39 PM, Mike Dubman <miked@dev.mellanox.co.il> wrote:Alina - could you please take a look?Thx---------- Forwarded message ----------
From: Timur Ismagilov <tismagilov@mail.ru>
Date: Tue, May 26, 2015 at 12:40 PM
Subject: Re[12]: [OMPI users] MXM problem
To: Open MPI Users <users@open-mpi.org>
Cc: Mike Dubman <miked@dev.mellanox.co.il>
It does not work for single node:
1) host: $ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off -host node5 -mca pml yalla -x MXM_TLS=ud,self,shm --prefix $HPCX_MPI_DIR -mca plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 --debug-daemons -np 1 ./hello &> yalla.out2) host: $ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off -host node5 --mca pml cm --mca mtl mxm --prefix $HPCX_MPI_DIR -mca plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 --debug-daemons -np 1 ./hello &> cm_mxm.out
I've attached the yalla.out and cm_mxm.out to this email.
Вторник, 26 мая 2015, 11:54 +03:00 от Mike Dubman <miked@dev.mellanox.co.il>:
does it work from single node?could you please run with opts below and attach output?-mca plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 --debug-daemonsOn Tue, May 26, 2015 at 11:38 AM, Timur Ismagilov <tismagilov@mail.ru> wrote:1. mxm_perf_test - OK.
2. no_tree_spawn - OK.
3. ompi yalla and "--mca pml cm --mca mtl mxm" still does not work (I use prebuild ompi-1.8.5 from hpcx-v1.3.330)
3.a) host:$ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off -host node5,node153 --mca pml cm --mca mtl mxm --prefix $HPCX_MPI_DIR ./hello
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: node153
Framework: mtl
Component: mxm
--------------------------------------------------------------------------
[node5:113560] PML cm cannot be selected
--------------------------------------------------------------------------
No available pml components were found!
This means that there are no components of this type installed on your
system or all the components reported that they could not be used.
This is a fatal error; your MPI process is likely to abort. Check the
output of the "ompi_info" command and ensure that components of this
type are available on your system. You may also wish to check the
value of the "component_path" MCA parameter and ensure that it has at
least one directory that contains valid MCA components.
--------------------------------------------------------------------------
[node153:44440] PML cm cannot be selected
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[43917,1],0]
Exit code: 1
--------------------------------------------------------------------------
[login:110455] 1 more process has sent help message help-mca-base.txt / find-available:not-valid
[login:110455] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[login:110455] 1 more process has sent help message help-mca-base.txt / find-available:none-found
3.b) host:$ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off -host node5,node153 -mca pml yalla --prefix $HPCX_MPI_DIR ./hello
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: node153
Framework: pml
Component: yalla
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
mca_pml_base_open() failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[node153:43979] Local abort before MPI_INIT completed successfully; not able to aggregate error messages,
and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[44992,1],1]
Exit code: 1
--------------------------------------------------------------------------
host:$ echo $HPCX_MPI_DIR
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8
host:$ ompi_info | grep pml
MCA pml: v (MCA v2.0, API v2.0, Component v1.8.5)
MCA pml: cm (MCA v2.0, API v2.0, Component v1.8.5)
MCA pml: bfo (MCA v2.0, API v2.0, Component v1.8.5)
MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.8.5)
MCA pml: yalla (MCA v2.0, API v2.0, Component v1.8.5)
host: tests$ ompi_info | grep mtl
MCA mtl: mxm (MCA v2.0, API v2.0, Component v1.8.5)
P.S.
possible error in the FAQ? (http://www.open-mpi.org/faq/?category=openfabrics#mxm)47. Does Open MPI support MXM?
............
NOTE: Please note that the 'yalla' pml is available only from Open MPI v1.9 and above
...........
But here we have(or not...) yalla in ompi 1.8.5
Вторник, 26 мая 2015, 9:53 +03:00 от Mike Dubman <miked@dev.mellanox.co.il>:
Hi Timur,Here it goes:wget ftp://bgate.mellanox.com/hpc/hpcx/custom/v1.3/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64.tbzPlease let me know if it works for you and will add 1.5.4.1 mofed to the default distribution list.MOn Mon, May 25, 2015 at 9:38 PM, Timur Ismagilov <tismagilov@mail.ru> wrote:
will send u the link tomorrow.On Mon, May 25, 2015 at 9:15 PM, Timur Ismagilov <tismagilov@mail.ru> wrote:Where can i find MXM for ofed 1.5.4.1?
Понедельник, 25 мая 2015, 21:11 +03:00 от Mike Dubman <miked@dev.mellanox.co.il>:
btw, the ofed on your system is 1.5.4.1 while HPCx in use is for ofed 1.5.3seems like ABI issue between ofed versionsOn Mon, May 25, 2015 at 8:59 PM, Timur Ismagilov <tismagilov@mail.ru> wrote:I did as you said, but got an error:
node1$ export MXM_IB_PORTS=mlx4_0:1
node1$ ./mxm_perftest
Waiting for connection...
Accepted connection from 10.65.0.253
[1432576262.370195] [node153:35388:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or directory. Won't use knem.
Failed to create endpoint: No such device
node2$ export MXM_IB_PORTS=mlx4_0:1
node2$ ./mxm_perftest node1 -t send_lat
[1432576262.367523] [node158:99366:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or directory. Won't use knem.
Failed to create endpoint: No such device
Понедельник, 25 мая 2015, 20:31 +03:00 от Mike Dubman <miked@dev.mellanox.co.il>:
scif is a OFA device from Intel.can you please select export MXM_IB_PORTS=mlx4_0:1 explicitly and retryOn Mon, May 25, 2015 at 8:26 PM, Timur Ismagilov <tismagilov@mail.ru> wrote:Hi, Mike,
that is what i have:$ echo $LD_LIBRARY_PATH | tr ":" "\n"
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib
+intel compiler paths
$ echo $OPAL_PREFIX
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8
I don't use LD_PRELOAD.
In the attached file(ompi_info.out) you will find the output of ompi_info -l 9 command.
P.S.
node1 $ ./mxm_perftest
node2 $ ./mxm_perftest node1 -t send_lat
[1432568685.067067] [node151:87372:0] shm.c:65 MXM WARN Could not open the KNEM device file $t /dev/knem : No such file or directory. Won't use knem. ( I don't have knem)
[1432568685.069699] [node151:87372:0] ib_dev.c:531 MXM WARN skipping device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a Mellanox device (???)
Failed to create endpoint: No such device
$ ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.10.600
node_guid: 0002:c903:00a1:13b0
sys_image_guid: 0002:c903:00a1:13b3
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: MT_1090120019
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 83
port_lmc: 0x00
port: 2
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
Best regards,
Timur.
Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman <miked@dev.mellanox.co.il>:
Hi Timur,seems that yalla component was not found in your OMPI tree.can it be that your mpirun is not from hpcx? Can you please check LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing to the right mpirun?Also, could you please check that yalla is present in the ompi_info -l 9 output?ThanksOn Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov <tismagilov@mail.ru> wrote:I can password-less ssh to all nodes:
base$ ssh node1
node1$ssh node2
Last login: Mon May 25 18:41:23
node2$ssh node3
Last login: Mon May 25 16:25:01
node3$ssh node4
Last login: Mon May 25 16:27:04
node4$
Is this correct?
In ompi-1.9 i do not have no-tree-spawn problem.
Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain <rhc@open-mpi.org>:
I can’t speak to the mxm problem, but the no-tree-spawn issue indicates that you don’t have password-less ssh authorized between the compute nodesOn May 25, 2015, at 8:55 AM, Timur Ismagilov <tismagilov@mail.ru> wrote:lack of common network interfaces and/or no route found betweenHello!
I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
OFED-1.5.4.1;
CentOS release 6.2;
infiniband 4x FDR
I have two problems:
1. I can not use mxm:
1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29 -mca plm_rsh_no_tree_spawn 1 -np 4 ./hello
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: node14
Framework: pml
Component: yalla
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
mca_pml_base_open() failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
[node28:102377] Local abort before MPI_INIT completed successfully; not able to aggregate error messages,
and not able to guarantee that all other processes were killed!
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[node29:105600] Local abort before MPI_INIT completed successfully; not able to aggregate error messages,
and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[node5:102409] Local abort before MPI_INIT completed successfully; not able to aggregate error messages,
and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[node14:85284] Local abort before MPI_INIT completed successfully; not able to aggregate error messages,
and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[9372,1],2]
Exit code: 1
--------------------------------------------------------------------------
[login:08295] 3 more processes have sent help message help-mca-base.txt / find-available:not-valid
[login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[login:08295] 3 more processes have sent help message help-mpi-runtime / mpi_init:startup:internal-failur
e
1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca plm_rsh_no_tree_spawn 1 -np 4 ./hello
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: node5
Framework: pml
Component: yalla
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[node5:102449] Local abort before MPI_INIT completed successfully; not able to aggregate error messages,
and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
mca_pml_base_open() failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[node14:85325] Local abort before MPI_INIT completed successfully; not able to aggregate error messages,
and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[9619,1],0]
Exit code: 1
--------------------------------------------------------------------------
[login:08552] 1 more process has sent help message help-mca-base.txt / find-available:not-valid
[login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2. I can not remove -mca plm_rsh_no_tree_spawn 1 from mpirun cmd line:
$mpirun -host node5,node14,node28,node29 -np 4 ./hello
sh: -c: line 0: syntax error near unexpected token `--tree-spawn'
sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc
es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; export OPAL_PREFIX; PATH=/gpfs/NETHOME/o
ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH ; export PA
TH ; LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi
-mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice
vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH ; expor
t DYLD_LIBRARY_PATH ; /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o
mpi-mellanox-v1.8/bin/orted --hnp-topo-sig 2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es
s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca orte_parent_uri "625606656.1;tc
p://10.65.0.105,10.64.0.105,10.67.0.105:56862" -mca orte_hnp_uri "625606656.0;tcp://10.65.0.2,10.67.0.2,8
3.149.214.101,10.64.0.2:54893" --mca pml "yalla" -mca plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s
pawn'
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
Thank you for your comments.Best regards,
Timur.
_______________________________________________
users mailing list
users@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/05/26919.php
_______________________________________________
users mailing list
users@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/05/26922.php--Kind Regards,M.
--Kind Regards,M.
--Kind Regards,M.
--Kind Regards,M.
--Kind Regards,M.
--Kind Regards,M.
--Kind Regards,M.
_______________________________________________
users mailing list
users@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/05/26965.php--Kind Regards,M.