Hi, Mike, that is what i have: $ echo $LD_LIBRARY_PATH | tr ":" "\n" /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib +intel compiler paths
$ echo $OPAL_PREFIX /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 I don't use LD_PRELOAD. In the attached file(ompi_info.out) you will find the output of ompi_info -l 9 command. P.S . node1 $ ./mxm_perftest node2 $ ./mxm_perftest node1 -t send_lat [1432568685.067067] [node151:87372:0] shm.c:65 MXM WARN Could not open the KNEM device file $t /dev/knem : No such file or directory. Won't use knem. ( I don't have knem) [1432568685.069699] [node151:87372:0] ib_dev.c:531 MXM WARN skipping device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a Mellanox device (???) Failed to create endpoint: No such device $ ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.10.600 node_guid: 0002:c903:00a1:13b0 sys_image_guid: 0002:c903:00a1:13b3 vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x0 board_id: MT_1090120019 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 83 port_lmc: 0x00 port: 2 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 Best regards, Timur. Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman <mi...@dev.mellanox.co.il>: >Hi Timur, >seems that yalla component was not found in your OMPI tree. >can it be that your mpirun is not from hpcx? Can you please check >LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing to the >right mpirun? > >Also, could you please check that yalla is present in the ompi_info -l 9 >output? > >Thanks > >On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov < tismagi...@mail.ru > wrote: >>I can password-less ssh to all nodes: >>base$ ssh node1 >>node1$ssh node2 >>Last login: Mon May 25 18:41:23 >>node2$ssh node3 >>Last login: Mon May 25 16:25:01 >>node3$ssh node4 >>Last login: Mon May 25 16:27:04 >>node4$ >> >>Is this correct? >> >>In ompi-1.9 i do not have no-tree-spawn problem. >> >> >>Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain < r...@open-mpi.org >: >> >>>I can’t speak to the mxm problem, but the no-tree-spawn issue indicates that >>>you don’t have password-less ssh authorized between the compute nodes >>> >>> >>>>On May 25, 2015, at 8:55 AM, Timur Ismagilov < tismagi...@mail.ru > wrote: >>>>Hello! >>>> >>>>I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2; >>>>OFED-1.5.4.1; >>>>CentOS release 6.2; >>>>infiniband 4x FDR >>>> >>>> >>>> >>>>I have two problems: >>>>1. I can not use mxm : >>>>1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29 >>>>-mca plm_rsh_no_tree_spawn 1 -np 4 ./hello >>>>-------------------------------------------------------------------------- >>>> >>>>A requested component was not found, or was unable to be opened. This >>>> >>>>means that this component is either not installed or is unable to be >>>> >>>>used on your system (e.g., sometimes this means that shared libraries >>>> >>>>that the component requires are unable to be found/loaded). Note that >>>> >>>>Open MPI stopped checking at the first component that it did not find. >>>> >>>> >>>> >>>>Host: node14 >>>> >>>>Framework: pml >>>> >>>>Component: yalla >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>*** An error occurred in MPI_Init >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>It looks like MPI_INIT failed for some reason; your parallel process is >>>> >>>>likely to abort. There are many reasons that a parallel process can >>>> >>>>fail during MPI_INIT; some of which are due to configuration or environment >>>> >>>>problems. This failure appears to be an internal failure; here's some >>>> >>>>additional information (which may only be relevant to an Open MPI >>>> >>>>developer): >>>> >>>> >>>> >>>> mca_pml_base_open() failed >>>> >>>> --> Returned "Not found" (-13) instead of "Success" (0) >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>*** on a NULL communicator >>>> >>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>> >>>>*** and potentially your MPI job) >>>> >>>>*** An error occurred in MPI_Init >>>> >>>>[node28:102377] Local abort before MPI_INIT completed successfully; not >>>>able to aggregate error messages, >>>> and not able to guarantee that all other processes were killed! >>>> >>>>*** on a NULL communicator >>>> >>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>> >>>>*** and potentially your MPI job) >>>> >>>>[node29:105600] Local abort before MPI_INIT completed successfully; not >>>>able to aggregate error messages, >>>> and not able to guarantee that all other processes were killed! >>>> >>>>*** An error occurred in MPI_Init >>>> >>>>*** on a NULL communicator >>>> >>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>> >>>>*** and potentially your MPI job) >>>> >>>>[node5:102409] Local abort before MPI_INIT completed successfully; not able >>>>to aggregate error messages, >>>>and not able to guarantee that all other processes were killed! >>>> >>>>*** An error occurred in MPI_Init >>>> >>>>*** on a NULL communicator >>>> >>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>> >>>>*** and potentially your MPI job) >>>> >>>>[node14:85284] Local abort before MPI_INIT completed successfully; not able >>>>to aggregate error messages, >>>>and not able to guarantee that all other processes were killed! >>>> >>>>------------------------------------------------------- >>>> >>>>Primary job terminated normally, but 1 process returned >>>> >>>>a non-zero exit code.. Per user-direction, the job has been aborted. >>>> >>>>------------------------------------------------------- >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>mpirun detected that one or more processes exited with non-zero status, >>>>thus causing >>>>the job to be terminated. The first process to do so was: >>>> >>>> >>>> >>>> Process name: [[9372,1],2] >>>> Exit code: 1 >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>[login:08295] 3 more processes have sent help message help-mca-base.txt / >>>>find-available:not-valid >>>>[login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see all >>>>help / error messages >>>>[login:08295] 3 more processes have sent help message help-mpi-runtime / >>>>mpi_init:startup:internal-failur >>>>e >>>> >>>> >>>>1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca >>>>plm_rsh_no_tree_spawn 1 -np 4 ./hello >>>>-------------------------------------------------------------------------- >>>> >>>>A requested component was not found, or was unable to be opened. This >>>> >>>>means that this component is either not installed or is unable to be >>>> >>>>used on your system (e.g., sometimes this means that shared libraries >>>> >>>>that the component requires are unable to be found/loaded). Note that >>>> >>>>Open MPI stopped checking at the first component that it did not find. >>>> >>>> >>>> >>>>Host: node5 >>>> >>>>Framework: pml >>>> >>>>Component: yalla >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>*** An error occurred in MPI_Init >>>> >>>>*** on a NULL communicator >>>> >>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>> >>>>*** and potentially your MPI job) >>>> >>>>[node5:102449] Local abort before MPI_INIT completed successfully; not able >>>>to aggregate error messages, >>>>and not able to guarantee that all other processes were killed! >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>It looks like MPI_INIT failed for some reason; your parallel process is >>>> >>>>likely to abort. There are many reasons that a parallel process can >>>> >>>>fail during MPI_INIT; some of which are due to configuration or environment >>>> >>>>problems. This failure appears to be an internal failure; here's some >>>> >>>>additional information (which may only be relevant to an Open MPI >>>> >>>>developer): >>>> >>>> >>>> >>>> mca_pml_base_open() failed >>>> >>>> --> Returned "Not found" (-13) instead of "Success" (0) >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>------------------------------------------------------- >>>> >>>>Primary job terminated normally, but 1 process returned >>>> >>>>a non-zero exit code.. Per user-direction, the job has been aborted. >>>> >>>>------------------------------------------------------- >>>> >>>>*** An error occurred in MPI_Init >>>> >>>>*** on a NULL communicator >>>> >>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>> >>>>*** and potentially your MPI job) >>>> >>>>[node14:85325] Local abort before MPI_INIT completed successfully; not able >>>>to aggregate error messages, >>>>and not able to guarantee that all other processes were killed! >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>mpirun detected that one or more processes exited with non-zero status, >>>>thus causing >>>>the job to be terminated. The first process to do so was: >>>> >>>> >>>> >>>> Process name: [[9619,1],0] >>>> >>>> Exit code: 1 >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>[login:08552] 1 more process has sent help message help-mca-base.txt / >>>>find-available:not-valid >>>>[login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to see all >>>>help / error messages >>>> >>>>2. I can not remove -mca plm_rsh_no_tree_spawn 1 from mpirun cmd line : >>>>$mpirun -host node5,node14,node28,node29 -np 4 ./hello >>>>sh: -c: line 0: syntax error near unexpected token `--tree-spawn' >>>> >>>>sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; >>>>OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc >>>>es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; export >>>>OPAL_PREFIX; PATH=/gpfs/NETHOME/o >>>>ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH >>>> ; export PA >>>>TH ; >>>>LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi >>>>-mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; >>>>DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice >>>>vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH >>>> ; expor >>>>t DYLD_LIBRARY_PATH ; >>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o >>>>mpi-mellanox-v1.8/bin/orted --hnp-topo-sig >>>>2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es >>>>s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca >>>>orte_parent_uri "625606656.1;tc >>>>p://10.65.0.105,10.64.0.105,10.67.0.105:56862 " -mca orte_hnp_uri >>>>"625606656.0; tcp://10.65.0.2,10.67.0.2,8 >>>>3.149.214.101, 10.64.0.2:54893 " --mca pml "yalla" -mca >>>>plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s >>>>pawn' >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>ORTE was unable to reliably start one or more daemons. >>>> >>>>This usually is caused by: >>>> >>>> >>>> >>>>* not finding the required libraries and/or binaries on >>>> >>>> one or more nodes. Please check your PATH and LD_LIBRARY_PATH >>>> >>>> settings, or configure OMPI with --enable-orterun-prefix-by-default >>>> >>>> >>>> >>>>* lack of authority to execute on one or more specified nodes. >>>> >>>> Please verify your allocation and authorities. >>>> >>>> >>>> >>>>* the inability to write startup files into /tmp >>>>(--tmpdir/orte_tmpdir_base). >>>> Please check with your sys admin to determine the correct location to >>>>use. >>>> >>>> >>>>* compilation of the orted with dynamic libraries when static are required >>>> >>>> (e.g., on Cray). Please check your configure cmd line and consider using >>>> >>>> one of the contrib/platform definitions for your system type. >>>> >>>> >>>> >>>>* an inability to create a connection back to mpirun due to a >>>> >>>> lack of common network interfaces and/or no route found between >>>> >>>> them. Please check network connectivity (including firewalls >>>> >>>> and network routing requirements). >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>mpirun: abort is already in progress...hit ctrl-c again to forcibly >>>>terminate >>>> >>>> >>>>Thank you for your comments. >>>> >>>>Best regards, >>>>Timur. >>>> >>>> >>>> >>>>_______________________________________________ >>>>users mailing list >>>>us...@open-mpi.org >>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>Link to this post: >>>>http://www.open-mpi.org/community/lists/users/2015/05/26919.php >>> >> >> >> >> >>_______________________________________________ >>users mailing list >>us...@open-mpi.org >>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>Link to this post: >>http://www.open-mpi.org/community/lists/users/2015/05/26922.php > > > >-- > >Kind Regards, > >M.
ompi_info.out
Description: Binary data