Is it normal to rebuild openmpi from hpcx? Why binaries don't work?
Четверг, 28 мая 2015, 14:01 +03:00 от Alina Sklarevich <ali...@dev.mellanox.co.il>: >Thank you for this info. > >If 'yalla' now works for you, is there anything that is still wrong? > >Thanks, >Alina. > >On Thu, May 28, 2015 at 10:21 AM, Timur Ismagilov < tismagi...@mail.ru > >wrote: >>I'm sorry for the delay . >> >>Here it is: >>( I used 5 min time limit ) >>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8/bin/mpirun >> -x >>LD_PRELOAD=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1- >> redhat6.2-x86_64/mxm/debug/lib/libmxm.so -x MXM_LOG_LEVEL=data -x >>MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off --mca pml yalla --hostfile >>hostlist ./hello 1> hello_debugMXM_n-2_ppn-2.out >>2>hello_debugMXM_n-2_ppn-2.err >>P.S. >>yalla warks fine with rebuilded ompi: --with-mxm=$HPCX_MXM_DIR >> >> >> >> >> >> >>Вторник, 26 мая 2015, 16:22 +03:00 от Alina Sklarevich < >>ali...@dev.mellanox.co.il >: >>>Hi Timur, >>> >>>HPCX has a debug version of MXM. Can you please add the following to your >>>command line with pml yalla in order to use it and attach the output? >>>"-x LD_PRELOAD=$HPCX_MXM_DIR/debug/lib/libmxm.so -x MXM_LOG_LEVEL=data" >>> >>>Also, could you please attach the entire output of >>>"$HPCX_MPI_DIR/bin/ompi_info -a" >>> >>>Thank you, >>>Alina. >>> >>>On Tue, May 26, 2015 at 3:39 PM, Mike Dubman < mi...@dev.mellanox.co.il > >>>wrote: >>>>Alina - could you please take a look? >>>>Thx >>>> >>>> >>>>---------- Forwarded message ---------- >>>>From: Timur Ismagilov < tismagi...@mail.ru > >>>>Date: Tue, May 26, 2015 at 12:40 PM >>>>Subject: Re[12]: [OMPI users] MXM problem >>>>To: Open MPI Users < us...@open-mpi.org > >>>>Cc: Mike Dubman < mi...@dev.mellanox.co.il > >>>> >>>> >>>>It does not work for single node: >>>> >>>>1) host: $ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x >>>>MXM_SHM_KCOPY_MODE=off -host node5 -mca pml yalla -x MXM_TLS=ud,self,shm >>>>--prefix $HPCX_MPI_DIR -mca plm_base_verbose 5 -mca oob_base_verbose 10 >>>>-mca rml_base_verbose 10 --debug-daemons -np 1 ./hello &> yalla.out >>>> >>>>2) host: $ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x >>>>MXM_SHM_KCOPY_MODE=off -host node5 --mca pml cm --mca mtl mxm --prefix >>>>$HPCX_MPI_DIR -mca plm_base_verbose 5 -mca oob_base_verbose 10 -mca >>>>rml_base_verbose 10 --debug-daemons -np 1 ./hello &> cm_mxm.out >>>> >>>>I've attached the yalla.out and cm_mxm.out to this email. >>>> >>>> >>>> >>>>Вторник, 26 мая 2015, 11:54 +03:00 от Mike Dubman < >>>>mi...@dev.mellanox.co.il >: >>>>>does it work from single node? >>>>>could you please run with opts below and attach output? >>>>> >>>>> -mca plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose >>>>>10 --debug-daemons >>>>> >>>>>On Tue, May 26, 2015 at 11:38 AM, Timur Ismagilov < tismagi...@mail.ru > >>>>>wrote: >>>>>>1. mxm_perf_test - OK. >>>>>>2. no_tree_spawn - OK. >>>>>>3. ompi yalla and "--mca pml cm --mca mtl mxm" still does not work (I >>>>>>use prebuild ompi-1.8.5 from hpcx-v1.3.330) >>>>>>3.a) host:$ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x >>>>>>MXM_SHM_KCOPY_MODE=off -host node5,node153 --mca pml cm --mca mtl mxm >>>>>>--prefix $HPCX_MPI_DIR ./hello >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>A requested component was not found, or was unable to be opened. This >>>>>> >>>>>>means that this component is either not installed or is unable to be >>>>>> >>>>>>used on your system (e.g., sometimes this means that shared libraries >>>>>> >>>>>>that the component requires are unable to be found/loaded). Note that >>>>>> >>>>>>Open MPI stopped checking at the first component that it did not find. >>>>>> >>>>>> >>>>>> >>>>>>Host: node153 >>>>>> >>>>>>Framework: mtl >>>>>> >>>>>>Component: mxm >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>[node5:113560] PML cm cannot be selected >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>No available pml components were found! >>>>>> >>>>>> >>>>>> >>>>>>This means that there are no components of this type installed on your >>>>>> >>>>>>system or all the components reported that they could not be used. >>>>>> >>>>>> >>>>>> >>>>>>This is a fatal error; your MPI process is likely to abort. Check the >>>>>> >>>>>>output of the "ompi_info" command and ensure that components of this >>>>>> >>>>>>type are available on your system. You may also wish to check the >>>>>> >>>>>>value of the "component_path" MCA parameter and ensure that it has at >>>>>> >>>>>>least one directory that contains valid MCA components. >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>[node153:44440] PML cm cannot be selected >>>>>> >>>>>>------------------------------------------------------- >>>>>> >>>>>>Primary job terminated normally, but 1 process returned >>>>>> >>>>>>a non-zero exit code.. Per user-direction, the job has been aborted. >>>>>> >>>>>>------------------------------------------------------- >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>mpirun detected that one or more processes exited with non-zero status, >>>>>>thus causing >>>>>>the job to be terminated. The first process to do so was: >>>>>> >>>>>> >>>>>> >>>>>> Process name: [[43917,1],0] >>>>>> >>>>>> Exit code: 1 >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>[login:110455] 1 more process has sent help message help-mca-base.txt / >>>>>>find-available:not-valid >>>>>>[login:110455] Set MCA parameter "orte_base_help_aggregate" to 0 to see >>>>>>all help / error messages >>>>>>[login:110455] 1 more process has sent help message help-mca-base.txt / >>>>>>find-available:none-found >>>>>> >>>>>>3.b) host:$ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x >>>>>>MXM_SHM_KCOPY_MODE=off -host node5,node153 -mca pml yalla --prefix >>>>>>$HPCX_MPI_DIR ./hello >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>A requested component was not found, or was unable to be opened. This >>>>>> >>>>>>means that this component is either not installed or is unable to be >>>>>> >>>>>>used on your system (e.g., sometimes this means that shared libraries >>>>>> >>>>>>that the component requires are unable to be found/loaded). Note that >>>>>> >>>>>>Open MPI stopped checking at the first component that it did not find. >>>>>> >>>>>> >>>>>> >>>>>>Host: node153 >>>>>> >>>>>>Framework: pml >>>>>> >>>>>>Component: yalla >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>*** An error occurred in MPI_Init >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>It looks like MPI_INIT failed for some reason; your parallel process is >>>>>> >>>>>>likely to abort. There are many reasons that a parallel process can >>>>>> >>>>>>fail during MPI_INIT; some of which are due to configuration or >>>>>>environment >>>>>>problems. This failure appears to be an internal failure; here's some >>>>>> >>>>>>additional information (which may only be relevant to an Open MPI >>>>>> >>>>>>developer): >>>>>> >>>>>> >>>>>> >>>>>> mca_pml_base_open() failed >>>>>> >>>>>> --> Returned "Not found" (-13) instead of "Success" (0) >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>*** on a NULL communicator >>>>>> >>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>>>> >>>>>>*** and potentially your MPI job) >>>>>> >>>>>>[node153:43979] Local abort before MPI_INIT completed successfully; not >>>>>>able to aggregate error messages, >>>>>> and not able to guarantee that all other processes were killed! >>>>>> >>>>>>------------------------------------------------------- >>>>>> >>>>>>Primary job terminated normally, but 1 process returned >>>>>> >>>>>>a non-zero exit code.. Per user-direction, the job has been aborted. >>>>>> >>>>>>------------------------------------------------------- >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>mpirun detected that one or more processes exited with non-zero status, >>>>>>thus causing >>>>>>the job to be terminated. The first process to do so was: >>>>>> >>>>>> >>>>>> >>>>>> Process name: [[44992,1],1] >>>>>> >>>>>> Exit code: 1 >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>host:$ echo $HPCX_MPI_DIR >>>>>> >>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8 >>>>>>host:$ ompi_info | grep pml >>>>>> >>>>>> MCA pml: v (MCA v2.0, API v2.0, Component v1.8.5) >>>>>> >>>>>> MCA pml: cm (MCA v2.0, API v2.0, Component v1.8.5) >>>>>> >>>>>> MCA pml: bfo (MCA v2.0, API v2.0, Component v1.8.5) >>>>>> >>>>>> MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.8.5) >>>>>> >>>>>> MCA pml: yalla (MCA v2.0, API v2.0, Component v1.8.5) >>>>>>host: tests$ ompi_info | grep mtl >>>>>> MCA mtl: mxm (MCA v2.0, API v2.0, Component v1.8.5) >>>>>> >>>>>>P.S. >>>>>>possible error in the FAQ? ( >>>>>>http://www.open-mpi.org/faq/?category=openfabrics#mxm ) >>>>>>47. Does Open MPI support MXM? >>>>>>............ >>>>>>NOTE: Please note that the 'yalla' pml is available only from Open MPI >>>>>>v1.9 and above >>>>>>........... >>>>>>But here we have(or not...) yalla in ompi 1.8.5 >>>>>> >>>>>> >>>>>> >>>>>>Вторник, 26 мая 2015, 9:53 +03:00 от Mike Dubman < >>>>>>mi...@dev.mellanox.co.il >: >>>>>>>Hi Timur, >>>>>>> >>>>>>>Here it goes: >>>>>>> >>>>>>>wget >>>>>>>ftp://bgate.mellanox.com/hpc/hpcx/custom/v1.3/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64.tbz >>>>>>> >>>>>>>Please let me know if it works for you and will add 1.5.4.1 mofed to the >>>>>>>default distribution list. >>>>>>> >>>>>>>M >>>>>>> >>>>>>> >>>>>>>On Mon, May 25, 2015 at 9:38 PM, Timur Ismagilov < tismagi...@mail.ru > >>>>>>>wrote: >>>>>>>>Thanks a lot . >>>>>>>> >>>>>>>>Понедельник, 25 мая 2015, 21:28 +03:00 от Mike Dubman < >>>>>>>>mi...@dev.mellanox.co.il >: >>>>>>>> >>>>>>>>>will send u the link tomorrow. >>>>>>>>> >>>>>>>>>On Mon, May 25, 2015 at 9:15 PM, Timur Ismagilov < tismagi...@mail.ru >>>>>>>>>> wrote: >>>>>>>>>>Where can i find MXM for ofed 1.5.4.1? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>Понедельник, 25 мая 2015, 21:11 +03:00 от Mike Dubman < >>>>>>>>>>mi...@dev.mellanox.co.il >: >>>>>>>>>> >>>>>>>>>>>btw, the ofed on your system is 1.5.4.1 while HPCx in use is for >>>>>>>>>>>ofed 1.5.3 >>>>>>>>>>> >>>>>>>>>>>seems like ABI issue between ofed versions >>>>>>>>>>> >>>>>>>>>>>On Mon, May 25, 2015 at 8:59 PM, Timur Ismagilov < >>>>>>>>>>>tismagi...@mail.ru > wrote: >>>>>>>>>>>>I did as you said, but got an error: >>>>>>>>>>>> >>>>>>>>>>>>node1$ export MXM_IB_PORTS=mlx4_0:1 >>>>>>>>>>>>node1$ ./mxm_perftest >>>>>>>>>>>> >>>>>>>>>>>>Waiting for connection... >>>>>>>>>>>> >>>>>>>>>>>>Accepted connection from 10.65.0.253 >>>>>>>>>>>> >>>>>>>>>>>>[1432576262.370195] [node153:35388:0] shm.c:65 MXM WARN >>>>>>>>>>>>Could not open the KNEM device file at /dev/knem : No such file or >>>>>>>>>>>>directory. Won't use knem. >>>>>>>>>>>> >>>>>>>>>>>>Failed to create endpoint: No such device >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>node2$ export MXM_IB_PORTS=mlx4_0:1 >>>>>>>>>>>>node2$ ./mxm_perftest node1 -t send_lat >>>>>>>>>>>> >>>>>>>>>>>>[1432576262.367523] [node158:99366:0] shm.c:65 MXM WARN >>>>>>>>>>>>Could not open the KNEM device file at /dev/knem : No such file or >>>>>>>>>>>>directory. Won't use knem. >>>>>>>>>>>>Failed to create endpoint: No such device >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>Понедельник, 25 мая 2015, 20:31 +03:00 от Mike Dubman < >>>>>>>>>>>>mi...@dev.mellanox.co.il >: >>>>>>>>>>>>>scif is a OFA device from Intel. >>>>>>>>>>>>>can you please select export MXM_IB_PORTS=mlx4_0:1 explicitly and >>>>>>>>>>>>>retry >>>>>>>>>>>>> >>>>>>>>>>>>>On Mon, May 25, 2015 at 8:26 PM, Timur Ismagilov < >>>>>>>>>>>>>tismagi...@mail.ru > wrote: >>>>>>>>>>>>>>Hi, Mike, >>>>>>>>>>>>>>that is what i have: >>>>>>>>>>>>>>$ echo $LD_LIBRARY_PATH | tr ":" "\n" >>>>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib >>>>>>>>>>>>>> >>>>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib >>>>>>>>>>>>>> >>>>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib >>>>>>>>>>>>>> >>>>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib >>>>>>>>>>>>>> +intel compiler paths >>>>>>>>>>>>>> >>>>>>>>>>>>>>$ echo $OPAL_PREFIX >>>>>>>>>>>>>> >>>>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 >>>>>>>>>>>>>> >>>>>>>>>>>>>>I don't use LD_PRELOAD. >>>>>>>>>>>>>> >>>>>>>>>>>>>>In the attached file(ompi_info.out) you will find the output of >>>>>>>>>>>>>>ompi_info -l 9 command. >>>>>>>>>>>>>> >>>>>>>>>>>>>>P.S . >>>>>>>>>>>>>>node1 $ ./mxm_perftest >>>>>>>>>>>>>>node2 $ ./mxm_perftest node1 -t send_lat >>>>>>>>>>>>>>[1432568685.067067] [node151:87372:0] shm.c:65 MXM >>>>>>>>>>>>>>WARN Could not open the KNEM device file $t /dev/knem : No such >>>>>>>>>>>>>>file or directory. Won't use knem. ( I don't have knem) >>>>>>>>>>>>>>[1432568685.069699] [node151:87372:0] ib_dev.c:531 MXM >>>>>>>>>>>>>>WARN skipping device scif0 (vendor_id/par$_id = 0x8086/0x0) - >>>>>>>>>>>>>>not a Mellanox device (???) >>>>>>>>>>>>>>Failed to create endpoint: No such device >>>>>>>>>>>>>> >>>>>>>>>>>>>>$ ibv_devinfo >>>>>>>>>>>>>>hca_id: mlx4_0 >>>>>>>>>>>>>> transport: InfiniBand (0) >>>>>>>>>>>>>> fw_ver: 2.10.600 >>>>>>>>>>>>>> node_guid: 0002:c903:00a1:13b0 >>>>>>>>>>>>>> sys_image_guid: 0002:c903:00a1:13b3 >>>>>>>>>>>>>> vendor_id: 0x02c9 >>>>>>>>>>>>>> vendor_part_id: 4099 >>>>>>>>>>>>>> hw_ver: 0x0 >>>>>>>>>>>>>> board_id: MT_1090120019 >>>>>>>>>>>>>> phys_port_cnt: 2 >>>>>>>>>>>>>> port: 1 >>>>>>>>>>>>>> state: PORT_ACTIVE (4) >>>>>>>>>>>>>> max_mtu: 4096 (5) >>>>>>>>>>>>>> active_mtu: 4096 (5) >>>>>>>>>>>>>> sm_lid: 1 >>>>>>>>>>>>>> port_lid: 83 >>>>>>>>>>>>>> port_lmc: 0x00 >>>>>>>>>>>>>> >>>>>>>>>>>>>> port: 2 >>>>>>>>>>>>>> state: PORT_DOWN (1) >>>>>>>>>>>>>> max_mtu: 4096 (5) >>>>>>>>>>>>>> active_mtu: 4096 (5) >>>>>>>>>>>>>> sm_lid: 0 >>>>>>>>>>>>>> port_lid: 0 >>>>>>>>>>>>>> port_lmc: 0x00 >>>>>>>>>>>>>> >>>>>>>>>>>>>>Best regards, >>>>>>>>>>>>>>Timur. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman < >>>>>>>>>>>>>>mi...@dev.mellanox.co.il >: >>>>>>>>>>>>>>>Hi Timur, >>>>>>>>>>>>>>>seems that yalla component was not found in your OMPI tree. >>>>>>>>>>>>>>>can it be that your mpirun is not from hpcx? Can you please >>>>>>>>>>>>>>>check LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it >>>>>>>>>>>>>>>is pointing to the right mpirun? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>Also, could you please check that yalla is present in the >>>>>>>>>>>>>>>ompi_info -l 9 output? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>Thanks >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov < >>>>>>>>>>>>>>>tismagi...@mail.ru > wrote: >>>>>>>>>>>>>>>>I can password-less ssh to all nodes: >>>>>>>>>>>>>>>>base$ ssh node1 >>>>>>>>>>>>>>>>node1$ssh node2 >>>>>>>>>>>>>>>>Last login: Mon May 25 18:41:23 >>>>>>>>>>>>>>>>node2$ssh node3 >>>>>>>>>>>>>>>>Last login: Mon May 25 16:25:01 >>>>>>>>>>>>>>>>node3$ssh node4 >>>>>>>>>>>>>>>>Last login: Mon May 25 16:27:04 >>>>>>>>>>>>>>>>node4$ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>Is this correct? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>In ompi-1.9 i do not have no-tree-spawn problem. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain < >>>>>>>>>>>>>>>>r...@open-mpi.org >: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>I can’t speak to the mxm problem, but the no-tree-spawn issue >>>>>>>>>>>>>>>>>indicates that you don’t have password-less ssh authorized >>>>>>>>>>>>>>>>>between the compute nodes >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>On May 25, 2015, at 8:55 AM, Timur Ismagilov < >>>>>>>>>>>>>>>>>>tismagi...@mail.ru > wrote: >>>>>>>>>>>>>>>>>>Hello! >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>I use ompi-v1.8.4 from >>>>>>>>>>>>>>>>>>hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2; >>>>>>>>>>>>>>>>>>OFED-1.5.4.1; >>>>>>>>>>>>>>>>>>CentOS release 6.2; >>>>>>>>>>>>>>>>>>infiniband 4x FDR >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>I have two problems: >>>>>>>>>>>>>>>>>>1. I can not use mxm : >>>>>>>>>>>>>>>>>>1.a) $mpirun --mca pml cm --mca mtl mxm -host >>>>>>>>>>>>>>>>>>node5,node14,node28,node29 -mca plm_rsh_no_tree_spawn 1 -np 4 >>>>>>>>>>>>>>>>>>./hello >>>>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>A requested component was not found, or was unable to be >>>>>>>>>>>>>>>>>>opened. This >>>>>>>>>>>>>>>>>>means that this component is either not installed or is >>>>>>>>>>>>>>>>>>unable to be >>>>>>>>>>>>>>>>>>used on your system (e.g., sometimes this means that shared >>>>>>>>>>>>>>>>>>libraries >>>>>>>>>>>>>>>>>>that the component requires are unable to be found/loaded). >>>>>>>>>>>>>>>>>>Note that >>>>>>>>>>>>>>>>>>Open MPI stopped checking at the first component that it did >>>>>>>>>>>>>>>>>>not find. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>Host: node14 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>Framework: pml >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>Component: yalla >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>It looks like MPI_INIT failed for some reason; your parallel >>>>>>>>>>>>>>>>>>process is >>>>>>>>>>>>>>>>>>likely to abort. There are many reasons that a parallel >>>>>>>>>>>>>>>>>>process can >>>>>>>>>>>>>>>>>>fail during MPI_INIT; some of which are due to configuration >>>>>>>>>>>>>>>>>>or environment >>>>>>>>>>>>>>>>>>problems. This failure appears to be an internal failure; >>>>>>>>>>>>>>>>>>here's some >>>>>>>>>>>>>>>>>>additional information (which may only be relevant to an Open >>>>>>>>>>>>>>>>>>MPI >>>>>>>>>>>>>>>>>>developer): >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> mca_pml_base_open() failed >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> --> Returned "Not found" (-13) instead of "Success" (0) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will >>>>>>>>>>>>>>>>>>now abort, >>>>>>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>[node28:102377] Local abort before MPI_INIT completed >>>>>>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>>>>>> and not able to guarantee that all other processes were >>>>>>>>>>>>>>>>>>killed! >>>>>>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will >>>>>>>>>>>>>>>>>>now abort, >>>>>>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>[node29:105600] Local abort before MPI_INIT completed >>>>>>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>>>>>> and not able to guarantee that all other processes were >>>>>>>>>>>>>>>>>>killed! >>>>>>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will >>>>>>>>>>>>>>>>>>now abort, >>>>>>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>[node5:102409] Local abort before MPI_INIT completed >>>>>>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>>>>>>and not able to guarantee that all other processes were >>>>>>>>>>>>>>>>>>killed! >>>>>>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will >>>>>>>>>>>>>>>>>>now abort, >>>>>>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>[node14:85284] Local abort before MPI_INIT completed >>>>>>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>>>>>>and not able to guarantee that all other processes were >>>>>>>>>>>>>>>>>>killed! >>>>>>>>>>>>>>>>>>------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>Primary job terminated normally, but 1 process returned >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>a non-zero exit code.. Per user-direction, the job has been >>>>>>>>>>>>>>>>>>aborted. >>>>>>>>>>>>>>>>>>------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>mpirun detected that one or more processes exited with >>>>>>>>>>>>>>>>>>non-zero status, thus causing >>>>>>>>>>>>>>>>>>the job to be terminated. The first process to do so was: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Process name: [[9372,1],2] >>>>>>>>>>>>>>>>>> Exit code: 1 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>[login:08295] 3 more processes have sent help message >>>>>>>>>>>>>>>>>>help-mca-base.txt / find-available:not-valid >>>>>>>>>>>>>>>>>>[login:08295] Set MCA parameter "orte_base_help_aggregate" to >>>>>>>>>>>>>>>>>>0 to see all help / error messages >>>>>>>>>>>>>>>>>>[login:08295] 3 more processes have sent help message >>>>>>>>>>>>>>>>>>help-mpi-runtime / mpi_init:startup:internal-failur >>>>>>>>>>>>>>>>>>e >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 >>>>>>>>>>>>>>>>>>-mca plm_rsh_no_tree_spawn 1 -np 4 ./hello >>>>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>A requested component was not found, or was unable to be >>>>>>>>>>>>>>>>>>opened. This >>>>>>>>>>>>>>>>>>means that this component is either not installed or is >>>>>>>>>>>>>>>>>>unable to be >>>>>>>>>>>>>>>>>>used on your system (e.g., sometimes this means that shared >>>>>>>>>>>>>>>>>>libraries >>>>>>>>>>>>>>>>>>that the component requires are unable to be found/loaded). >>>>>>>>>>>>>>>>>>Note that >>>>>>>>>>>>>>>>>>Open MPI stopped checking at the first component that it did >>>>>>>>>>>>>>>>>>not find. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>Host: node5 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>Framework: pml >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>Component: yalla >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will >>>>>>>>>>>>>>>>>>now abort, >>>>>>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>[node5:102449] Local abort before MPI_INIT completed >>>>>>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>>>>>>and not able to guarantee that all other processes were >>>>>>>>>>>>>>>>>>killed! >>>>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>It looks like MPI_INIT failed for some reason; your parallel >>>>>>>>>>>>>>>>>>process is >>>>>>>>>>>>>>>>>>likely to abort. There are many reasons that a parallel >>>>>>>>>>>>>>>>>>process can >>>>>>>>>>>>>>>>>>fail during MPI_INIT; some of which are due to configuration >>>>>>>>>>>>>>>>>>or environment >>>>>>>>>>>>>>>>>>problems. This failure appears to be an internal failure; >>>>>>>>>>>>>>>>>>here's some >>>>>>>>>>>>>>>>>>additional information (which may only be relevant to an Open >>>>>>>>>>>>>>>>>>MPI >>>>>>>>>>>>>>>>>>developer): >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> mca_pml_base_open() failed >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> --> Returned "Not found" (-13) instead of "Success" (0) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>Primary job terminated normally, but 1 process returned >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>a non-zero exit code.. Per user-direction, the job has been >>>>>>>>>>>>>>>>>>aborted. >>>>>>>>>>>>>>>>>>------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will >>>>>>>>>>>>>>>>>>now abort, >>>>>>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>[node14:85325] Local abort before MPI_INIT completed >>>>>>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>>>>>>and not able to guarantee that all other processes were >>>>>>>>>>>>>>>>>>killed! >>>>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>mpirun detected that one or more processes exited with >>>>>>>>>>>>>>>>>>non-zero status, thus causing >>>>>>>>>>>>>>>>>>the job to be terminated. The first process to do so was: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Process name: [[9619,1],0] >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Exit code: 1 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>[login:08552] 1 more process has sent help message >>>>>>>>>>>>>>>>>>help-mca-base.txt / find-available:not-valid >>>>>>>>>>>>>>>>>>[login:08552] Set MCA parameter "orte_base_help_aggregate" to >>>>>>>>>>>>>>>>>>0 to see all help / error messages >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>2. I can not remove -mca plm_rsh_no_tree_spawn 1 from mpirun >>>>>>>>>>>>>>>>>>cmd line : >>>>>>>>>>>>>>>>>>$mpirun -host node5,node14,node28,node29 -np 4 ./hello >>>>>>>>>>>>>>>>>>sh: -c: line 0: syntax error near unexpected token >>>>>>>>>>>>>>>>>>`--tree-spawn' >>>>>>>>>>>>>>>>>>sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; >>>>>>>>>>>>>>>>>>OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc >>>>>>>>>>>>>>>>>>es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 >>>>>>>>>>>>>>>>>> ; export OPAL_PREFIX; PATH=/gpfs/NETHOME/o >>>>>>>>>>>>>>>>>>ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH >>>>>>>>>>>>>>>>>> ; export PA >>>>>>>>>>>>>>>>>>TH ; >>>>>>>>>>>>>>>>>>LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi >>>>>>>>>>>>>>>>>>-mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH >>>>>>>>>>>>>>>>>>; DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice >>>>>>>>>>>>>>>>>>vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH >>>>>>>>>>>>>>>>>> ; expor >>>>>>>>>>>>>>>>>>t DYLD_LIBRARY_PATH ; >>>>>>>>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o >>>>>>>>>>>>>>>>>>mpi-mellanox-v1.8/bin/orted --hnp-topo-sig >>>>>>>>>>>>>>>>>>2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es >>>>>>>>>>>>>>>>>>s_jobid "625606656" -mca orte_ess_vpid 3 -mca >>>>>>>>>>>>>>>>>>orte_ess_num_procs "5" -mca orte_parent_uri "625606656.1;tc >>>>>>>>>>>>>>>>>>p://10.65.0.105,10.64.0.105,10.67.0.105:56862 " -mca >>>>>>>>>>>>>>>>>>orte_hnp_uri "625606656.0; tcp://10.65.0.2,10.67.0.2,8 >>>>>>>>>>>>>>>>>>3.149.214.101, 10.64.0.2:54893 " --mca pml "yalla" -mca >>>>>>>>>>>>>>>>>>plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s >>>>>>>>>>>>>>>>>>pawn' >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>ORTE was unable to reliably start one or more daemons. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>This usually is caused by: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>* not finding the required libraries and/or binaries on >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> one or more nodes. Please check your PATH and >>>>>>>>>>>>>>>>>>LD_LIBRARY_PATH >>>>>>>>>>>>>>>>>> settings, or configure OMPI with >>>>>>>>>>>>>>>>>>--enable-orterun-prefix-by-default >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>* lack of authority to execute on one or more specified >>>>>>>>>>>>>>>>>>nodes. >>>>>>>>>>>>>>>>>> Please verify your allocation and authorities. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>* the inability to write startup files into /tmp >>>>>>>>>>>>>>>>>>(--tmpdir/orte_tmpdir_base). >>>>>>>>>>>>>>>>>> Please check with your sys admin to determine the correct >>>>>>>>>>>>>>>>>>location to use. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>* compilation of the orted with dynamic libraries when >>>>>>>>>>>>>>>>>>static are required >>>>>>>>>>>>>>>>>> (e.g., on Cray). Please check your configure cmd line and >>>>>>>>>>>>>>>>>>consider using >>>>>>>>>>>>>>>>>> one of the contrib/platform definitions for your system >>>>>>>>>>>>>>>>>>type. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>* an inability to create a connection back to mpirun due to a >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> lack of common network interfaces and/or no route found >>>>>>>>>>>>>>>>>>between >>>>>>>>>>>>>>>>>> them. Please check network connectivity (including >>>>>>>>>>>>>>>>>>firewalls >>>>>>>>>>>>>>>>>> and network routing requirements). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>mpirun: abort is already in progress...hit ctrl-c again to >>>>>>>>>>>>>>>>>>forcibly terminate >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>Thank you for your comments. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>Best regards, >>>>>>>>>>>>>>>>>>Timur. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>_______________________________________________ >>>>>>>>>>>>>>>>>>users mailing list >>>>>>>>>>>>>>>>>>us...@open-mpi.org >>>>>>>>>>>>>>>>>>Subscription: >>>>>>>>>>>>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>>>Link to this post: >>>>>>>>>>>>>>>>>>http://www.open-mpi.org/community/lists/users/2015/05/26919.php >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>_______________________________________________ >>>>>>>>>>>>>>>>users mailing list >>>>>>>>>>>>>>>>us...@open-mpi.org >>>>>>>>>>>>>>>>Subscription: >>>>>>>>>>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>Link to this post: >>>>>>>>>>>>>>>>http://www.open-mpi.org/community/lists/users/2015/05/26922.php >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>-- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>Kind Regards, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>M. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>-- >>>>>>>>>>>>> >>>>>>>>>>>>>Kind Regards, >>>>>>>>>>>>> >>>>>>>>>>>>>M. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>-- >>>>>>>>>>> >>>>>>>>>>>Kind Regards, >>>>>>>>>>> >>>>>>>>>>>M. >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>-- >>>>>>>>> >>>>>>>>>Kind Regards, >>>>>>>>> >>>>>>>>>M. >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>-- >>>>>>> >>>>>>>Kind Regards, >>>>>>> >>>>>>>M. >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>>-- >>>>> >>>>>Kind Regards, >>>>> >>>>>M. >>>> >>>> >>>> >>>> >>>> >>>>-- >>>> >>>>Kind Regards, >>>> >>>>M. >> >> >> >