I'm sorry for the delay . Here it is: ( I used 5 min time limit ) /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8/bin/mpirun -x LD_PRELOAD=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1- redhat6.2-x86_64/mxm/debug/lib/libmxm.so -x MXM_LOG_LEVEL=data -x MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off --mca pml yalla --hostfile hostlist ./hello 1> hello_debugMXM_n-2_ppn-2.out 2>hello_debugMXM_n-2_ppn-2.err P.S. yalla warks fine with rebuilded ompi: --with-mxm=$HPCX_MXM_DIR
Вторник, 26 мая 2015, 16:22 +03:00 от Alina Sklarevich <ali...@dev.mellanox.co.il>: >Hi Timur, > >HPCX has a debug version of MXM. Can you please add the following to your >command line with pml yalla in order to use it and attach the output? >"-x LD_PRELOAD=$HPCX_MXM_DIR/debug/lib/libmxm.so -x MXM_LOG_LEVEL=data" > >Also, could you please attach the entire output of >"$HPCX_MPI_DIR/bin/ompi_info -a" > >Thank you, >Alina. > >On Tue, May 26, 2015 at 3:39 PM, Mike Dubman < mi...@dev.mellanox.co.il > >wrote: >>Alina - could you please take a look? >>Thx >> >> >>---------- Forwarded message ---------- >>From: Timur Ismagilov < tismagi...@mail.ru > >>Date: Tue, May 26, 2015 at 12:40 PM >>Subject: Re[12]: [OMPI users] MXM problem >>To: Open MPI Users < us...@open-mpi.org > >>Cc: Mike Dubman < mi...@dev.mellanox.co.il > >> >> >>It does not work for single node: >> >>1) host: $ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x >>MXM_SHM_KCOPY_MODE=off -host node5 -mca pml yalla -x MXM_TLS=ud,self,shm >>--prefix $HPCX_MPI_DIR -mca plm_base_verbose 5 -mca oob_base_verbose 10 -mca >>rml_base_verbose 10 --debug-daemons -np 1 ./hello &> yalla.out >> >>2) host: $ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x >>MXM_SHM_KCOPY_MODE=off -host node5 --mca pml cm --mca mtl mxm --prefix >>$HPCX_MPI_DIR -mca plm_base_verbose 5 -mca oob_base_verbose 10 -mca >>rml_base_verbose 10 --debug-daemons -np 1 ./hello &> cm_mxm.out >> >>I've attached the yalla.out and cm_mxm.out to this email. >> >> >> >>Вторник, 26 мая 2015, 11:54 +03:00 от Mike Dubman < mi...@dev.mellanox.co.il >>>: >>>does it work from single node? >>>could you please run with opts below and attach output? >>> >>> -mca plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 >>>--debug-daemons >>> >>>On Tue, May 26, 2015 at 11:38 AM, Timur Ismagilov < tismagi...@mail.ru > >>>wrote: >>>>1. mxm_perf_test - OK. >>>>2. no_tree_spawn - OK. >>>>3. ompi yalla and "--mca pml cm --mca mtl mxm" still does not work (I use >>>>prebuild ompi-1.8.5 from hpcx-v1.3.330) >>>>3.a) host:$ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x >>>>MXM_SHM_KCOPY_MODE=off -host node5,node153 --mca pml cm --mca mtl mxm >>>>--prefix $HPCX_MPI_DIR ./hello >>>>-------------------------------------------------------------------------- >>>> >>>>A requested component was not found, or was unable to be opened. This >>>> >>>>means that this component is either not installed or is unable to be >>>> >>>>used on your system (e.g., sometimes this means that shared libraries >>>> >>>>that the component requires are unable to be found/loaded). Note that >>>> >>>>Open MPI stopped checking at the first component that it did not find. >>>> >>>> >>>> >>>>Host: node153 >>>> >>>>Framework: mtl >>>> >>>>Component: mxm >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>[node5:113560] PML cm cannot be selected >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>No available pml components were found! >>>> >>>> >>>> >>>>This means that there are no components of this type installed on your >>>> >>>>system or all the components reported that they could not be used. >>>> >>>> >>>> >>>>This is a fatal error; your MPI process is likely to abort. Check the >>>> >>>>output of the "ompi_info" command and ensure that components of this >>>> >>>>type are available on your system. You may also wish to check the >>>> >>>>value of the "component_path" MCA parameter and ensure that it has at >>>> >>>>least one directory that contains valid MCA components. >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>[node153:44440] PML cm cannot be selected >>>> >>>>------------------------------------------------------- >>>> >>>>Primary job terminated normally, but 1 process returned >>>> >>>>a non-zero exit code.. Per user-direction, the job has been aborted. >>>> >>>>------------------------------------------------------- >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>mpirun detected that one or more processes exited with non-zero status, >>>>thus causing >>>>the job to be terminated. The first process to do so was: >>>> >>>> >>>> >>>> Process name: [[43917,1],0] >>>> >>>> Exit code: 1 >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>[login:110455] 1 more process has sent help message help-mca-base.txt / >>>>find-available:not-valid >>>>[login:110455] Set MCA parameter "orte_base_help_aggregate" to 0 to see all >>>>help / error messages >>>>[login:110455] 1 more process has sent help message help-mca-base.txt / >>>>find-available:none-found >>>> >>>>3.b) host:$ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x >>>>MXM_SHM_KCOPY_MODE=off -host node5,node153 -mca pml yalla --prefix >>>>$HPCX_MPI_DIR ./hello >>>>-------------------------------------------------------------------------- >>>> >>>>A requested component was not found, or was unable to be opened. This >>>> >>>>means that this component is either not installed or is unable to be >>>> >>>>used on your system (e.g., sometimes this means that shared libraries >>>> >>>>that the component requires are unable to be found/loaded). Note that >>>> >>>>Open MPI stopped checking at the first component that it did not find. >>>> >>>> >>>> >>>>Host: node153 >>>> >>>>Framework: pml >>>> >>>>Component: yalla >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>*** An error occurred in MPI_Init >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>It looks like MPI_INIT failed for some reason; your parallel process is >>>> >>>>likely to abort. There are many reasons that a parallel process can >>>> >>>>fail during MPI_INIT; some of which are due to configuration or environment >>>> >>>>problems. This failure appears to be an internal failure; here's some >>>> >>>>additional information (which may only be relevant to an Open MPI >>>> >>>>developer): >>>> >>>> >>>> >>>> mca_pml_base_open() failed >>>> >>>> --> Returned "Not found" (-13) instead of "Success" (0) >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>*** on a NULL communicator >>>> >>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>> >>>>*** and potentially your MPI job) >>>> >>>>[node153:43979] Local abort before MPI_INIT completed successfully; not >>>>able to aggregate error messages, >>>> and not able to guarantee that all other processes were killed! >>>> >>>>------------------------------------------------------- >>>> >>>>Primary job terminated normally, but 1 process returned >>>> >>>>a non-zero exit code.. Per user-direction, the job has been aborted. >>>> >>>>------------------------------------------------------- >>>> >>>>-------------------------------------------------------------------------- >>>> >>>>mpirun detected that one or more processes exited with non-zero status, >>>>thus causing >>>>the job to be terminated. The first process to do so was: >>>> >>>> >>>> >>>> Process name: [[44992,1],1] >>>> >>>> Exit code: 1 >>>> >>>>-------------------------------------------------------------------------- >>>> >>>> >>>> >>>> >>>>host:$ echo $HPCX_MPI_DIR >>>> >>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8 >>>>host:$ ompi_info | grep pml >>>> >>>> MCA pml: v (MCA v2.0, API v2.0, Component v1.8.5) >>>> >>>> MCA pml: cm (MCA v2.0, API v2.0, Component v1.8.5) >>>> >>>> MCA pml: bfo (MCA v2.0, API v2.0, Component v1.8.5) >>>> >>>> MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.8.5) >>>> >>>> MCA pml: yalla (MCA v2.0, API v2.0, Component v1.8.5) >>>>host: tests$ ompi_info | grep mtl >>>> MCA mtl: mxm (MCA v2.0, API v2.0, Component v1.8.5) >>>> >>>>P.S. >>>>possible error in the FAQ? ( >>>>http://www.open-mpi.org/faq/?category=openfabrics#mxm ) >>>>47. Does Open MPI support MXM? >>>>............ >>>>NOTE: Please note that the 'yalla' pml is available only from Open MPI v1.9 >>>>and above >>>>........... >>>>But here we have(or not...) yalla in ompi 1.8.5 >>>> >>>> >>>> >>>>Вторник, 26 мая 2015, 9:53 +03:00 от Mike Dubman < mi...@dev.mellanox.co.il >>>>>: >>>>>Hi Timur, >>>>> >>>>>Here it goes: >>>>> >>>>>wget >>>>>ftp://bgate.mellanox.com/hpc/hpcx/custom/v1.3/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64.tbz >>>>> >>>>>Please let me know if it works for you and will add 1.5.4.1 mofed to the >>>>>default distribution list. >>>>> >>>>>M >>>>> >>>>> >>>>>On Mon, May 25, 2015 at 9:38 PM, Timur Ismagilov < tismagi...@mail.ru > >>>>>wrote: >>>>>>Thanks a lot . >>>>>> >>>>>>Понедельник, 25 мая 2015, 21:28 +03:00 от Mike Dubman < >>>>>>mi...@dev.mellanox.co.il >: >>>>>> >>>>>>>will send u the link tomorrow. >>>>>>> >>>>>>>On Mon, May 25, 2015 at 9:15 PM, Timur Ismagilov < tismagi...@mail.ru > >>>>>>>wrote: >>>>>>>>Where can i find MXM for ofed 1.5.4.1? >>>>>>>> >>>>>>>> >>>>>>>>Понедельник, 25 мая 2015, 21:11 +03:00 от Mike Dubman < >>>>>>>>mi...@dev.mellanox.co.il >: >>>>>>>> >>>>>>>>>btw, the ofed on your system is 1.5.4.1 while HPCx in use is for ofed >>>>>>>>>1.5.3 >>>>>>>>> >>>>>>>>>seems like ABI issue between ofed versions >>>>>>>>> >>>>>>>>>On Mon, May 25, 2015 at 8:59 PM, Timur Ismagilov < tismagi...@mail.ru >>>>>>>>>> wrote: >>>>>>>>>>I did as you said, but got an error: >>>>>>>>>> >>>>>>>>>>node1$ export MXM_IB_PORTS=mlx4_0:1 >>>>>>>>>>node1$ ./mxm_perftest >>>>>>>>>> >>>>>>>>>>Waiting for connection... >>>>>>>>>> >>>>>>>>>>Accepted connection from 10.65.0.253 >>>>>>>>>> >>>>>>>>>>[1432576262.370195] [node153:35388:0] shm.c:65 MXM WARN >>>>>>>>>>Could not open the KNEM device file at /dev/knem : No such file or >>>>>>>>>>directory. Won't use knem. >>>>>>>>>> >>>>>>>>>>Failed to create endpoint: No such device >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>node2$ export MXM_IB_PORTS=mlx4_0:1 >>>>>>>>>>node2$ ./mxm_perftest node1 -t send_lat >>>>>>>>>> >>>>>>>>>>[1432576262.367523] [node158:99366:0] shm.c:65 MXM WARN >>>>>>>>>>Could not open the KNEM device file at /dev/knem : No such file or >>>>>>>>>>directory. Won't use knem. >>>>>>>>>>Failed to create endpoint: No such device >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>Понедельник, 25 мая 2015, 20:31 +03:00 от Mike Dubman < >>>>>>>>>>mi...@dev.mellanox.co.il >: >>>>>>>>>>>scif is a OFA device from Intel. >>>>>>>>>>>can you please select export MXM_IB_PORTS=mlx4_0:1 explicitly and >>>>>>>>>>>retry >>>>>>>>>>> >>>>>>>>>>>On Mon, May 25, 2015 at 8:26 PM, Timur Ismagilov < >>>>>>>>>>>tismagi...@mail.ru > wrote: >>>>>>>>>>>>Hi, Mike, >>>>>>>>>>>>that is what i have: >>>>>>>>>>>>$ echo $LD_LIBRARY_PATH | tr ":" "\n" >>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib >>>>>>>>>>>> >>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib >>>>>>>>>>>> >>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib >>>>>>>>>>>> >>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib >>>>>>>>>>>> +intel compiler paths >>>>>>>>>>>> >>>>>>>>>>>>$ echo $OPAL_PREFIX >>>>>>>>>>>> >>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 >>>>>>>>>>>> >>>>>>>>>>>>I don't use LD_PRELOAD. >>>>>>>>>>>> >>>>>>>>>>>>In the attached file(ompi_info.out) you will find the output of >>>>>>>>>>>>ompi_info -l 9 command. >>>>>>>>>>>> >>>>>>>>>>>>P.S . >>>>>>>>>>>>node1 $ ./mxm_perftest >>>>>>>>>>>>node2 $ ./mxm_perftest node1 -t send_lat >>>>>>>>>>>>[1432568685.067067] [node151:87372:0] shm.c:65 MXM WARN >>>>>>>>>>>>Could not open the KNEM device file $t /dev/knem : No such file or >>>>>>>>>>>>directory. Won't use knem. ( I don't have knem) >>>>>>>>>>>>[1432568685.069699] [node151:87372:0] ib_dev.c:531 MXM WARN >>>>>>>>>>>>skipping device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a >>>>>>>>>>>>Mellanox device (???) >>>>>>>>>>>>Failed to create endpoint: No such device >>>>>>>>>>>> >>>>>>>>>>>>$ ibv_devinfo >>>>>>>>>>>>hca_id: mlx4_0 >>>>>>>>>>>> transport: InfiniBand (0) >>>>>>>>>>>> fw_ver: 2.10.600 >>>>>>>>>>>> node_guid: 0002:c903:00a1:13b0 >>>>>>>>>>>> sys_image_guid: 0002:c903:00a1:13b3 >>>>>>>>>>>> vendor_id: 0x02c9 >>>>>>>>>>>> vendor_part_id: 4099 >>>>>>>>>>>> hw_ver: 0x0 >>>>>>>>>>>> board_id: MT_1090120019 >>>>>>>>>>>> phys_port_cnt: 2 >>>>>>>>>>>> port: 1 >>>>>>>>>>>> state: PORT_ACTIVE (4) >>>>>>>>>>>> max_mtu: 4096 (5) >>>>>>>>>>>> active_mtu: 4096 (5) >>>>>>>>>>>> sm_lid: 1 >>>>>>>>>>>> port_lid: 83 >>>>>>>>>>>> port_lmc: 0x00 >>>>>>>>>>>> >>>>>>>>>>>> port: 2 >>>>>>>>>>>> state: PORT_DOWN (1) >>>>>>>>>>>> max_mtu: 4096 (5) >>>>>>>>>>>> active_mtu: 4096 (5) >>>>>>>>>>>> sm_lid: 0 >>>>>>>>>>>> port_lid: 0 >>>>>>>>>>>> port_lmc: 0x00 >>>>>>>>>>>> >>>>>>>>>>>>Best regards, >>>>>>>>>>>>Timur. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman < >>>>>>>>>>>>mi...@dev.mellanox.co.il >: >>>>>>>>>>>>>Hi Timur, >>>>>>>>>>>>>seems that yalla component was not found in your OMPI tree. >>>>>>>>>>>>>can it be that your mpirun is not from hpcx? Can you please check >>>>>>>>>>>>>LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is >>>>>>>>>>>>>pointing to the right mpirun? >>>>>>>>>>>>> >>>>>>>>>>>>>Also, could you please check that yalla is present in the >>>>>>>>>>>>>ompi_info -l 9 output? >>>>>>>>>>>>> >>>>>>>>>>>>>Thanks >>>>>>>>>>>>> >>>>>>>>>>>>>On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov < >>>>>>>>>>>>>tismagi...@mail.ru > wrote: >>>>>>>>>>>>>>I can password-less ssh to all nodes: >>>>>>>>>>>>>>base$ ssh node1 >>>>>>>>>>>>>>node1$ssh node2 >>>>>>>>>>>>>>Last login: Mon May 25 18:41:23 >>>>>>>>>>>>>>node2$ssh node3 >>>>>>>>>>>>>>Last login: Mon May 25 16:25:01 >>>>>>>>>>>>>>node3$ssh node4 >>>>>>>>>>>>>>Last login: Mon May 25 16:27:04 >>>>>>>>>>>>>>node4$ >>>>>>>>>>>>>> >>>>>>>>>>>>>>Is this correct? >>>>>>>>>>>>>> >>>>>>>>>>>>>>In ompi-1.9 i do not have no-tree-spawn problem. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain < >>>>>>>>>>>>>>r...@open-mpi.org >: >>>>>>>>>>>>>> >>>>>>>>>>>>>>>I can’t speak to the mxm problem, but the no-tree-spawn issue >>>>>>>>>>>>>>>indicates that you don’t have password-less ssh authorized >>>>>>>>>>>>>>>between the compute nodes >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>On May 25, 2015, at 8:55 AM, Timur Ismagilov < >>>>>>>>>>>>>>>>tismagi...@mail.ru > wrote: >>>>>>>>>>>>>>>>Hello! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2; >>>>>>>>>>>>>>>>OFED-1.5.4.1; >>>>>>>>>>>>>>>>CentOS release 6.2; >>>>>>>>>>>>>>>>infiniband 4x FDR >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>I have two problems: >>>>>>>>>>>>>>>>1. I can not use mxm : >>>>>>>>>>>>>>>>1.a) $mpirun --mca pml cm --mca mtl mxm -host >>>>>>>>>>>>>>>>node5,node14,node28,node29 -mca plm_rsh_no_tree_spawn 1 -np 4 >>>>>>>>>>>>>>>>./hello >>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>A requested component was not found, or was unable to be >>>>>>>>>>>>>>>>opened. This >>>>>>>>>>>>>>>>means that this component is either not installed or is unable >>>>>>>>>>>>>>>>to be >>>>>>>>>>>>>>>>used on your system (e.g., sometimes this means that shared >>>>>>>>>>>>>>>>libraries >>>>>>>>>>>>>>>>that the component requires are unable to be found/loaded). >>>>>>>>>>>>>>>>Note that >>>>>>>>>>>>>>>>Open MPI stopped checking at the first component that it did >>>>>>>>>>>>>>>>not find. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>Host: node14 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>Framework: pml >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>Component: yalla >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>It looks like MPI_INIT failed for some reason; your parallel >>>>>>>>>>>>>>>>process is >>>>>>>>>>>>>>>>likely to abort. There are many reasons that a parallel >>>>>>>>>>>>>>>>process can >>>>>>>>>>>>>>>>fail during MPI_INIT; some of which are due to configuration or >>>>>>>>>>>>>>>>environment >>>>>>>>>>>>>>>>problems. This failure appears to be an internal failure; >>>>>>>>>>>>>>>>here's some >>>>>>>>>>>>>>>>additional information (which may only be relevant to an Open >>>>>>>>>>>>>>>>MPI >>>>>>>>>>>>>>>>developer): >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> mca_pml_base_open() failed >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> --> Returned "Not found" (-13) instead of "Success" (0) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will >>>>>>>>>>>>>>>>now abort, >>>>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>[node28:102377] Local abort before MPI_INIT completed >>>>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>>>> and not able to guarantee that all other processes were >>>>>>>>>>>>>>>>killed! >>>>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will >>>>>>>>>>>>>>>>now abort, >>>>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>[node29:105600] Local abort before MPI_INIT completed >>>>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>>>> and not able to guarantee that all other processes were >>>>>>>>>>>>>>>>killed! >>>>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will >>>>>>>>>>>>>>>>now abort, >>>>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>[node5:102409] Local abort before MPI_INIT completed >>>>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>>>>and not able to guarantee that all other processes were killed! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will >>>>>>>>>>>>>>>>now abort, >>>>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>[node14:85284] Local abort before MPI_INIT completed >>>>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>>>>and not able to guarantee that all other processes were killed! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>Primary job terminated normally, but 1 process returned >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>a non-zero exit code.. Per user-direction, the job has been >>>>>>>>>>>>>>>>aborted. >>>>>>>>>>>>>>>>------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>mpirun detected that one or more processes exited with non-zero >>>>>>>>>>>>>>>>status, thus causing >>>>>>>>>>>>>>>>the job to be terminated. The first process to do so was: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Process name: [[9372,1],2] >>>>>>>>>>>>>>>> Exit code: 1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>[login:08295] 3 more processes have sent help message >>>>>>>>>>>>>>>>help-mca-base.txt / find-available:not-valid >>>>>>>>>>>>>>>>[login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 >>>>>>>>>>>>>>>>to see all help / error messages >>>>>>>>>>>>>>>>[login:08295] 3 more processes have sent help message >>>>>>>>>>>>>>>>help-mpi-runtime / mpi_init:startup:internal-failur >>>>>>>>>>>>>>>>e >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 >>>>>>>>>>>>>>>>-mca plm_rsh_no_tree_spawn 1 -np 4 ./hello >>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>A requested component was not found, or was unable to be >>>>>>>>>>>>>>>>opened. This >>>>>>>>>>>>>>>>means that this component is either not installed or is unable >>>>>>>>>>>>>>>>to be >>>>>>>>>>>>>>>>used on your system (e.g., sometimes this means that shared >>>>>>>>>>>>>>>>libraries >>>>>>>>>>>>>>>>that the component requires are unable to be found/loaded). >>>>>>>>>>>>>>>>Note that >>>>>>>>>>>>>>>>Open MPI stopped checking at the first component that it did >>>>>>>>>>>>>>>>not find. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>Host: node5 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>Framework: pml >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>Component: yalla >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will >>>>>>>>>>>>>>>>now abort, >>>>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>[node5:102449] Local abort before MPI_INIT completed >>>>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>>>>and not able to guarantee that all other processes were killed! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>It looks like MPI_INIT failed for some reason; your parallel >>>>>>>>>>>>>>>>process is >>>>>>>>>>>>>>>>likely to abort. There are many reasons that a parallel >>>>>>>>>>>>>>>>process can >>>>>>>>>>>>>>>>fail during MPI_INIT; some of which are due to configuration or >>>>>>>>>>>>>>>>environment >>>>>>>>>>>>>>>>problems. This failure appears to be an internal failure; >>>>>>>>>>>>>>>>here's some >>>>>>>>>>>>>>>>additional information (which may only be relevant to an Open >>>>>>>>>>>>>>>>MPI >>>>>>>>>>>>>>>>developer): >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> mca_pml_base_open() failed >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> --> Returned "Not found" (-13) instead of "Success" (0) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>Primary job terminated normally, but 1 process returned >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>a non-zero exit code.. Per user-direction, the job has been >>>>>>>>>>>>>>>>aborted. >>>>>>>>>>>>>>>>------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will >>>>>>>>>>>>>>>>now abort, >>>>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>[node14:85325] Local abort before MPI_INIT completed >>>>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>>>>and not able to guarantee that all other processes were killed! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>mpirun detected that one or more processes exited with non-zero >>>>>>>>>>>>>>>>status, thus causing >>>>>>>>>>>>>>>>the job to be terminated. The first process to do so was: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Process name: [[9619,1],0] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Exit code: 1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>[login:08552] 1 more process has sent help message >>>>>>>>>>>>>>>>help-mca-base.txt / find-available:not-valid >>>>>>>>>>>>>>>>[login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 >>>>>>>>>>>>>>>>to see all help / error messages >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>2. I can not remove -mca plm_rsh_no_tree_spawn 1 from mpirun >>>>>>>>>>>>>>>>cmd line : >>>>>>>>>>>>>>>>$mpirun -host node5,node14,node28,node29 -np 4 ./hello >>>>>>>>>>>>>>>>sh: -c: line 0: syntax error near unexpected token >>>>>>>>>>>>>>>>`--tree-spawn' >>>>>>>>>>>>>>>>sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; >>>>>>>>>>>>>>>>OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc >>>>>>>>>>>>>>>>es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 >>>>>>>>>>>>>>>>; export OPAL_PREFIX; PATH=/gpfs/NETHOME/o >>>>>>>>>>>>>>>>ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH >>>>>>>>>>>>>>>> ; export PA >>>>>>>>>>>>>>>>TH ; >>>>>>>>>>>>>>>>LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi >>>>>>>>>>>>>>>>-mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; >>>>>>>>>>>>>>>>DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice >>>>>>>>>>>>>>>>vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH >>>>>>>>>>>>>>>> ; expor >>>>>>>>>>>>>>>>t DYLD_LIBRARY_PATH ; >>>>>>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o >>>>>>>>>>>>>>>>mpi-mellanox-v1.8/bin/orted --hnp-topo-sig >>>>>>>>>>>>>>>>2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es >>>>>>>>>>>>>>>>s_jobid "625606656" -mca orte_ess_vpid 3 -mca >>>>>>>>>>>>>>>>orte_ess_num_procs "5" -mca orte_parent_uri "625606656.1;tc >>>>>>>>>>>>>>>>p://10.65.0.105,10.64.0.105,10.67.0.105:56862 " -mca >>>>>>>>>>>>>>>>orte_hnp_uri "625606656.0; tcp://10.65.0.2,10.67.0.2,8 >>>>>>>>>>>>>>>>3.149.214.101, 10.64.0.2:54893 " --mca pml "yalla" -mca >>>>>>>>>>>>>>>>plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s >>>>>>>>>>>>>>>>pawn' >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>ORTE was unable to reliably start one or more daemons. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>This usually is caused by: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>* not finding the required libraries and/or binaries on >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> one or more nodes. Please check your PATH and LD_LIBRARY_PATH >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> settings, or configure OMPI with >>>>>>>>>>>>>>>>--enable-orterun-prefix-by-default >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>* lack of authority to execute on one or more specified nodes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Please verify your allocation and authorities. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>* the inability to write startup files into /tmp >>>>>>>>>>>>>>>>(--tmpdir/orte_tmpdir_base). >>>>>>>>>>>>>>>> Please check with your sys admin to determine the correct >>>>>>>>>>>>>>>>location to use. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>* compilation of the orted with dynamic libraries when static >>>>>>>>>>>>>>>>are required >>>>>>>>>>>>>>>> (e.g., on Cray). Please check your configure cmd line and >>>>>>>>>>>>>>>>consider using >>>>>>>>>>>>>>>> one of the contrib/platform definitions for your system type. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>* an inability to create a connection back to mpirun due to a >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> lack of common network interfaces and/or no route found >>>>>>>>>>>>>>>>between >>>>>>>>>>>>>>>> them. Please check network connectivity (including firewalls >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> and network routing requirements). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>mpirun: abort is already in progress...hit ctrl-c again to >>>>>>>>>>>>>>>>forcibly terminate >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>Thank you for your comments. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>Best regards, >>>>>>>>>>>>>>>>Timur. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>_______________________________________________ >>>>>>>>>>>>>>>>users mailing list >>>>>>>>>>>>>>>>us...@open-mpi.org >>>>>>>>>>>>>>>>Subscription: >>>>>>>>>>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>Link to this post: >>>>>>>>>>>>>>>>http://www.open-mpi.org/community/lists/users/2015/05/26919.php >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>_______________________________________________ >>>>>>>>>>>>>>users mailing list >>>>>>>>>>>>>>us...@open-mpi.org >>>>>>>>>>>>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>Link to this post: >>>>>>>>>>>>>>http://www.open-mpi.org/community/lists/users/2015/05/26922.php >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>-- >>>>>>>>>>>>> >>>>>>>>>>>>>Kind Regards, >>>>>>>>>>>>> >>>>>>>>>>>>>M. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>-- >>>>>>>>>>> >>>>>>>>>>>Kind Regards, >>>>>>>>>>> >>>>>>>>>>>M. >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>-- >>>>>>>>> >>>>>>>>>Kind Regards, >>>>>>>>> >>>>>>>>>M. >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>-- >>>>>>> >>>>>>>Kind Regards, >>>>>>> >>>>>>>M. >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>>-- >>>>> >>>>>Kind Regards, >>>>> >>>>>M. >>>> >>>> >>> >>> >>> >>>-- >>> >>>Kind Regards, >>> >>>M. >> >> >> >> >> >>-- >> >>Kind Regards, >> >>M.
hello_debugMXM_n-2_ppn-2.out
Description: Binary data
hello_debugMXM_n-2_ppn-2.err
Description: Binary data
ompi_info.out
Description: Binary data