I'm sorry for the delay .

Here it is:
( I used 5 min  time limit )
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8/bin/mpirun
 -x 
LD_PRELOAD=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-
          redhat6.2-x86_64/mxm/debug/lib/libmxm.so -x MXM_LOG_LEVEL=data -x 
MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off --mca pml yalla --hostfile 
hostlist ./hello 1> hello_debugMXM_n-2_ppn-2.out  
2>hello_debugMXM_n-2_ppn-2.err  
P.S.
yalla warks fine with rebuilded ompi: --with-mxm=$HPCX_MXM_DIR






Вторник, 26 мая 2015, 16:22 +03:00 от Alina Sklarevich 
<ali...@dev.mellanox.co.il>:
>Hi Timur,
>
>HPCX has a debug version of MXM. Can you please add the following to your 
>command line with pml yalla in order to use it and attach the output? 
>"-x LD_PRELOAD=$HPCX_MXM_DIR/debug/lib/libmxm.so -x MXM_LOG_LEVEL=data"
>
>Also, could you please attach the entire output of 
>"$HPCX_MPI_DIR/bin/ompi_info -a" 
>
>Thank you,
>Alina. 
>
>On Tue, May 26, 2015 at 3:39 PM, Mike Dubman  < mi...@dev.mellanox.co.il > 
>wrote:
>>Alina - could you please take a look?
>>Thx
>>
>>
>>---------- Forwarded message ----------
>>From:  Timur Ismagilov < tismagi...@mail.ru >
>>Date: Tue, May 26, 2015 at 12:40 PM
>>Subject: Re[12]: [OMPI users] MXM problem
>>To: Open MPI Users < us...@open-mpi.org >
>>Cc: Mike Dubman < mi...@dev.mellanox.co.il >
>>
>>
>>It does not work for single node:
>>
>>1) host: $  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
>>MXM_SHM_KCOPY_MODE=off -host node5 -mca pml yalla -x MXM_TLS=ud,self,shm 
>>--prefix $HPCX_MPI_DIR -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca 
>>rml_base_verbose 10 --debug-daemons  -np 1 ./hello &>  yalla.out              
>>                   
>>2) host: $  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
>>MXM_SHM_KCOPY_MODE=off -host node5  --mca pml cm --mca mtl mxm --prefix 
>>$HPCX_MPI_DIR -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca 
>>rml_base_verbose 10 --debug-daemons -np 1 ./hello &>  cm_mxm.out
>>
>>I've attached the  yalla.out and  cm_mxm.out to this email.
>>
>>
>>
>>Вторник, 26 мая 2015, 11:54 +03:00 от Mike Dubman < mi...@dev.mellanox.co.il 
>>>:
>>>does it work from single node?
>>>could you please run with opts below and attach output?
>>>
>>> -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca rml_base_verbose 10 
>>>--debug-daemons
>>>
>>>On Tue, May 26, 2015 at 11:38 AM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>wrote:
>>>>1. mxm_perf_test - OK.
>>>>2. no_tree_spawn  - OK.
>>>>3. ompi yalla and "--mca pml cm --mca mtl mxm" still  does not  work (I use 
>>>>prebuild ompi-1.8.5 from hpcx-v1.3.330)
>>>>3.a) host:$  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
>>>>MXM_SHM_KCOPY_MODE=off -host node5,node153  --mca pml cm --mca mtl mxm 
>>>>--prefix $HPCX_MPI_DIR ./hello
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>A requested component was not found, or was unable to be opened.  This      
>>>>                             
>>>>means that this component is either not installed or is unable to be        
>>>>                             
>>>>used on your system (e.g., sometimes this means that shared libraries       
>>>>                             
>>>>that the component requires are unable to be found/loaded).  Note that      
>>>>                             
>>>>Open MPI stopped checking at the first component that it did not find.      
>>>>                             
>>>>                                                                            
>>>>                             
>>>>Host:      node153                                                          
>>>>                             
>>>>Framework: mtl                                                              
>>>>                             
>>>>Component: mxm                                                              
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>[node5:113560] PML cm cannot be selected                                    
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>No available pml components were found!                                     
>>>>                             
>>>>                                                                            
>>>>                             
>>>>This means that there are no components of this type installed on your      
>>>>                             
>>>>system or all the components reported that they could not be used.          
>>>>                             
>>>>                                                                            
>>>>                             
>>>>This is a fatal error; your MPI process is likely to abort.  Check the      
>>>>                             
>>>>output of the "ompi_info" command and ensure that components of this        
>>>>                             
>>>>type are available on your system.  You may also wish to check the          
>>>>                             
>>>>value of the "component_path" MCA parameter and ensure that it has at       
>>>>                             
>>>>least one directory that contains valid MCA components.                     
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>[node153:44440] PML cm cannot be selected                                   
>>>>                             
>>>>-------------------------------------------------------                     
>>>>                             
>>>>Primary job  terminated normally, but 1 process returned                    
>>>>                             
>>>>a non-zero exit code.. Per user-direction, the job has been aborted.        
>>>>                             
>>>>-------------------------------------------------------                     
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>mpirun detected that one or more processes exited with non-zero status, 
>>>>thus causing                     
>>>>the job to be terminated. The first process to do so was:                   
>>>>                             
>>>>                                                                            
>>>>                             
>>>>  Process name: [[43917,1],0]                                               
>>>>                             
>>>>  Exit code:    1                                                           
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>[login:110455] 1 more process has sent help message help-mca-base.txt / 
>>>>find-available:not-valid         
>>>>[login:110455] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
>>>>help / error messages        
>>>>[login:110455] 1 more process has sent help message help-mca-base.txt / 
>>>>find-available:none-found        
>>>>                            
>>>>3.b) host:$  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
>>>>MXM_SHM_KCOPY_MODE=off -host node5,node153 -mca pml yalla --prefix 
>>>>$HPCX_MPI_DIR ./hello
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>A requested component was not found, or was unable to be opened.  This      
>>>>                             
>>>>means that this component is either not installed or is unable to be        
>>>>                             
>>>>used on your system (e.g., sometimes this means that shared libraries       
>>>>                             
>>>>that the component requires are unable to be found/loaded).  Note that      
>>>>                             
>>>>Open MPI stopped checking at the first component that it did not find.      
>>>>                             
>>>>                                                                            
>>>>                             
>>>>Host:      node153                                                          
>>>>                             
>>>>Framework: pml                                                              
>>>>                             
>>>>Component: yalla                                                            
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>*** An error occurred in MPI_Init                                           
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>It looks like MPI_INIT failed for some reason; your parallel process is     
>>>>                             
>>>>likely to abort.  There are many reasons that a parallel process can        
>>>>                             
>>>>fail during MPI_INIT; some of which are due to configuration or environment 
>>>>                             
>>>>problems.  This failure appears to be an internal failure; here's some      
>>>>                             
>>>>additional information (which may only be relevant to an Open MPI           
>>>>                             
>>>>developer):                                                                 
>>>>                             
>>>>                                                                            
>>>>                             
>>>>  mca_pml_base_open() failed                                                
>>>>                             
>>>>  --> Returned "Not found" (-13) instead of "Success" (0)                   
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>*** on a NULL communicator                                                  
>>>>                             
>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,    
>>>>                             
>>>>***    and potentially your MPI job)                                        
>>>>                             
>>>>[node153:43979] Local abort before MPI_INIT completed successfully; not 
>>>>able to aggregate error messages,
>>>> and not able to guarantee that all other processes were killed!            
>>>>                             
>>>>-------------------------------------------------------                     
>>>>                             
>>>>Primary job  terminated normally, but 1 process returned                    
>>>>                             
>>>>a non-zero exit code.. Per user-direction, the job has been aborted.        
>>>>                             
>>>>-------------------------------------------------------                     
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>mpirun detected that one or more processes exited with non-zero status, 
>>>>thus causing                     
>>>>the job to be terminated. The first process to do so was:                   
>>>>                             
>>>>                                                                            
>>>>                             
>>>>  Process name: [[44992,1],1]                                               
>>>>                             
>>>>  Exit code:    1                                                           
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>
>>>>
>>>>
>>>>host:$  echo $HPCX_MPI_DIR                                                  
>>>>                        
>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8
>>>>host:$ ompi_info | grep pml                                                 
>>>>                        
>>>>                 MCA pml: v (MCA v2.0, API v2.0, Component v1.8.5)          
>>>>                             
>>>>                 MCA pml: cm (MCA v2.0, API v2.0, Component v1.8.5)         
>>>>                             
>>>>                 MCA pml: bfo (MCA v2.0, API v2.0, Component v1.8.5)        
>>>>                             
>>>>                 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.8.5)        
>>>>                             
>>>>                 MCA pml: yalla (MCA v2.0, API v2.0, Component v1.8.5)  
>>>>host: tests$  ompi_info | grep mtl                                   
>>>>                 MCA mtl: mxm (MCA v2.0, API v2.0, Component v1.8.5)
>>>>
>>>>P.S.
>>>>possible error in the FAQ? ( 
>>>>http://www.open-mpi.org/faq/?category=openfabrics#mxm )
>>>>47. Does Open MPI support MXM?
>>>>............
>>>>NOTE: Please note that the 'yalla' pml is available only from Open MPI v1.9 
>>>>and above
>>>>...........
>>>>But here we have(or not...) yalla in ompi 1.8.5
>>>>
>>>>
>>>>
>>>>Вторник, 26 мая 2015, 9:53 +03:00 от Mike Dubman < mi...@dev.mellanox.co.il 
>>>>>:
>>>>>Hi Timur,
>>>>>
>>>>>Here it goes:
>>>>>
>>>>>wget  
>>>>>ftp://bgate.mellanox.com/hpc/hpcx/custom/v1.3/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64.tbz
>>>>>
>>>>>Please let me know if it works for you and will add 1.5.4.1 mofed to the 
>>>>>default distribution list.
>>>>>
>>>>>M
>>>>>
>>>>>
>>>>>On Mon, May 25, 2015 at 9:38 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>>>wrote:
>>>>>>Thanks a lot .
>>>>>>
>>>>>>Понедельник, 25 мая 2015, 21:28 +03:00 от Mike Dubman < 
>>>>>>mi...@dev.mellanox.co.il >:
>>>>>>
>>>>>>>will send u the link tomorrow.
>>>>>>>
>>>>>>>On Mon, May 25, 2015 at 9:15 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>>>>>wrote:
>>>>>>>>Where can i find MXM for ofed 1.5.4.1?
>>>>>>>>
>>>>>>>>
>>>>>>>>Понедельник, 25 мая 2015, 21:11 +03:00 от Mike Dubman < 
>>>>>>>>mi...@dev.mellanox.co.il >:
>>>>>>>>
>>>>>>>>>btw, the ofed on your system is 1.5.4.1 while HPCx in use is for ofed 
>>>>>>>>>1.5.3
>>>>>>>>>
>>>>>>>>>seems like ABI issue between ofed versions
>>>>>>>>>
>>>>>>>>>On Mon, May 25, 2015 at 8:59 PM, Timur Ismagilov  < tismagi...@mail.ru 
>>>>>>>>>> wrote:
>>>>>>>>>>I did as you said, but got an error:
>>>>>>>>>>
>>>>>>>>>>node1$ export MXM_IB_PORTS=mlx4_0:1
>>>>>>>>>>node1$  ./mxm_perftest                                                
>>>>>>>>>>                            
>>>>>>>>>>Waiting for connection...                                             
>>>>>>>>>>                                   
>>>>>>>>>>Accepted connection from 10.65.0.253                                  
>>>>>>>>>>                                   
>>>>>>>>>>[1432576262.370195] [node153:35388:0]         shm.c:65   MXM  WARN  
>>>>>>>>>>Could not open the KNEM device file at /dev/knem : No such file or 
>>>>>>>>>>directory. Won't use knem.                                            
>>>>>>>>>>     
>>>>>>>>>>Failed to create endpoint: No such device                             
>>>>>>>>>>                                   
>>>>>>>>>>
>>>>>>>>>>node2$ export MXM_IB_PORTS=mlx4_0:1
>>>>>>>>>>node2$ ./mxm_perftest node1  -t send_lat                              
>>>>>>>>>>                         
>>>>>>>>>>[1432576262.367523] [node158:99366:0]         shm.c:65   MXM  WARN  
>>>>>>>>>>Could not open the KNEM device file at /dev/knem : No such file or 
>>>>>>>>>>directory. Won't use knem.
>>>>>>>>>>Failed to create endpoint: No such device
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>Понедельник, 25 мая 2015, 20:31 +03:00 от Mike Dubman < 
>>>>>>>>>>mi...@dev.mellanox.co.il >:
>>>>>>>>>>>scif is a OFA device from Intel.
>>>>>>>>>>>can you please select export MXM_IB_PORTS=mlx4_0:1 explicitly and 
>>>>>>>>>>>retry
>>>>>>>>>>>
>>>>>>>>>>>On Mon, May 25, 2015 at 8:26 PM, Timur Ismagilov  < 
>>>>>>>>>>>tismagi...@mail.ru > wrote:
>>>>>>>>>>>>Hi, Mike,
>>>>>>>>>>>>that is what i have:
>>>>>>>>>>>>$ echo $LD_LIBRARY_PATH | tr ":" "\n"
>>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib
>>>>>>>>>>>>               
>>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib
>>>>>>>>>>>>             
>>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib
>>>>>>>>>>>>               
>>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib
>>>>>>>>>>>> +intel compiler paths
>>>>>>>>>>>>
>>>>>>>>>>>>$ echo $OPAL_PREFIX                                                 
>>>>>>>>>>>>                    
>>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8
>>>>>>>>>>>>
>>>>>>>>>>>>I don't use LD_PRELOAD.
>>>>>>>>>>>>
>>>>>>>>>>>>In the attached file(ompi_info.out) you will find the output of 
>>>>>>>>>>>>ompi_info -l 9  command.
>>>>>>>>>>>>
>>>>>>>>>>>>P.S . 
>>>>>>>>>>>>node1 $ ./mxm_perftest
>>>>>>>>>>>>node2 $  ./mxm_perftest node1  -t send_lat
>>>>>>>>>>>>[1432568685.067067] [node151:87372:0]         shm.c:65   MXM  WARN  
>>>>>>>>>>>>Could not open the KNEM device file $t /dev/knem : No such file or 
>>>>>>>>>>>>directory. Won't use knem.          ( I don't have knem)
>>>>>>>>>>>>[1432568685.069699] [node151:87372:0]      ib_dev.c:531  MXM  WARN  
>>>>>>>>>>>>skipping device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a 
>>>>>>>>>>>>Mellanox device                                (???)
>>>>>>>>>>>>Failed to create endpoint: No such device
>>>>>>>>>>>>
>>>>>>>>>>>>$  ibv_devinfo                                         
>>>>>>>>>>>>hca_id: mlx4_0                                                  
>>>>>>>>>>>>        transport:                      InfiniBand (0)          
>>>>>>>>>>>>        fw_ver:                         2.10.600                
>>>>>>>>>>>>        node_guid:                      0002:c903:00a1:13b0     
>>>>>>>>>>>>        sys_image_guid:                 0002:c903:00a1:13b3     
>>>>>>>>>>>>        vendor_id:                      0x02c9                  
>>>>>>>>>>>>        vendor_part_id:                 4099                    
>>>>>>>>>>>>        hw_ver:                         0x0                     
>>>>>>>>>>>>        board_id:                       MT_1090120019           
>>>>>>>>>>>>        phys_port_cnt:                  2                       
>>>>>>>>>>>>                port:   1                                       
>>>>>>>>>>>>                        state:                  PORT_ACTIVE (4) 
>>>>>>>>>>>>                        max_mtu:                4096 (5)        
>>>>>>>>>>>>                        active_mtu:             4096 (5)        
>>>>>>>>>>>>                        sm_lid:                 1               
>>>>>>>>>>>>                        port_lid:               83              
>>>>>>>>>>>>                        port_lmc:               0x00            
>>>>>>>>>>>>                                                                
>>>>>>>>>>>>                port:   2                                       
>>>>>>>>>>>>                        state:                  PORT_DOWN (1)   
>>>>>>>>>>>>                        max_mtu:                4096 (5)        
>>>>>>>>>>>>                        active_mtu:             4096 (5)        
>>>>>>>>>>>>                        sm_lid:                 0               
>>>>>>>>>>>>                        port_lid:               0               
>>>>>>>>>>>>                        port_lmc:               0x00            
>>>>>>>>>>>>
>>>>>>>>>>>>Best regards,
>>>>>>>>>>>>Timur.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman < 
>>>>>>>>>>>>mi...@dev.mellanox.co.il >:
>>>>>>>>>>>>>Hi Timur,
>>>>>>>>>>>>>seems that yalla component was not found in your OMPI tree.
>>>>>>>>>>>>>can it be that your mpirun is not from hpcx? Can you please check 
>>>>>>>>>>>>>LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is 
>>>>>>>>>>>>>pointing to the right mpirun?
>>>>>>>>>>>>>
>>>>>>>>>>>>>Also, could you please check that yalla is present in the 
>>>>>>>>>>>>>ompi_info -l 9 output?
>>>>>>>>>>>>>
>>>>>>>>>>>>>Thanks
>>>>>>>>>>>>>
>>>>>>>>>>>>>On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov  < 
>>>>>>>>>>>>>tismagi...@mail.ru > wrote:
>>>>>>>>>>>>>>I can password-less ssh to all nodes:
>>>>>>>>>>>>>>base$ ssh node1
>>>>>>>>>>>>>>node1$ssh node2
>>>>>>>>>>>>>>Last login: Mon May 25 18:41:23 
>>>>>>>>>>>>>>node2$ssh node3
>>>>>>>>>>>>>>Last login: Mon May 25 16:25:01
>>>>>>>>>>>>>>node3$ssh node4
>>>>>>>>>>>>>>Last login: Mon May 25 16:27:04
>>>>>>>>>>>>>>node4$
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>Is this correct?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>In ompi-1.9 i do not have no-tree-spawn problem.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain < 
>>>>>>>>>>>>>>r...@open-mpi.org >:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>I can’t speak to the mxm problem, but the no-tree-spawn issue 
>>>>>>>>>>>>>>>indicates that you don’t have password-less ssh authorized 
>>>>>>>>>>>>>>>between the compute nodes
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>On May 25, 2015, at 8:55 AM, Timur Ismagilov < 
>>>>>>>>>>>>>>>>tismagi...@mail.ru > wrote:
>>>>>>>>>>>>>>>>Hello!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
>>>>>>>>>>>>>>>>OFED-1.5.4.1;
>>>>>>>>>>>>>>>>CentOS release 6.2;
>>>>>>>>>>>>>>>>infiniband 4x FDR
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>I have two problems:
>>>>>>>>>>>>>>>>1. I can not use mxm :
>>>>>>>>>>>>>>>>1.a) $mpirun --mca pml cm --mca mtl mxm -host 
>>>>>>>>>>>>>>>>node5,node14,node28,node29 -mca plm_rsh_no_tree_spawn 1 -np 4 
>>>>>>>>>>>>>>>>./hello
>>>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>>>A requested component was not found, or was unable to be 
>>>>>>>>>>>>>>>>opened.  This                                   
>>>>>>>>>>>>>>>>means that this component is either not installed or is unable 
>>>>>>>>>>>>>>>>to be                                     
>>>>>>>>>>>>>>>>used on your system (e.g., sometimes this means that shared 
>>>>>>>>>>>>>>>>libraries                                    
>>>>>>>>>>>>>>>>that the component requires are unable to be found/loaded).  
>>>>>>>>>>>>>>>>Note that                                   
>>>>>>>>>>>>>>>>Open MPI stopped checking at the first component that it did 
>>>>>>>>>>>>>>>>not find.                                   
>>>>>>>>>>>>>>>>                                                                
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>Host:      node14                                               
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>Framework: pml                                                  
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>Component: yalla                                                
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>>>*** An error occurred in MPI_Init                               
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>>>It looks like MPI_INIT failed for some reason; your parallel 
>>>>>>>>>>>>>>>>process is                                  
>>>>>>>>>>>>>>>>likely to abort.  There are many reasons that a parallel 
>>>>>>>>>>>>>>>>process can                                     
>>>>>>>>>>>>>>>>fail during MPI_INIT; some of which are due to configuration or 
>>>>>>>>>>>>>>>>environment                              
>>>>>>>>>>>>>>>>problems.  This failure appears to be an internal failure; 
>>>>>>>>>>>>>>>>here's some                                   
>>>>>>>>>>>>>>>>additional information (which may only be relevant to an Open 
>>>>>>>>>>>>>>>>MPI                                        
>>>>>>>>>>>>>>>>developer):                                                     
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>                                                                
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>  mca_pml_base_open() failed                                    
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>  --> Returned "Not found" (-13) instead of "Success" (0)       
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>>>*** on a NULL communicator                                      
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
>>>>>>>>>>>>>>>>now abort,                                 
>>>>>>>>>>>>>>>>***    and potentially your MPI job)                            
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>*** An error occurred in MPI_Init                               
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>[node28:102377] Local abort before MPI_INIT completed 
>>>>>>>>>>>>>>>>successfully; not able to aggregate error messages,
>>>>>>>>>>>>>>>> and not able to guarantee that all other processes were 
>>>>>>>>>>>>>>>>killed!                                         
>>>>>>>>>>>>>>>>*** on a NULL communicator                                      
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
>>>>>>>>>>>>>>>>now abort,                                 
>>>>>>>>>>>>>>>>***    and potentially your MPI job)                            
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>[node29:105600] Local abort before MPI_INIT completed 
>>>>>>>>>>>>>>>>successfully; not able to aggregate error messages,
>>>>>>>>>>>>>>>> and not able to guarantee that all other processes were 
>>>>>>>>>>>>>>>>killed!                                         
>>>>>>>>>>>>>>>>*** An error occurred in MPI_Init                               
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>*** on a NULL communicator                                      
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
>>>>>>>>>>>>>>>>now abort,                                 
>>>>>>>>>>>>>>>>***    and potentially your MPI job)                            
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>[node5:102409] Local abort before MPI_INIT completed 
>>>>>>>>>>>>>>>>successfully; not able to aggregate error messages, 
>>>>>>>>>>>>>>>>and not able to guarantee that all other processes were killed! 
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>*** An error occurred in MPI_Init                               
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>*** on a NULL communicator                                      
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
>>>>>>>>>>>>>>>>now abort,                                 
>>>>>>>>>>>>>>>>***    and potentially your MPI job)                            
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>[node14:85284] Local abort before MPI_INIT completed 
>>>>>>>>>>>>>>>>successfully; not able to aggregate error messages, 
>>>>>>>>>>>>>>>>and not able to guarantee that all other processes were killed! 
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>-------------------------------------------------------         
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>Primary job  terminated normally, but 1 process returned        
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>a non-zero exit code.. Per user-direction, the job has been 
>>>>>>>>>>>>>>>>aborted.                                     
>>>>>>>>>>>>>>>>-------------------------------------------------------         
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>>>mpirun detected that one or more processes exited with non-zero 
>>>>>>>>>>>>>>>>status, thus causing                     
>>>>>>>>>>>>>>>>the job to be terminated. The first process to do so was:       
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>                                                                
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>  Process name: [[9372,1],2]
>>>>>>>>>>>>>>>>  Exit code:    1                                               
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>>>[login:08295] 3 more processes have sent help message 
>>>>>>>>>>>>>>>>help-mca-base.txt / find-available:not-valid       
>>>>>>>>>>>>>>>>[login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 
>>>>>>>>>>>>>>>>to see all help / error messages         
>>>>>>>>>>>>>>>>[login:08295] 3 more processes have sent help message 
>>>>>>>>>>>>>>>>help-mpi-runtime / mpi_init:startup:internal-failur
>>>>>>>>>>>>>>>>e                                                               
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 
>>>>>>>>>>>>>>>>-mca plm_rsh_no_tree_spawn 1 -np 4 ./hello
>>>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>>>A requested component was not found, or was unable to be 
>>>>>>>>>>>>>>>>opened.  This                                  
>>>>>>>>>>>>>>>>means that this component is either not installed or is unable 
>>>>>>>>>>>>>>>>to be                                    
>>>>>>>>>>>>>>>>used on your system (e.g., sometimes this means that shared 
>>>>>>>>>>>>>>>>libraries                                   
>>>>>>>>>>>>>>>>that the component requires are unable to be found/loaded).  
>>>>>>>>>>>>>>>>Note that                                  
>>>>>>>>>>>>>>>>Open MPI stopped checking at the first component that it did 
>>>>>>>>>>>>>>>>not find.                                  
>>>>>>>>>>>>>>>>                                                                
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>Host:      node5                                                
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>Framework: pml                                                  
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>Component: yalla                                                
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>>>*** An error occurred in MPI_Init                               
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>*** on a NULL communicator                                      
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
>>>>>>>>>>>>>>>>now abort,                                
>>>>>>>>>>>>>>>>***    and potentially your MPI job)                            
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>[node5:102449] Local abort before MPI_INIT completed 
>>>>>>>>>>>>>>>>successfully; not able to aggregate error messages,
>>>>>>>>>>>>>>>>and not able to guarantee that all other processes were killed! 
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>>>It looks like MPI_INIT failed for some reason; your parallel 
>>>>>>>>>>>>>>>>process is                                 
>>>>>>>>>>>>>>>>likely to abort.  There are many reasons that a parallel 
>>>>>>>>>>>>>>>>process can                                    
>>>>>>>>>>>>>>>>fail during MPI_INIT; some of which are due to configuration or 
>>>>>>>>>>>>>>>>environment                             
>>>>>>>>>>>>>>>>problems.  This failure appears to be an internal failure; 
>>>>>>>>>>>>>>>>here's some                                  
>>>>>>>>>>>>>>>>additional information (which may only be relevant to an Open 
>>>>>>>>>>>>>>>>MPI                                       
>>>>>>>>>>>>>>>>developer):                                                     
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>                                                                
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>  mca_pml_base_open() failed                                    
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>  --> Returned "Not found" (-13) instead of "Success" (0)       
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>>>-------------------------------------------------------         
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>Primary job  terminated normally, but 1 process returned        
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>a non-zero exit code.. Per user-direction, the job has been 
>>>>>>>>>>>>>>>>aborted.                                    
>>>>>>>>>>>>>>>>-------------------------------------------------------         
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>*** An error occurred in MPI_Init                               
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>*** on a NULL communicator                                      
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
>>>>>>>>>>>>>>>>now abort,                                
>>>>>>>>>>>>>>>>***    and potentially your MPI job)                            
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>[node14:85325] Local abort before MPI_INIT completed 
>>>>>>>>>>>>>>>>successfully; not able to aggregate error messages,
>>>>>>>>>>>>>>>>and not able to guarantee that all other processes were killed! 
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>>>mpirun detected that one or more processes exited with non-zero 
>>>>>>>>>>>>>>>>status, thus causing                    
>>>>>>>>>>>>>>>>the job to be terminated. The first process to do so was:       
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>                                                                
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>  Process name: [[9619,1],0]                                    
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>  Exit code:    1                                               
>>>>>>>>>>>>>>>>                                        
>>>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>>>[login:08552] 1 more process has sent help message 
>>>>>>>>>>>>>>>>help-mca-base.txt / find-available:not-valid         
>>>>>>>>>>>>>>>>[login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 
>>>>>>>>>>>>>>>>to see all help / error messages        
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>2. I can not remove  -mca plm_rsh_no_tree_spawn 1 from mpirun 
>>>>>>>>>>>>>>>>cmd line :
>>>>>>>>>>>>>>>>$mpirun -host node5,node14,node28,node29 -np 4 ./hello
>>>>>>>>>>>>>>>>sh: -c: line 0: syntax error near unexpected token 
>>>>>>>>>>>>>>>>`--tree-spawn'                                        
>>>>>>>>>>>>>>>>sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; 
>>>>>>>>>>>>>>>>OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc
>>>>>>>>>>>>>>>>es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 
>>>>>>>>>>>>>>>>; export OPAL_PREFIX; PATH=/gpfs/NETHOME/o
>>>>>>>>>>>>>>>>ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH
>>>>>>>>>>>>>>>> ; export PA
>>>>>>>>>>>>>>>>TH ; 
>>>>>>>>>>>>>>>>LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi
>>>>>>>>>>>>>>>>-mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; 
>>>>>>>>>>>>>>>>DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice
>>>>>>>>>>>>>>>>vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH
>>>>>>>>>>>>>>>> ; expor
>>>>>>>>>>>>>>>>t DYLD_LIBRARY_PATH ;   
>>>>>>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o
>>>>>>>>>>>>>>>>mpi-mellanox-v1.8/bin/orted --hnp-topo-sig 
>>>>>>>>>>>>>>>>2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es
>>>>>>>>>>>>>>>>s_jobid "625606656" -mca orte_ess_vpid 3 -mca 
>>>>>>>>>>>>>>>>orte_ess_num_procs "5" -mca orte_parent_uri "625606656.1;tc
>>>>>>>>>>>>>>>>p://10.65.0.105,10.64.0.105,10.67.0.105:56862 " -mca 
>>>>>>>>>>>>>>>>orte_hnp_uri "625606656.0; tcp://10.65.0.2,10.67.0.2,8
>>>>>>>>>>>>>>>>3.149.214.101, 10.64.0.2:54893 " --mca pml "yalla" -mca 
>>>>>>>>>>>>>>>>plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s
>>>>>>>>>>>>>>>>pawn'                                                           
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>>>ORTE was unable to reliably start one or more daemons.          
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>This usually is caused by:                                      
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>                                                                
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>* not finding the required libraries and/or binaries on         
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>  one or more nodes. Please check your PATH and LD_LIBRARY_PATH 
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>  settings, or configure OMPI with 
>>>>>>>>>>>>>>>>--enable-orterun-prefix-by-default                              
>>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>>                                                                
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>* lack of authority to execute on one or more specified nodes.  
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>  Please verify your allocation and authorities.                
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>                                                                
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>* the inability to write startup files into /tmp 
>>>>>>>>>>>>>>>>(--tmpdir/orte_tmpdir_base).                            
>>>>>>>>>>>>>>>>  Please check with your sys admin to determine the correct 
>>>>>>>>>>>>>>>>location to use.                             
>>>>>>>>>>>>>>>>                                                                
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>*  compilation of the orted with dynamic libraries when static 
>>>>>>>>>>>>>>>>are required                              
>>>>>>>>>>>>>>>>  (e.g., on Cray). Please check your configure cmd line and 
>>>>>>>>>>>>>>>>consider using                               
>>>>>>>>>>>>>>>>  one of the contrib/platform definitions for your system type. 
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>                                                                
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>* an inability to create a connection back to mpirun due to a   
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>  lack of common network interfaces and/or no route found 
>>>>>>>>>>>>>>>>between                                        
>>>>>>>>>>>>>>>>  them. Please check network connectivity (including firewalls  
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>  and network routing requirements).                            
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>>>mpirun: abort is already in progress...hit ctrl-c again to 
>>>>>>>>>>>>>>>>forcibly terminate                          
>>>>>>>>>>>>>>>>                                                                
>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>Thank you for your comments.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>Best regards,
>>>>>>>>>>>>>>>>Timur.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>>>>>>users mailing list
>>>>>>>>>>>>>>>>us...@open-mpi.org
>>>>>>>>>>>>>>>>Subscription:  
>>>>>>>>>>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>Link to this post:  
>>>>>>>>>>>>>>>>http://www.open-mpi.org/community/lists/users/2015/05/26919.php
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>>>>users mailing list
>>>>>>>>>>>>>>us...@open-mpi.org
>>>>>>>>>>>>>>Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>Link to this post:  
>>>>>>>>>>>>>>http://www.open-mpi.org/community/lists/users/2015/05/26922.php
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>-- 
>>>>>>>>>>>>>
>>>>>>>>>>>>>Kind Regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>>M.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>-- 
>>>>>>>>>>>
>>>>>>>>>>>Kind Regards,
>>>>>>>>>>>
>>>>>>>>>>>M.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>-- 
>>>>>>>>>
>>>>>>>>>Kind Regards,
>>>>>>>>>
>>>>>>>>>M.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>-- 
>>>>>>>
>>>>>>>Kind Regards,
>>>>>>>
>>>>>>>M.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>-- 
>>>>>
>>>>>Kind Regards,
>>>>>
>>>>>M.
>>>>
>>>>
>>>
>>>
>>>
>>>-- 
>>>
>>>Kind Regards,
>>>
>>>M.
>>
>>
>>
>>
>>
>>-- 
>>
>>Kind Regards,
>>
>>M.




Attachment: hello_debugMXM_n-2_ppn-2.out
Description: Binary data

Attachment: hello_debugMXM_n-2_ppn-2.err
Description: Binary data

Attachment: ompi_info.out
Description: Binary data

Reply via email to