Hi, Mike,
that is what i have:
$ echo $LD_LIBRARY_PATH | tr ":" "\n"
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib
               
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib
             
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib
               
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib
 +intel compiler paths

$ echo $OPAL_PREFIX                                                             
        
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8

I don't use LD_PRELOAD.

In the attached file(ompi_info.out) you will find the output of ompi_info -l 9  
command.

P.S . 
node1 $ ./mxm_perftest
node2 $  ./mxm_perftest node1  -t send_lat
[1432568685.067067] [node151:87372:0]         shm.c:65   MXM  WARN  Could not 
open the KNEM device file $t /dev/knem : No such file or directory. Won't use 
knem.          ( I don't have knem)
[1432568685.069699] [node151:87372:0]      ib_dev.c:531  MXM  WARN  skipping 
device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a Mellanox device           
                     (???)
Failed to create endpoint: No such device

$  ibv_devinfo                                         
hca_id: mlx4_0                                                  
        transport:                      InfiniBand (0)          
        fw_ver:                         2.10.600                
        node_guid:                      0002:c903:00a1:13b0     
        sys_image_guid:                 0002:c903:00a1:13b3     
        vendor_id:                      0x02c9                  
        vendor_part_id:                 4099                    
        hw_ver:                         0x0                     
        board_id:                       MT_1090120019           
        phys_port_cnt:                  2                       
                port:   1                                       
                        state:                  PORT_ACTIVE (4) 
                        max_mtu:                4096 (5)        
                        active_mtu:             4096 (5)        
                        sm_lid:                 1               
                        port_lid:               83              
                        port_lmc:               0x00            
                                                                
                port:   2                                       
                        state:                  PORT_DOWN (1)   
                        max_mtu:                4096 (5)        
                        active_mtu:             4096 (5)        
                        sm_lid:                 0               
                        port_lid:               0               
                        port_lmc:               0x00            

Best regards,
Timur.


Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman 
<mi...@dev.mellanox.co.il>:
>Hi Timur,
>seems that yalla component was not found in your OMPI tree.
>can it be that your mpirun is not from hpcx? Can you please check 
>LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing to the 
>right mpirun?
>
>Also, could you please check that yalla is present in the ompi_info -l 9 
>output?
>
>Thanks
>
>On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov  < tismagi...@mail.ru > wrote:
>>I can password-less ssh to all nodes:
>>base$ ssh node1
>>node1$ssh node2
>>Last login: Mon May 25 18:41:23 
>>node2$ssh node3
>>Last login: Mon May 25 16:25:01
>>node3$ssh node4
>>Last login: Mon May 25 16:27:04
>>node4$
>>
>>Is this correct?
>>
>>In ompi-1.9 i do not have no-tree-spawn problem.
>>
>>
>>Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain < r...@open-mpi.org >:
>>
>>>I can’t speak to the mxm problem, but the no-tree-spawn issue indicates that 
>>>you don’t have password-less ssh authorized between the compute nodes
>>>
>>>
>>>>On May 25, 2015, at 8:55 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>>>Hello!
>>>>
>>>>I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
>>>>OFED-1.5.4.1;
>>>>CentOS release 6.2;
>>>>infiniband 4x FDR
>>>>
>>>>
>>>>
>>>>I have two problems:
>>>>1. I can not use mxm :
>>>>1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29 
>>>>-mca plm_rsh_no_tree_spawn 1 -np 4 ./hello
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>A requested component was not found, or was unable to be opened.  This      
>>>>                             
>>>>means that this component is either not installed or is unable to be        
>>>>                             
>>>>used on your system (e.g., sometimes this means that shared libraries       
>>>>                             
>>>>that the component requires are unable to be found/loaded).  Note that      
>>>>                             
>>>>Open MPI stopped checking at the first component that it did not find.      
>>>>                             
>>>>                                                                            
>>>>                             
>>>>Host:      node14                                                           
>>>>                             
>>>>Framework: pml                                                              
>>>>                             
>>>>Component: yalla                                                            
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>*** An error occurred in MPI_Init                                           
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>It looks like MPI_INIT failed for some reason; your parallel process is     
>>>>                             
>>>>likely to abort.  There are many reasons that a parallel process can        
>>>>                             
>>>>fail during MPI_INIT; some of which are due to configuration or environment 
>>>>                             
>>>>problems.  This failure appears to be an internal failure; here's some      
>>>>                             
>>>>additional information (which may only be relevant to an Open MPI           
>>>>                             
>>>>developer):                                                                 
>>>>                             
>>>>                                                                            
>>>>                             
>>>>  mca_pml_base_open() failed                                                
>>>>                             
>>>>  --> Returned "Not found" (-13) instead of "Success" (0)                   
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>*** on a NULL communicator                                                  
>>>>                             
>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,    
>>>>                             
>>>>***    and potentially your MPI job)                                        
>>>>                             
>>>>*** An error occurred in MPI_Init                                           
>>>>                             
>>>>[node28:102377] Local abort before MPI_INIT completed successfully; not 
>>>>able to aggregate error messages,
>>>> and not able to guarantee that all other processes were killed!            
>>>>                             
>>>>*** on a NULL communicator                                                  
>>>>                             
>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,    
>>>>                             
>>>>***    and potentially your MPI job)                                        
>>>>                             
>>>>[node29:105600] Local abort before MPI_INIT completed successfully; not 
>>>>able to aggregate error messages,
>>>> and not able to guarantee that all other processes were killed!            
>>>>                             
>>>>*** An error occurred in MPI_Init                                           
>>>>                             
>>>>*** on a NULL communicator                                                  
>>>>                             
>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,    
>>>>                             
>>>>***    and potentially your MPI job)                                        
>>>>                             
>>>>[node5:102409] Local abort before MPI_INIT completed successfully; not able 
>>>>to aggregate error messages, 
>>>>and not able to guarantee that all other processes were killed!             
>>>>                             
>>>>*** An error occurred in MPI_Init                                           
>>>>                             
>>>>*** on a NULL communicator                                                  
>>>>                             
>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,    
>>>>                             
>>>>***    and potentially your MPI job)                                        
>>>>                             
>>>>[node14:85284] Local abort before MPI_INIT completed successfully; not able 
>>>>to aggregate error messages, 
>>>>and not able to guarantee that all other processes were killed!             
>>>>                             
>>>>-------------------------------------------------------                     
>>>>                             
>>>>Primary job  terminated normally, but 1 process returned                    
>>>>                             
>>>>a non-zero exit code.. Per user-direction, the job has been aborted.        
>>>>                             
>>>>-------------------------------------------------------                     
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>mpirun detected that one or more processes exited with non-zero status, 
>>>>thus causing                     
>>>>the job to be terminated. The first process to do so was:                   
>>>>                             
>>>>                                                                            
>>>>                             
>>>>  Process name: [[9372,1],2]
>>>>  Exit code:    1                                                           
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>[login:08295] 3 more processes have sent help message help-mca-base.txt / 
>>>>find-available:not-valid       
>>>>[login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
>>>>help / error messages         
>>>>[login:08295] 3 more processes have sent help message help-mpi-runtime / 
>>>>mpi_init:startup:internal-failur
>>>>e                                                                           
>>>>                             
>>>>
>>>>1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca 
>>>>plm_rsh_no_tree_spawn 1 -np 4 ./hello
>>>>--------------------------------------------------------------------------  
>>>>                            
>>>>A requested component was not found, or was unable to be opened.  This      
>>>>                            
>>>>means that this component is either not installed or is unable to be        
>>>>                            
>>>>used on your system (e.g., sometimes this means that shared libraries       
>>>>                            
>>>>that the component requires are unable to be found/loaded).  Note that      
>>>>                            
>>>>Open MPI stopped checking at the first component that it did not find.      
>>>>                            
>>>>                                                                            
>>>>                            
>>>>Host:      node5                                                            
>>>>                            
>>>>Framework: pml                                                              
>>>>                            
>>>>Component: yalla                                                            
>>>>                            
>>>>--------------------------------------------------------------------------  
>>>>                            
>>>>*** An error occurred in MPI_Init                                           
>>>>                            
>>>>*** on a NULL communicator                                                  
>>>>                            
>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,    
>>>>                            
>>>>***    and potentially your MPI job)                                        
>>>>                            
>>>>[node5:102449] Local abort before MPI_INIT completed successfully; not able 
>>>>to aggregate error messages,
>>>>and not able to guarantee that all other processes were killed!             
>>>>                            
>>>>--------------------------------------------------------------------------  
>>>>                            
>>>>It looks like MPI_INIT failed for some reason; your parallel process is     
>>>>                            
>>>>likely to abort.  There are many reasons that a parallel process can        
>>>>                            
>>>>fail during MPI_INIT; some of which are due to configuration or environment 
>>>>                            
>>>>problems.  This failure appears to be an internal failure; here's some      
>>>>                            
>>>>additional information (which may only be relevant to an Open MPI           
>>>>                            
>>>>developer):                                                                 
>>>>                            
>>>>                                                                            
>>>>                            
>>>>  mca_pml_base_open() failed                                                
>>>>                            
>>>>  --> Returned "Not found" (-13) instead of "Success" (0)                   
>>>>                            
>>>>--------------------------------------------------------------------------  
>>>>                            
>>>>-------------------------------------------------------                     
>>>>                            
>>>>Primary job  terminated normally, but 1 process returned                    
>>>>                            
>>>>a non-zero exit code.. Per user-direction, the job has been aborted.        
>>>>                            
>>>>-------------------------------------------------------                     
>>>>                            
>>>>*** An error occurred in MPI_Init                                           
>>>>                            
>>>>*** on a NULL communicator                                                  
>>>>                            
>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,    
>>>>                            
>>>>***    and potentially your MPI job)                                        
>>>>                            
>>>>[node14:85325] Local abort before MPI_INIT completed successfully; not able 
>>>>to aggregate error messages,
>>>>and not able to guarantee that all other processes were killed!             
>>>>                            
>>>>--------------------------------------------------------------------------  
>>>>                            
>>>>mpirun detected that one or more processes exited with non-zero status, 
>>>>thus causing                    
>>>>the job to be terminated. The first process to do so was:                   
>>>>                            
>>>>                                                                            
>>>>                            
>>>>  Process name: [[9619,1],0]                                                
>>>>                            
>>>>  Exit code:    1                                                           
>>>>                            
>>>>--------------------------------------------------------------------------  
>>>>                            
>>>>[login:08552] 1 more process has sent help message help-mca-base.txt / 
>>>>find-available:not-valid         
>>>>[login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
>>>>help / error messages        
>>>>
>>>>2. I can not remove  -mca plm_rsh_no_tree_spawn 1 from mpirun cmd line :
>>>>$mpirun -host node5,node14,node28,node29 -np 4 ./hello
>>>>sh: -c: line 0: syntax error near unexpected token `--tree-spawn'           
>>>>                             
>>>>sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; 
>>>>OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc
>>>>es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; export 
>>>>OPAL_PREFIX; PATH=/gpfs/NETHOME/o
>>>>ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH
>>>> ; export PA
>>>>TH ; 
>>>>LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi
>>>>-mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; 
>>>>DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice
>>>>vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH
>>>> ; expor
>>>>t DYLD_LIBRARY_PATH ;   
>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o
>>>>mpi-mellanox-v1.8/bin/orted --hnp-topo-sig 
>>>>2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es
>>>>s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca 
>>>>orte_parent_uri "625606656.1;tc
>>>>p://10.65.0.105,10.64.0.105,10.67.0.105:56862 " -mca orte_hnp_uri 
>>>>"625606656.0; tcp://10.65.0.2,10.67.0.2,8
>>>>3.149.214.101, 10.64.0.2:54893 " --mca pml "yalla" -mca 
>>>>plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s
>>>>pawn'                                                                       
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>ORTE was unable to reliably start one or more daemons.                      
>>>>                             
>>>>This usually is caused by:                                                  
>>>>                             
>>>>                                                                            
>>>>                             
>>>>* not finding the required libraries and/or binaries on                     
>>>>                             
>>>>  one or more nodes. Please check your PATH and LD_LIBRARY_PATH             
>>>>                             
>>>>  settings, or configure OMPI with --enable-orterun-prefix-by-default       
>>>>                             
>>>>                                                                            
>>>>                             
>>>>* lack of authority to execute on one or more specified nodes.              
>>>>                             
>>>>  Please verify your allocation and authorities.                            
>>>>                             
>>>>                                                                            
>>>>                             
>>>>* the inability to write startup files into /tmp 
>>>>(--tmpdir/orte_tmpdir_base).                            
>>>>  Please check with your sys admin to determine the correct location to 
>>>>use.                             
>>>>                                                                            
>>>>                             
>>>>*  compilation of the orted with dynamic libraries when static are required 
>>>>                             
>>>>  (e.g., on Cray). Please check your configure cmd line and consider using  
>>>>                             
>>>>  one of the contrib/platform definitions for your system type.             
>>>>                             
>>>>                                                                            
>>>>                             
>>>>* an inability to create a connection back to mpirun due to a               
>>>>                             
>>>>  lack of common network interfaces and/or no route found between           
>>>>                             
>>>>  them. Please check network connectivity (including firewalls              
>>>>                             
>>>>  and network routing requirements).                                        
>>>>                             
>>>>--------------------------------------------------------------------------  
>>>>                             
>>>>mpirun: abort is already in progress...hit ctrl-c again to forcibly 
>>>>terminate                          
>>>>                                                                            
>>>>                             
>>>>Thank you for your comments.
>>>> 
>>>>Best regards,
>>>>Timur.
>>>> 
>>>>
>>>>
>>>>_______________________________________________
>>>>users mailing list
>>>>us...@open-mpi.org
>>>>Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>Link to this post:  
>>>>http://www.open-mpi.org/community/lists/users/2015/05/26919.php
>>>
>>
>>
>>
>>
>>_______________________________________________
>>users mailing list
>>us...@open-mpi.org
>>Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/users
>>Link to this post:  
>>http://www.open-mpi.org/community/lists/users/2015/05/26922.php
>
>
>
>-- 
>
>Kind Regards,
>
>M.



Attachment: ompi_info.out
Description: Binary data

Reply via email to