Segei,
can you run :

ibhosts

ibstat

ibdiagnet


Lord help me for being so naive, but do you have a subnet manager running?



On 1 November 2016 at 06:40, Sergei Hrushev <hrus...@gmail.com> wrote:

> Hi Jeff !
>
> What does "ompi_info | grep openib" show?
>>
>>
> $ ompi_info | grep openib
>                  MCA btl: openib (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>
> Additionally, Mellanox provides alternate support through their MXM
>> libraries, if you want to try that.
>>
>
> Yes, I know.
> But we already have a hybrid cluster with OpenMPI, OpenMP, CUDA, Torque
> and many other libraries installed,
> and because it works perfect over Ethernet interconnect my idea was to add
> InfiniBand support with minimum
> of changes. Mainly because we already have some custom-written software
> for OpenMPI.
>
>
>> If that shows that you have the openib BTL plugin loaded, try running
>> with "mpirun --mca btl_base_verbose 100 ..."  That will provide additional
>> output about why / why not each point-to-point plugin is chosen.
>>
>>
> Yes, I tried to get this info already.
> And I saw in log that rdmacm wants IP address on port.
> So my question in topc start message was:
>
> Is it enough for OpenMPI to have RDMA only or IPoIB should also be
> installed?
>
> The mpirun output is:
>
> [node1:02674] mca: base: components_register: registering btl components
> [node1:02674] mca: base: components_register: found loaded component openib
> [node1:02674] mca: base: components_register: component openib register
> function successful
> [node1:02674] mca: base: components_register: found loaded component sm
> [node1:02674] mca: base: components_register: component sm register
> function successful
> [node1:02674] mca: base: components_register: found loaded component self
> [node1:02674] mca: base: components_register: component self register
> function successful
> [node1:02674] mca: base: components_open: opening btl components
> [node1:02674] mca: base: components_open: found loaded component openib
> [node1:02674] mca: base: components_open: component openib open function
> successful
> [node1:02674] mca: base: components_open: found loaded component sm
> [node1:02674] mca: base: components_open: component sm open function
> successful
> [node1:02674] mca: base: components_open: found loaded component self
> [node1:02674] mca: base: components_open: component self open function
> successful
> [node1:02674] select: initializing btl component openib
> [node1:02674] openib BTL: rdmacm IP address not found on port
> [node1:02674] openib BTL: rdmacm CPC unavailable for use on mlx4_0:1;
> skipped
> [node1:02674] select: init of component openib returned failure
> [node1:02674] mca: base: close: component openib closed
> [node1:02674] mca: base: close: unloading component openib
> [node1:02674] select: initializing btl component sm
> [node1:02674] select: init of component sm returned failure
> [node1:02674] mca: base: close: component sm closed
> [node1:02674] mca: base: close: unloading component sm
> [node1:02674] select: initializing btl component self
> [node1:02674] select: init of component self returned success
> [node1:02674] mca: bml: Using self btl to [[16642,1],0] on node node1
> [node1:02674] mca: base: close: component self closed
> [node1:02674] mca: base: close: unloading component self
>
> Best regards,
> Sergei.
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to