Segei, can you run : ibhosts
ibstat ibdiagnet Lord help me for being so naive, but do you have a subnet manager running? On 1 November 2016 at 06:40, Sergei Hrushev <hrus...@gmail.com> wrote: > Hi Jeff ! > > What does "ompi_info | grep openib" show? >> >> > $ ompi_info | grep openib > MCA btl: openib (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > > Additionally, Mellanox provides alternate support through their MXM >> libraries, if you want to try that. >> > > Yes, I know. > But we already have a hybrid cluster with OpenMPI, OpenMP, CUDA, Torque > and many other libraries installed, > and because it works perfect over Ethernet interconnect my idea was to add > InfiniBand support with minimum > of changes. Mainly because we already have some custom-written software > for OpenMPI. > > >> If that shows that you have the openib BTL plugin loaded, try running >> with "mpirun --mca btl_base_verbose 100 ..." That will provide additional >> output about why / why not each point-to-point plugin is chosen. >> >> > Yes, I tried to get this info already. > And I saw in log that rdmacm wants IP address on port. > So my question in topc start message was: > > Is it enough for OpenMPI to have RDMA only or IPoIB should also be > installed? > > The mpirun output is: > > [node1:02674] mca: base: components_register: registering btl components > [node1:02674] mca: base: components_register: found loaded component openib > [node1:02674] mca: base: components_register: component openib register > function successful > [node1:02674] mca: base: components_register: found loaded component sm > [node1:02674] mca: base: components_register: component sm register > function successful > [node1:02674] mca: base: components_register: found loaded component self > [node1:02674] mca: base: components_register: component self register > function successful > [node1:02674] mca: base: components_open: opening btl components > [node1:02674] mca: base: components_open: found loaded component openib > [node1:02674] mca: base: components_open: component openib open function > successful > [node1:02674] mca: base: components_open: found loaded component sm > [node1:02674] mca: base: components_open: component sm open function > successful > [node1:02674] mca: base: components_open: found loaded component self > [node1:02674] mca: base: components_open: component self open function > successful > [node1:02674] select: initializing btl component openib > [node1:02674] openib BTL: rdmacm IP address not found on port > [node1:02674] openib BTL: rdmacm CPC unavailable for use on mlx4_0:1; > skipped > [node1:02674] select: init of component openib returned failure > [node1:02674] mca: base: close: component openib closed > [node1:02674] mca: base: close: unloading component openib > [node1:02674] select: initializing btl component sm > [node1:02674] select: init of component sm returned failure > [node1:02674] mca: base: close: component sm closed > [node1:02674] mca: base: close: unloading component sm > [node1:02674] select: initializing btl component self > [node1:02674] select: init of component self returned success > [node1:02674] mca: bml: Using self btl to [[16642,1],0] on node node1 > [node1:02674] mca: base: close: component self closed > [node1:02674] mca: base: close: unloading component self > > Best regards, > Sergei. > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users