here it is: ~/openmpi/bin/mpirun -np 2 -hostfile hostfile --mca btl openib,self,sm --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 ./sendrecv
what I got as follows. -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: chguo-msr-linux1 Local device: mlx5_0 -------------------------------------------------------------------------- -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[45408,1],0]) is on host: chguo-msr-linux1 Process 2 ([[45408,1],1]) is on host: chguo-msr-linux02 BTLs attempted: self Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- [chguo-msr-linux1:12690] *** An error occurred in MPI_Send [chguo-msr-linux1:12690] *** reported by process [140379686961153,140376711102464] [chguo-msr-linux1:12690] *** on communicator MPI_COMM_WORLD [chguo-msr-linux1:12690] *** MPI_ERR_INTERN: internal error [chguo-msr-linux1:12690] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [chguo-msr-linux1:12690] *** and potentially your MPI job) [chguo-msr-linux1:12684] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init [chguo-msr-linux1:12684] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages On Tue, Jun 13, 2017 at 5:05 AM, Joshua Ladd <jladd.m...@gmail.com> wrote: > Hi, > > Please include your full command line. > > Josh > > On Mon, Jun 12, 2017 at 6:17 PM, Chuanxiong Guo <chuanxiong....@gmail.com> > wrote: > >> Hi, >> >> I have two servers with Mellanox CX4-LX (50GbE Ethernet) back-to-back >> connected. I am using Ubuntu 14-04. I have made mvapich2 work, and I can >> confirm both roce and rocev2 work well (by packet capturing). >> >> But I still cannot make openmpi work with roce. I am using openmpi 2.1.1. >> It looks that this version of openmpi does not recognize CX4-LX, which I >> have added vendor part id 4117 to mca-btl-openib-device-params.ini, and >> I have also updated opal/mca/common/verbs/common_verbs_port.c to support >> CX4-LX, which has speed 64 and width 1. >> >> But I am still getting: >> >> "WARNING: There was an error initializing an OpenFabrics device. >> Local host: chguo-msr-linux1 >> >> Local device: mlx5_0 >> " >> Any hint on what are missing? >> >> Thanks, >> CX >> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> > >
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel