I've tried this out, and got the same problem as I sent before. With the same configuration and command line, 1.6.5 works for me, 1.10 series seem not.
Could it also be IB configuration issue? (ib_write/read_bw/lat work fine across the two nodes) Error output below: [[39776,1],0][btl_openib_component.c:3502:handle_wc] from vrdma-host1 to: 192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 2318d80 opcode 32767 vendor error 129 qp_idx 0 -------------------------------------------------------------------------- The InfiniBand retry count between two MPI processes has been exceeded. "Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38): The total number of times that the sender wishes the receiver to retry timeout, packet sequence, etc. errors before posting a completion error. This error typically means that there is something awry within the InfiniBand fabric itself. You should note the hosts on which this error has occurred; it has been observed that rebooting or removing a particular host from the job can sometimes resolve this issue. Two MCA parameters can be used to control Open MPI's behavior with respect to the retry count: * btl_openib_ib_retry_count - The number of times the sender will attempt to retry (defaulted to 7, the maximum value). * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted to 20). The actual timeout value used is calculated as: 4.096 microseconds * (2^btl_openib_ib_timeout) See the InfiniBand spec 1.2 (section 12.7.34) for more details. Below is some information about the host that raised the error and the peer to which it was connected: Local host: host1 Local device: mlx4_0 Peer host: 192.168.2.22 You may need to consult with your system administrator to get this problem fixed. -------------------------------------------------------------------------- -----Original Message----- From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Gilles Gouaillardet Sent: Friday, April 21, 2017 9:41 AM To: devel@lists.open-mpi.org Subject: Re: [OMPI devel] openib oob module Folks, fwiw, i made https://github.com/open-mpi/ompi/pull/3393 and it works for me on a mlx4 cluster (Mellanox QDR) Cheers, Gilles On 4/21/2017 1:31 AM, r...@open-mpi.org wrote: > I’m not seeing any problem inside the OOB - the problem appears to be > in the info being given to it: > > [host1:16244] 1 more process has sent help message > help-mpi-btl-openib.txt / default subnet prefix > [host1:16244] Set MCA parameter "orte_base_help_aggregate" to 0 to see > all help / error messages > [[46697,1],0][btl_openib_component.c:3501:handle_wc] from host1 to: > 192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR > status number 12 for wr_id 112db80 opcode 32767 vendor error 129 qp_idx 0 > > I’ve been searching, and I don’t see that help message anywhere in > your output - not sure what happened to it. I do see this in your > output - don’t know what it means: > > [host1][[46697,1],0][connect/btl_openib_connect_oob.c:935:rml_recv_cb] > !!!!!!!!!!!!!!!!!!!!!!!!! > > >> On Apr 20, 2017, at 8:36 AM, Shiqing Fan <shiqing....@huawei.com >> <mailto:shiqing....@huawei.com>> wrote: >> >> Forgot to enable oob verbose in my last test. Here is the updated >> output file. >> Thanks, >> Shiqing >> *From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf >> Of*r...@open-mpi.org <mailto:r...@open-mpi.org> >> *Sent:*Thursday, April 20, 2017 4:29 PM >> *To:*OpenMPI Devel >> *Subject:*Re: [OMPI devel] openib oob module >> Yeah, I forgot that the 1.10 series still had the BTLs in OMPI. >> Should be able to restore it. I honestly don’t recall the bug, though :-( >> If you want to try reviving it, you can add some debug in there (plus >> turn on the OOB verbosity) and I’m happy to help you figure it out. >> Ralph >> >> On Apr 20, 2017, at 7:13 AM, Shiqing Fan <shiqing....@huawei.com >> <mailto:shiqing....@huawei.com>> wrote: >> Hi Ralph, >> Yes, it’s been a long time. Hope you all are doing well (I >> believe soJ). >> I’m working on a virtualization project, and need to run Open MPI >> on an unikernel OS (most of OFED is missing/unsupported). >> Actually I’m only focusing on 1.10.2, which still has oob in >> ompi. Probably it might be possible to make oob work there? Or >> even for 1.10 branch (as Gilles metioned)? >> Do you have any clue about the bug in oob back then? >> Regards, >> Shiqing >> *From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf >> Of*r...@open-mpi.org <mailto:r...@open-mpi.org> >> *Sent:*Thursday, April 20, 2017 3:49 PM >> *To:*OpenMPI Devel >> *Subject:*Re: [OMPI devel] openib oob module >> Hi Shiqing! >> Been a long time - hope you are doing well. >> I see no way to bring the oob module back now that the BTLs are >> in the OPAL layer - this is why it was removed as the oob is in >> ORTE, and thus not accessible from OPAL. >> Ralph >> >> On Apr 20, 2017, at 6:02 AM, Shiqing Fan >> <shiqing....@huawei.com <mailto:shiqing....@huawei.com>> wrote: >> Dear all, >> I noticed that openib oob module has been removed since a >> long time ago, because it wasn’t working anymore and nobody >> seemed need it. >> But for some special operating system, where the rdmacm, udcm >> or ibcm kernel support is missing, oob may still be necessary. >> I’m curious if it’s possible to bring this module back? How >> difficult would it be to fix the bug in order to make it work >> again in 1.10 branch or later? Thanks a lot. >> Best Regards, >> Shiqing >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> <output.txt>_______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel