I’m not seeing any problem inside the OOB - the problem appears to be in the info being given to it:
[host1:16244] 1 more process has sent help message help-mpi-btl-openib.txt / default subnet prefix [host1:16244] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [[46697,1],0][btl_openib_component.c:3501:handle_wc] from host1 to: 192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 112db80 opcode 32767 vendor error 129 qp_idx 0 I’ve been searching, and I don’t see that help message anywhere in your output - not sure what happened to it. I do see this in your output - don’t know what it means: [host1][[46697,1],0][connect/btl_openib_connect_oob.c:935:rml_recv_cb] !!!!!!!!!!!!!!!!!!!!!!!!! > On Apr 20, 2017, at 8:36 AM, Shiqing Fan <shiqing....@huawei.com> wrote: > > Forgot to enable oob verbose in my last test. Here is the updated output file. > > Thanks, > Shiqing > > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of > r...@open-mpi.org > Sent: Thursday, April 20, 2017 4:29 PM > To: OpenMPI Devel > Subject: Re: [OMPI devel] openib oob module > > Yeah, I forgot that the 1.10 series still had the BTLs in OMPI. Should be > able to restore it. I honestly don’t recall the bug, though :-( > > If you want to try reviving it, you can add some debug in there (plus turn on > the OOB verbosity) and I’m happy to help you figure it out. > Ralph > > On Apr 20, 2017, at 7:13 AM, Shiqing Fan <shiqing....@huawei.com > <mailto:shiqing....@huawei.com>> wrote: > > Hi Ralph, > > Yes, it’s been a long time. Hope you all are doing well (I believe so J ). > > I’m working on a virtualization project, and need to run Open MPI on an > unikernel OS (most of OFED is missing/unsupported). > > Actually I’m only focusing on 1.10.2, which still has oob in ompi. Probably > it might be possible to make oob work there? Or even for 1.10 branch (as > Gilles metioned)? > Do you have any clue about the bug in oob back then? > > Regards, > Shiqing > > > From: devel [mailto:devel-boun...@lists.open-mpi.org > <mailto:devel-boun...@lists.open-mpi.org>] On Behalf Of r...@open-mpi.org > <mailto:r...@open-mpi.org> > Sent: Thursday, April 20, 2017 3:49 PM > To: OpenMPI Devel > Subject: Re: [OMPI devel] openib oob module > > Hi Shiqing! > > Been a long time - hope you are doing well. > > I see no way to bring the oob module back now that the BTLs are in the OPAL > layer - this is why it was removed as the oob is in ORTE, and thus not > accessible from OPAL. > Ralph > > On Apr 20, 2017, at 6:02 AM, Shiqing Fan <shiqing....@huawei.com > <mailto:shiqing....@huawei.com>> wrote: > > Dear all, > > I noticed that openib oob module has been removed since a long time ago, > because it wasn’t working anymore and nobody seemed need it. > But for some special operating system, where the rdmacm, udcm or ibcm kernel > support is missing, oob may still be necessary. > > I’m curious if it’s possible to bring this module back? How difficult would > it be to fix the bug in order to make it work again in 1.10 branch or later? > Thanks a lot. > > Best Regards, > Shiqing > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > > <output.txt>_______________________________________________ > devel mailing list > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel