The gap between these two versions is quite huge. I will first try to debug a 
bit more in 1.10. 

Regards,
Shiqing

-----Original Message-----
From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of 
r...@open-mpi.org
Sent: Friday, April 21, 2017 4:02 PM
To: OpenMPI Devel
Subject: Re: [OMPI devel] openib oob module

I’m not familiar with the openib code, but this looks to me like it may be 
caused by a change in the openib code itself. Have you looked to see what the 
diff might be between the two versions?

> On Apr 21, 2017, at 6:45 AM, Shiqing Fan <shiqing....@huawei.com> wrote:
> 
> I've tried this out, and got the same problem as I sent before. 
> 
> With the same configuration and command line, 1.6.5 works for me, 1.10 series 
> seem not.
> 
> Could it also be IB configuration issue? (ib_write/read_bw/lat work 
> fine across the two nodes)
> 
> Error output below:
> 
> [[39776,1],0][btl_openib_component.c:3502:handle_wc] from vrdma-host1 
> to: 192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR 
> status number 12 for wr_id 2318d80 opcode 32767  vendor error 129 
> qp_idx 0
> 
> ----------------------------------------------------------------------
> ---- The InfiniBand retry count between two MPI processes has been 
> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2 
> (section 12.7.38):
> 
>    The total number of times that the sender wishes the receiver to
>    retry timeout, packet sequence, etc. errors before posting a
>    completion error.
> 
> This error typically means that there is something awry within the 
> InfiniBand fabric itself.  You should note the hosts on which this 
> error has occurred; it has been observed that rebooting or removing a 
> particular host from the job can sometimes resolve this issue.
> 
> Two MCA parameters can be used to control Open MPI's behavior with 
> respect to the retry count:
> 
> * btl_openib_ib_retry_count - The number of times the sender will  
> attempt to retry (defaulted to 7, the maximum value).
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted  
> to 20).  The actual timeout value used is calculated as:
> 
>     4.096 microseconds * (2^btl_openib_ib_timeout)
> 
>  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
> 
> Below is some information about the host that raised the error and the 
> peer to which it was connected:
> 
>  Local host:   host1
>  Local device: mlx4_0
>  Peer host:    192.168.2.22
> 
> You may need to consult with your system administrator to get this 
> problem fixed.
> ----------------------------------------------------------------------
> ----
> 
> -----Original Message-----
> From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of 
> Gilles Gouaillardet
> Sent: Friday, April 21, 2017 9:41 AM
> To: devel@lists.open-mpi.org
> Subject: Re: [OMPI devel] openib oob module
> 
> Folks,
> 
> 
> fwiw, i made https://github.com/open-mpi/ompi/pull/3393 and it works 
> for me on a mlx4 cluster (Mellanox QDR)
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> 
> On 4/21/2017 1:31 AM, r...@open-mpi.org wrote:
>> I’m not seeing any problem inside the OOB - the problem appears to be 
>> in the info being given to it:
>> 
>> [host1:16244] 1 more process has sent help message 
>> help-mpi-btl-openib.txt / default subnet prefix [host1:16244] Set MCA 
>> parameter "orte_base_help_aggregate" to 0 to see all help / error 
>> messages [[46697,1],0][btl_openib_component.c:3501:handle_wc] from 
>> host1 to:
>> 192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR 
>> status number 12 for wr_id 112db80 opcode 32767  vendor error 129 
>> qp_idx 0
>> 
>> I’ve been searching, and I don’t see that help message anywhere in 
>> your output - not sure what happened to it. I do see this in your 
>> output - don’t know what it means:
>> 
>> [host1][[46697,1],0][connect/btl_openib_connect_oob.c:935:rml_recv_cb
>> ]
>> !!!!!!!!!!!!!!!!!!!!!!!!!
>> 
>> 
>>> On Apr 20, 2017, at 8:36 AM, Shiqing Fan <shiqing....@huawei.com 
>>> <mailto:shiqing....@huawei.com>> wrote:
>>> 
>>> Forgot to enable oob verbose in my last test. Here is the updated 
>>> output file.
>>> Thanks,
>>> Shiqing
>>> *From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf 
>>> Of*r...@open-mpi.org <mailto:r...@open-mpi.org> *Sent:*Thursday, April 
>>> 20, 2017 4:29 PM *To:*OpenMPI Devel
>>> *Subject:*Re: [OMPI devel] openib oob module Yeah, I forgot that the 
>>> 1.10 series still had the BTLs in OMPI.
>>> Should be able to restore it. I honestly don’t recall the bug, 
>>> though :-( If you want to try reviving it, you can add some debug in 
>>> there (plus turn on the OOB verbosity) and I’m happy to help you figure it 
>>> out.
>>> Ralph
>>> 
>>>    On Apr 20, 2017, at 7:13 AM, Shiqing Fan <shiqing....@huawei.com
>>>    <mailto:shiqing....@huawei.com>> wrote:
>>>    Hi Ralph,
>>>    Yes, it’s been a long time. Hope you all are doing well (I
>>>    believe soJ).
>>>    I’m working on a virtualization project, and need to run Open MPI
>>>    on an unikernel OS (most of OFED is missing/unsupported).
>>>    Actually I’m only focusing on 1.10.2, which still has oob in
>>>    ompi. Probably it might be possible to make oob work there? Or
>>>    even for 1.10 branch (as Gilles metioned)?
>>>    Do you have any clue about the bug in oob back then?
>>>    Regards,
>>>    Shiqing
>>>    *From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf
>>>    Of*r...@open-mpi.org <mailto:r...@open-mpi.org>
>>>    *Sent:*Thursday, April 20, 2017 3:49 PM
>>>    *To:*OpenMPI Devel
>>>    *Subject:*Re: [OMPI devel] openib oob module
>>>    Hi Shiqing!
>>>    Been a long time - hope you are doing well.
>>>    I see no way to bring the oob module back now that the BTLs are
>>>    in the OPAL layer - this is why it was removed as the oob is in
>>>    ORTE, and thus not accessible from OPAL.
>>>    Ralph
>>> 
>>>        On Apr 20, 2017, at 6:02 AM, Shiqing Fan
>>>        <shiqing....@huawei.com <mailto:shiqing....@huawei.com>> wrote:
>>>        Dear all,
>>>        I noticed that openib oob module has been removed since a
>>>        long time ago, because it wasn’t working anymore and nobody
>>>        seemed need it.
>>>        But for some special operating system, where the rdmacm, udcm
>>>        or ibcm kernel support is missing, oob may still be necessary.
>>>        I’m curious if it’s possible to bring this module back? How
>>>        difficult would it be to fix the bug in order to make it work
>>>        again in 1.10 branch or later? Thanks a lot.
>>>        Best Regards,
>>>        Shiqing
>>>        _______________________________________________
>>>        devel mailing list
>>>        devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>>        https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> 
>>>    _______________________________________________
>>>    devel mailing list
>>>    devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>>    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> 
>>> <output.txt>_______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> 
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to