It sounds like this fix should be merged in soon.

Nathan: are your other changes bug fixes, or part of your BTL revamp branch?


On Nov 4, 2014, at 5:06 PM, Steve Wise <sw...@opengridcomputing.com> wrote:

> Ok, sounds like I should let you continue the good work! :)  When do you plan 
> to merge this into ompi proper?
> 
> 
> On 11/4/2014 3:58 PM, Nathan Hjelm wrote:
>> That certainly addresses part of the problem. I am working on a complete
>> revamp of the btl RDMA interface. It contains this fix:
>> 
>> 
>> https://github.com/hjelmn/ompi/commit/66fa429e306beb9fca59da0a4554e9b98d788316
>> 
>> 
>> -Nathan
>> 
>> On Tue, Nov 04, 2014 at 03:27:23PM -0600, Steve Wise wrote:
>> 
>>> I found the bug.  Here is the fix:
>>> 
>>> [root@stevo1 openib]# git diff
>>> diff --git a/opal/mca/btl/openib/btl_openib_component.c
>>> b/opal/mca/btl/openib/btl_openib_component.c
>>> index d876e21..8a5ea82 100644
>>> --- a/opal/mca/btl/openib/btl_openib_component.c
>>> +++ b/opal/mca/btl/openib/btl_openib_component.c
>>> @@ -1960,9 +1960,8 @@ static int init_one_device(opal_list_t *btl_list,
>>> struct ibv_device* ib_dev)
>>>          }
>>> 
>>>          /* If the MCA param was specified, skip all the checks */
>>> -        if ( MCA_BASE_VAR_SOURCE_COMMAND_LINE ||
>>> -                MCA_BASE_VAR_SOURCE_ENV ==
>>> -            mca_btl_openib_component.receive_queues_source) {
>>> +        if (MCA_BASE_VAR_SOURCE_COMMAND_LINE ==
>>> mca_btl_openib_component.receive_queues_source||
>>> +            MCA_BASE_VAR_SOURCE_ENV ==
>>> mca_btl_openib_component.receive_queues_source) {
>>>              goto good;
>>>          }
>>> 
>>> 
>>> On 11/4/2014 3:08 PM, Nathan Hjelm wrote:
>>> 
>>>> I have run into the issue as well. I will open a pull request for 1.8.4
>>>> as part of a patch fixing the coalescing issues.
>>>> 
>>>> -Nathan
>>>> 
>>>> On Tue, Nov 04, 2014 at 02:50:30PM -0600, Steve Wise wrote:
>>>> 
>>>>> On 11/4/2014 2:09 PM, Steve Wise wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I'm running ompi top-o-tree from github and seeing an openib btl issue
>>>>>> where the qp/srq configuration is incorrect for the given device id.  
>>>>>> This
>>>>>> works fine in 1.8.4rc1, but I see the problem in top-of-tree.  A simple 2
>>>>>> node IMB-MPI1 pingpong fails to get the ranks setup.  I see this logged:
>>>>>> 
>>>>>> /opt/ompi-trunk/bin/mpirun --allow-run-as-root --np 2 --host 
>>>>>> stevo1,stevo2
>>>>>> --mca btl openib,sm,self /opt/ompi-trunk/bin/IMB-MPI1 pingpong
>>>>>> 
>>>>>> 
>>>>> Adding this works around the issue:
>>>>> 
>>>>> --mca btl_openib_receive_queues P,65536,64
>>>>> 
>>>>> I also confirmed that opal_btl_openib_ini_query() is getting the correct
>>>>> receive_queues string from the .ini file on both nodes for the cxgb4
>>>>> device...
>>>>> 
>>>>> 
>>>>> 
>>>>>> <snip>
>>>>>> 
>>>>>> --------------------------------------------------------------------------
>>>>>> 
>>>>>> The Open MPI receive queue configuration for the OpenFabrics devices
>>>>>> on two nodes are incompatible, meaning that MPI processes on two
>>>>>> specific nodes were unable to communicate with each other.  This
>>>>>> generally happens when you are using OpenFabrics devices from
>>>>>> different vendors on the same network.  You should be able to use the
>>>>>> mca_btl_openib_receive_queues MCA parameter to set a uniform receive
>>>>>> queue configuration for all the devices in the MPI job, and therefore
>>>>>> be able to run successfully.
>>>>>> 
>>>>>>  Local host:       stevo2
>>>>>>  Local adapter:    cxgb4_0 (vendor 0x1425, part ID 21520)
>>>>>>  Local queues: 
>>>>>> P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64
>>>>>> 
>>>>>>  Remote host:      stevo1
>>>>>>  Remote adapter:   (vendor 0x1425, part ID 21520)
>>>>>>  Remote queues:    P,65536,64
>>>>>> ----------------------------------------------------------------------------
>>>>>> 
>>>>>> 
>>>>>> The stevo1 rank has the correct queue settings: P,65536,64.  For some
>>>>>> reason, stevo2 has the wrong settings, even though it has the correct
>>>>>> device id info.
>>>>>> 
>>>>>> Any suggestions on debugging this?  Like where to dig in the src to see 
>>>>>> if
>>>>>> somehow the .ini parsing is broken...
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Steve.
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> 
>>>>>> de...@open-mpi.org
>>>>>> 
>>>>>> Subscription: 
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> Link to this post:
>>>>>> 
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16179.php
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> 
>>>>> de...@open-mpi.org
>>>>> 
>>>>> Subscription: 
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16180.php
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> 
>>>>> de...@open-mpi.org
>>>>> 
>>>>> Subscription: 
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16181.php
>>> _______________________________________________
>>> devel mailing list
>>> 
>>> de...@open-mpi.org
>>> 
>>> Subscription: 
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/11/16182.php
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> 
>>> de...@open-mpi.org
>>> 
>>> Subscription: 
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/11/16184.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16185.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to