Ah, gotcha.

On Nov 4, 2014, at 5:41 PM, Steve Wise <sw...@opengridcomputing.com> wrote:

> Correct:  I don't see the bug in the 1.8.4rc1 release.
> 
> 
> On 11/4/2014 4:33 PM, Nathan Hjelm wrote:
>> Looks like there is no issue in 1.8.4 except for the message coalescing
>> bug. Ralph, Howard, and I agree that disabling message coalescing for
>> 1.8.4 is the safest way forward. We can back-port the real fix for an
>> eventual 1.8.5. Message rates no longer seem to care about message
>> coalescing in the openib btl anymore. We beat mvapich handily without
>> the feature.
>> 
>> -Nathan
>> 
>> On Tue, Nov 04, 2014 at 10:27:56PM +0000, Jeff Squyres (jsquyres) wrote:
>> 
>>> That sounds fine, but I think Steve's point is that he is being bitten by 
>>> this bug now, so it would probably be good to even include this one 
>>> particular fix in 1.8.4.
>>> 
>>> 
>>> On Nov 4, 2014, at 5:24 PM, Nathan Hjelm 
>>> <hje...@lanl.gov>
>>>  wrote:
>>> 
>>> 
>>>> Going to put the RFC out today with a timeout of about 2 weeks. This
>>>> will give me some time to talk with other Open MPI developers
>>>> face-to-face at SC14.
>>>> 
>>>> If the RFC fails I will still bring that and a couple of other fixes
>>>> into the master.
>>>> 
>>>> -Nathan
>>>> 
>>>> On Tue, Nov 04, 2014 at 04:06:45PM -0600, Steve Wise wrote:
>>>> 
>>>>>   Ok, sounds like I should let you continue the good work! :)  When do you
>>>>>   plan to merge this into ompi proper?
>>>>> 
>>>>>   On 11/4/2014 3:58 PM, Nathan Hjelm wrote:
>>>>> 
>>>>> That certainly addresses part of the problem. I am working on a complete
>>>>> revamp of the btl RDMA interface. It contains this fix:
>>>>> 
>>>>> 
>>>>> https://github.com/hjelmn/ompi/commit/66fa429e306beb9fca59da0a4554e9b98d788316
>>>>> 
>>>>> 
>>>>> -Nathan
>>>>> 
>>>>> On Tue, Nov 04, 2014 at 03:27:23PM -0600, Steve Wise wrote:
>>>>> 
>>>>> I found the bug.  Here is the fix:
>>>>> 
>>>>> [root@stevo1 openib]# git diff
>>>>> diff --git a/opal/mca/btl/openib/btl_openib_component.c
>>>>> b/opal/mca/btl/openib/btl_openib_component.c
>>>>> index d876e21..8a5ea82 100644
>>>>> --- a/opal/mca/btl/openib/btl_openib_component.c
>>>>> +++ b/opal/mca/btl/openib/btl_openib_component.c
>>>>> @@ -1960,9 +1960,8 @@ static int init_one_device(opal_list_t *btl_list,
>>>>> struct ibv_device* ib_dev)
>>>>>          }
>>>>> 
>>>>>          /* If the MCA param was specified, skip all the checks */
>>>>> -        if ( MCA_BASE_VAR_SOURCE_COMMAND_LINE ||
>>>>> -                MCA_BASE_VAR_SOURCE_ENV ==
>>>>> -            mca_btl_openib_component.receive_queues_source) {
>>>>> +        if (MCA_BASE_VAR_SOURCE_COMMAND_LINE ==
>>>>> mca_btl_openib_component.receive_queues_source||
>>>>> +            MCA_BASE_VAR_SOURCE_ENV ==
>>>>> mca_btl_openib_component.receive_queues_source) {
>>>>>              goto good;
>>>>>          }
>>>>> 
>>>>> 
>>>>> On 11/4/2014 3:08 PM, Nathan Hjelm wrote:
>>>>> 
>>>>> I have run into the issue as well. I will open a pull request for 1.8.4
>>>>> as part of a patch fixing the coalescing issues.
>>>>> 
>>>>> -Nathan
>>>>> 
>>>>> On Tue, Nov 04, 2014 at 02:50:30PM -0600, Steve Wise wrote:
>>>>> 
>>>>> On 11/4/2014 2:09 PM, Steve Wise wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I'm running ompi top-o-tree from github and seeing an openib btl issue
>>>>> where the qp/srq configuration is incorrect for the given device id.  This
>>>>> works fine in 1.8.4rc1, but I see the problem in top-of-tree.  A simple 2
>>>>> node IMB-MPI1 pingpong fails to get the ranks setup.  I see this logged:
>>>>> 
>>>>> /opt/ompi-trunk/bin/mpirun --allow-run-as-root --np 2 --host stevo1,stevo2
>>>>> --mca btl openib,sm,self /opt/ompi-trunk/bin/IMB-MPI1 pingpong
>>>>> 
>>>>> 
>>>>> Adding this works around the issue:
>>>>> 
>>>>> --mca btl_openib_receive_queues P,65536,64
>>>>> 
>>>>> I also confirmed that opal_btl_openib_ini_query() is getting the correct
>>>>> receive_queues string from the .ini file on both nodes for the cxgb4
>>>>> device...
>>>>> 
>>>>> 
>>>>> 
>>>>> <snip>
>>>>> 
>>>>> --------------------------------------------------------------------------
>>>>> 
>>>>> The Open MPI receive queue configuration for the OpenFabrics devices
>>>>> on two nodes are incompatible, meaning that MPI processes on two
>>>>> specific nodes were unable to communicate with each other.  This
>>>>> generally happens when you are using OpenFabrics devices from
>>>>> different vendors on the same network.  You should be able to use the
>>>>> mca_btl_openib_receive_queues MCA parameter to set a uniform receive
>>>>> queue configuration for all the devices in the MPI job, and therefore
>>>>> be able to run successfully.
>>>>> 
>>>>>  Local host:       stevo2
>>>>>  Local adapter:    cxgb4_0 (vendor 0x1425, part ID 21520)
>>>>>  Local queues: 
>>>>> P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64
>>>>> 
>>>>>  Remote host:      stevo1
>>>>>  Remote adapter:   (vendor 0x1425, part ID 21520)
>>>>>  Remote queues:    P,65536,64
>>>>> ----------------------------------------------------------------------------
>>>>> 
>>>>> 
>>>>> The stevo1 rank has the correct queue settings: P,65536,64.  For some
>>>>> reason, stevo2 has the wrong settings, even though it has the correct
>>>>> device id info.
>>>>> 
>>>>> Any suggestions on debugging this?  Like where to dig in the src to see if
>>>>> somehow the .ini parsing is broken...
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Steve.
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> 
>>>>> de...@open-mpi.org
>>>>> 
>>>>> Subscription: 
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> Link to this post:
>>>>> 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16179.php
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> 
>>>>> de...@open-mpi.org
>>>>> 
>>>>> Subscription: 
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16180.php
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> 
>>>>> de...@open-mpi.org
>>>>> 
>>>>> Subscription: 
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16181.php
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> 
>>>>> de...@open-mpi.org
>>>>> 
>>>>> Subscription: 
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16182.php
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> 
>>>>> de...@open-mpi.org
>>>>> 
>>>>> Subscription: 
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16184.php
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> 
>>>>> de...@open-mpi.org
>>>>> 
>>>>> Subscription: 
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16185.php
>>>> _______________________________________________
>>>> devel mailing list
>>>> 
>>>> de...@open-mpi.org
>>>> 
>>>> Subscription: 
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16187.php
>>> 
>>> -- 
>>> Jeff Squyres
>>> 
>>> jsquy...@cisco.com
>>> 
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> 
>>> de...@open-mpi.org
>>> 
>>> Subscription: 
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/11/16188.php
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> 
>>> de...@open-mpi.org
>>> 
>>> Subscription: 
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/11/16190.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16191.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to