Ah, gotcha. On Nov 4, 2014, at 5:41 PM, Steve Wise <sw...@opengridcomputing.com> wrote:
> Correct: I don't see the bug in the 1.8.4rc1 release. > > > On 11/4/2014 4:33 PM, Nathan Hjelm wrote: >> Looks like there is no issue in 1.8.4 except for the message coalescing >> bug. Ralph, Howard, and I agree that disabling message coalescing for >> 1.8.4 is the safest way forward. We can back-port the real fix for an >> eventual 1.8.5. Message rates no longer seem to care about message >> coalescing in the openib btl anymore. We beat mvapich handily without >> the feature. >> >> -Nathan >> >> On Tue, Nov 04, 2014 at 10:27:56PM +0000, Jeff Squyres (jsquyres) wrote: >> >>> That sounds fine, but I think Steve's point is that he is being bitten by >>> this bug now, so it would probably be good to even include this one >>> particular fix in 1.8.4. >>> >>> >>> On Nov 4, 2014, at 5:24 PM, Nathan Hjelm >>> <hje...@lanl.gov> >>> wrote: >>> >>> >>>> Going to put the RFC out today with a timeout of about 2 weeks. This >>>> will give me some time to talk with other Open MPI developers >>>> face-to-face at SC14. >>>> >>>> If the RFC fails I will still bring that and a couple of other fixes >>>> into the master. >>>> >>>> -Nathan >>>> >>>> On Tue, Nov 04, 2014 at 04:06:45PM -0600, Steve Wise wrote: >>>> >>>>> Ok, sounds like I should let you continue the good work! :) When do you >>>>> plan to merge this into ompi proper? >>>>> >>>>> On 11/4/2014 3:58 PM, Nathan Hjelm wrote: >>>>> >>>>> That certainly addresses part of the problem. I am working on a complete >>>>> revamp of the btl RDMA interface. It contains this fix: >>>>> >>>>> >>>>> https://github.com/hjelmn/ompi/commit/66fa429e306beb9fca59da0a4554e9b98d788316 >>>>> >>>>> >>>>> -Nathan >>>>> >>>>> On Tue, Nov 04, 2014 at 03:27:23PM -0600, Steve Wise wrote: >>>>> >>>>> I found the bug. Here is the fix: >>>>> >>>>> [root@stevo1 openib]# git diff >>>>> diff --git a/opal/mca/btl/openib/btl_openib_component.c >>>>> b/opal/mca/btl/openib/btl_openib_component.c >>>>> index d876e21..8a5ea82 100644 >>>>> --- a/opal/mca/btl/openib/btl_openib_component.c >>>>> +++ b/opal/mca/btl/openib/btl_openib_component.c >>>>> @@ -1960,9 +1960,8 @@ static int init_one_device(opal_list_t *btl_list, >>>>> struct ibv_device* ib_dev) >>>>> } >>>>> >>>>> /* If the MCA param was specified, skip all the checks */ >>>>> - if ( MCA_BASE_VAR_SOURCE_COMMAND_LINE || >>>>> - MCA_BASE_VAR_SOURCE_ENV == >>>>> - mca_btl_openib_component.receive_queues_source) { >>>>> + if (MCA_BASE_VAR_SOURCE_COMMAND_LINE == >>>>> mca_btl_openib_component.receive_queues_source|| >>>>> + MCA_BASE_VAR_SOURCE_ENV == >>>>> mca_btl_openib_component.receive_queues_source) { >>>>> goto good; >>>>> } >>>>> >>>>> >>>>> On 11/4/2014 3:08 PM, Nathan Hjelm wrote: >>>>> >>>>> I have run into the issue as well. I will open a pull request for 1.8.4 >>>>> as part of a patch fixing the coalescing issues. >>>>> >>>>> -Nathan >>>>> >>>>> On Tue, Nov 04, 2014 at 02:50:30PM -0600, Steve Wise wrote: >>>>> >>>>> On 11/4/2014 2:09 PM, Steve Wise wrote: >>>>> >>>>> Hi, >>>>> >>>>> I'm running ompi top-o-tree from github and seeing an openib btl issue >>>>> where the qp/srq configuration is incorrect for the given device id. This >>>>> works fine in 1.8.4rc1, but I see the problem in top-of-tree. A simple 2 >>>>> node IMB-MPI1 pingpong fails to get the ranks setup. I see this logged: >>>>> >>>>> /opt/ompi-trunk/bin/mpirun --allow-run-as-root --np 2 --host stevo1,stevo2 >>>>> --mca btl openib,sm,self /opt/ompi-trunk/bin/IMB-MPI1 pingpong >>>>> >>>>> >>>>> Adding this works around the issue: >>>>> >>>>> --mca btl_openib_receive_queues P,65536,64 >>>>> >>>>> I also confirmed that opal_btl_openib_ini_query() is getting the correct >>>>> receive_queues string from the .ini file on both nodes for the cxgb4 >>>>> device... >>>>> >>>>> >>>>> >>>>> <snip> >>>>> >>>>> -------------------------------------------------------------------------- >>>>> >>>>> The Open MPI receive queue configuration for the OpenFabrics devices >>>>> on two nodes are incompatible, meaning that MPI processes on two >>>>> specific nodes were unable to communicate with each other. This >>>>> generally happens when you are using OpenFabrics devices from >>>>> different vendors on the same network. You should be able to use the >>>>> mca_btl_openib_receive_queues MCA parameter to set a uniform receive >>>>> queue configuration for all the devices in the MPI job, and therefore >>>>> be able to run successfully. >>>>> >>>>> Local host: stevo2 >>>>> Local adapter: cxgb4_0 (vendor 0x1425, part ID 21520) >>>>> Local queues: >>>>> P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64 >>>>> >>>>> Remote host: stevo1 >>>>> Remote adapter: (vendor 0x1425, part ID 21520) >>>>> Remote queues: P,65536,64 >>>>> ---------------------------------------------------------------------------- >>>>> >>>>> >>>>> The stevo1 rank has the correct queue settings: P,65536,64. For some >>>>> reason, stevo2 has the wrong settings, even though it has the correct >>>>> device id info. >>>>> >>>>> Any suggestions on debugging this? Like where to dig in the src to see if >>>>> somehow the .ini parsing is broken... >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Steve. >>>>> _______________________________________________ >>>>> devel mailing list >>>>> >>>>> de...@open-mpi.org >>>>> >>>>> Subscription: >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> Link to this post: >>>>> >>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16179.php >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> >>>>> de...@open-mpi.org >>>>> >>>>> Subscription: >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16180.php >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> >>>>> de...@open-mpi.org >>>>> >>>>> Subscription: >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16181.php >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> >>>>> de...@open-mpi.org >>>>> >>>>> Subscription: >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16182.php >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> >>>>> de...@open-mpi.org >>>>> >>>>> Subscription: >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16184.php >>>>> _______________________________________________ >>>>> devel mailing list >>>>> >>>>> de...@open-mpi.org >>>>> >>>>> Subscription: >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16185.php >>>> _______________________________________________ >>>> devel mailing list >>>> >>>> de...@open-mpi.org >>>> >>>> Subscription: >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/11/16187.php >>> >>> -- >>> Jeff Squyres >>> >>> jsquy...@cisco.com >>> >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> >>> _______________________________________________ >>> devel mailing list >>> >>> de...@open-mpi.org >>> >>> Subscription: >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/11/16188.php >>> >>> >>> _______________________________________________ >>> devel mailing list >>> >>> de...@open-mpi.org >>> >>> Subscription: >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/11/16190.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16191.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/