It sounds like this fix should be merged in soon. Nathan: are your other changes bug fixes, or part of your BTL revamp branch?
On Nov 4, 2014, at 5:06 PM, Steve Wise <sw...@opengridcomputing.com> wrote: > Ok, sounds like I should let you continue the good work! :) When do you plan > to merge this into ompi proper? > > > On 11/4/2014 3:58 PM, Nathan Hjelm wrote: >> That certainly addresses part of the problem. I am working on a complete >> revamp of the btl RDMA interface. It contains this fix: >> >> >> https://github.com/hjelmn/ompi/commit/66fa429e306beb9fca59da0a4554e9b98d788316 >> >> >> -Nathan >> >> On Tue, Nov 04, 2014 at 03:27:23PM -0600, Steve Wise wrote: >> >>> I found the bug. Here is the fix: >>> >>> [root@stevo1 openib]# git diff >>> diff --git a/opal/mca/btl/openib/btl_openib_component.c >>> b/opal/mca/btl/openib/btl_openib_component.c >>> index d876e21..8a5ea82 100644 >>> --- a/opal/mca/btl/openib/btl_openib_component.c >>> +++ b/opal/mca/btl/openib/btl_openib_component.c >>> @@ -1960,9 +1960,8 @@ static int init_one_device(opal_list_t *btl_list, >>> struct ibv_device* ib_dev) >>> } >>> >>> /* If the MCA param was specified, skip all the checks */ >>> - if ( MCA_BASE_VAR_SOURCE_COMMAND_LINE || >>> - MCA_BASE_VAR_SOURCE_ENV == >>> - mca_btl_openib_component.receive_queues_source) { >>> + if (MCA_BASE_VAR_SOURCE_COMMAND_LINE == >>> mca_btl_openib_component.receive_queues_source|| >>> + MCA_BASE_VAR_SOURCE_ENV == >>> mca_btl_openib_component.receive_queues_source) { >>> goto good; >>> } >>> >>> >>> On 11/4/2014 3:08 PM, Nathan Hjelm wrote: >>> >>>> I have run into the issue as well. I will open a pull request for 1.8.4 >>>> as part of a patch fixing the coalescing issues. >>>> >>>> -Nathan >>>> >>>> On Tue, Nov 04, 2014 at 02:50:30PM -0600, Steve Wise wrote: >>>> >>>>> On 11/4/2014 2:09 PM, Steve Wise wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I'm running ompi top-o-tree from github and seeing an openib btl issue >>>>>> where the qp/srq configuration is incorrect for the given device id. >>>>>> This >>>>>> works fine in 1.8.4rc1, but I see the problem in top-of-tree. A simple 2 >>>>>> node IMB-MPI1 pingpong fails to get the ranks setup. I see this logged: >>>>>> >>>>>> /opt/ompi-trunk/bin/mpirun --allow-run-as-root --np 2 --host >>>>>> stevo1,stevo2 >>>>>> --mca btl openib,sm,self /opt/ompi-trunk/bin/IMB-MPI1 pingpong >>>>>> >>>>>> >>>>> Adding this works around the issue: >>>>> >>>>> --mca btl_openib_receive_queues P,65536,64 >>>>> >>>>> I also confirmed that opal_btl_openib_ini_query() is getting the correct >>>>> receive_queues string from the .ini file on both nodes for the cxgb4 >>>>> device... >>>>> >>>>> >>>>> >>>>>> <snip> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> The Open MPI receive queue configuration for the OpenFabrics devices >>>>>> on two nodes are incompatible, meaning that MPI processes on two >>>>>> specific nodes were unable to communicate with each other. This >>>>>> generally happens when you are using OpenFabrics devices from >>>>>> different vendors on the same network. You should be able to use the >>>>>> mca_btl_openib_receive_queues MCA parameter to set a uniform receive >>>>>> queue configuration for all the devices in the MPI job, and therefore >>>>>> be able to run successfully. >>>>>> >>>>>> Local host: stevo2 >>>>>> Local adapter: cxgb4_0 (vendor 0x1425, part ID 21520) >>>>>> Local queues: >>>>>> P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64 >>>>>> >>>>>> Remote host: stevo1 >>>>>> Remote adapter: (vendor 0x1425, part ID 21520) >>>>>> Remote queues: P,65536,64 >>>>>> ---------------------------------------------------------------------------- >>>>>> >>>>>> >>>>>> The stevo1 rank has the correct queue settings: P,65536,64. For some >>>>>> reason, stevo2 has the wrong settings, even though it has the correct >>>>>> device id info. >>>>>> >>>>>> Any suggestions on debugging this? Like where to dig in the src to see >>>>>> if >>>>>> somehow the .ini parsing is broken... >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Steve. >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> >>>>>> de...@open-mpi.org >>>>>> >>>>>> Subscription: >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> Link to this post: >>>>>> >>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16179.php >>>>> _______________________________________________ >>>>> devel mailing list >>>>> >>>>> de...@open-mpi.org >>>>> >>>>> Subscription: >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16180.php >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> >>>>> de...@open-mpi.org >>>>> >>>>> Subscription: >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16181.php >>> _______________________________________________ >>> devel mailing list >>> >>> de...@open-mpi.org >>> >>> Subscription: >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/11/16182.php >>> >>> >>> _______________________________________________ >>> devel mailing list >>> >>> de...@open-mpi.org >>> >>> Subscription: >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/11/16184.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16185.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/