Correct:  I don't see the bug in the 1.8.4rc1 release.


On 11/4/2014 4:33 PM, Nathan Hjelm wrote:
Looks like there is no issue in 1.8.4 except for the message coalescing
bug. Ralph, Howard, and I agree that disabling message coalescing for
1.8.4 is the safest way forward. We can back-port the real fix for an
eventual 1.8.5. Message rates no longer seem to care about message
coalescing in the openib btl anymore. We beat mvapich handily without
the feature.

-Nathan

On Tue, Nov 04, 2014 at 10:27:56PM +0000, Jeff Squyres (jsquyres) wrote:
That sounds fine, but I think Steve's point is that he is being bitten by this 
bug now, so it would probably be good to even include this one particular fix 
in 1.8.4.


On Nov 4, 2014, at 5:24 PM, Nathan Hjelm <hje...@lanl.gov> wrote:

Going to put the RFC out today with a timeout of about 2 weeks. This
will give me some time to talk with other Open MPI developers
face-to-face at SC14.

If the RFC fails I will still bring that and a couple of other fixes
into the master.

-Nathan

On Tue, Nov 04, 2014 at 04:06:45PM -0600, Steve Wise wrote:
   Ok, sounds like I should let you continue the good work! :)  When do you
   plan to merge this into ompi proper?

   On 11/4/2014 3:58 PM, Nathan Hjelm wrote:

That certainly addresses part of the problem. I am working on a complete
revamp of the btl RDMA interface. It contains this fix:

https://github.com/hjelmn/ompi/commit/66fa429e306beb9fca59da0a4554e9b98d788316

-Nathan

On Tue, Nov 04, 2014 at 03:27:23PM -0600, Steve Wise wrote:

I found the bug.  Here is the fix:

[root@stevo1 openib]# git diff
diff --git a/opal/mca/btl/openib/btl_openib_component.c
b/opal/mca/btl/openib/btl_openib_component.c
index d876e21..8a5ea82 100644
--- a/opal/mca/btl/openib/btl_openib_component.c
+++ b/opal/mca/btl/openib/btl_openib_component.c
@@ -1960,9 +1960,8 @@ static int init_one_device(opal_list_t *btl_list,
struct ibv_device* ib_dev)
          }

          /* If the MCA param was specified, skip all the checks */
-        if ( MCA_BASE_VAR_SOURCE_COMMAND_LINE ||
-                MCA_BASE_VAR_SOURCE_ENV ==
-            mca_btl_openib_component.receive_queues_source) {
+        if (MCA_BASE_VAR_SOURCE_COMMAND_LINE ==
mca_btl_openib_component.receive_queues_source||
+            MCA_BASE_VAR_SOURCE_ENV ==
mca_btl_openib_component.receive_queues_source) {
              goto good;
          }


On 11/4/2014 3:08 PM, Nathan Hjelm wrote:

I have run into the issue as well. I will open a pull request for 1.8.4
as part of a patch fixing the coalescing issues.

-Nathan

On Tue, Nov 04, 2014 at 02:50:30PM -0600, Steve Wise wrote:

On 11/4/2014 2:09 PM, Steve Wise wrote:

Hi,

I'm running ompi top-o-tree from github and seeing an openib btl issue
where the qp/srq configuration is incorrect for the given device id.  This
works fine in 1.8.4rc1, but I see the problem in top-of-tree.  A simple 2
node IMB-MPI1 pingpong fails to get the ranks setup.  I see this logged:

/opt/ompi-trunk/bin/mpirun --allow-run-as-root --np 2 --host stevo1,stevo2
--mca btl openib,sm,self /opt/ompi-trunk/bin/IMB-MPI1 pingpong


Adding this works around the issue:

--mca btl_openib_receive_queues P,65536,64

I also confirmed that opal_btl_openib_ini_query() is getting the correct
receive_queues string from the .ini file on both nodes for the cxgb4
device...



<snip>

--------------------------------------------------------------------------

The Open MPI receive queue configuration for the OpenFabrics devices
on two nodes are incompatible, meaning that MPI processes on two
specific nodes were unable to communicate with each other.  This
generally happens when you are using OpenFabrics devices from
different vendors on the same network.  You should be able to use the
mca_btl_openib_receive_queues MCA parameter to set a uniform receive
queue configuration for all the devices in the MPI job, and therefore
be able to run successfully.

  Local host:       stevo2
  Local adapter:    cxgb4_0 (vendor 0x1425, part ID 21520)
  Local queues: 
P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64

  Remote host:      stevo1
  Remote adapter:   (vendor 0x1425, part ID 21520)
  Remote queues:    P,65536,64
----------------------------------------------------------------------------


The stevo1 rank has the correct queue settings: P,65536,64.  For some
reason, stevo2 has the wrong settings, even though it has the correct
device id info.

Any suggestions on debugging this?  Like where to dig in the src to see if
somehow the .ini parsing is broken...


Thanks,

Steve.
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/11/16179.php

_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/11/16180.php


_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/11/16181.php

_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/11/16182.php

_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/11/16184.php
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/11/16185.php
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/11/16187.php

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/11/16188.php


_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/11/16190.php

Reply via email to