[OMPI devel] BML changes

2015-02-26 Thread Rolf vandeVaart
This message is mostly for Nathan, but figured I would go with the wider 
distribution. I have noticed some different behaviour that I assume started 
with this change.


https://github.com/open-mpi/ompi/commit/4bf7a207e90997e75ba1c60d9d191d9d96402d04


I am noticing that the openib BTL will also be used for on-node communication 
even though the sm (or smcuda) BTL is also available. I think with the 
aforementioned change that the openib BTL is listed as an available BTL that 
supports RDMA. While looking through the debugger and looking at the 
bml_endpoint, it appears that the sm BTL is listed as the eager and send BTL, 
but the openib is listed as the RDMA btl. Looking at the logic in 
pml_ob1_sendreq.h, it looks like we can end up selecting the openib btl for 
some of the communication. I ran with some various verbosity and saw that this 
was happening. With v1.8, we only appear to use the sm (or smcuda) btl.


I am wondering if this was intentional with this change or maybe a side effect.


Rolf


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI devel] BML changes

2015-03-11 Thread Nathan Hjelm

Definitely a side-effect though it could be beneficial in some cases as
the RDMA engine in the HCA may be faster than using memcpy (larger than
a certain size). I don't know how to best fix this as I need all RDMA
capable BTLs to listed for RMA. I though about adding another list to
track BTLs that have both RMA and atomics but that would increase the
memory footprint of Open MPI by a factor of nranks.

-Nathan

On Thu, Feb 26, 2015 at 11:59:41PM +, Rolf vandeVaart wrote:
>This message is mostly for Nathan, but figured I would go with the wider
>distribution. I have noticed some different behaviour that I assume
>started with this change.
> 
>
> https://github.com/open-mpi/ompi/commit/4bf7a207e90997e75ba1c60d9d191d9d96402d04
> 
>I am noticing that the openib BTL will also be used for on-node
>communication even though the sm (or smcuda) BTL is also available. I
>think with the aforementioned change that the openib BTL is listed as an
>available BTL that supports RDMA. While looking through the debugger and
>looking at the bml_endpoint, it appears that the sm BTL is listed as the
>eager and send BTL, but the openib is listed as the RDMA btl. Looking at
>the logic in pml_ob1_sendreq.h, it looks like we can end up selecting the
>openib btl for some of the communication. I ran with some various
>verbosity and saw that this was happening. With v1.8, we only appear to
>use the sm (or smcuda) btl.
> 
>I am wondering if this was intentional with this change or maybe a side
>effect.
> 
>Rolf
> 
>  --
> 
>This email message is for the sole use of the intended recipient(s) and
>may contain confidential information.  Any unauthorized review, use,
>disclosure or distribution is prohibited.  If you are not the intended
>recipient, please contact the sender by reply email and destroy all copies
>of the original message.
> 
>  --

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/02/17065.php



pgp4ZPssyGRwR.pgp
Description: PGP signature


Re: [OMPI devel] BML changes

2015-03-11 Thread Howard Pritchard
My experience with DMA engines located on the other side of a PCI-e 16x
gen3 bus from the cpus is that for a couple of ranks doing large
transfers between each other on a node, using the DMA engine looks good.
But once there are multiple ranks exchanging data (like up to 32 ranks on a
dual socket haswell node, not using HT),  using the DMA engine of the NIC
is not such a good idea.

Howard


2015-03-11 10:57 GMT-06:00 Nathan Hjelm :

>
> Definitely a side-effect though it could be beneficial in some cases as
> the RDMA engine in the HCA may be faster than using memcpy (larger than
> a certain size). I don't know how to best fix this as I need all RDMA
> capable BTLs to listed for RMA. I though about adding another list to
> track BTLs that have both RMA and atomics but that would increase the
> memory footprint of Open MPI by a factor of nranks.
>
> -Nathan
>
> On Thu, Feb 26, 2015 at 11:59:41PM +, Rolf vandeVaart wrote:
> >This message is mostly for Nathan, but figured I would go with the
> wider
> >distribution. I have noticed some different behaviour that I assume
> >started with this change.
> >
> >
> https://github.com/open-mpi/ompi/commit/4bf7a207e90997e75ba1c60d9d191d9d96402d04
> >
> >I am noticing that the openib BTL will also be used for on-node
> >communication even though the sm (or smcuda) BTL is also available. I
> >think with the aforementioned change that the openib BTL is listed as
> an
> >available BTL that supports RDMA. While looking through the debugger
> and
> >looking at the bml_endpoint, it appears that the sm BTL is listed as
> the
> >eager and send BTL, but the openib is listed as the RDMA btl. Looking
> at
> >the logic in pml_ob1_sendreq.h, it looks like we can end up selecting
> the
> >openib btl for some of the communication. I ran with some various
> >verbosity and saw that this was happening. With v1.8, we only appear
> to
> >use the sm (or smcuda) btl.
> >
> >I am wondering if this was intentional with this change or maybe a
> side
> >effect.
> >
> >Rolf
> >
> >
> --
> >
> >This email message is for the sole use of the intended recipient(s)
> and
> >may contain confidential information.  Any unauthorized review, use,
> >disclosure or distribution is prohibited.  If you are not the intended
> >recipient, please contact the sender by reply email and destroy all
> copies
> >of the original message.
> >
> >
> --
>
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/02/17065.php
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/03/17127.php
>


Re: [OMPI devel] BML changes

2015-03-11 Thread Nathan Hjelm

In that case we should find a way to eliminate this behavior. I will
take a look later this week and see if there is a workable solution.

-Nathan

On Wed, Mar 11, 2015 at 11:41:00AM -0600, Howard Pritchard wrote:
>My experience with DMA engines located on the other side of a PCI-e 16x
>gen3 bus from the cpus is that for a couple of ranks doing large
>transfers between each other on a node, using the DMA engine looks good. 
>But once there are multiple ranks exchanging data (like up to 32 ranks on
>a dual socket haswell node, not using HT),  using the DMA engine of the
>NIC is not such a good idea.
>Howard
>2015-03-11 10:57 GMT-06:00 Nathan Hjelm :
> 
>  Definitely a side-effect though it could be beneficial in some cases as
>  the RDMA engine in the HCA may be faster than using memcpy (larger than
>  a certain size). I don't know how to best fix this as I need all RDMA
>  capable BTLs to listed for RMA. I though about adding another list to
>  track BTLs that have both RMA and atomics but that would increase the
>  memory footprint of Open MPI by a factor of nranks.
> 
>  -Nathan
> 
>  On Thu, Feb 26, 2015 at 11:59:41PM +, Rolf vandeVaart wrote:
>  >This message is mostly for Nathan, but figured I would go with the
>  wider
>  >distribution. I have noticed some different behaviour that I assume
>  >started with this change.
>  >
>  >   
>  
> https://github.com/open-mpi/ompi/commit/4bf7a207e90997e75ba1c60d9d191d9d96402d04
>  >
>  >I am noticing that the openib BTL will also be used for on-node
>  >communication even though the sm (or smcuda) BTL is also available.
>  I
>  >think with the aforementioned change that the openib BTL is listed
>  as an
>  >available BTL that supports RDMA. While looking through the
>  debugger and
>  >looking at the bml_endpoint, it appears that the sm BTL is listed
>  as the
>  >eager and send BTL, but the openib is listed as the RDMA btl.
>  Looking at
>  >the logic in pml_ob1_sendreq.h, it looks like we can end up
>  selecting the
>  >openib btl for some of the communication. I ran with some various
>  >verbosity and saw that this was happening. With v1.8, we only
>  appear to
>  >use the sm (or smcuda) btl.
>  >
>  >I am wondering if this was intentional with this change or maybe a
>  side
>  >effect.
>  >
>  >Rolf
>  >
>  > 
>  --
>  >
>  >This email message is for the sole use of the intended recipient(s)
>  and
>  >may contain confidential information.  Any unauthorized review,
>  use,
>  >disclosure or distribution is prohibited.  If you are not the
>  intended
>  >recipient, please contact the sender by reply email and destroy all
>  copies
>  >of the original message.
>  >
>  > 
>  --
> 
>  > ___
>  > devel mailing list
>  > de...@open-mpi.org
>  > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>  > Link to this post:
>  http://www.open-mpi.org/community/lists/devel/2015/02/17065.php
> 
>  ___
>  devel mailing list
>  de...@open-mpi.org
>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>  Link to this post:
>  http://www.open-mpi.org/community/lists/devel/2015/03/17127.php

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/03/17128.php



pgpH2L48vwH2x.pgp
Description: PGP signature


Re: [OMPI devel] BML changes

2015-03-11 Thread Atchley, Scott
We have some new Power8 nodes with dual-port FDR HCAs. I have not tested 
same-node Verbs throughput. Using Linux’s Cross Memory Attach (CMA), I can get 
30 GB/s for 2 MB messages between two cores and then it drops off to ~12 GB/s. 
The PCIe Gen3 x16 slots should max at ~15 GB/s. I agree that when there are 
more than two processes communicating, that shared memory will go higher while 
the PCIe link is capped at ~15 GB/s.

Scott

On Mar 11, 2015, at 1:41 PM, Howard Pritchard  wrote:

> My experience with DMA engines located on the other side of a PCI-e 16x gen3 
> bus from the cpus is that for a couple of ranks doing large
> transfers between each other on a node, using the DMA engine looks good.  But 
> once there are multiple ranks exchanging data (like up to 32 ranks on a dual 
> socket haswell node, not using HT),  using the DMA engine of the NIC is not 
> such a good idea.
> 
> Howard
> 
> 
> 2015-03-11 10:57 GMT-06:00 Nathan Hjelm :
> 
> Definitely a side-effect though it could be beneficial in some cases as
> the RDMA engine in the HCA may be faster than using memcpy (larger than
> a certain size). I don't know how to best fix this as I need all RDMA
> capable BTLs to listed for RMA. I though about adding another list to
> track BTLs that have both RMA and atomics but that would increase the
> memory footprint of Open MPI by a factor of nranks.
> 
> -Nathan
> 
> On Thu, Feb 26, 2015 at 11:59:41PM +, Rolf vandeVaart wrote:
> >This message is mostly for Nathan, but figured I would go with the wider
> >distribution. I have noticed some different behaviour that I assume
> >started with this change.
> >
> >
> > https://github.com/open-mpi/ompi/commit/4bf7a207e90997e75ba1c60d9d191d9d96402d04
> >
> >I am noticing that the openib BTL will also be used for on-node
> >communication even though the sm (or smcuda) BTL is also available. I
> >think with the aforementioned change that the openib BTL is listed as an
> >available BTL that supports RDMA. While looking through the debugger and
> >looking at the bml_endpoint, it appears that the sm BTL is listed as the
> >eager and send BTL, but the openib is listed as the RDMA btl. Looking at
> >the logic in pml_ob1_sendreq.h, it looks like we can end up selecting the
> >openib btl for some of the communication. I ran with some various
> >verbosity and saw that this was happening. With v1.8, we only appear to
> >use the sm (or smcuda) btl.
> >
> >I am wondering if this was intentional with this change or maybe a side
> >effect.
> >
> >Rolf
> >
> >  --
> >
> >This email message is for the sole use of the intended recipient(s) and
> >may contain confidential information.  Any unauthorized review, use,
> >disclosure or distribution is prohibited.  If you are not the intended
> >recipient, please contact the sender by reply email and destroy all 
> > copies
> >of the original message.
> >
> >  --
> 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2015/02/17065.php
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/03/17127.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/03/17128.php