Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Michael Thomadakis
| Remember that the point of IB and other operating-system bypass devices
is that the driver is not involved in the fast path of sending /
| receiving.  One of the side-effects of that design point is that
userspace does all the allocation of send / receive buffers.

That's a good point. It was not clear to me who and with what logic was
allocating memory. But definitely for IB it makes sense that the user
provides pointers to their memory.

thanks
Michael




On Mon, Jul 8, 2013 at 1:07 PM, Jeff Squyres (jsquyres)
wrote:

> On Jul 8, 2013, at 2:01 PM, Brice Goglin  wrote:
>
> > The driver doesn't allocate much memory here. Maybe some small control
> buffers, but nothing significantly involved in large message transfer
> performance. Everything critical here is allocated by user-space (either
> MPI lib or application), so we just have to make sure we bind the process
> memory properly. I used hwloc-bind to do that.
>
> +1
>
> Remember that the point of IB and other operating-system bypass devices is
> that the driver is not involved in the fast path of sending / receiving.
>  One of the side-effects of that design point is that userspace does all
> the allocation of send / receive buffers.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Michael Thomadakis
| The driver doesn't allocate much memory here. Maybe some small control
buffers, but nothing significantly involved in large message transfer |
performance. Everything critical here is allocated by user-space (either
MPI lib or application), so we just have to make sure we bind the
| process memory properly. I used hwloc-bind to do that.

I see ... So the user level process (user or MPI library) sets aside memory
(malloc?) and basically then the OFED/IB sets up RDMA messaging with
addresses pointing back to that user physical memory. I guess before
running the MPI benchmark you requested *data *memory allocation policy to
allocate pages "owned" by the other socket?

| Note that we have seen larger issues on older platforms. You basically
just need a big HCA and PCI link on a not-so-big machine. Not very
| common fortunately with todays QPI links between Sandy-Bridge socket,
those are quite big compared to PCI Gen3 8x links to the HCA. On
| old AMD platforms (and modern Intels with big GPUs), issues are not that
uncommon (we've seen up to 40% DMA bandwidth difference
| there).

The issue that has been observed is with PCIe_gen 3 traffic on attached I/O
which, say, reads data off of the HCA and has to store it to memory but
when this memory belongs to the other socket. In that case PCI e data uses
the QPI links on SB to send out these packets to the other socket. It has
been speculated that QPI links where NOT provisioned to transfer more than
1GiB of PCIe data alongside the regular inter-NUMA memory traffic. It may
be the case that Intel has re-provisioned QPI to be able to accommodate
more PCIe traffic.

Thanks again
Michael



On Mon, Jul 8, 2013 at 1:01 PM, Brice Goglin  wrote:

>  The driver doesn't allocate much memory here. Maybe some small control
> buffers, but nothing significantly involved in large message transfer
> performance. Everything critical here is allocated by user-space (either
> MPI lib or application), so we just have to make sure we bind the process
> memory properly. I used hwloc-bind to do that.
>
> Note that we have seen larger issues on older platforms. You basically
> just need a big HCA and PCI link on a not-so-big machine. Not very common
> fortunately with todays QPI links between Sandy-Bridge socket, those are
> quite big compared to PCI Gen3 8x links to the HCA. On old AMD platforms
> (and modern Intels with big GPUs), issues are not that uncommon (we've seen
> up to 40% DMA bandwidth difference there).
>
> Brice
>
>
>
> Le 08/07/2013 19:44, Michael Thomadakis a écrit :
>
>  Hi Brice,
>
>  thanks for testing this out.
>
>  How did you make sure that the pinned pages used by the I/O adapter
> mapped to the "other" socket's memory controller ? Is pining the MPI binary
> to a socket sufficient to pin the space used for MPI I/O as well to that
> socket? I think this is something done by and at the HCA device driver
> level.
>
>  Anyways, as long as the memory performance difference is a the levels
> you mentioned then there is no "big" issue. Most likely the device driver
> get space from the same numa domain that of the socket the HCA is attached
> to.
>
>  Thanks for trying it out
> Michael
>
>
>
>
>
>
>  On Mon, Jul 8, 2013 at 11:45 AM, Brice Goglin wrote:
>
>>  On a dual E5 2650 machine with FDR cards, I see the IMB Pingpong
>> throughput drop from 6000 to 5700MB/s when the memory isn't allocated on
>> the right socket (and latency increases from 0.8 to 1.4us). Of course
>> that's pingpong only, things will be worse on a memory-overloaded machine.
>> But I don't expect things to be "less worse" if you do an intermediate copy
>> through the memory near the HCA: you would overload the QPI link as much as
>> here, and you would overload the CPU even more because of the additional
>> copies.
>>
>> Brice
>>
>>
>>
>> Le 08/07/2013 18:27, Michael Thomadakis a écrit :
>>
>> People have mentioned that they experience unexpected slow downs in
>> PCIe_gen3 I/O when the pages map to a socket different from the one the HCA
>> connects to. It is speculated that the inter-socket QPI is not provisioned
>> to transfer more than 1GiB/sec for PCIe_gen 3 traffic. This situation may
>> not be in effect on all SandyBrige or IvyBridge systems.
>>
>>  Have you measured anything like this on you systems as well? That would
>> require using physical memory mapped to the socket w/o HCA exclusively for
>> MPI messaging.
>>
>>  Mike
>>
>>
>> On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres) <
>> jsquy...@cisco.com> wrote:
>>
>>> On Jul 8, 2013, at 11:35 AM, Michael Thomadakis <
>>> drmichaelt7...@gmail.com> wrote:
>>>
>>> > The issue is that when you read or write PCIe_gen 3 dat to a non-local
>>> NUMA memory, SandyBridge will use the inter-socket QPIs to get this data
>>> across to the other socket. I think there is considerable limitation in
>>> PCIe I/O traffic data going over the inter-socket QPI. One way to get
>>> around this is for reads to 

Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Jeff Squyres (jsquyres)
On Jul 8, 2013, at 2:01 PM, Brice Goglin  wrote:

> The driver doesn't allocate much memory here. Maybe some small control 
> buffers, but nothing significantly involved in large message transfer 
> performance. Everything critical here is allocated by user-space (either MPI 
> lib or application), so we just have to make sure we bind the process memory 
> properly. I used hwloc-bind to do that.

+1

Remember that the point of IB and other operating-system bypass devices is that 
the driver is not involved in the fast path of sending / receiving.  One of the 
side-effects of that design point is that userspace does all the allocation of 
send / receive buffers.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Jeff Squyres (jsquyres)
Cisco hasn't been involved in IB for several years, so I can't comment on that 
directly.

That being said, our Cisco VIC devices are PCI gen *2*, but they are x16 (not 
x8).  We can get full bandwidth out of out 2*10Gb device from remote NUMA nodes 
on E5-2690-based machines (Sandy Bridge) for large messages.  In the lab, we 
have... tweaked... versions of those devices that give significantly higher 
bandwidth (it's no secret that the ASIC on these devices is capable of 80Gb).

We haven't looked for this specific issue, but I can confirm that we have seen 
the bandwidth that we expected out of our devices.


On Jul 8, 2013, at 12:27 PM, Michael Thomadakis  
wrote:

> People have mentioned that they experience unexpected slow downs in PCIe_gen3 
> I/O when the pages map to a socket different from the one the HCA connects 
> to. It is speculated that the inter-socket QPI is not provisioned to transfer 
> more than 1GiB/sec for PCIe_gen 3 traffic. This situation may not be in 
> effect on all SandyBrige or IvyBridge systems.
> 
> Have you measured anything like this on you systems as well? That would 
> require using physical memory mapped to the socket w/o HCA exclusively for 
> MPI messaging.
> 
> Mike
> 
> 
> On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres)  
> wrote:
> On Jul 8, 2013, at 11:35 AM, Michael Thomadakis  
> wrote:
> 
> > The issue is that when you read or write PCIe_gen 3 dat to a non-local NUMA 
> > memory, SandyBridge will use the inter-socket QPIs to get this data across 
> > to the other socket. I think there is considerable limitation in PCIe I/O 
> > traffic data going over the inter-socket QPI. One way to get around this is 
> > for reads to buffer all data into memory space local to the same socket and 
> > then transfer them by code across to the other socket's physical memory. 
> > For writes the same approach can be used with intermediary process copying 
> > data.
> 
> Sure, you'll cause congestion across the QPI network when you do non-local 
> PCI reads/writes.  That's a given.
> 
> But I'm not aware of a hardware limitation on PCI-requested traffic across 
> QPI (I could be wrong, of course -- I'm a software guy, not a hardware guy).  
> A simple test would be to bind an MPI process to a far NUMA node and run a 
> simple MPI bandwidth test and see if to get better/same/worse bandwidth 
> compared to binding an MPI process on a near NUMA socket.
> 
> But in terms of doing intermediate (pipelined) reads/writes to local NUMA 
> memory before reading/writing to PCI, no, Open MPI does not do this.  Unless 
> there is a PCI-QPI bandwidth constraint that we're unaware of, I'm not sure 
> why you would do this -- it would likely add considerable complexity to the 
> code and it would definitely lead to higher overall MPI latency.
> 
> Don't forget that the MPI paradigm is for the application to provide the 
> send/receive buffer.  Meaning: MPI doesn't (always) control where the buffer 
> is located (particularly for large messages).
> 
> > I was wondering if OpenMPI does anything special memory mapping to work 
> > around this.
> 
> Just what I mentioned in the prior email.
> 
> > And if with Ivy Bridge (or Haswell) he situation has improved.
> 
> Open MPI doesn't treat these chips any different.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Brice Goglin
The driver doesn't allocate much memory here. Maybe some small control
buffers, but nothing significantly involved in large message transfer
performance. Everything critical here is allocated by user-space (either
MPI lib or application), so we just have to make sure we bind the
process memory properly. I used hwloc-bind to do that.

Note that we have seen larger issues on older platforms. You basically
just need a big HCA and PCI link on a not-so-big machine. Not very
common fortunately with todays QPI links between Sandy-Bridge socket,
those are quite big compared to PCI Gen3 8x links to the HCA. On old AMD
platforms (and modern Intels with big GPUs), issues are not that
uncommon (we've seen up to 40% DMA bandwidth difference there).

Brice



Le 08/07/2013 19:44, Michael Thomadakis a écrit :
> Hi Brice, 
>
> thanks for testing this out.
>
> How did you make sure that the pinned pages used by the I/O adapter
> mapped to the "other" socket's memory controller ? Is pining the MPI
> binary to a socket sufficient to pin the space used for MPI I/O as
> well to that socket? I think this is something done by and at the HCA
> device driver level. 
>
> Anyways, as long as the memory performance difference is a the levels
> you mentioned then there is no "big" issue. Most likely the device
> driver get space from the same numa domain that of the socket the HCA
> is attached to. 
>
> Thanks for trying it out
> Michael
>
>
>
>
>
>
> On Mon, Jul 8, 2013 at 11:45 AM, Brice Goglin  > wrote:
>
> On a dual E5 2650 machine with FDR cards, I see the IMB Pingpong
> throughput drop from 6000 to 5700MB/s when the memory isn't
> allocated on the right socket (and latency increases from 0.8 to
> 1.4us). Of course that's pingpong only, things will be worse on a
> memory-overloaded machine. But I don't expect things to be "less
> worse" if you do an intermediate copy through the memory near the
> HCA: you would overload the QPI link as much as here, and you
> would overload the CPU even more because of the additional copies.
>
> Brice
>
>
>
> Le 08/07/2013 18:27, Michael Thomadakis a écrit :
>> People have mentioned that they experience unexpected slow downs
>> in PCIe_gen3 I/O when the pages map to a socket different from
>> the one the HCA connects to. It is speculated that the
>> inter-socket QPI is not provisioned to transfer more than
>> 1GiB/sec for PCIe_gen 3 traffic. This situation may not be in
>> effect on all SandyBrige or IvyBridge systems.
>>
>> Have you measured anything like this on you systems as well? That
>> would require using physical memory mapped to the socket w/o HCA
>> exclusively for MPI messaging.
>>
>> Mike
>>
>>
>> On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres)
>> > wrote:
>>
>> On Jul 8, 2013, at 11:35 AM, Michael Thomadakis
>> >
>> wrote:
>>
>> > The issue is that when you read or write PCIe_gen 3 dat to
>> a non-local NUMA memory, SandyBridge will use the
>> inter-socket QPIs to get this data across to the other
>> socket. I think there is considerable limitation in PCIe I/O
>> traffic data going over the inter-socket QPI. One way to get
>> around this is for reads to buffer all data into memory space
>> local to the same socket and then transfer them by code
>> across to the other socket's physical memory. For writes the
>> same approach can be used with intermediary process copying data.
>>
>> Sure, you'll cause congestion across the QPI network when you
>> do non-local PCI reads/writes.  That's a given.
>>
>> But I'm not aware of a hardware limitation on PCI-requested
>> traffic across QPI (I could be wrong, of course -- I'm a
>> software guy, not a hardware guy).  A simple test would be to
>> bind an MPI process to a far NUMA node and run a simple MPI
>> bandwidth test and see if to get better/same/worse bandwidth
>> compared to binding an MPI process on a near NUMA socket.
>>
>> But in terms of doing intermediate (pipelined) reads/writes
>> to local NUMA memory before reading/writing to PCI, no, Open
>> MPI does not do this.  Unless there is a PCI-QPI bandwidth
>> constraint that we're unaware of, I'm not sure why you would
>> do this -- it would likely add considerable complexity to the
>> code and it would definitely lead to higher overall MPI latency.
>>
>> Don't forget that the MPI paradigm is for the application to
>> provide the send/receive buffer.  Meaning: MPI doesn't
>> (always) control where the buffer is located (particularly
>> for large messages).
>>
>> > I was wondering if OpenMPI 

Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Michael Thomadakis
Hi Brice,

thanks for testing this out.

How did you make sure that the pinned pages used by the I/O adapter mapped
to the "other" socket's memory controller ? Is pining the MPI binary to a
socket sufficient to pin the space used for MPI I/O as well to that socket?
I think this is something done by and at the HCA device driver level.

Anyways, as long as the memory performance difference is a the levels you
mentioned then there is no "big" issue. Most likely the device driver get
space from the same numa domain that of the socket the HCA is attached to.

Thanks for trying it out
Michael






On Mon, Jul 8, 2013 at 11:45 AM, Brice Goglin  wrote:

>  On a dual E5 2650 machine with FDR cards, I see the IMB Pingpong
> throughput drop from 6000 to 5700MB/s when the memory isn't allocated on
> the right socket (and latency increases from 0.8 to 1.4us). Of course
> that's pingpong only, things will be worse on a memory-overloaded machine.
> But I don't expect things to be "less worse" if you do an intermediate copy
> through the memory near the HCA: you would overload the QPI link as much as
> here, and you would overload the CPU even more because of the additional
> copies.
>
> Brice
>
>
>
> Le 08/07/2013 18:27, Michael Thomadakis a écrit :
>
> People have mentioned that they experience unexpected slow downs in
> PCIe_gen3 I/O when the pages map to a socket different from the one the HCA
> connects to. It is speculated that the inter-socket QPI is not provisioned
> to transfer more than 1GiB/sec for PCIe_gen 3 traffic. This situation may
> not be in effect on all SandyBrige or IvyBridge systems.
>
>  Have you measured anything like this on you systems as well? That would
> require using physical memory mapped to the socket w/o HCA exclusively for
> MPI messaging.
>
>  Mike
>
>
> On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>
>> On Jul 8, 2013, at 11:35 AM, Michael Thomadakis 
>> wrote:
>>
>> > The issue is that when you read or write PCIe_gen 3 dat to a non-local
>> NUMA memory, SandyBridge will use the inter-socket QPIs to get this data
>> across to the other socket. I think there is considerable limitation in
>> PCIe I/O traffic data going over the inter-socket QPI. One way to get
>> around this is for reads to buffer all data into memory space local to the
>> same socket and then transfer them by code across to the other socket's
>> physical memory. For writes the same approach can be used with intermediary
>> process copying data.
>>
>>  Sure, you'll cause congestion across the QPI network when you do
>> non-local PCI reads/writes.  That's a given.
>>
>> But I'm not aware of a hardware limitation on PCI-requested traffic
>> across QPI (I could be wrong, of course -- I'm a software guy, not a
>> hardware guy).  A simple test would be to bind an MPI process to a far NUMA
>> node and run a simple MPI bandwidth test and see if to get
>> better/same/worse bandwidth compared to binding an MPI process on a near
>> NUMA socket.
>>
>> But in terms of doing intermediate (pipelined) reads/writes to local NUMA
>> memory before reading/writing to PCI, no, Open MPI does not do this.
>>  Unless there is a PCI-QPI bandwidth constraint that we're unaware of, I'm
>> not sure why you would do this -- it would likely add considerable
>> complexity to the code and it would definitely lead to higher overall MPI
>> latency.
>>
>> Don't forget that the MPI paradigm is for the application to provide the
>> send/receive buffer.  Meaning: MPI doesn't (always) control where the
>> buffer is located (particularly for large messages).
>>
>> > I was wondering if OpenMPI does anything special memory mapping to work
>> around this.
>>
>>  Just what I mentioned in the prior email.
>>
>> > And if with Ivy Bridge (or Haswell) he situation has improved.
>>
>>  Open MPI doesn't treat these chips any different.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> ___
> users mailing 
> listusers@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Brice Goglin
On a dual E5 2650 machine with FDR cards, I see the IMB Pingpong
throughput drop from 6000 to 5700MB/s when the memory isn't allocated on
the right socket (and latency increases from 0.8 to 1.4us). Of course
that's pingpong only, things will be worse on a memory-overloaded
machine. But I don't expect things to be "less worse" if you do an
intermediate copy through the memory near the HCA: you would overload
the QPI link as much as here, and you would overload the CPU even more
because of the additional copies.

Brice



Le 08/07/2013 18:27, Michael Thomadakis a écrit :
> People have mentioned that they experience unexpected slow downs in
> PCIe_gen3 I/O when the pages map to a socket different from the one
> the HCA connects to. It is speculated that the inter-socket QPI is not
> provisioned to transfer more than 1GiB/sec for PCIe_gen 3 traffic.
> This situation may not be in effect on all SandyBrige or IvyBridge
> systems.
>
> Have you measured anything like this on you systems as well? That
> would require using physical memory mapped to the socket w/o HCA
> exclusively for MPI messaging.
>
> Mike
>
>
> On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres)
> > wrote:
>
> On Jul 8, 2013, at 11:35 AM, Michael Thomadakis
> > wrote:
>
> > The issue is that when you read or write PCIe_gen 3 dat to a
> non-local NUMA memory, SandyBridge will use the inter-socket QPIs
> to get this data across to the other socket. I think there is
> considerable limitation in PCIe I/O traffic data going over the
> inter-socket QPI. One way to get around this is for reads to
> buffer all data into memory space local to the same socket and
> then transfer them by code across to the other socket's physical
> memory. For writes the same approach can be used with intermediary
> process copying data.
>
> Sure, you'll cause congestion across the QPI network when you do
> non-local PCI reads/writes.  That's a given.
>
> But I'm not aware of a hardware limitation on PCI-requested
> traffic across QPI (I could be wrong, of course -- I'm a software
> guy, not a hardware guy).  A simple test would be to bind an MPI
> process to a far NUMA node and run a simple MPI bandwidth test and
> see if to get better/same/worse bandwidth compared to binding an
> MPI process on a near NUMA socket.
>
> But in terms of doing intermediate (pipelined) reads/writes to
> local NUMA memory before reading/writing to PCI, no, Open MPI does
> not do this.  Unless there is a PCI-QPI bandwidth constraint that
> we're unaware of, I'm not sure why you would do this -- it would
> likely add considerable complexity to the code and it would
> definitely lead to higher overall MPI latency.
>
> Don't forget that the MPI paradigm is for the application to
> provide the send/receive buffer.  Meaning: MPI doesn't (always)
> control where the buffer is located (particularly for large messages).
>
> > I was wondering if OpenMPI does anything special memory mapping
> to work around this.
>
> Just what I mentioned in the prior email.
>
> > And if with Ivy Bridge (or Haswell) he situation has improved.
>
> Open MPI doesn't treat these chips any different.
>
> --
> Jeff Squyres
> jsquy...@cisco.com 
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org 
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Michael Thomadakis
People have mentioned that they experience unexpected slow downs in
PCIe_gen3 I/O when the pages map to a socket different from the one the HCA
connects to. It is speculated that the inter-socket QPI is not provisioned
to transfer more than 1GiB/sec for PCIe_gen 3 traffic. This situation may
not be in effect on all SandyBrige or IvyBridge systems.

Have you measured anything like this on you systems as well? That would
require using physical memory mapped to the socket w/o HCA exclusively for
MPI messaging.

Mike


On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres)  wrote:

> On Jul 8, 2013, at 11:35 AM, Michael Thomadakis 
> wrote:
>
> > The issue is that when you read or write PCIe_gen 3 dat to a non-local
> NUMA memory, SandyBridge will use the inter-socket QPIs to get this data
> across to the other socket. I think there is considerable limitation in
> PCIe I/O traffic data going over the inter-socket QPI. One way to get
> around this is for reads to buffer all data into memory space local to the
> same socket and then transfer them by code across to the other socket's
> physical memory. For writes the same approach can be used with intermediary
> process copying data.
>
> Sure, you'll cause congestion across the QPI network when you do non-local
> PCI reads/writes.  That's a given.
>
> But I'm not aware of a hardware limitation on PCI-requested traffic across
> QPI (I could be wrong, of course -- I'm a software guy, not a hardware
> guy).  A simple test would be to bind an MPI process to a far NUMA node and
> run a simple MPI bandwidth test and see if to get better/same/worse
> bandwidth compared to binding an MPI process on a near NUMA socket.
>
> But in terms of doing intermediate (pipelined) reads/writes to local NUMA
> memory before reading/writing to PCI, no, Open MPI does not do this.
>  Unless there is a PCI-QPI bandwidth constraint that we're unaware of, I'm
> not sure why you would do this -- it would likely add considerable
> complexity to the code and it would definitely lead to higher overall MPI
> latency.
>
> Don't forget that the MPI paradigm is for the application to provide the
> send/receive buffer.  Meaning: MPI doesn't (always) control where the
> buffer is located (particularly for large messages).
>
> > I was wondering if OpenMPI does anything special memory mapping to work
> around this.
>
> Just what I mentioned in the prior email.
>
> > And if with Ivy Bridge (or Haswell) he situation has improved.
>
> Open MPI doesn't treat these chips any different.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Jeff Squyres (jsquyres)
On Jul 8, 2013, at 11:35 AM, Michael Thomadakis  
wrote:

> The issue is that when you read or write PCIe_gen 3 dat to a non-local NUMA 
> memory, SandyBridge will use the inter-socket QPIs to get this data across to 
> the other socket. I think there is considerable limitation in PCIe I/O 
> traffic data going over the inter-socket QPI. One way to get around this is 
> for reads to buffer all data into memory space local to the same socket and 
> then transfer them by code across to the other socket's physical memory. For 
> writes the same approach can be used with intermediary process copying data.

Sure, you'll cause congestion across the QPI network when you do non-local PCI 
reads/writes.  That's a given.

But I'm not aware of a hardware limitation on PCI-requested traffic across QPI 
(I could be wrong, of course -- I'm a software guy, not a hardware guy).  A 
simple test would be to bind an MPI process to a far NUMA node and run a simple 
MPI bandwidth test and see if to get better/same/worse bandwidth compared to 
binding an MPI process on a near NUMA socket.

But in terms of doing intermediate (pipelined) reads/writes to local NUMA 
memory before reading/writing to PCI, no, Open MPI does not do this.  Unless 
there is a PCI-QPI bandwidth constraint that we're unaware of, I'm not sure why 
you would do this -- it would likely add considerable complexity to the code 
and it would definitely lead to higher overall MPI latency.

Don't forget that the MPI paradigm is for the application to provide the 
send/receive buffer.  Meaning: MPI doesn't (always) control where the buffer is 
located (particularly for large messages).

> I was wondering if OpenMPI does anything special memory mapping to work 
> around this.

Just what I mentioned in the prior email.

> And if with Ivy Bridge (or Haswell) he situation has improved.

Open MPI doesn't treat these chips any different.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Michael Thomadakis
Hi Jeff,

thanks for the reply.

The issue is that when you read or write PCIe_gen 3 dat to a non-local NUMA
memory, SandyBridge will use the inter-socket QPIs to get this data across
to the other socket. I think there is considerable limitation in PCIe I/O
traffic data going over the inter-socket QPI. One way to get around this is
for reads to buffer all data into memory space local to the same socket and
then transfer them by code across to the other socket's physical memory.
For writes the same approach can be used with intermediary process copying
data.

I was wondering if OpenMPI does anything special memory mapping to work
around this. And if with Ivy Bridge (or Haswell) he situation has improved.

thanks
Mike


On Mon, Jul 8, 2013 at 9:57 AM, Jeff Squyres (jsquyres)
wrote:

> On Jul 6, 2013, at 4:59 PM, Michael Thomadakis 
> wrote:
>
> > When you stack runs on SandyBridge nodes atached to HCAs ove PCI3 gen 3
> do you pay any special attention to the memory buffers according to which
> socket/memory controller  their physical memory belongs to?
> >
> > For instance, if the HCA is attached to the PCIgen3 lanes of Socket 1 do
> you do anything special when the read/write buffers map to physical memory
> belonging to Socket 2? Or do you7 avoid using buffers mapping ro memory
> that belongs (is accessible via) the other socket?
>
> It is not *necessary* to do ensure that buffers are NUMA-local to the PCI
> device that they are writing to, but it certainly results in lower latency
> to read/write to PCI devices (regardless of flavor) that are attached to an
> MPI process' local NUMA node.  The Hardware Locality (hwloc) tool "lstopo"
> can print a pretty picture of your server to show you where your PCI busses
> are connected.
>
> For TCP, Open MPI will use all TCP devices that it finds by default
> (because it is assumed that latency is so high that NUMA locality doesn't
> matter).  The openib (OpenFabrics) transport will use the "closest" HCA
> ports that it can find to each MPI process.
>
> In our upcoming Cisco ultra low latency BTL, it defaults to using the
> closest Cisco VIC ports that it can find for short messages (i.e., to
> minimize latency), but uses all available VICs for long messages (i.e., to
> maximize bandwidth).
>
> > Has this situation improved with Ivy-Brige systems or Haswell?
>
> It's the same overall architecture (i.e., NUMA).
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Question on handling of memory for communications

2013-07-08 Thread Jeff Squyres (jsquyres)
On Jul 6, 2013, at 4:59 PM, Michael Thomadakis  wrote:

> When you stack runs on SandyBridge nodes atached to HCAs ove PCI3 gen 3 do 
> you pay any special attention to the memory buffers according to which 
> socket/memory controller  their physical memory belongs to?
> 
> For instance, if the HCA is attached to the PCIgen3 lanes of Socket 1 do you 
> do anything special when the read/write buffers map to physical memory 
> belonging to Socket 2? Or do you7 avoid using buffers mapping ro memory that 
> belongs (is accessible via) the other socket?

It is not *necessary* to do ensure that buffers are NUMA-local to the PCI 
device that they are writing to, but it certainly results in lower latency to 
read/write to PCI devices (regardless of flavor) that are attached to an MPI 
process' local NUMA node.  The Hardware Locality (hwloc) tool "lstopo" can 
print a pretty picture of your server to show you where your PCI busses are 
connected.

For TCP, Open MPI will use all TCP devices that it finds by default (because it 
is assumed that latency is so high that NUMA locality doesn't matter).  The 
openib (OpenFabrics) transport will use the "closest" HCA ports that it can 
find to each MPI process.  

In our upcoming Cisco ultra low latency BTL, it defaults to using the closest 
Cisco VIC ports that it can find for short messages (i.e., to minimize 
latency), but uses all available VICs for long messages (i.e., to maximize 
bandwidth).

> Has this situation improved with Ivy-Brige systems or Haswell?

It's the same overall architecture (i.e., NUMA).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/