Cisco hasn't been involved in IB for several years, so I can't comment on that 
directly.

That being said, our Cisco VIC devices are PCI gen *2*, but they are x16 (not 
x8).  We can get full bandwidth out of out 2*10Gb device from remote NUMA nodes 
on E5-2690-based machines (Sandy Bridge) for large messages.  In the lab, we 
have... tweaked... versions of those devices that give significantly higher 
bandwidth (it's no secret that the ASIC on these devices is capable of 80Gb).

We haven't looked for this specific issue, but I can confirm that we have seen 
the bandwidth that we expected out of our devices.


On Jul 8, 2013, at 12:27 PM, Michael Thomadakis <drmichaelt7...@gmail.com> 
wrote:

> People have mentioned that they experience unexpected slow downs in PCIe_gen3 
> I/O when the pages map to a socket different from the one the HCA connects 
> to. It is speculated that the inter-socket QPI is not provisioned to transfer 
> more than 1GiB/sec for PCIe_gen 3 traffic. This situation may not be in 
> effect on all SandyBrige or IvyBridge systems.
> 
> Have you measured anything like this on you systems as well? That would 
> require using physical memory mapped to the socket w/o HCA exclusively for 
> MPI messaging.
> 
> Mike
> 
> 
> On Mon, Jul 8, 2013 at 10:52 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> wrote:
> On Jul 8, 2013, at 11:35 AM, Michael Thomadakis <drmichaelt7...@gmail.com> 
> wrote:
> 
> > The issue is that when you read or write PCIe_gen 3 dat to a non-local NUMA 
> > memory, SandyBridge will use the inter-socket QPIs to get this data across 
> > to the other socket. I think there is considerable limitation in PCIe I/O 
> > traffic data going over the inter-socket QPI. One way to get around this is 
> > for reads to buffer all data into memory space local to the same socket and 
> > then transfer them by code across to the other socket's physical memory. 
> > For writes the same approach can be used with intermediary process copying 
> > data.
> 
> Sure, you'll cause congestion across the QPI network when you do non-local 
> PCI reads/writes.  That's a given.
> 
> But I'm not aware of a hardware limitation on PCI-requested traffic across 
> QPI (I could be wrong, of course -- I'm a software guy, not a hardware guy).  
> A simple test would be to bind an MPI process to a far NUMA node and run a 
> simple MPI bandwidth test and see if to get better/same/worse bandwidth 
> compared to binding an MPI process on a near NUMA socket.
> 
> But in terms of doing intermediate (pipelined) reads/writes to local NUMA 
> memory before reading/writing to PCI, no, Open MPI does not do this.  Unless 
> there is a PCI-QPI bandwidth constraint that we're unaware of, I'm not sure 
> why you would do this -- it would likely add considerable complexity to the 
> code and it would definitely lead to higher overall MPI latency.
> 
> Don't forget that the MPI paradigm is for the application to provide the 
> send/receive buffer.  Meaning: MPI doesn't (always) control where the buffer 
> is located (particularly for large messages).
> 
> > I was wondering if OpenMPI does anything special memory mapping to work 
> > around this.
> 
> Just what I mentioned in the prior email.
> 
> > And if with Ivy Bridge (or Haswell) he situation has improved.
> 
> Open MPI doesn't treat these chips any different.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to