[OMPI devel] Very poor performance with btl sm on twin nehalem servers with Mellanox ConnectX installed

2010-05-13 Thread Oskar Enoksson
Sorry for crossposting, I already posted this report to the users list,
but the developers list is probably more relevant.

I have a cluster with two Intel Xeon Nehalem E5520 CPU per server
quad-core, 2.27GHz). The interconnect is 4xQDR Infiniband (Mellanox
ConnectX).

I have compiled and installed OpenMPI 1.4.2. Openmpi was compiled with
"--with-libnuma --with-sge using gcc 4.4 and "-march=native -O3". The
kernel is 2.6.32.12 and I have compiled the kernel myself. The system is
Centos 5.4. I use gridengine 6.2u5. The OFED stack installed is 1.5.1.

The problem is that I get very bad performance unless I explicitly
exclude the "sm" btl and I can't figure out why. I have tried searching
the web and the OpenMPI mailing lists. I have seen reports about
non-optimal performance, but my results are far worse than any other
reports I have found.

I run the "mpi_stress" program with different packet lengths. I run on a
single server using 8 slots so that all eight cores on one server are
occupied, just to see the loopback/shared memory performance.

When I use "-mca btl self,openib" I get pretty good results, between
450MB/s and 700MB/s depending on the packet lengths. When I use "-mca
btl self,sm" or "-mca btl self,sm,openib" I just get 9MB/s for 1MB
packets and 1.5MB/s for 10kB packets. Following the FAQ I have tried
tweaking btl_sm_num_fifos=8 and btl_sm_eager_limit=65536 which improves
things to 30MB/s for 1MB packets and 5MB/s for 10kB packets. With
"-mca_paffinity_alone=1" I gain another 20% speedup.

But still this is pretty louse. I had expected several GB/s. What is
going on? Any hints? I thought these CPU's had excellent SM-bandwidth
over quickpath.

Hyperthreading is enabled, if that is relevant. The locked-memory limit
is 500MB and the stack limit is 64MB.

Please help!
Thanks
/Oskar



Re: [OMPI devel] Very poor performance with btl sm on twin nehalem servers with Mellanox ConnectX installed

2010-05-13 Thread Christopher Samuel
On 13/05/10 20:56, Oskar Enoksson wrote:

> The problem is that I get very bad performance unless I
> explicitly exclude the "sm" btl and I can't figure out why.

Recently someone reported issues which were traced back to
the fact that the files that sm uses for mmap() were in a
/tmp which was NFS mounted; changing the location where their
files were kept to another directory with the orte_tmpdir_base
MCA parameter fixed that issue for them.

Could it be similar for yourself ?

cheers,
Chris
-- 
  Christopher Samuel - Senior Systems Administrator
  VLSCI - Victorian Life Sciences Computational Initiative
  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
  http://www.vlsci.unimelb.edu.au/


[OMPI devel] RFC: Remove all other paffinity components

2010-05-13 Thread Jeff Squyres
WHAT: Remove all non-hwloc paffinity components.

WHY: The hwloc component supports all those systems.

WHERE: opal/mca/paffinity/[^hwloc|base] directories

WHEN: for 1.5.1

TIMEOUT: Tuesday call, May 25 (yes, about 2 weeks from now -- let hwloc soak 
for a while...)

-

MORE DETAILS:

As you probably noticed, I have (finally) committed the "hwloc" paffinity 
component to the trunk and removed the "linux" (i.e., PLPA) paffinity component:

https://svn.open-mpi.org/trac/ompi/changeset/23125
https://svn.open-mpi.org/trac/ompi/changeset/23126

hwloc supports all systems that OMPI supports (and several that OMPI doesn't!) 
-- more specifically, it supports all the other systems that we have paffinity 
components for (darwin, linux, posix, solaris, windows).  It can therefore 
fully replace all the other paffinity components.

Indeed, the new hwloc's default priority is higher than all of the other 
current paffinity components, so over the next week or two, it'll be a good 
soak test to see if it is working properly.  Once we get any kinks worked out, 
I propose removing all the other paffinity components and leaving only hwloc.

That being said, we might as well leave the paffinity framework around, even if 
it only has one component left, simply on the argument that someday Open MPI 
may support a platform that hwloc does not.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] RFC: move hwloc code base to opal/hwloc

2010-05-13 Thread Jeff Squyres
WHAT: hwloc is currently embedded in opal/mca/paffinity/hwloc/hwloc -- move it 
to be a first class citizen in opal/hwloc.

WHY: Let other portions of the OPAL, ORTE, and OMPI code bases use hwloc 
services (remember that hwloc provides detailed topology information, not just 
processor binding).

WHERE: Move opal/mca/paffinity/hwloc/hwloc to opal/hwloc, and adjust associated 
configury

WHEN: For v1.5.1

TIMEOUT: Tuesday call, May 25

-

MORE DETAILS:

The hwloc code base is *much* more powerful and useful than PLPA -- it provides 
a wealth of information that PLPA did not.  Specifically: hwloc provides data 
structures detailing the internal topology of a server.  You can see cache line 
sizes, NUMA layouts, sockets, cores, hardware threads, ...etc.

This information should be available to the entire OMPI code base -- not just 
locked up in a paffinity component.  Putting hwloc up in opal/hwloc makes it 
available everywhere.  Developers can just call hwloc_, and OMPI's build 
system will automatically do all the right symbol-shifting if the embedded 
hwloc is used in OMPI (and not symbol-shift if an external hwloc is used, 
obviously).  It's magically delicious!

One immediate use that I'd like to see is to have the openib BTL use hwloc's 
ibv functionality to find "nearby" HCAs (right now, you can only do this with 
rankfiles).

I can foresee other components using cache line size information to help tune 
performance (e.g., sm btl and sm coll...?).

To be clear: there will still be an hwloc paffinity component.  It just won't 
embed its own copy of hwloc anymore.  It'll use the hwloc services provided by 
the OMPI build system, just like the rest of the OPAL / ORTE / OMPI code bases.

There will also be an option to compile hwloc out altogether -- some stubs will 
be left that return ERR_NOT_SUPPORTED, or somesuch (details TBD).  The reason 
for this is that there are some systems where processor affinity and NUMA 
information aren't relevant (e.g., embedded systems).  Memory footprint is key 
in such systems; hwloc would simply take up valuable RAM.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: Remove all other paffinity components

2010-05-13 Thread Christopher Samuel
On 14/05/10 10:20, Jeff Squyres wrote:

> That being said, we might as well leave the paffinity
> framework around, even if it only has one component left,
> simply on the argument that someday Open MPI may support
> a platform that hwloc does not.

Sounds good to me.

cheers!
Chris
-- 
  Christopher Samuel - Senior Systems Administrator
  VLSCI - Victorian Life Sciences Computational Initiative
  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
  http://www.vlsci.unimelb.edu.au/