[OMPI devel] Very poor performance with btl sm on twin nehalem servers with Mellanox ConnectX installed
Sorry for crossposting, I already posted this report to the users list, but the developers list is probably more relevant. I have a cluster with two Intel Xeon Nehalem E5520 CPU per server quad-core, 2.27GHz). The interconnect is 4xQDR Infiniband (Mellanox ConnectX). I have compiled and installed OpenMPI 1.4.2. Openmpi was compiled with "--with-libnuma --with-sge using gcc 4.4 and "-march=native -O3". The kernel is 2.6.32.12 and I have compiled the kernel myself. The system is Centos 5.4. I use gridengine 6.2u5. The OFED stack installed is 1.5.1. The problem is that I get very bad performance unless I explicitly exclude the "sm" btl and I can't figure out why. I have tried searching the web and the OpenMPI mailing lists. I have seen reports about non-optimal performance, but my results are far worse than any other reports I have found. I run the "mpi_stress" program with different packet lengths. I run on a single server using 8 slots so that all eight cores on one server are occupied, just to see the loopback/shared memory performance. When I use "-mca btl self,openib" I get pretty good results, between 450MB/s and 700MB/s depending on the packet lengths. When I use "-mca btl self,sm" or "-mca btl self,sm,openib" I just get 9MB/s for 1MB packets and 1.5MB/s for 10kB packets. Following the FAQ I have tried tweaking btl_sm_num_fifos=8 and btl_sm_eager_limit=65536 which improves things to 30MB/s for 1MB packets and 5MB/s for 10kB packets. With "-mca_paffinity_alone=1" I gain another 20% speedup. But still this is pretty louse. I had expected several GB/s. What is going on? Any hints? I thought these CPU's had excellent SM-bandwidth over quickpath. Hyperthreading is enabled, if that is relevant. The locked-memory limit is 500MB and the stack limit is 64MB. Please help! Thanks /Oskar
Re: [OMPI devel] Very poor performance with btl sm on twin nehalem servers with Mellanox ConnectX installed
On 13/05/10 20:56, Oskar Enoksson wrote: > The problem is that I get very bad performance unless I > explicitly exclude the "sm" btl and I can't figure out why. Recently someone reported issues which were traced back to the fact that the files that sm uses for mmap() were in a /tmp which was NFS mounted; changing the location where their files were kept to another directory with the orte_tmpdir_base MCA parameter fixed that issue for them. Could it be similar for yourself ? cheers, Chris -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/
[OMPI devel] RFC: Remove all other paffinity components
WHAT: Remove all non-hwloc paffinity components. WHY: The hwloc component supports all those systems. WHERE: opal/mca/paffinity/[^hwloc|base] directories WHEN: for 1.5.1 TIMEOUT: Tuesday call, May 25 (yes, about 2 weeks from now -- let hwloc soak for a while...) - MORE DETAILS: As you probably noticed, I have (finally) committed the "hwloc" paffinity component to the trunk and removed the "linux" (i.e., PLPA) paffinity component: https://svn.open-mpi.org/trac/ompi/changeset/23125 https://svn.open-mpi.org/trac/ompi/changeset/23126 hwloc supports all systems that OMPI supports (and several that OMPI doesn't!) -- more specifically, it supports all the other systems that we have paffinity components for (darwin, linux, posix, solaris, windows). It can therefore fully replace all the other paffinity components. Indeed, the new hwloc's default priority is higher than all of the other current paffinity components, so over the next week or two, it'll be a good soak test to see if it is working properly. Once we get any kinks worked out, I propose removing all the other paffinity components and leaving only hwloc. That being said, we might as well leave the paffinity framework around, even if it only has one component left, simply on the argument that someday Open MPI may support a platform that hwloc does not. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] RFC: move hwloc code base to opal/hwloc
WHAT: hwloc is currently embedded in opal/mca/paffinity/hwloc/hwloc -- move it to be a first class citizen in opal/hwloc. WHY: Let other portions of the OPAL, ORTE, and OMPI code bases use hwloc services (remember that hwloc provides detailed topology information, not just processor binding). WHERE: Move opal/mca/paffinity/hwloc/hwloc to opal/hwloc, and adjust associated configury WHEN: For v1.5.1 TIMEOUT: Tuesday call, May 25 - MORE DETAILS: The hwloc code base is *much* more powerful and useful than PLPA -- it provides a wealth of information that PLPA did not. Specifically: hwloc provides data structures detailing the internal topology of a server. You can see cache line sizes, NUMA layouts, sockets, cores, hardware threads, ...etc. This information should be available to the entire OMPI code base -- not just locked up in a paffinity component. Putting hwloc up in opal/hwloc makes it available everywhere. Developers can just call hwloc_, and OMPI's build system will automatically do all the right symbol-shifting if the embedded hwloc is used in OMPI (and not symbol-shift if an external hwloc is used, obviously). It's magically delicious! One immediate use that I'd like to see is to have the openib BTL use hwloc's ibv functionality to find "nearby" HCAs (right now, you can only do this with rankfiles). I can foresee other components using cache line size information to help tune performance (e.g., sm btl and sm coll...?). To be clear: there will still be an hwloc paffinity component. It just won't embed its own copy of hwloc anymore. It'll use the hwloc services provided by the OMPI build system, just like the rest of the OPAL / ORTE / OMPI code bases. There will also be an option to compile hwloc out altogether -- some stubs will be left that return ERR_NOT_SUPPORTED, or somesuch (details TBD). The reason for this is that there are some systems where processor affinity and NUMA information aren't relevant (e.g., embedded systems). Memory footprint is key in such systems; hwloc would simply take up valuable RAM. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: Remove all other paffinity components
On 14/05/10 10:20, Jeff Squyres wrote: > That being said, we might as well leave the paffinity > framework around, even if it only has one component left, > simply on the argument that someday Open MPI may support > a platform that hwloc does not. Sounds good to me. cheers! Chris -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/