On Mar 24, 2008, at 6:29 AM, Mark Kosmowski wrote:
I have a successful ompi installation and my software runs across my
humble cluster of three dual-Opteron (single core) nodes on OpenSUSE
10.2. I'm planning to upgrade some RAM soon and have been thinking of
playing with affinity, since each cpu will have it's own DIMMs after
the upgrade. I have read the FAQ and know to use "--mca
mpi_paffinity_alone 1" to enable affinity.
It looks like I am running ompi 1.1.4 (see below).
mark@LT:~> ompi_info | grep affinity
MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.4)
MCA maffinity: first_use (MCA v1.0, API v1.0, Component
v1.1.4)
MCA maffinity: libnuma (MCA v1.0, API v1.0, Component
v1.1.4)
Does this old version of ompi do a good job of implementing affinity
or would it behoove me to use the current version if I am interested
in trying affinity?
It's the same level of affinity support as in the 1.2 series.
There are a few affinity upgrades in development, some of which will
hit for the v1.3 series, some of which will be later:
- upgrade to a newer embedded version of PLPA; this probably won't
affect you much (will be in v1.3)
- allow assigning MPI processes to specific socket/core combinations
via a file specification (will be in v1.3)
- have some "better" launch support such that resource managers who
implement their own affinity controls (e.g., SLURM) can directly set
the affinity for MPI processes (some future version; probably won't be
ready for v1.3).
What sorts of time gains do people typically see with affinity? (I'm
a chemistry student running planewave solid state calculation software
if this helps with the question)
As with everything, it depends. :-)
- If you're just running one MPI process per core and you only have
one core per socket, you might simply see a "smoothing" of results --
meaning that multiple runs of the same job will have slightly more
consistent timing results (e.g., less "jitter" in the timings)
- If you have a NUMA architecture (e.g., AMD) and have multiple NICs,
you can play games to get the MPI processes who are actually doing the
communicating to be "close" to the NIC in the internal host topology.
If your app is using a lot of short messages over low-latency
interconnects, this can make a difference. If you're using TCP, it
likely won't make much of a difference. :-)
Lastly, two of the three machines will have all of their DIMM slots
populated by equal sized DIMMs. However, one of my machines has two
processors, each of which having four DIMM slots. This machine will
be getting 4 @ 1 Gb DIMMs and 2 @ 2 Gb DIMMs. i am assuming that the
best thing for affinity would be to put all of the 1 Gb DIMMs to one
processor and the 2 Gb DIMMs to the other and to put the 2 Gb DIMMs in
slots 0 and 1. Does it matter which processor gets which set of
DIMMs?
It depends on what your application is doing. You generally want to
have enough "local" RAM for the [MPI] processes that will be running
on each socket.
--
Jeff Squyres
Cisco Systems