Hello all. I apologize if this has been addressed in the FAQ or on the mailing list, but I spent a fair amount of time searching both and found no direct answers.
I use OpenMPI, currently version 1.3.2, on an 8-way quad-core AMD Opteron machine. So 32 cores in total. The computer runs a modern 2.6 family Linux kernel. I don't at the present time use a resource manager like SLURM, since there is at most one other user and we don't step on each others' toes. What I find is that when I launch MPI jobs, I don't see the processes packed optimally onto the cores. I think OMPI should try to place jobs in such a way that the tasks fill up all four cores of one socket, then as many cores as necessary on the next socket, and so on. So for example, if I want to run 6 tasks, each of which needs 4 processors, I can see that as I start the jobs up, the processes for each job get distributed without regard to NUMA optimality -- 2 of them might be on processor A, 1 on processor B, and the fourth on processor C. Since I have dynamic clocking enabled, I can check this by looking at /proc/cpuinfo (see what the clock speeds are on each core when the system is otherwise quiescent), or by using top and turning on the display for each processor. Obviously, in terms of maximizing performance, this is bad. Once I start getting up to say 5 of the 4-processor jobs, I can see computational throughput degrade heavily. I would hypothesize there is heavy contention on the HyperTransport links. I saw the processor and memory affinity options, but that seems to address a different problem -- namely, keep the jobs pinned to specific resources. I also want that, but it's not the same issue as I discussed above. So, I guess I have several questions: 1. Is there any way to have OpenMPI automatically tell Linux via its affinity and NUMA-related APIs that the OMPI jobs should be scheduled in such a way that they fill the cores on particular sockets, and try to use adjacent sockets? 2. I think the rankfile may be the way for me to address this issue, but do I need to have a different rankfile for each job? The FAQ shows the ability to wildcard the "core" number/ID field. Is there a way to wildcard the socket field, but not the core field, that is tell OMPI I don't care what socket you choose, but the job should always be mapped onto the cores of a single socket? The latter might not make sense for a job using more than the number of cores per socket, but it would be useful in that case. On a job needing say more than 4 processes on a quad-core, it probably makes sense to specifically tell OMPI which sockets to use as well, to try to maintain the smallest number of processor hops. 3. If my understanding is correct, and a rankfile will help me solve this problem, can I safely turn on processor and memory affinity such that the different OMPI jobs I manually launched will not vie for affinity on the same processor cores/memory chunks? Thank you. -- http://www.fastmail.fm - Same, same, but different...