Thank you all for your help. --bind-to-core increased the cluster performance by approximately 10%, so in addition to the improvements through the implementation of Open-MX, the performance now scales within expectations - not linear, but much better than with the original setup.
On 30 January 2014 20:43, Tim Prince <n...@aol.com> wrote: > > On 1/29/2014 11:30 PM, Ralph Castain wrote: > > > On Jan 29, 2014, at 7:56 PM, Victor <victor.ma...@gmail.com> wrote: > > Thanks for the insights Tim. I was aware that the CPUs will choke beyond > a certain point. From memory on my machine this happens with 5 concurrent > MPI jobs with that benchmark that I am using. > > My primary question was about scaling between the nodes. I was not > getting close to double the performance when running MPI jobs acros two 4 > core nodes. It may be better now since I have Open-MX in place, but I have > not repeated the benchmarks yet since I need to get one simulation job done > asap. > > > Some of that may be due to expected loss of performance when you switch > from shared memory to inter-node transports. While it is true about > saturation of the memory path, what you reported could be more consistent > with that transition - i.e., it isn't unusual to see applications perform > better when run on a single node, depending upon how they are written, up > to a certain size of problem (which your code may not be hitting). > > > Regarding your mention of setting affinities and MPI ranks do you have a > specific (as in syntactically specific since I am a novice and easily > confused...) examples how I may want to set affinities to get the Westmere > node performing better? > > > mpirun --bind-to-core -cpus-per-rank 2 ... > > will bind each MPI rank to 2 cores. Note that this will definitely *not* > be a good idea if you are running more than two threads in your process - > if you are, then set --cpus-per-rank to the number of threads, keeping in > mind that you want things to break evenly across the sockets. In other > words, if you have two 6 core/socket Westmere's on the node, then you > either want to run 6 process at cpus-per-rank=2 if each process runs 2 > threads, or 4 processes with cpus-per-rank=3 if each process runs 3 > threads, or 2 processes with no cpus-per-rank but --bind-to-socket instead > of --bind-to-core for any other thread number > 3. > > You would not want to run any other number of processes on the node or > else the binding pattern will cause a single process to split its threads > across the sockets - which will definitely hurt performance. > > > -cpus-per-rank 2 is an effective choice for this platform. As Ralph > said, it should work automatically for 2 threads per rank. > Ralph's point about not splitting a process across sockets is an important > one. Even splitting a process across internal busses, which would happen > with 3 threads per process, seems problematical. > > -- > Tim Prince > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >