FWIW: 1.5.5 still doesn't support binding to NUMA regions, for example - and the script doesn't really do anything more than bind to cores. I believe only the trunk provides a more comprehensive set of binding options.
Given the described NUMA layout, I suspect bind-to-NUMA is going to make the biggest difference. On Mar 30, 2012, at 6:17 AM, Pavel Mezentsev wrote: > You can try running using this script: > #!/bin/bash > > s=$(($OMPI_COMM_WORLD_NODE_RANK)) > > numactl --physcpubind=$((s)) --localalloc ./YOUR_PROG > > instead of 'mpirun ... ./YOUR_PROG' run 'mpirun ... ./SCRIPT > > I tried this with openmpi-1.5.4 and it helped. > > Best regards, Pavel Mezentsev > > P.S openmpi-1.5.5 bind processes correctly, so you can try it as well. > > 2012/3/30 Ralph Castain <r...@open-mpi.org> > I think you'd have much better luck using the developer's trunk as the > binding there is much better - e.g., you can bind to NUMA instead of just > cores. The 1.4 binding is pretty limited. > > http://www.open-mpi.org/nightly/trunk/ > > On Mar 30, 2012, at 5:02 AM, Ricardo Fonseca wrote: > > > Hi guys > > > > I'm benchmarking our (well tested) parallel code on and AMD based system, > > featuring 2x AMD Opteron(TM) Processor 6276, with 16 cores each for a total > > of 32 cores. The system is running Scientific Linux 6.1 and OpenMPI 1.4.5. > > > > When I run a single core job the performance is as expected. However, when > > I run with 32 processes the performance drops to about 60% (when compared > > with other systems running the exact same problem, so this is not a code > > scaling issue). I think this may have to do with core binding / NUMA, but I > > haven't been able to get any improvement out of the bind-* mpirun options. > > > > Any suggestions? > > > > Thanks in advance, > > Ricardo > > > > P.S: Here's the output of lscpu > > > > Architecture: x86_64 > > CPU op-mode(s): 32-bit, 64-bit > > Byte Order: Little Endian > > CPU(s): 32 > > On-line CPU(s) list: 0-31 > > Thread(s) per core: 2 > > Core(s) per socket: 8 > > CPU socket(s): 2 > > NUMA node(s): 4 > > Vendor ID: AuthenticAMD > > CPU family: 21 > > Model: 1 > > Stepping: 2 > > CPU MHz: 2300.045 > > BogoMIPS: 4599.38 > > Virtualization: AMD-V > > L1d cache: 16K > > L1i cache: 64K > > L2 cache: 2048K > > L3 cache: 6144K > > NUMA node0 CPU(s): 0,2,4,6,8,10,12,14 > > NUMA node1 CPU(s): 16,18,20,22,24,26,28,30 > > NUMA node2 CPU(s): 1,3,5,7,9,11,13,15 > > NUMA node3 CPU(s): 17,19,21,23,25,27,29,31 > > > > --- > > Ricardo Fonseca > > > > Associate Professor > > GoLP - Grupo de Lasers e Plasmas > > Instituto de Plasmas e Fusão Nuclear > > Instituto Superior Técnico > > Av. Rovisco Pais > > 1049-001 Lisboa > > Portugal > > > > tel: +351 21 8419202 > > fax: +351 21 8464455 > > web: http://golp.ist.utl.pt/ > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users