Dear Meep users and developer,
I'm getting strange scaling performance using meep-mpi compiled with
IntelMPI on our cluster. When I go from 1 to 2 processors, I'm getting
an almost ideal scaling (i.e. runtime is divided by almost 2 as shown
below for various problem sizes), but the scaling becomes very weak
when using more than 2 processors. I should say that meep-mpi results
agree with the one I am getting on my PC with meep-serial (in other
words, our compilation seems all right).
nb_proc runtime-res=20 runtime-res=40 runtime-res=60 runtime-res=80
1 20.5 135 449 1086
2 11.47 73 230 551
4 11.52 68 219 530
8 12.9 67 222 528
Let's go for some more details with a job size of ~3Gb (3D stuff). I
am showing below the stats obtained when requesting 4 processors:
mpirun -np 4 meep-mpi res=100 runtime0=2 norm-run?=true slab3D.ctl
-------------------------------------------------------------------------
Mem: 16411088k total, 4015216k used, 12395872k free, 256k buffers
Swap: 0k total, 0k used, 0k free, 283692k cached
PID PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
18175 25 0 353m 221m 6080 R 99.8 1.4 1:10.41 1 meep-mpi
18174 25 0 354m 222m 6388 R 100.2 1.4 1:10.41 6 meep-mpi
18172 25 0 1140m 1.0g 7016 R 99.8 6.3 1:10.41 2 meep-mpi
18173 25 0 1140m 1.0g 6804 R 99.5 6.3 1:10.40 4 meep-mpi
Tasks: 228 total, 5 running, 222 sleeping, 0 stopped, 1 zombie
Cpu1 : 23.9%us, 76.1%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu6 : 23.3%us, 76.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu2 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu4 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
[...]
-------------------------------------------------------------------------
So what we see here is that while the processors are all running flat
out, for CPU 1 and 6 (which are the two running processes light on
memory) only 1/4 of the time is in user code, and 3/4 is in system
time -- normally I/O, but here probably MPI communications. It
explains why I don't get shorter runtimes with more than 2 processors.
So we have a fairly clear load-balance issue; Have you experienced
this kind of situation? I was wondering if there may be meep-mpi
parameters I can set to affect the domain decomposition into chunks in
a helpful way.
I can send more details if needed.
Thanks in advance!
Best regards,
Guillaume Demésy
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
_______________________________________________
meep-discuss mailing list
meep-discuss@ab-initio.mit.edu
http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss