I'm getting strange scaling performance using meep-mpi compiled with IntelMPI on our cluster. When I go from 1 to 2 processors, I'm getting an almost ideal scaling (i.e. runtime is divided by almost 2 as shown below for various problem sizes), but the scaling becomes very weak when using more than 2 processors. I should say that meep-mpi results agree with the one I am getting on my PC with meep-serial (in other words, our compilation seems all right).

nb_proc  runtime-res=20   runtime-res=40     runtime-res=60  runtime-res=80
    1          20.5             135                449             1086
    2          11.47             73                230              551
    4          11.52             68                219              530
    8          12.9              67                222              528

Let's go for some more details with a job size of ~3Gb (3D stuff). I am showing below the stats obtained when requesting 4 processors:
mpirun -np 4 meep-mpi res=100 runtime0=2 norm-run?=true slab3D.ctl

Mem:  16411088k total,  4015216k used, 12395872k free,      256k buffers
Swap:        0k total,        0k used,        0k free,   283692k cached
18175    25   0  353m 221m 6080 R  99.8  1.4   1:10.41  1  meep-mpi
18174    25   0  354m 222m 6388 R 100.2  1.4   1:10.41  6  meep-mpi
18172    25   0 1140m 1.0g 7016 R  99.8  6.3   1:10.41  2  meep-mpi
18173    25   0 1140m 1.0g 6804 R  99.5  6.3   1:10.40  4  meep-mpi

Tasks: 228 total,   5 running, 222 sleeping,   0 stopped,   1 zombie
Cpu1  : 23.9%us, 76.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 23.3%us, 76.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 99.7%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 99.7%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

So what we see here is that while the processors are all running flat out, for CPU 1 and 6 (which are the two running processes light on memory) only 1/4 of the time is in user code, and 3/4 is in system time -- normally I/O, but here probably MPI communications. It explains why I don't get shorter runtimes with more than 2 processors.

So we have a fairly clear load-balance issue; Have you experienced this kind of situation? I was wondering if there may be meep-mpi parameters I can set to affect the domain decomposition into chunks in a helpful way.

I can send more details if needed.

Thanks in advance!

Best regards,

Guillaume Demésy

