Hi Shavkat,

thanks a lot for your quick answer.
1)You are right, I didn't mention the layout of the cluster. For now, I'm working on one node of a larger machine based on Intel's Nehalem architecture (I'm trying to evaluate performance on only one node before using more). Each node has 16Gb RAM for 8 cores (2GB per core).

3)I gave global runtimes for the overall simulation in my previous message. However, the screen shots of performance were taken after "real" work has started. As for I/O, I don't output anything and my only input is the averaged epsilon structure. By the way, the subpixel averaging is a step which is strongly scaled (9.4s with 8 procs, 46.9s with 2 procs, 82.4s with 1 proc).

2) In your case, have you witnessed this kind of unbalanced behavior (unbalanced memory, I mean)?

Thanks again.

Best regards,

Guillaume

Nizamov Shawkat <nizamov.shaw...@gmail.com> a écrit :

1) You didn't provide any details on the layout of your cluster. It is
hard to guess if you have a  several dual-cores with 16Gb memory, or
they are 6 cores and you are running only on one of them.
2) I resemble that using 4 (or was it 8?) core Athlon in single PC
there were no acceleration beyond 3 cores. In my case the limitation
was most probably memory bandwidth, but it is not  your case. I would
anyway see in my case all cores running at almost 100%.
3) Are you sure that you are accounting for actual simulation and not
the initialization? Populating the memory with epsilons is not
parallel, I mean, that every core populates only some simulation
space. If it is uniformly filled it completes fast. If it has some
structure, especially if subpixel averaging is turned on,  it may take
much longer time, during which other cores will just simply wait.
Print some debug information like "structure initialization"
"simulation started" etc and compare the timing distribution. From
"runtime0=2" I conclude that your simulation time is actually rather
short.

With best regards,
Shawkat Nizamov

2010/7/15, gdem...@physics.utoronto.ca <gdem...@physics.utoronto.ca>:
Dear Meep users and developer,

I'm getting strange scaling performance using meep-mpi compiled with
IntelMPI on our cluster. When I go from 1 to 2 processors, I'm getting
an almost ideal scaling (i.e. runtime is divided by almost 2 as shown
below for various problem sizes), but the scaling becomes very weak
when using more than 2 processors. I should say that meep-mpi results
agree with the one I am getting on my PC with meep-serial (in other
words, our compilation seems all right).

nb_proc  runtime-res=20   runtime-res=40     runtime-res=60  runtime-res=80
     1          20.5             135                449             1086
     2          11.47             73                230              551
     4          11.52             68                219              530
     8          12.9              67                222              528

Let's go for some more details with a job size of ~3Gb (3D stuff). I
am showing below the stats obtained when requesting 4 processors:
mpirun -np 4 meep-mpi res=100 runtime0=2 norm-run?=true slab3D.ctl

-------------------------------------------------------------------------
Mem:  16411088k total,  4015216k used, 12395872k free,      256k buffers
Swap:        0k total,        0k used,        0k free,   283692k cached
     PID    PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+    P  COMMAND
18175    25   0  353m 221m 6080 R  99.8  1.4   1:10.41  1  meep-mpi
18174    25   0  354m 222m 6388 R 100.2  1.4   1:10.41  6  meep-mpi
18172    25   0 1140m 1.0g 7016 R  99.8  6.3   1:10.41  2  meep-mpi
18173    25   0 1140m 1.0g 6804 R  99.5  6.3   1:10.40  4  meep-mpi

Tasks: 228 total,   5 running, 222 sleeping,   0 stopped,   1 zombie
Cpu1  : 23.9%us, 76.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu6  : 23.3%us, 76.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu2  : 99.7%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu4  : 99.7%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
[...]
-------------------------------------------------------------------------

So what we see here is that while the processors are all running flat
out, for CPU 1 and 6 (which are the two running processes light on
memory) only 1/4 of the time is in user code, and 3/4 is in system
time -- normally I/O, but here probably MPI communications. It
explains why I don't get shorter runtimes with more than 2 processors.

So we have a fairly clear load-balance issue; Have you experienced
this kind of situation? I was wondering if there may be meep-mpi
parameters I can set to affect the domain decomposition into chunks in
a helpful way.

I can send more details if needed.

Thanks in advance!

Best regards,

Guillaume Demésy

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.



_______________________________________________
meep-discuss mailing list
meep-discuss@ab-initio.mit.edu
http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss





----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.



_______________________________________________
meep-discuss mailing list
meep-discuss@ab-initio.mit.edu
http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss

Reply via email to