MPI rank distribution (e.g., 8 ranks per node or 16 ranks per node) is usually 
managed by workload managers like Slurm, PBS through your job scripts, which is 
out of petsc’s control.

From: Amin Sadeghi <aminthefr...@gmail.com>
Date: Wednesday, March 25, 2020 at 4:40 PM
To: Junchao Zhang <junchao.zh...@gmail.com>
Cc: Mark Adams <mfad...@lbl.gov>, PETSc users list <petsc-users@mcs.anl.gov>
Subject: Re: [petsc-users] Poor speed up for KSP example 45

Junchao, thank you for doing the experiment, I guess TACC Frontera nodes have 
higher memory bandwidth (maybe more modern CPU architecture, although I'm not 
familiar as to which hardware affect memory bandwidth) than Compute Canada's 
Graham.

Mark, I did as you suggested. As you suspected, running make streams yielded 
the same results, indicating that the memory bandwidth saturated at around 8 
MPI processes. I ran the experiment on multiple nodes but only requested 8 
cores per node, and here is the result:

1 node (8 cores total): 17.5s, 6X speedup
2 nodes (16 cores total): 13.5s, 7X speedup
3 nodes (24 cores total): 9.4s, 10X speedup
4 nodes (32 cores total): 8.3s, 12X speedup
5 nodes (40 cores total): 7.0s, 14X speedup
6 nodes (48 cores total): 61.4s, 2X speedup [!!!]
7 nodes (56 cores total): 4.3s, 23X speedup
8 nodes (64 cores total): 3.7s, 27X speedup

Note: as you can see, the experiment with 6 nodes showed extremely poor 
scaling, which I guess was an outlier, maybe due to some connection problem?

I also ran another experiment, requesting 2 full nodes, i.e. 64 cores, and 
here's the result:

2 nodes (64 cores total): 6.0s, 16X speedup [32 cores each node]

So, it turns out that given a fixed number of cores, i.e. 64 in our case, much 
better speedups (27X vs. 16X in our case) can be achieved if they are 
distributed among separate nodes.

Anyways, I really appreciate all your inputs.

One final question: From what I understand from Mark's comment, PETSc at the 
moment is blind to memory hierarchy, is it feasible to make PETSc aware of the 
inter and intra node communication so that partitioning is done to maximize 
performance? Or, to put it differently, is this something that PETSc devs have 
their eyes on for the future?


Sincerely,
Amin


On Wed, Mar 25, 2020 at 3:51 PM Junchao Zhang 
<junchao.zh...@gmail.com<mailto:junchao.zh...@gmail.com>> wrote:
I repeated your experiment on one node of TACC Frontera,
1 rank: 85.0s
16 ranks: 8.2s, 10x speedup
32 ranks: 5.7s, 15x speedup

--Junchao Zhang


On Wed, Mar 25, 2020 at 1:18 PM Mark Adams 
<mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:
Also, a better test is see where streams pretty much saturates, then run that 
many processors per node and do the same test by increasing the nodes. This 
will tell you how well your network communication is doing.

But this result has a lot of stuff in "network communication" that can be 
further evaluated. The worst thing about this, I would think, is that the 
partitioning is blind to the memory hierarchy of inter and intra node 
communication. The next thing to do is run with an initial grid that puts one 
cell per node and the do uniform refinement, until you have one cell per 
process (eg, one refinement step using 8 processes per node), partition to get 
one cell per process, then do uniform refinement to get a reasonable sized 
local problem. Alas, this is not easy to do, but it is doable.

On Wed, Mar 25, 2020 at 2:04 PM Mark Adams 
<mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:
I would guess that you are saturating the memory bandwidth. After you make 
PETSc (make all) it will suggest that you test it (make test) and suggest that 
you run streams (make streams).

I see Matt answered but let me add that when you make streams you will seed the 
memory rate for 1,2,3, ... NP processes. If your machine is decent you should 
see very good speed up at the beginning and then it will start to saturate. You 
are seeing about 50% of perfect speedup at 16 process. I would expect that you 
will see something similar with streams. Without knowing your machine, your 
results look typical.

On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi 
<aminthefr...@gmail.com<mailto:aminthefr...@gmail.com>> wrote:
Hi,

I ran KSP example 45 on a single node with 32 cores and 125GB memory using 1, 
16 and 32 MPI processes. Here's a comparison of the time spent during KSP.solve:

- 1 MPI process: ~98 sec, speedup: 1X
- 16 MPI processes: ~12 sec, speedup: ~8X
- 32 MPI processes: ~11 sec, speedup: ~9X

Since the problem size is large enough (8M unknowns), I expected a speedup much 
closer to 32X, rather than 9X. Is this expected? If yes, how can it be improved?

I've attached three log files for more details.

Sincerely,
Amin

Reply via email to