On Wed, Mar 25, 2020 at 6:40 PM Fande Kong <fdkong...@gmail.com> wrote:
> > > On Wed, Mar 25, 2020 at 12:18 PM Mark Adams <mfad...@lbl.gov> wrote: > >> Also, a better test is see where streams pretty much saturates, then run >> that many processors per node and do the same test by increasing the nodes. >> This will tell you how well your network communication is doing. >> >> But this result has a lot of stuff in "network communication" that can be >> further evaluated. The worst thing about this, I would think, is that the >> partitioning is blind to the memory hierarchy of inter and intra node >> communication. >> > > Hierarchical partitioning was designed for this purpose. > https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/MatOrderings/MATPARTITIONINGHIERARCH.html#MATPARTITIONINGHIERARCH > > That's fantastic! > Fande, > > >> The next thing to do is run with an initial grid that puts one cell per >> node and the do uniform refinement, until you have one cell per process >> (eg, one refinement step using 8 processes per node), partition to get one >> cell per process, then do uniform refinement to get a reasonable sized >> local problem. Alas, this is not easy to do, but it is doable. >> >> On Wed, Mar 25, 2020 at 2:04 PM Mark Adams <mfad...@lbl.gov> wrote: >> >>> I would guess that you are saturating the memory bandwidth. After >>> you make PETSc (make all) it will suggest that you test it (make test) and >>> suggest that you run streams (make streams). >>> >>> I see Matt answered but let me add that when you make streams you will >>> seed the memory rate for 1,2,3, ... NP processes. If your machine is decent >>> you should see very good speed up at the beginning and then it will start >>> to saturate. You are seeing about 50% of perfect speedup at 16 process. I >>> would expect that you will see something similar with streams. Without >>> knowing your machine, your results look typical. >>> >>> On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi <aminthefr...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I ran KSP example 45 on a single node with 32 cores and 125GB memory >>>> using 1, 16 and 32 MPI processes. Here's a comparison of the time spent >>>> during KSP.solve: >>>> >>>> - 1 MPI process: ~98 sec, speedup: 1X >>>> - 16 MPI processes: ~12 sec, speedup: ~8X >>>> - 32 MPI processes: ~11 sec, speedup: ~9X >>>> >>>> Since the problem size is large enough (8M unknowns), I expected a >>>> speedup much closer to 32X, rather than 9X. Is this expected? If yes, how >>>> can it be improved? >>>> >>>> I've attached three log files for more details. >>>> >>>> Sincerely, >>>> Amin >>>> >>>