In case someone wants to learn more about the hierarchical partitioning 
algorithm. Here is a reference 

https://arxiv.org/pdf/1809.02666.pdf

Thanks 

Fande 


> On Mar 25, 2020, at 5:18 PM, Mark Adams <mfad...@lbl.gov> wrote:
> 
> 
> 
> 
>> On Wed, Mar 25, 2020 at 6:40 PM Fande Kong <fdkong...@gmail.com> wrote:
>>> 
>>> 
>>>> On Wed, Mar 25, 2020 at 12:18 PM Mark Adams <mfad...@lbl.gov> wrote:
>>>> Also, a better test is see where streams pretty much saturates, then run 
>>>> that many processors per node and do the same test by increasing the 
>>>> nodes. This will tell you how well your network communication is doing.
>>>> 
>>>> But this result has a lot of stuff in "network communication" that can be 
>>>> further evaluated. The worst thing about this, I would think, is that the 
>>>> partitioning is blind to the memory hierarchy of inter and intra node 
>>>> communication.
>>> 
>>> Hierarchical partitioning was designed for this purpose. 
>>> https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/MatOrderings/MATPARTITIONINGHIERARCH.html#MATPARTITIONINGHIERARCH
>>> 
>> 
>> That's fantastic!
>>  
>> Fande,
>>  
>>> The next thing to do is run with an initial grid that puts one cell per 
>>> node and the do uniform refinement, until you have one cell per process 
>>> (eg, one refinement step using 8 processes per node), partition to get one 
>>> cell per process, then do uniform refinement to get a reasonable sized 
>>> local problem. Alas, this is not easy to do, but it is doable.
>>> 
>>>> On Wed, Mar 25, 2020 at 2:04 PM Mark Adams <mfad...@lbl.gov> wrote:
>>>> I would guess that you are saturating the memory bandwidth. After you make 
>>>> PETSc (make all) it will suggest that you test it (make test) and suggest 
>>>> that you run streams (make streams).
>>>> 
>>>> I see Matt answered but let me add that when you make streams you will 
>>>> seed the memory rate for 1,2,3, ... NP processes. If your machine is 
>>>> decent you should see very good speed up at the beginning and then it will 
>>>> start to saturate. You are seeing about 50% of perfect speedup at 16 
>>>> process. I would expect that you will see something similar with streams. 
>>>> Without knowing your machine, your results look typical.
>>>> 
>>>>> On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi <aminthefr...@gmail.com> 
>>>>> wrote:
>>>>> Hi,
>>>>> 
>>>>> I ran KSP example 45 on a single node with 32 cores and 125GB memory 
>>>>> using 1, 16 and 32 MPI processes. Here's a comparison of the time spent 
>>>>> during KSP.solve:
>>>>> 
>>>>> - 1 MPI process: ~98 sec, speedup: 1X
>>>>> - 16 MPI processes: ~12 sec, speedup: ~8X
>>>>> - 32 MPI processes: ~11 sec, speedup: ~9X
>>>>> 
>>>>> Since the problem size is large enough (8M unknowns), I expected a 
>>>>> speedup much closer to 32X, rather than 9X. Is this expected? If yes, how 
>>>>> can it be improved?
>>>>> 
>>>>> I've attached three log files for more details. 
>>>>> 
>>>>> Sincerely,
>>>>> Amin

Reply via email to