> On Oct 3, 2020, at 9:12 PM, Matthew Knepley <knep...@gmail.com> wrote:
> 
> On Sat, Oct 3, 2020 at 1:49 PM Stefano Zampini <stefano.zamp...@gmail.com 
> <mailto:stefano.zamp...@gmail.com>> wrote:
> 
>     There is a MATPARTITIONINGHIERARCH (man page) that Fande provided that 
> helped scaling up problems he was working on significantly.
> 
>    Barry
> 
> The scaling issue with DMPlex is the one-to-all pattern of communication that 
> happens when distributing an original sequential mesh. 
> MATPARTITIONINGHIERARCH won't fix the issue.
> In order to get reasonable performances when distributing a sequential mesh 
> on a large number of processes, you need at least two stages of partitioning: 
> an initial one from the sequential mesh to a mesh with one process per node, 
> migrate the PLEX data, then partition on each node separately, and migrate 
> the data again.
> 
> Just to make sure I understand completely, You partition a serial mesh (SELF) 
> onto one process per node (1PROC), and then refine, and repartition the new 
> mesh onto the whole machine (WORLD). Thus I
> need three communicators, right? And also a method, for moving a Plex on a 
> subcomm onto the larger comm, using 0 parts on the new ranks.

This is how the loop over stages looks like in pseudo code

DMPlexDistributeML(DM,overlap,SF* migration, DM *newdm)
For i  = 0 : nstages
  DMPlexGetPartitioner(dm,&p)
  PartitionerSetStage(p,i)
  DMPlexDistribute(dm,0,&sft,&dmt);
  SFCompose(sf,sft)
  dm = dmt;
  If (I = nstages and overlap > 1) DMPlexDistributeOverlap
end

I thought about including the refinement step within the stages, but it turns 
out it is not doable now if you want to get back a migration sf which is usable 
(and we need it to migrate our data)
For a general solution, we need hooks for migrating user-defined data and to 
generate user-defined data while refining.

The partitioner is defined with a series of MPI_Groups that identify the 
various processes involved per stage. But the temporary meshes that are 
generated in the loop are always defined on the same global communicator.

 



> 
>   Thanks,
> 
>      Matt
>  
> 
>> On Oct 3, 2020, at 10:04 AM, Matthew Knepley <knep...@gmail.com 
>> <mailto:knep...@gmail.com>> wrote:
>> 
>> On Sat, Oct 3, 2020 at 10:51 AM Stefano Zampini <stefano.zamp...@gmail.com 
>> <mailto:stefano.zamp...@gmail.com>> wrote:
>> 
>> 
>> 
>> Secondly, I'd like to add a multilevel "simple" partitioning in DMPlex to 
>> optimize communication. I am thinking that I can create a mesh with 'nnodes' 
>> cells and distribute that to 'nnodes*procs_node' processes with a "spread" 
>> distribution. (the default seems to be "compact"). Then refine that enough 
>> to get 'procs_node' more cells and the use a simple partitioner again to put 
>> one cell on each process, in such a way that the locality is preserved (not 
>> sure how that would work). Then refine from there on each proc for a scaling 
>> study.
>> 
>> 
>> Mark
>> 
>> for multilevel partitioning, you need custom code, since what kills 
>> performances with one-to-all patterns in DMPlex is the actual communication 
>> of the mesh data.
>> However, you can always generate a mesh to have one cell per process, and 
>> then refine from there.
>> 
>> I have coded a multilevel partitioner that works quite well for general 
>> meshes, we have it in a private repo with Lisandro. From my experience, the 
>> benefits of using the multilevel scheme start from 4K processes on. If you 
>> plan very large runs (say > 32K cores) then you definitely want a multistage 
>> scheme.
>> 
>> We never contributed the code since it requires some boilerplate code to run 
>> through the stages of the partitioning and move the data.
>> If you are using hexas, you can always define your own "shell" partitioner 
>> producing box decompositions.
>> 
>> I could integrate it if you want to stop maintaining it there :) It sounds 
>> really useful.
>> 
>>   Thanks,
>> 
>>      Matt
>>  
>> Another option is to generate the meshes upfront in sequential, and then use 
>> the parallel HDF5 reader that Vaclav and Matt put together.
>>  
>> The point here is to get communication patterns that look like an 
>> (idealized) well partition application. (I suppose I could take an array of 
>> factors, the product of which is the number of processors, and generalize 
>> this in a loop for any number of memory levels, or make an oct-tree).
>> 
>> Any thoughts?
>> Thanks,
>> Mark
>> 
>> 
>> 
>> 
>> -- 
>> Stefano
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments 
>> is infinitely more interesting than any results to which their experiments 
>> lead.
>> -- Norbert Wiener
>> 
>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
> 
> 
> 
> -- 
> Stefano
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>

Reply via email to