Re: [FEniCS] Issue 162 - interpolation on nonmatching meshes

Mikael Mortensen Thu, 12 Dec 2013 07:37:24 -0800

Den Dec 6, 2013 kl. 6:28 PM skrev Garth N. Wells:

> On 2013-12-05 08:57, Mikael Mortensen wrote:
>> Den Dec 4, 2013 kl. 6:58 PM skrev Garth N. Wells:
>>> On 2013-12-04 17:15, Anders Logg wrote:
>>>> On Tue, Dec 03, 2013 at 10:12:22AM +0100, Mikael Mortensen wrote:
>>>>> Den Dec 2, 2013 kl. 9:17 PM skrev Anders Logg:
>>>>>> I don't see a simple way around this (other than inventing a new
>>>>>> algorithm - the only way I can think of is to create a third data
>>>>>> structure with a regular grid predictively distributed onto the
>>>>>> different processors and then communicate via that grid) so I don't
>>>>>> mind this being added.
>>>>> Great
>>>>> I could probably create a very inefficient parallel interpolation
>>>>> routine by looping over all partitions of the distributed mesh1, one
>>>>> by one, collectively (i.e., broadcast mesh1's partition on process 0
>>>>> to all, then run collectively over it, allowing us to call u0.eval
>>>>> or evaluate_dofs collectively. When finished move on to partition on
>>>>> process 1). That could work for any space I guess?
>>>> I meant communicating via a third independent grid constructed in a
>>>> way such that each cell only needs to communicate with a subset of
>>>> that third grid.
>>> I would create a bounding box(es) of the mesh on each process, and then 
>>> build a search tree for the boxes. Then do
>> That's a good idea if I understand correctly. Use all_gathered
>> bounding boxes for the mesh partitions to find the process that owns
>> the interpolation point
> 
> The bounding boxes can be used to determine which processes *might* own a 
> particular point. In the simple case, each process could store a copy of all 
> bounding boxes. For really large problems, we might want a more elaborate 
> distributed tree design.


I have implemented this now. There is a significant speed up when using a large 
number of processors (I've tried up to 32).

> 
>> and then only call u0.eval after found.
> 
> The function call u0.eval(p) should ask the process that might own the point 
> 'p' to do the eval, and to send back the result.
> 
>> I can
>> then reuse in_boundary_box++ from PeriodicBoundaryComputation, right?
>> I'll have a closer look.
> 
> Maybe. Most likely it should be spun out as separate class. I think you could 
> register this a task Issue on Bitbucket.

In the current implementation I use both the lt_coordinate struct and 
in_bounding_box from PeriodicBoundaryComputation. Not sure how this should be 
organized in a separate class though. 

There is a considerable speed-up for mixed spaces if I create a map from 
interpolation points to dofs living at that point (i.e., a VectorFunctionSpace 
has three dofs on each point in 3D). This reduces the calls to eval to one per 
interpolation point and lowers the MPI traffic. However, the std::map is slower 
than using a regular std::vector for single spaces. Not sure what's the best 
way to combine these two yet. Any suggestions are most welcome.

I need to clean up the code a bit, but the current version is attached to issue 
162.

Mikael

> 
>> You sure this isn't handled automatically by bounding box trees
>> created the first time u0.eval is called?
>>> for (CellIterator cell(*_mesh); !cell.end(); ++cell)
>>> {
>>>  // If interpolation point is on this process, do as we do now
>>>  // If interpolation point is *not* on this process, insert
>>>  // into cache to be fetched
>>> }
>>> // Call function to retrieve non-local data. This function should use
>>> // the search tree.
>> Not quite sure I follow. AFAIK it is still not possible to do as
>> before since restrict_as_ufc_function calls element.evaluate_dofs and
>> as such the entire cell must be on both processes, not just the
>> interpolation point. I would follow if the above was an iteration over
>> interpolation points, though, like it is in my code already.
> 
> 
> Yes, I'm thinking in terms of iteration over interpolation points.
> 
>> For Lagrange elements there is really no reason to do the cell loop.
> 
> For P1 yes, for higher order it may be necessary. It could be made efficient 
> by going over all mesh entities.
> 
> Garth
> 
>> I
>> forgot to mention in the previous mail that the nonmatching routine
>> attached to the issue is 4-5 (!!!) times faster in 3D than the regular
>> interpolate for P1 elements. That is because the loop is over vertices
>> and not over cells. And there are many fewer callbacks to eval, just
>> the optimal one for each dof.
>> Mikael
>>> The constrained/periodic bc code uses bounding boxes to locate constrained 
>>> vertices/edges/facets that are off-process. It doesn't use a tree for 
>>> searching, but it should. The some abstract function for finding the 
>>> off-process owner of a point or mesh entity would be very helpful.
>>> Garth
>>>> --
>>>> Anders
>>>>>> Since not all elements are supported, I think this function should
>>>>>> have a separate name for now, perhaps
>>>>>> 
>>>>>>  interpolate_lagrange()
>>>>>> 
>>>>>> My suggestion would be that the documentation explains that this
>>>>>> function works in parallel for non-matching meshes (in contrast to the
>>>>>> regular interpolate) but does only support Lagrange spaces.
>>>>> Sounds good to me.
>>>>> Mikael
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> This is in reference to Issue 162 
>>>>>>> https://bitbucket.org/fenics-project/dolfin/
>>>>>>> issue/162/interpolation-nonmatching-mesh-in-parallel
>>>>>>> 
>>>>>>> Currently it is not possible in parallel to interpolate a
>>>>>>> dolfin Function u0 defined on one mesh to a Function with the
>>>>>>> same FunctionSpace defined on a different mesh. I would like to be able 
>>>>>>> to do
>>>>>>> this
>>>>>>> 
>>>>>>> mesh0 = UnitSquareMesh(100, 100)
>>>>>>> mesh1 = UnitSquareMesh(20, 20)
>>>>>>> V0 = FunctionSpace(mesh0, "CG", 1)
>>>>>>> V1 = FunctionSpace(mesh1, "CG", 1)
>>>>>>> u0 = interpolate(Expression("x[0]*x[1]"), V0)
>>>>>>> u0.update()
>>>>>>> u = interpolate_nonmatching_mesh(u0, V1)
>>>>>>> 
>>>>>>> or probably overloaded like this
>>>>>>> 
>>>>>>> u = interpolate(u0, V1)
>>>>>>> 
>>>>>>> Now there are probably many different ways this can be implemented. The 
>>>>>>> most
>>>>>>> obvious approach that comes to mind is simply to overload 
>>>>>>> `Function::eval` with
>>>>>>> a parallel version. There are many reasons why this doesn’t quite seem 
>>>>>>> to work.
>>>>>>> Consider the current `interpolate` implementation
>>>>>>> 
>>>>>>> FunctionSpace::interpolate(GenericVector& expansion_coefficients,
>>>>>>>                               const GenericFunction& v)
>>>>>>> …
>>>>>>> // Iterate over mesh and interpolate on each cell
>>>>>>> UFCCell ufc_cell(*_mesh);
>>>>>>> for (CellIterator cell(*_mesh); !cell.end(); ++cell)
>>>>>>> {
>>>>>>>   // Update to current cell
>>>>>>>   ufc_cell.update(*cell);
>>>>>>> 
>>>>>>>   // Restrict function to cell
>>>>>>>   v.restrict(&cell_coefficients[0], *_element, *cell, ufc_cell);
>>>>>>> 
>>>>>>>   // Tabulate dofs
>>>>>>>   const std::vector<dolfin::la_index>& cell_dofs = 
>>>>>>> _dofmap->cell_dofs(cell->
>>>>>>> index());
>>>>>>> 
>>>>>>>   // Copy dofs to vector
>>>>>>>   expansion_coefficients.set(&cell_coefficients[0],
>>>>>>>                              _dofmap->cell_dimension(cell->index()),
>>>>>>>                              &cell_dofs[0]);
>>>>>>> }
>>>>>>> 
>>>>>>> The cell loop runs on all processors for the part of the mesh of V1 
>>>>>>> that lives
>>>>>>> on that process. `v.restrict` calls `restrict_as_ufc_function` since V0 
>>>>>>> and V1
>>>>>>> have different meshes, `restrict_as_ufc_function` calls 
>>>>>>> `element.evaluate_dofs`
>>>>>>> that again calls `u0.eval` for each dof on the cell. Now the major 
>>>>>>> problem here
>>>>>>> that makes it difficult to extend this to parallel is that the cell 
>>>>>>> iteration
>>>>>>> is not collective and `eval` is not called collectively. Since the 
>>>>>>> coordinates
>>>>>>> of dofs on a cell in the V1-mesh could live on any other cpu for the 
>>>>>>> V0-mesh,
>>>>>>> the `eval` needs to be called collectively with some try/catch clause 
>>>>>>> and with
>>>>>>> a lot of broadcasting (of coordinates and result) to obtain what is
>>>>>>> required.  Alternatively, the non-collective `eval` that fails must be 
>>>>>>> captured
>>>>>>> and handled using MPI++ later.
>>>>>>> 
>>>>>>> Faced with these problems my solution has instead been to create a 
>>>>>>> stand-alone
>>>>>>> version for parallel interpolation on nonmatching meshes. Not trying to
>>>>>>> overload `eval` or fit it into the current `interpolate` framework. The 
>>>>>>> code
>>>>>>> that performs the interpolation is attached to issue 162. The code is 
>>>>>>> fairly
>>>>>>> well documented and should, hopefully, be possible to follow. The 
>>>>>>> algorithm is
>>>>>>> basically like this
>>>>>>> 
>>>>>>> 1) On each process tabulate all coordinates for dofs in the 
>>>>>>> functionspace
>>>>>>> we’re interpolating to, i.e., V1 in the above code.
>>>>>>> 
>>>>>>> x = V1.dofmap()->tabulate_all_coordinates(mesh1)
>>>>>>> 
>>>>>>> 2) Create a map from dof number in V1 to component number in Mixed 
>>>>>>> Space.
>>>>>>> 
>>>>>>> This is necessary to be able to handle mixed spaces. I do not think I 
>>>>>>> have
>>>>>>> found the best solution to this problem, so any help here would be much
>>>>>>> appreciated. I just realized that the current solution does not work
>>>>>>> for special spaces like BDM or RT.
>>>>>>> 
>>>>>>> 3) Compute `u0.eval` for all points x computed in 1).
>>>>>>> 
>>>>>>> Problem here is that u0 and u have different meshes and as such a point 
>>>>>>> in u
>>>>>>> will not necessarily, or not even likely, be found on the same 
>>>>>>> processor for
>>>>>>> u0. Hence the dofs coordinates must be passed around and searched on 
>>>>>>> all ranks
>>>>>>> until found. When found the result is returned to the process that owns 
>>>>>>> the
>>>>>>> dof. (Alternatively, but slower, `u0.eval` could simply be called
>>>>>>> collectively.) A major issue here for performance is that all points 
>>>>>>> not found
>>>>>>> on one process are collected in a vector and passed on to the
>>>>>>> next process through just one single MPI::send_recv. All points not 
>>>>>>> found are
>>>>>>> sent and points not found by process with one lower rank are received.
>>>>>>> 
>>>>>>> 4) Finally, set all values in u using the dof to component map.
>>>>>>> 
>>>>>>> 
>>>>>>> The problem is inherently unscalable by nature since the efficiency 
>>>>>>> depends on
>>>>>>> the distribution of two different meshes. If they happen to overlap 
>>>>>>> much, then
>>>>>>> the routine may be faster with 2 than one cpu, but in the end that 
>>>>>>> depends on
>>>>>>> the mesh partitioning that you cannot really control, at least not just 
>>>>>>> using
>>>>>>> the default mesh-partitioner. Consider running a problem with 2 CPUs. 
>>>>>>> Here it
>>>>>>> should be quite obvious that each dof must be searched on average 1.5 
>>>>>>> times
>>>>>>> (half are found on this CPU and the rest on the other). Since each 
>>>>>>> process'
>>>>>>> mesh is half the size of the original, the total cost is then 
>>>>>>> approximately
>>>>>>> 1.5*0.5 = 0.75. So, not counting any cost of MPI communication, you 
>>>>>>> should
>>>>>>> expect to reduce the computational time to 75% of the original. 
>>>>>>> However, the
>>>>>>> MPI communications are extensive (half the mesh is sent back and forth) 
>>>>>>> and you
>>>>>>> should honestly not expect any speedup at all. The estimate for N cpus 
>>>>>>> will
>>>>>>> similarly be (N+1) / N / 2 + MPI communications.
>>>>>>> 
>>>>>>> If you have any comments to the implementation or, even better, 
>>>>>>> suggestions on
>>>>>>> how to improve it, then please let me know by following up on this 
>>>>>>> mail. If you
>>>>>>> like the current solution and you’d like to see this implemented soon
>>>>>>> in dolfin, then please vote for issue 162 on bitbucket:-)
>>>>>>> 
>>>>>>> Meanwhile, the `interpolate_nonmatching_mesh` function is available on 
>>>>>>> https://
>>>>>>> github.com/mikaem/fenicstools
>>>>>>> 
>>>>>>> Best regards
>>>>>>> 
>>>>>>> Mikael
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> fenics mailing list
>>>>>>> [email protected]
>>>>>>> http://fenicsproject.org/mailman/listinfo/fenics
>>>>>> 
>>>>>> _______________________________________________
>>>>>> fenics mailing list
>>>>>> [email protected]
>>>>>> http://fenicsproject.org/mailman/listinfo/fenics
>>>> _______________________________________________
>>>> fenics mailing list
>>>> [email protected]
>>>> http://fenicsproject.org/mailman/listinfo/fenics
>>> _______________________________________________
>>> fenics mailing list
>>> [email protected]
>>> http://fenicsproject.org/mailman/listinfo/fenics

_______________________________________________
fenics mailing list
[email protected]
http://fenicsproject.org/mailman/listinfo/fenics

Re: [FEniCS] Issue 162 - interpolation on nonmatching meshes

Reply via email to