Hi,
This is in reference to Issue 162
https://bitbucket.org/fenics-project/dolfin/issue/162/interpolation-nonmatching-mesh-in-parallel
Currently it is not possible in parallel to interpolate a dolfin Function u0
defined on one mesh to a Function with the same FunctionSpace defined on a
different mesh. I would like to be able to do this
mesh0 = UnitSquareMesh(100, 100)
mesh1 = UnitSquareMesh(20, 20)
V0 = FunctionSpace(mesh0, "CG", 1)
V1 = FunctionSpace(mesh1, "CG", 1)
u0 = interpolate(Expression("x[0]*x[1]"), V0)
u0.update()
u = interpolate_nonmatching_mesh(u0, V1)
or probably overloaded like this
u = interpolate(u0, V1)
Now there are probably many different ways this can be implemented. The most
obvious approach that comes to mind is simply to overload `Function::eval` with
a parallel version. There are many reasons why this doesn’t quite seem to work.
Consider the current `interpolate` implementation
FunctionSpace::interpolate(GenericVector& expansion_coefficients,
const GenericFunction& v)
…
// Iterate over mesh and interpolate on each cell
UFCCell ufc_cell(*_mesh);
for (CellIterator cell(*_mesh); !cell.end(); ++cell)
{
// Update to current cell
ufc_cell.update(*cell);
// Restrict function to cell
v.restrict(&cell_coefficients[0], *_element, *cell, ufc_cell);
// Tabulate dofs
const std::vector<dolfin::la_index>& cell_dofs =
_dofmap->cell_dofs(cell->index());
// Copy dofs to vector
expansion_coefficients.set(&cell_coefficients[0],
_dofmap->cell_dimension(cell->index()),
&cell_dofs[0]);
}
The cell loop runs on all processors for the part of the mesh of V1 that lives
on that process. `v.restrict` calls `restrict_as_ufc_function` since V0 and V1
have different meshes, `restrict_as_ufc_function` calls `element.evaluate_dofs`
that again calls `u0.eval` for each dof on the cell. Now the major problem here
that makes it difficult to extend this to parallel is that the cell iteration
is not collective and `eval` is not called collectively. Since the coordinates
of dofs on a cell in the V1-mesh could live on any other cpu for the V0-mesh,
the `eval` needs to be called collectively with some try/catch clause and with
a lot of broadcasting (of coordinates and result) to obtain what is required.
Alternatively, the non-collective `eval` that fails must be captured and
handled using MPI++ later.
Faced with these problems my solution has instead been to create a stand-alone
version for parallel interpolation on nonmatching meshes. Not trying to
overload `eval` or fit it into the current `interpolate` framework. The code
that performs the interpolation is attached to issue 162. The code is fairly
well documented and should, hopefully, be possible to follow. The algorithm is
basically like this
1) On each process tabulate all coordinates for dofs in the functionspace we’re
interpolating to, i.e., V1 in the above code.
x = V1.dofmap()->tabulate_all_coordinates(mesh1)
2) Create a map from dof number in V1 to component number in Mixed Space.
This is necessary to be able to handle mixed spaces. I do not think I have
found the best solution to this problem, so any help here would be much
appreciated. I just realized that the current solution does not work for
special spaces like BDM or RT.
3) Compute `u0.eval` for all points x computed in 1).
Problem here is that u0 and u have different meshes and as such a point in u
will not necessarily, or not even likely, be found on the same processor for
u0. Hence the dofs coordinates must be passed around and searched on all ranks
until found. When found the result is returned to the process that owns the
dof. (Alternatively, but slower, `u0.eval` could simply be called
collectively.) A major issue here for performance is that all points not found
on one process are collected in a vector and passed on to the next process
through just one single MPI::send_recv. All points not found are sent and
points not found by process with one lower rank are received.
4) Finally, set all values in u using the dof to component map.
The problem is inherently unscalable by nature since the efficiency depends on
the distribution of two different meshes. If they happen to overlap much, then
the routine may be faster with 2 than one cpu, but in the end that depends on
the mesh partitioning that you cannot really control, at least not just using
the default mesh-partitioner. Consider running a problem with 2 CPUs. Here it
should be quite obvious that each dof must be searched on average 1.5 times
(half are found on this CPU and the rest on the other). Since each process'
mesh is half the size of the original, the total cost is then approximately
1.5*0.5 = 0.75. So, not counting any cost of MPI communication, you should
expect to reduce the computational time to 75% of the original. However, the
MPI communications are extensive (half the mesh is sent back and forth) and you
should honestly not expect any speedup at all. The estimate for N cpus will
similarly be (N+1) / N / 2 + MPI communications.
If you have any comments to the implementation or, even better, suggestions on
how to improve it, then please let me know by following up on this mail. If you
like the current solution and you’d like to see this implemented soon in
dolfin, then please vote for issue 162 on bitbucket:-)
Meanwhile, the `interpolate_nonmatching_mesh` function is available on
https://github.com/mikaem/fenicstools
Best regards
Mikael
_______________________________________________
fenics mailing list
[email protected]
http://fenicsproject.org/mailman/listinfo/fenics