Re: [DOLFIN-dev] profiling an assembly

Anders Logg Sun, 18 May 2008 12:14:41 -0700

On Sun, May 18, 2008 at 08:46:11PM +0200, Murtazo Nazarov wrote:
> > On Sat, May 17, 2008 at 04:40:48PM +0200, Johan Hoffman wrote:
> >> If the petsc sparse matrix structure works as I expect; to insert/add an
> >> element you go to the particular row and search all non-zero entries of
> >> that row until you find your column. So the complexity would be = #dofs
> >> x
> >> #row non-zero entries
> >>
> >> With a block matrix you can optimize this by just searching for the
> >> first
> >> entry of the block.
> >>
> >> I am surprised that "add" should totally dominate the assembly
> >> algorithm.
> >> Is this reasonable?
> >
> > It's remarkable but it certainly isn't any surpise, see here for example:
> >
> >   http://simula.no/research/scientific/publications/Simula.SC.46
> >
> > This wasn't always the case. We've worked quite hard over the last
> > years to speed up the computation of the element matrix
> > (tabulate_tensor), to speed up the computation of the dof map
> > (tabulate_dofs) and to speed up the time it takes to access geometry
> > (the new Mesh classes). So if the one remaining step (A.add()) now
> > dominates it means we have been successful. It also means that to gain
> > more, we need to improve on A.add().
> >
> 
> Good! It seems FFC does pretty good job, since only ~15% of the assembly
> time is spent to tabulate_tensor, and tabulate_dofs. The most time ~70%
> takes A.add().
> 
> > But we also need to remember that
> >
> > 1. Solve time may dominate assemble anyway so that's where we should
> > optimize.
> >
> > 2. Assembling the action instead of the operator removes the A.add()
> > bottleneck.
> >
> > As mentioned before, we are experimenting with iterating locally over
> > cells sharing common dofs and combining batches of element tensors
> > before inserting into the global sparse matrix row by row. Let's see
> > how it goes.
> >
> > Some other comments:
> >
> > Murtazo: It's interesting that femLego is that much faster. It would
> > be interesting to see exactly why it is faster. Can you take a simple
> > example (maybe even Poisson) and profile both femLego and DOLFIN on
> > the same mesh and get detailed results for tabulate_tensor,
> > tabulate_dofs, A.add(). If femLego is faster on A.add(), then what
> > linear algebra backend is it using?
> 
> Yes, the test we did it is simple 2D Poisson with unite square mesh and an
> assembly in FemLego is 3 time faster, because A.add() is done in a way I
> wrote in previous mails. The linear algebra package is AZTEC. Perhaps,
> dolfin should be much faster than femLego if A.add() is the same, since
> FFC is very fast than quadrature rule.


I thought AZTEC was just solvers. I mean what sparse matrix format is
used. And what does the interface look like for communicating the
exact position for insertion?

> > Murtazo: It seems you suggest we should basically store some kind of
> > index/pointer for where to write directly to memory when adding the
> > entries. This has two problems: (i) we need to store quite a few of
> > these indices (n^2*num_cells where n is the local space dimension),
> > and (ii) it requires cooperation from the linear algebra backend.
> > We would need to ask PETSc to tell us where in its private data
> > structures it inserted the entries.
> 
> Yes, maybe there is a better way to do it. If we store the global indices
> of the A it will be totally A.nz()*numVertices*num_components, but still
> we will have a speedup which is more important in some case.

That doesn't look correct to me. I think we would need n^2*num_cells.
We iterate over the cells and for each cell we have n^2 entries and we
need to know where to insert each one of those.

-- 
Anders


> /murtazo
> 
> >
> > Dag: A test suite/benchmark would be very good to have. We need to
> > have something in bench/fem/assembly that we can trust and run
> > regularly to make sure we don't have regressions.
> >
> > Jed: You seem to know your way around PETSc, so any specific
> > recommendations regarding the initialization of matrices in
> > PETScMatrix.cpp are appreciated.
> >
> >
> >
> >> Dag: the first assembly should typically involve an initialization
> >> (at least for a non-linear problem), is this part of you test? If
> >> so, I think it is strange that the difference between the first pass
> >> and the reassemble do not differ (with a faster reassemble).
> >>
> >> For a PETSc matrix I do not recognize the std::lower_bound, is this
> >> ublas
> >> specific?
> >>
> >> /Johan
> >>
> >> > It seems this thread got a bit derailed yesterday :(
> >> >
> >> > I've done some more careful profiling:
> >> > *) Full assemble once
> >> > *) Assemble matrix 30 times without reset
> >> > in order to amortize the time for initialization.
> >> >
> >> > The call graph shows that std::lower_bound is called from add:
> >> >  dolfin::GenericMatrix::add ->
> >> >  dolfin::uBlasMatrix<>:add ->
> >> >  std::lower_bound
> >> >
> >> > In this assembly add+children takes 89% of the time, and
> >> tabulate_tensor
> >> > taking roughly 9%. Full (gprof) profile attached. Murtazo is probably
> >> > right that the performance numbers are virtually the same with PETSc.
> >> I
> >> > will hook it up and try (and let you know if this is not the case).
> >> >
> >> > I do appreciate the complexity of inserting elements into a sparse
> >> > matrix, and I do _not_ claim to know better when it comes to the
> >> > assembler architecture.
> >> >
> >> > Still, as I vary the size of the mesh I get this performance metric
> >> > virtually constant:
> >> > Assembled 7.3e+05 non-zero matrix elements per second (first pass)
> >> > Assembled 1.4e+06 non-zero matrix elements per second (re-assemble).
> >> >
> >> > Is this a sensible metric? If so, is it well understood how the DOLFIN
> >> > assembler performs? If not, I could put together a test-suite for a
> >> > range of forms (2/3D, simple/mixed element, simple/complicated
> >> > expressions in the form etc).
> >> >
> >> > To restate my question: How should I verify that the assembler is
> >> > performing as expected here? Am I looking at some unexpected overhead
> >> in
> >> > this assembly (we all know how hard this can be to spot with C++)?
> >> >
> >> > Thanks!
> >> > /Dag
> >> >
> >> > Garth N. Wells wrote:
> >> >>
> >> >> Anders Logg wrote:
> >> >>> On Fri, May 16, 2008 at 12:17:19AM +0200, Murtazo Nazarov wrote:
> >> >>>>> Hello!
> >> >>>>>
> >> >>>>> I'm looking at a "suspiciously slow" assembly and would like to
> >> >>>>> determine what is going on. In general, what should one expect the
> >> >>>>> most
> >> >>>>> time-consuming step to be?
> >> >>>>>
> >> >>>>> This is what my gprof looks like:
> >> >>>>>
> >> >>>>> Time:
> >> >>>>> 61.97%  unsigned int const* std::lower_bound
> >> >>>>> 25.84%  dolfin::uBlasMatrix<...>::add
> >> >>>>> 8.27%
> >> >>>>> UFC_NSEMomentum3DBilinearForm_cell_integral_0::tabulate_tensor
> >> >>>>> 1.1%    dolfin::uBlasMatrix<...>::init
> >> >>> Where is lower_bound used? From within uBlasMatrix::add or is it in
> >> >>> building the sparsity pattern?
> >> >>>
> >> >>
> >> >> I suspect that it's either in building the sparsity pattern or
> >> >> initialising the uBLAS matrix. The matrix structure is initialised by
> >> >> running across rows and inserting a zero. uBLAS doesn't provide a
> >> >> mechanism for initialising the underlying data structures directly
> >> for a
> >> >>   sparse matrix.
> >> >>
> >> >> Dag: could you you run the same test using PETSc as the backend?
> >> >>
> >> >>>> I got these numbers also. I understand that it is very painful in
> >> >>>> large
> >> >>>> computations.
> >> >>>>
> >> >>>> I see what is a problem with adding into the stiffness matrix A.
> >> >>>> Searching
> >> >>>> the position of the element which needs to be added takes very long
> >> >>>> time,
> >> >>>> especially if you are solving big problems with thousands unknowns
> >> and
> >> >>>> repeating the assembling a lot of times!
> >> >>> If you know a good way to avoid inserting entries into a sparse
> >> matrix
> >> >>> during assembly, please tell me... :-)
> >> >>>
> >> >>> If the assembly is costly, you might want to try assembling the
> >> action
> >> >>> of it instead and send that to a Krylov solver. Inserting into a
> >> >>> vector is much easier than into a sparse matrix.
> >> >>>
> >> >>>> One way could be finding the global indices of the matrix A once,
> >> and
> >> >>>> use
> >> >>>> it in the assembly process. By this way we avoid of searching the
> >> >>>> element
> >> >>>> position and it makes the process significantly fast. But, there is
> >> a
> >> >>>> problem: somehow I cannot get access to the global index of cell in
> >> >>>> the A
> >> >>>> and change it instead of using MatSetValues (in PETSc).
> >> >>> I don't understand what you suggest here. We do precompute the
> >> >>> sparsity pattern of the matrix and use that to preallocate, but I
> >> >>> don't know of any other way to insert entries than MatSetValues.
> >> >>>
> >> >>
> >> >> I doubt insertion is the real problem, especially as Dag noted that
> >> >> subsequent assembly operations take only half the time since the
> >> matrix
> >> >> is already initialised.
> >> >>
> >> >> PETSc (and no doubt Trilinos) do offer some assembly possibilities
> >> that
> >> >> we haven't yet exploited because they require a reorganisation of the
> >> >> dof map.
> >> >>
> >> >> Garth
> >> >>
> >> >>>> I am pretty sure that we may speed up the A.set() and A.get()
> >> >>>> processes as
> >> >>>> well by the above method.
> >> >>> Please explain.
> >> >>>
> >> >>>> I am not sure how the dofmap to get rows and cols indices of the
> >> cells
> >> >>>> is
> >> >>>> implemented. We could avoid repeating this operation as well.
> >> >>> This is already implemented (but maybe not used). DofMap handles
> >> this.
> >> >>> It wraps the generated ufc::dof_map code and may pretabulate (and
> >> >>> possibly reorder) the dofs.
> >> >>>
> >> >>>> We did some comparison with another free fem toolbox, FemLego, the
> >> >>>> assembly process in Dolfin is 3 times slower than FemLego in 2D. I
> >> >>>> believe
> >> >>>> this number will increase in 3D. FemLego uses quadrature rule for
> >> >>>> computing integrals.
> >> >>> Can you benchmark the various parts of the assembly to see what
> >> causes
> >> >>> the slowdown:
> >> >>>
> >> >>>   1. Is it tabulate_tensor?
> >> >>>   2. Is it tabulate_dofs?
> >> >>>   3. Is it A.add()?
> >> >>>   4. Something else?
> >> >>>
> >> >>>> I hope some PETSc guys will help us to do this improvements. Any
> >> other
> >> >>>> ideas are welcome!
> >> >>> We are currently experimenting with collecting and preprocessing
> >> >>> batches of entries before inserting into the global sparse matrix in
> >> >>> hope of speeding up the assembly but we don't have any results yet.
> >> >>>
> >> >> _______________________________________________
> >> >> DOLFIN-dev mailing list
> >> >> [email protected]
> >> >> http://www.fenics.org/mailman/listinfo/dolfin-dev
> >> > _______________________________________________
> >> > DOLFIN-dev mailing list
> >> > [email protected]
> >> > http://www.fenics.org/mailman/listinfo/dolfin-dev
> >> >
> >>
> >>
> >> _______________________________________________
> >> DOLFIN-dev mailing list
> >> [email protected]
> >> http://www.fenics.org/mailman/listinfo/dolfin-dev
> > _______________________________________________
> > DOLFIN-dev mailing list
> > [email protected]
> > http://www.fenics.org/mailman/listinfo/dolfin-dev
> >
> 
> 
> _______________________________________________
> DOLFIN-dev mailing list
> [email protected]
> http://www.fenics.org/mailman/listinfo/dolfin-dev
_______________________________________________
DOLFIN-dev mailing list
[email protected]
http://www.fenics.org/mailman/listinfo/dolfin-dev

Re: [DOLFIN-dev] profiling an assembly

Reply via email to