On Sun, May 18, 2008 at 08:46:11PM +0200, Murtazo Nazarov wrote: > > On Sat, May 17, 2008 at 04:40:48PM +0200, Johan Hoffman wrote: > >> If the petsc sparse matrix structure works as I expect; to insert/add an > >> element you go to the particular row and search all non-zero entries of > >> that row until you find your column. So the complexity would be = #dofs > >> x > >> #row non-zero entries > >> > >> With a block matrix you can optimize this by just searching for the > >> first > >> entry of the block. > >> > >> I am surprised that "add" should totally dominate the assembly > >> algorithm. > >> Is this reasonable? > > > > It's remarkable but it certainly isn't any surpise, see here for example: > > > > http://simula.no/research/scientific/publications/Simula.SC.46 > > > > This wasn't always the case. We've worked quite hard over the last > > years to speed up the computation of the element matrix > > (tabulate_tensor), to speed up the computation of the dof map > > (tabulate_dofs) and to speed up the time it takes to access geometry > > (the new Mesh classes). So if the one remaining step (A.add()) now > > dominates it means we have been successful. It also means that to gain > > more, we need to improve on A.add(). > > > > Good! It seems FFC does pretty good job, since only ~15% of the assembly > time is spent to tabulate_tensor, and tabulate_dofs. The most time ~70% > takes A.add(). > > > But we also need to remember that > > > > 1. Solve time may dominate assemble anyway so that's where we should > > optimize. > > > > 2. Assembling the action instead of the operator removes the A.add() > > bottleneck. > > > > As mentioned before, we are experimenting with iterating locally over > > cells sharing common dofs and combining batches of element tensors > > before inserting into the global sparse matrix row by row. Let's see > > how it goes. > > > > Some other comments: > > > > Murtazo: It's interesting that femLego is that much faster. It would > > be interesting to see exactly why it is faster. Can you take a simple > > example (maybe even Poisson) and profile both femLego and DOLFIN on > > the same mesh and get detailed results for tabulate_tensor, > > tabulate_dofs, A.add(). If femLego is faster on A.add(), then what > > linear algebra backend is it using? > > Yes, the test we did it is simple 2D Poisson with unite square mesh and an > assembly in FemLego is 3 time faster, because A.add() is done in a way I > wrote in previous mails. The linear algebra package is AZTEC. Perhaps, > dolfin should be much faster than femLego if A.add() is the same, since > FFC is very fast than quadrature rule.
I thought AZTEC was just solvers. I mean what sparse matrix format is used. And what does the interface look like for communicating the exact position for insertion? > > Murtazo: It seems you suggest we should basically store some kind of > > index/pointer for where to write directly to memory when adding the > > entries. This has two problems: (i) we need to store quite a few of > > these indices (n^2*num_cells where n is the local space dimension), > > and (ii) it requires cooperation from the linear algebra backend. > > We would need to ask PETSc to tell us where in its private data > > structures it inserted the entries. > > Yes, maybe there is a better way to do it. If we store the global indices > of the A it will be totally A.nz()*numVertices*num_components, but still > we will have a speedup which is more important in some case. That doesn't look correct to me. I think we would need n^2*num_cells. We iterate over the cells and for each cell we have n^2 entries and we need to know where to insert each one of those. -- Anders > /murtazo > > > > > Dag: A test suite/benchmark would be very good to have. We need to > > have something in bench/fem/assembly that we can trust and run > > regularly to make sure we don't have regressions. > > > > Jed: You seem to know your way around PETSc, so any specific > > recommendations regarding the initialization of matrices in > > PETScMatrix.cpp are appreciated. > > > > > > > >> Dag: the first assembly should typically involve an initialization > >> (at least for a non-linear problem), is this part of you test? If > >> so, I think it is strange that the difference between the first pass > >> and the reassemble do not differ (with a faster reassemble). > >> > >> For a PETSc matrix I do not recognize the std::lower_bound, is this > >> ublas > >> specific? > >> > >> /Johan > >> > >> > It seems this thread got a bit derailed yesterday :( > >> > > >> > I've done some more careful profiling: > >> > *) Full assemble once > >> > *) Assemble matrix 30 times without reset > >> > in order to amortize the time for initialization. > >> > > >> > The call graph shows that std::lower_bound is called from add: > >> > dolfin::GenericMatrix::add -> > >> > dolfin::uBlasMatrix<>:add -> > >> > std::lower_bound > >> > > >> > In this assembly add+children takes 89% of the time, and > >> tabulate_tensor > >> > taking roughly 9%. Full (gprof) profile attached. Murtazo is probably > >> > right that the performance numbers are virtually the same with PETSc. > >> I > >> > will hook it up and try (and let you know if this is not the case). > >> > > >> > I do appreciate the complexity of inserting elements into a sparse > >> > matrix, and I do _not_ claim to know better when it comes to the > >> > assembler architecture. > >> > > >> > Still, as I vary the size of the mesh I get this performance metric > >> > virtually constant: > >> > Assembled 7.3e+05 non-zero matrix elements per second (first pass) > >> > Assembled 1.4e+06 non-zero matrix elements per second (re-assemble). > >> > > >> > Is this a sensible metric? If so, is it well understood how the DOLFIN > >> > assembler performs? If not, I could put together a test-suite for a > >> > range of forms (2/3D, simple/mixed element, simple/complicated > >> > expressions in the form etc). > >> > > >> > To restate my question: How should I verify that the assembler is > >> > performing as expected here? Am I looking at some unexpected overhead > >> in > >> > this assembly (we all know how hard this can be to spot with C++)? > >> > > >> > Thanks! > >> > /Dag > >> > > >> > Garth N. Wells wrote: > >> >> > >> >> Anders Logg wrote: > >> >>> On Fri, May 16, 2008 at 12:17:19AM +0200, Murtazo Nazarov wrote: > >> >>>>> Hello! > >> >>>>> > >> >>>>> I'm looking at a "suspiciously slow" assembly and would like to > >> >>>>> determine what is going on. In general, what should one expect the > >> >>>>> most > >> >>>>> time-consuming step to be? > >> >>>>> > >> >>>>> This is what my gprof looks like: > >> >>>>> > >> >>>>> Time: > >> >>>>> 61.97% unsigned int const* std::lower_bound > >> >>>>> 25.84% dolfin::uBlasMatrix<...>::add > >> >>>>> 8.27% > >> >>>>> UFC_NSEMomentum3DBilinearForm_cell_integral_0::tabulate_tensor > >> >>>>> 1.1% dolfin::uBlasMatrix<...>::init > >> >>> Where is lower_bound used? From within uBlasMatrix::add or is it in > >> >>> building the sparsity pattern? > >> >>> > >> >> > >> >> I suspect that it's either in building the sparsity pattern or > >> >> initialising the uBLAS matrix. The matrix structure is initialised by > >> >> running across rows and inserting a zero. uBLAS doesn't provide a > >> >> mechanism for initialising the underlying data structures directly > >> for a > >> >> sparse matrix. > >> >> > >> >> Dag: could you you run the same test using PETSc as the backend? > >> >> > >> >>>> I got these numbers also. I understand that it is very painful in > >> >>>> large > >> >>>> computations. > >> >>>> > >> >>>> I see what is a problem with adding into the stiffness matrix A. > >> >>>> Searching > >> >>>> the position of the element which needs to be added takes very long > >> >>>> time, > >> >>>> especially if you are solving big problems with thousands unknowns > >> and > >> >>>> repeating the assembling a lot of times! > >> >>> If you know a good way to avoid inserting entries into a sparse > >> matrix > >> >>> during assembly, please tell me... :-) > >> >>> > >> >>> If the assembly is costly, you might want to try assembling the > >> action > >> >>> of it instead and send that to a Krylov solver. Inserting into a > >> >>> vector is much easier than into a sparse matrix. > >> >>> > >> >>>> One way could be finding the global indices of the matrix A once, > >> and > >> >>>> use > >> >>>> it in the assembly process. By this way we avoid of searching the > >> >>>> element > >> >>>> position and it makes the process significantly fast. But, there is > >> a > >> >>>> problem: somehow I cannot get access to the global index of cell in > >> >>>> the A > >> >>>> and change it instead of using MatSetValues (in PETSc). > >> >>> I don't understand what you suggest here. We do precompute the > >> >>> sparsity pattern of the matrix and use that to preallocate, but I > >> >>> don't know of any other way to insert entries than MatSetValues. > >> >>> > >> >> > >> >> I doubt insertion is the real problem, especially as Dag noted that > >> >> subsequent assembly operations take only half the time since the > >> matrix > >> >> is already initialised. > >> >> > >> >> PETSc (and no doubt Trilinos) do offer some assembly possibilities > >> that > >> >> we haven't yet exploited because they require a reorganisation of the > >> >> dof map. > >> >> > >> >> Garth > >> >> > >> >>>> I am pretty sure that we may speed up the A.set() and A.get() > >> >>>> processes as > >> >>>> well by the above method. > >> >>> Please explain. > >> >>> > >> >>>> I am not sure how the dofmap to get rows and cols indices of the > >> cells > >> >>>> is > >> >>>> implemented. We could avoid repeating this operation as well. > >> >>> This is already implemented (but maybe not used). DofMap handles > >> this. > >> >>> It wraps the generated ufc::dof_map code and may pretabulate (and > >> >>> possibly reorder) the dofs. > >> >>> > >> >>>> We did some comparison with another free fem toolbox, FemLego, the > >> >>>> assembly process in Dolfin is 3 times slower than FemLego in 2D. I > >> >>>> believe > >> >>>> this number will increase in 3D. FemLego uses quadrature rule for > >> >>>> computing integrals. > >> >>> Can you benchmark the various parts of the assembly to see what > >> causes > >> >>> the slowdown: > >> >>> > >> >>> 1. Is it tabulate_tensor? > >> >>> 2. Is it tabulate_dofs? > >> >>> 3. Is it A.add()? > >> >>> 4. Something else? > >> >>> > >> >>>> I hope some PETSc guys will help us to do this improvements. Any > >> other > >> >>>> ideas are welcome! > >> >>> We are currently experimenting with collecting and preprocessing > >> >>> batches of entries before inserting into the global sparse matrix in > >> >>> hope of speeding up the assembly but we don't have any results yet. > >> >>> > >> >> _______________________________________________ > >> >> DOLFIN-dev mailing list > >> >> [email protected] > >> >> http://www.fenics.org/mailman/listinfo/dolfin-dev > >> > _______________________________________________ > >> > DOLFIN-dev mailing list > >> > [email protected] > >> > http://www.fenics.org/mailman/listinfo/dolfin-dev > >> > > >> > >> > >> _______________________________________________ > >> DOLFIN-dev mailing list > >> [email protected] > >> http://www.fenics.org/mailman/listinfo/dolfin-dev > > _______________________________________________ > > DOLFIN-dev mailing list > > [email protected] > > http://www.fenics.org/mailman/listinfo/dolfin-dev > > > > > _______________________________________________ > DOLFIN-dev mailing list > [email protected] > http://www.fenics.org/mailman/listinfo/dolfin-dev _______________________________________________ DOLFIN-dev mailing list [email protected] http://www.fenics.org/mailman/listinfo/dolfin-dev
