If the petsc sparse matrix structure works as I expect; to insert/add an element you go to the particular row and search all non-zero entries of that row until you find your column. So the complexity would be = #dofs x #row non-zero entries
With a block matrix you can optimize this by just searching for the first entry of the block. I am surprised that "add" should totally dominate the assembly algorithm. Is this reasonable? Dag: the first assembly should typically involve an initialization (at least for a non-linear problem), is this part of you test? If so, I think it is strange that the difference between the first pass and the reassemble do not differ (with a faster reassemble). For a PETSc matrix I do not recognize the std::lower_bound, is this ublas specific? /Johan > It seems this thread got a bit derailed yesterday :( > > I've done some more careful profiling: > *) Full assemble once > *) Assemble matrix 30 times without reset > in order to amortize the time for initialization. > > The call graph shows that std::lower_bound is called from add: > dolfin::GenericMatrix::add -> > dolfin::uBlasMatrix<>:add -> > std::lower_bound > > In this assembly add+children takes 89% of the time, and tabulate_tensor > taking roughly 9%. Full (gprof) profile attached. Murtazo is probably > right that the performance numbers are virtually the same with PETSc. I > will hook it up and try (and let you know if this is not the case). > > I do appreciate the complexity of inserting elements into a sparse > matrix, and I do _not_ claim to know better when it comes to the > assembler architecture. > > Still, as I vary the size of the mesh I get this performance metric > virtually constant: > Assembled 7.3e+05 non-zero matrix elements per second (first pass) > Assembled 1.4e+06 non-zero matrix elements per second (re-assemble). > > Is this a sensible metric? If so, is it well understood how the DOLFIN > assembler performs? If not, I could put together a test-suite for a > range of forms (2/3D, simple/mixed element, simple/complicated > expressions in the form etc). > > To restate my question: How should I verify that the assembler is > performing as expected here? Am I looking at some unexpected overhead in > this assembly (we all know how hard this can be to spot with C++)? > > Thanks! > /Dag > > Garth N. Wells wrote: >> >> Anders Logg wrote: >>> On Fri, May 16, 2008 at 12:17:19AM +0200, Murtazo Nazarov wrote: >>>>> Hello! >>>>> >>>>> I'm looking at a "suspiciously slow" assembly and would like to >>>>> determine what is going on. In general, what should one expect the >>>>> most >>>>> time-consuming step to be? >>>>> >>>>> This is what my gprof looks like: >>>>> >>>>> Time: >>>>> 61.97% unsigned int const* std::lower_bound >>>>> 25.84% dolfin::uBlasMatrix<...>::add >>>>> 8.27% >>>>> UFC_NSEMomentum3DBilinearForm_cell_integral_0::tabulate_tensor >>>>> 1.1% dolfin::uBlasMatrix<...>::init >>> Where is lower_bound used? From within uBlasMatrix::add or is it in >>> building the sparsity pattern? >>> >> >> I suspect that it's either in building the sparsity pattern or >> initialising the uBLAS matrix. The matrix structure is initialised by >> running across rows and inserting a zero. uBLAS doesn't provide a >> mechanism for initialising the underlying data structures directly for a >> sparse matrix. >> >> Dag: could you you run the same test using PETSc as the backend? >> >>>> I got these numbers also. I understand that it is very painful in >>>> large >>>> computations. >>>> >>>> I see what is a problem with adding into the stiffness matrix A. >>>> Searching >>>> the position of the element which needs to be added takes very long >>>> time, >>>> especially if you are solving big problems with thousands unknowns and >>>> repeating the assembling a lot of times! >>> If you know a good way to avoid inserting entries into a sparse matrix >>> during assembly, please tell me... :-) >>> >>> If the assembly is costly, you might want to try assembling the action >>> of it instead and send that to a Krylov solver. Inserting into a >>> vector is much easier than into a sparse matrix. >>> >>>> One way could be finding the global indices of the matrix A once, and >>>> use >>>> it in the assembly process. By this way we avoid of searching the >>>> element >>>> position and it makes the process significantly fast. But, there is a >>>> problem: somehow I cannot get access to the global index of cell in >>>> the A >>>> and change it instead of using MatSetValues (in PETSc). >>> I don't understand what you suggest here. We do precompute the >>> sparsity pattern of the matrix and use that to preallocate, but I >>> don't know of any other way to insert entries than MatSetValues. >>> >> >> I doubt insertion is the real problem, especially as Dag noted that >> subsequent assembly operations take only half the time since the matrix >> is already initialised. >> >> PETSc (and no doubt Trilinos) do offer some assembly possibilities that >> we haven't yet exploited because they require a reorganisation of the >> dof map. >> >> Garth >> >>>> I am pretty sure that we may speed up the A.set() and A.get() >>>> processes as >>>> well by the above method. >>> Please explain. >>> >>>> I am not sure how the dofmap to get rows and cols indices of the cells >>>> is >>>> implemented. We could avoid repeating this operation as well. >>> This is already implemented (but maybe not used). DofMap handles this. >>> It wraps the generated ufc::dof_map code and may pretabulate (and >>> possibly reorder) the dofs. >>> >>>> We did some comparison with another free fem toolbox, FemLego, the >>>> assembly process in Dolfin is 3 times slower than FemLego in 2D. I >>>> believe >>>> this number will increase in 3D. FemLego uses quadrature rule for >>>> computing integrals. >>> Can you benchmark the various parts of the assembly to see what causes >>> the slowdown: >>> >>> 1. Is it tabulate_tensor? >>> 2. Is it tabulate_dofs? >>> 3. Is it A.add()? >>> 4. Something else? >>> >>>> I hope some PETSc guys will help us to do this improvements. Any other >>>> ideas are welcome! >>> We are currently experimenting with collecting and preprocessing >>> batches of entries before inserting into the global sparse matrix in >>> hope of speeding up the assembly but we don't have any results yet. >>> >> _______________________________________________ >> DOLFIN-dev mailing list >> [email protected] >> http://www.fenics.org/mailman/listinfo/dolfin-dev > _______________________________________________ > DOLFIN-dev mailing list > [email protected] > http://www.fenics.org/mailman/listinfo/dolfin-dev > _______________________________________________ DOLFIN-dev mailing list [email protected] http://www.fenics.org/mailman/listinfo/dolfin-dev
