Hi Wolfgang, Thank you for your reply.
The reason that the solver takes 90% of the run-time is because my problem allows me to perform the assembly on the domain once off and only assemble contributions from the boundary at each time step. (This is also why I can perform the LU decomposition of the matrix upfront for this model.) For the once off LU decomposition the solver takes about 50% of the run-time as you suggest an ILU or AMG preconditioner might also allow. Yes, my problem is around 90 000 DoFs in 2D so UMFPACK is ideal but I wanted to show results that compare other alternatives. Thanks for your comments on the SSOR implementation in deal.II and Trilinos. It certainly seems worth the extra work assembling the full SSOR preconditioner for my problem, given the performance deficit the parallel solver starts from. Regarding controlling the threads: I wasn't sure which method ultimately controlled the solver threads so yes, I simply set both to the same value (or rather the value + 1 in the case of TBB threads because one thread remains idle in TBB). Thanks for confirming that the TBB threads are the main ones. Thanks again, Michael On Mon, Jan 16, 2012 at 2:06 AM, Wolfgang Bangerth <[email protected]> wrote: > >> I performed the tests with the very specific aim of demonstrating that >> for the type of problem I am dealing with, (1) the solver generally >> requires over 90% of the run time, and hence is the major area that >> should be optimized, > > > I'd say that's only true if you have a poor preconditioner (e.g. SSOR). With > good preconditioners (e.g. AMG or an ILU) you should be able to get that > down to 50% or so. > > > >> Under these idealized conditions, the Trilinos CG solver with SSOR >> preconditioner performed very well, in terms of speed up attained. The >> maximum deviation from a linear speed up was 30% for up to 8 >> processors. For processor 1-4 it was around 10%. (These were measured >> on a single xeon 8 core chip. I am waiting for a job with two 4 core >> chips to run so that I can show the (expected) performance drop as off >> chip communications affects the results. As I said, idealized >> conditions.) >> >> My surprise came when using the deal.II CG solver with SSOR as >> preconditioner. My results for a single processor took slightly less >> than half the time the Trilinos solver required when using one MPI >> thread, which is great, but I found virtually no speed up from 1 >> thread to 8 threads. > > > Right, but I suspect that that is because deal.II and Trilinos disagree on > what SSOR means. In deal.II, we apply SSOR to the entire matrix, i.e. it is > a sequential algorithm because you need to have the result of the previous > row's operation to substitute in the current row. I suspect that what > Trilinos means is that it chops the matrix into a number of blocks and > applies the SSOR algorithm to each of these blocks. An alternative viewpoint > is that the matrix is subdivided into BxB blocks and that the SSOR method is > only applied to the B diagonal blocks; since they don't couple with each > other, this creates the potential for parallelization, at the cost that the > preconditioner is worse than one that considers the entire matrix. > > > >> I (mistakenly?) thought that the deal.II vmult >> >> method was threaded and should have shown at least some speed up. > > > The SparseMatrix::vmult function is; but the PreconditionSSOR::vmult > function isn't. > > > >> For the record, I controlled the number of threads deal.II used by >> explicitly editing source/base/multithread_info.cc to set n_cpus to my >> desired value. > > > This is sort of an outdated way of doing things since most of the library > has been converted to the tasks framework instead of explicit threads. The > variable you set does not affect the parallelization of SparseMatrix::vmult, > for example. However, you can control how many tasks should be created at > once using a method described in > > http://www.dealii.org/developer/doxygen/deal.II/group__threads.html#MTTaskThreads > > I think you've already found this. I suppose you set this to the same value > as for threads? > > > >> I also used UMFPACK to solve the system. On the xeon 8 core chip, the >> deal.II CG solver + SSOR preconditioner beat UMFPACK by about 20% when >> I reinitialized UMFPACK every time I needed to solve the matrix. On my >> laptop the opposite occurs and UMFPACK beats CG by about 16%. I would >> expect the difference lies in the versions of BLAS I am using on the >> different machines. When I only initialize the UMFPACK matrix once and >> then use it for the remainder of the time steps (which my test case >> allows, but in general cannot be done) UMFPACK is an order of >> magnitude faster than the rest, perhaps unsurprisingly. > > > Yes, I think this is a general observation. For things that have less than > 100,000 DoFs, umfpack is generally the fastest method. Your problem would > fall into this category. > > Cheers > W. > > -- > ------------------------------------------------------------------------ > Wolfgang Bangerth email: [email protected] > www: http://www.math.tamu.edu/~bangerth/ > _______________________________________________ dealii mailing list http://poisson.dealii.org/mailman/listinfo/dealii
