I performed the tests with the very specific aim of demonstrating that
for the type of problem I am dealing with, (1) the solver generally
requires over 90% of the run time, and hence is the major area that
should be optimized,
I'd say that's only true if you have a poor preconditioner (e.g. SSOR).
With good preconditioners (e.g. AMG or an ILU) you should be able to get
that down to 50% or so.
Under these idealized conditions, the Trilinos CG solver with SSOR
preconditioner performed very well, in terms of speed up attained. The
maximum deviation from a linear speed up was 30% for up to 8
processors. For processor 1-4 it was around 10%. (These were measured
on a single xeon 8 core chip. I am waiting for a job with two 4 core
chips to run so that I can show the (expected) performance drop as off
chip communications affects the results. As I said, idealized
conditions.)
My surprise came when using the deal.II CG solver with SSOR as
preconditioner. My results for a single processor took slightly less
than half the time the Trilinos solver required when using one MPI
thread, which is great, but I found virtually no speed up from 1
thread to 8 threads.
Right, but I suspect that that is because deal.II and Trilinos disagree
on what SSOR means. In deal.II, we apply SSOR to the entire matrix, i.e.
it is a sequential algorithm because you need to have the result of the
previous row's operation to substitute in the current row. I suspect
that what Trilinos means is that it chops the matrix into a number of
blocks and applies the SSOR algorithm to each of these blocks. An
alternative viewpoint is that the matrix is subdivided into BxB blocks
and that the SSOR method is only applied to the B diagonal blocks; since
they don't couple with each other, this creates the potential for
parallelization, at the cost that the preconditioner is worse than one
that considers the entire matrix.
> I (mistakenly?) thought that the deal.II vmult
method was threaded and should have shown at least some speed up.
The SparseMatrix::vmult function is; but the PreconditionSSOR::vmult
function isn't.
For the record, I controlled the number of threads deal.II used by
explicitly editing source/base/multithread_info.cc to set n_cpus to my
desired value.
This is sort of an outdated way of doing things since most of the
library has been converted to the tasks framework instead of explicit
threads. The variable you set does not affect the parallelization of
SparseMatrix::vmult, for example. However, you can control how many
tasks should be created at once using a method described in
http://www.dealii.org/developer/doxygen/deal.II/group__threads.html#MTTaskThreads
I think you've already found this. I suppose you set this to the same
value as for threads?
I also used UMFPACK to solve the system. On the xeon 8 core chip, the
deal.II CG solver + SSOR preconditioner beat UMFPACK by about 20% when
I reinitialized UMFPACK every time I needed to solve the matrix. On my
laptop the opposite occurs and UMFPACK beats CG by about 16%. I would
expect the difference lies in the versions of BLAS I am using on the
different machines. When I only initialize the UMFPACK matrix once and
then use it for the remainder of the time steps (which my test case
allows, but in general cannot be done) UMFPACK is an order of
magnitude faster than the rest, perhaps unsurprisingly.
Yes, I think this is a general observation. For things that have less
than 100,000 DoFs, umfpack is generally the fastest method. Your problem
would fall into this category.
Cheers
W.
--
------------------------------------------------------------------------
Wolfgang Bangerth email: [email protected]
www: http://www.math.tamu.edu/~bangerth/
_______________________________________________
dealii mailing list http://poisson.dealii.org/mailman/listinfo/dealii