Re: [eigen] Parallelization of SparseMatrix * dense_matrix

Lorenzo Botti Tue, 14 Sep 2021 13:05:47 -0700

Yes, I'm solving systems involving SparseMatrices.
I follow the hints at
https://eigen.tuxfamily.org/dox/TopicMultiThreading.html.
By scaling I mean halving the CPU time when doubling the number of threads.
I know that this optimal efficiency is  achivable with MPI based
distributed memory parallelism if the preconditioner is scalable, e.g.
diagonal or block diagonal preconditioners. This is what I usually get with
PETSc.
Clearly matrices and vectors are distributed in memory based on some
partitioning of the problem that must be carried out in a preliminary
phase.


Thanks to increasing core counts in modern CPUs it would be great to avoid
distributed memory for some problems and combine distributed and shared
memory for others. It is easy enough to obtain good scaling for matrix
assembly using std::async and std::future, not a big deal. Based on your
expertise, do you think that matrix vector products require specific
knowledge of the matrix sparsity pattern?

Best regards
Lorenzo


On Tue, Sep 14, 2021, 20:04 Rasmus Munk Larsen <[email protected]> wrote:

> Hi Lorenzo,
>
> Thanks for the input. For Krylov substance solvers most of the work is
> usually in the matrix-vector multiplies (and possibly the preconditioner if
> you provide your own), so I'm not sure what you mean by scaling in this
> case. Are you running, e.g., conjugate gradients on a dense Eigen::Matrix
> (I assume not) or sparse Eigen::SparseMatrix? In the latter case, most of
> the scaling would be in the sparse matrix-vector product. The vector
> operations (dot product, norms etc.) in the Krylov solvers are not
> multi-threaded in Eigen AFAICT.
>
> This type of multi-threading _is_ available in the Tensor library, if you
> use a ThreadPoolDevice for evaluation, see, for example
> https://gitlab.com/libeigen/eigen/-/blob/master/unsupported/test/cxx11_tensor_thread_pool.cpp#L50
> However, adding this type of general multi-threaded evaluation to matrix
> expression in Eigen Core by unifying it with Tensor, is a much longer term
> project that nobody is working AFAICT, although it has been on the wishlist
> for a long time.
>
> Best regard,
>   Rasmus
>
> On Mon, Sep 13, 2021 at 11:52 PM Lorenzo Botti <[email protected]>
> wrote:
>
>> Dear all,
>> I hope that this new implementation also improves the scaling of Eigen
>> solvers, e.g. conjugate gradient and biconjugate gradient. What I usually
>> get is a factor 2 speed up independently from the number of threads.
>>
>> Thanks for the effort.
>> Best regards.
>> Lorenzo
>>
>>
>>
>> On Mon, Sep 13, 2021, 17:52 Rasmus Munk Larsen <[email protected]>
>> wrote:
>>
>>> Hi Peter,
>>>
>>> I would recommend that you hold off on this change for a bit. We are
>>> planning on moving the non-blocking threadpool from unsupported into core
>>> in the near future and use that as the basis for multi-threading without
>>> depending on OpenMP.
>>>
>>> Rasmus
>>>
>>> On Mon, Sep 13, 2021 at 5:21 AM Peter <[email protected]> wrote:
>>>
>>>> Dear all,
>>>>
>>>> I'm currently playing with the sparse matrix implementation within
>>>> eigen,
>>>> namely Eigen::SparseMatrix<double, Eigen::ColMajor> combined with a
>>>> self adjoint view.
>>>>
>>>> In my application I need the multiplication of a symmetric sparse
>>>> matrix with a dense matrix,
>>>> where the dimension of the matrices are of the order of up to a few ten
>>>> thousands.
>>>>
>>>> I tried to parallelize the dot product in SparseSelfAdjointView.h by
>>>> copying a omp directive
>>>> from other parts in eigen, and I tried to parallelize the outer loop
>>>> directly
>>>> with std::execution::par, see below.
>>>>
>>>> In summary, I do not see any effect of the parallelization.
>>>> Before digging deeper into it and building a minimal working example,
>>>> has someone  already achieved this?
>>>>
>>>> Could one actually directly call the corresponding MKL routine or are
>>>> the internal storage schemes different?
>>>>
>>>> Best regards
>>>> Peter
>>>>
>>>> P.S.: That's what I tried as a diff to the current master branch:
>>>>
>>>> :~/Eigen/eigen/Eigen/src/SparseCore$ git diff SparseSelfAdjointView.h
>>>> diff --git a/Eigen/src/SparseCore/SparseSelfAdjointView.h
>>>> b/Eigen/src/SparseCore/SparseSelfAdjointView.h
>>>> index 0302ef3a4..91e7d495b 100644
>>>> --- a/Eigen/src/SparseCore/SparseSelfAdjointView.h
>>>> +++ b/Eigen/src/SparseCore/SparseSelfAdjointView.h
>>>> @@ -10,6 +10,11 @@
>>>>  #ifndef EIGEN_SPARSE_SELFADJOINTVIEW_H
>>>>  #define EIGEN_SPARSE_SELFADJOINTVIEW_H
>>>>
>>>> +#include <iostream> /// only for debugging
>>>> +#include <boost/range/irange.hpp>
>>>> +
>>>> +#include <execution>
>>>> +
>>>>  #include "./InternalHeaderCheck.h"
>>>>
>>>>  namespace Eigen {
>>>> @@ -295,8 +300,19 @@ inline void
>>>> sparse_selfadjoint_time_dense_product(const SparseLhsType& lhs, cons
>>>>    SparseLhsTypeNested lhs_nested(lhs);
>>>>    LhsEval lhsEval(lhs_nested);
>>>>
>>>> +  Eigen::initParallel();
>>>> +  Index threads = Eigen::nbThreads();
>>>> +
>>>> +  std::cout << "\ndot threads: "<< threads << " rhs-cols: " <<
>>>> rhs.cols() << std::endl;
>>>> +
>>>> +  // #pragma omp parallel for
>>>> +  // #pragma omp parallel for
>>>> schedule(dynamic,(rhs.cols()+threads*4-1)/(threads*4)) num_threads(threads)
>>>> +  // for (Index k=0; k<rhs.cols(); ++k)
>>>> +
>>>> +  auto r = boost::irange(rhs.cols());
>>>> +
>>>> +  std::for_each(std::execution::par, r.begin(), r.end(),  [&](Index k)
>>>>    // work on one column at once
>>>> -  for (Index k=0; k<rhs.cols(); ++k)
>>>>    {
>>>>      for (Index j=0; j<lhs.outerSize(); ++j)
>>>>      {
>>>> @@ -330,6 +346,7 @@ inline void
>>>> sparse_selfadjoint_time_dense_product(const SparseLhsType& lhs, cons
>>>>          res.coeffRef(j,k) += alpha * i.value() * rhs.coeff(j,k);
>>>>      }
>>>>    }
>>>> +  );
>>>>  }
>>>>
>>>>
>>>>
>>>>

Re: [eigen] Parallelization of SparseMatrix * dense_matrix

Reply via email to