Hi Barry,

Thank you for the information provided!

Do you think different BLAS implementation will affect the multithreading 
performance of some vectors operations in GMERS in PETSc?

I am now using OpenBLAS but didn’t see much improvement when theb 
multithreading are enabled, do you think other implementation such as netlib 
and intel-mkl will help?

Best,
Yongzhong

From: Barry Smith <bsm...@petsc.dev>
Date: Monday, April 22, 2024 at 4:20 PM
To: Yongzhong Li <yongzhong...@mail.utoronto.ca>
Cc: petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov>, petsc-ma...@mcs.anl.gov 
<petsc-ma...@mcs.anl.gov>, Piero Triverio <piero.trive...@utoronto.ca>
Subject: Re: [petsc-maint] Inquiry about Multithreading Capabilities in PETSc's 
KSPSolver
你通常不会收到来自 bsm...@petsc.dev 
的电子邮件。了解这一点为什么很重要<https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!aNwwhaNwa_gbiduyqOK-fh3XqaflSgzf_Epel-lCvCrQrOx_5whj-_fjlBOTZsR-8DKl0ZHsC7L78nIw1YvDUq9dQO1D0UttjKo$
 >

   PETSc provided solvers do not directly use threads.

   The BLAS used by LAPACK and PETSc may use threads depending on what BLAS is 
being used and how it was configured.

   Some of the vector operations in GMRES in PETSc use BLAS that can use 
threads, including axpy, dot, etc. For sufficiently large problems, the use of 
threaded BLAS can help with these routines, but not significantly for the 
solver.

   Dense matrix-vector products MatMult() and dense matrix direct solvers PCLU 
use BLAS and thus can benefit from threading. The benefit can be significant 
for large enough problems with good hardware, especially with PCLU.

   If you run with -blas_view  PETSc tries to indicate information about the 
threading of BLAS. You can also use -blas_num_threads <n> to set the number of 
threads, equivalent to setting the environmental variable.  For dense solvers 
you can vary the number of threads and run with -log_view to see what it helps 
to improve and what it does not effect.




On Apr 22, 2024, at 4:06 PM, Yongzhong Li <yongzhong...@mail.utoronto.ca> wrote:

This Message Is From an External Sender
This message came from outside your organization.
Hello all,

I am writing to ask if PETSc’s KSPSolver makes use of OpenMP/multithreading, 
specifically when performing iterative solutions with the GMRES algorithm.

The questions appeared when I was running a large numerical program based on 
boundary element method. I used the PETSc's GMRES algorithm in KSPSolve to 
solve a shell matrix system iteratively. I observed that threads were being 
utilized, controlled by the OPENBLAS_NUM_THREADS environment variable. However, 
I noticed no significant performance difference between running the solver with 
multiple threads versus a single thread.

Could you please confirm if GMRES in KSPSolve leverages multithreading, and 
also whether it is influenced by the multithreadings of the low-level math 
libraries such as BLAS and LAPACK? If so, how can I enable multithreading 
effectively to see noticeable improvements in solution times when using GMRES? 
If not, why do I observe that threads are being used during the GMERS solutions?

For reference, I am using PETSc version 3.16.0, configured in CMakelists as 
follows:

./configure PETSC_ARCH=config-release --with-scalar-type=complex 
--with-fortran-kernels=1 --with-debugging=0 COPTFLAGS=-O3 -march=native 
CXXOPTFLAGS=-O3 -march=native FOPTFLAGS=-O3 -march=native --with-cxx=g++ 
--download-openmpi --download-superlu --download-opencascade 
--with-openblas-include=${OPENBLAS_INC} --with-openblas-lib=${OPENBLAS_LIB} 
--with-threadsafety --with-log=0 --with-openmp

To simplify the diagnosis of potential issues, I have also written a small 
example program using GMRES to solve a sparse matrix system derived from a 2D 
Poisson problem using the finite difference method. I found similar issues on 
this piece of codes. The code is as follows:

#include <petscksp.h>

/* Monitor function to print iteration number and residual norm */
PetscErrorCode MyKSPMonitor(KSP ksp, PetscInt n, PetscReal rnorm, void *ctx) {
    PetscErrorCode ierr;
    ierr = PetscPrintf(PETSC_COMM_WORLD, "Iteration %D, Residual norm %g\n", n, 
(double)rnorm);
    CHKERRQ(ierr);
    return 0;
}

int main(int argc, char **args) {
    Vec x, b, x_true, e;
    Mat A;
    KSP ksp;
    PetscErrorCode ierr;
    PetscInt i, j, Ii, J, n = 500; // Size of the grid n x n
    PetscInt Istart, Iend, ncols;
    PetscScalar v;
    PetscMPIInt rank;
    PetscInitialize(&argc, &args, NULL, NULL);
    PetscLogDouble t1, t2;     // Variables for timing
    MPI_Comm_rank(PETSC_COMM_WORLD, &rank);

    // Create vectors and matrix
    ierr = VecCreateMPI(PETSC_COMM_WORLD, PETSC_DECIDE, n*n, &x); CHKERRQ(ierr);
    ierr = VecDuplicate(x, &b); CHKERRQ(ierr);
    ierr = VecDuplicate(x, &x_true); CHKERRQ(ierr);

    // Set true solution as all ones
    ierr = VecSet(x_true, 1.0); CHKERRQ(ierr);

    // Create and assemble matrix A for the 2D Laplacian using 5-point stencil
    ierr = MatCreate(PETSC_COMM_WORLD, &A); CHKERRQ(ierr);
    ierr = MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, n*n, n*n); CHKERRQ(ierr);
    ierr = MatSetFromOptions(A); CHKERRQ(ierr);
    ierr = MatSetUp(A); CHKERRQ(ierr);
    ierr = MatGetOwnershipRange(A, &Istart, &Iend); CHKERRQ(ierr);
    for (Ii = Istart; Ii < Iend; Ii++) {
        i = Ii / n; // Row index
        j = Ii % n; // Column index
        v = -4.0;
        ierr = MatSetValues(A, 1, &Ii, 1, &Ii, &v, INSERT_VALUES); 
CHKERRQ(ierr);
        if (i > 0) { // South
            J = Ii - n;
            v = 1.0;
            ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES); 
CHKERRQ(ierr);
        }
        if (i < n - 1) { // North
            J = Ii + n;
            v = 1.0;
            ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES); 
CHKERRQ(ierr);
        }
        if (j > 0) { // West
            J = Ii - 1;
            v = 1.0;
            ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES); 
CHKERRQ(ierr);
        }
        if (j < n - 1) { // East
            J = Ii + 1;
            v = 1.0;
            ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES); 
CHKERRQ(ierr);
        }
    }
    ierr = MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
    ierr = MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);

    // Compute the RHS corresponding to the true solution
    ierr = MatMult(A, x_true, b); CHKERRQ(ierr);

    // Set up and solve the linear system
    ierr = KSPCreate(PETSC_COMM_WORLD, &ksp); CHKERRQ(ierr);
    ierr = KSPSetOperators(ksp, A, A); CHKERRQ(ierr);
    ierr = KSPSetType(ksp, KSPGMRES); CHKERRQ(ierr);
    ierr = KSPSetTolerances(ksp, 1e-5, PETSC_DEFAULT, PETSC_DEFAULT, 
PETSC_DEFAULT); CHKERRQ(ierr);

    /* Set up the monitor */
    ierr = KSPMonitorSet(ksp, MyKSPMonitor, NULL, NULL); CHKERRQ(ierr);

    // Start timing
    PetscTime(&t1);

    ierr = KSPSolve(ksp, b, x); CHKERRQ(ierr);

    // Stop timing
    PetscTime(&t2);

    // Compute error
    ierr = VecDuplicate(x, &e); CHKERRQ(ierr);
    ierr = VecWAXPY(e, -1.0, x_true, x); CHKERRQ(ierr);
    PetscReal norm_error, norm_true;
    ierr = VecNorm(e, NORM_2, &norm_error); CHKERRQ(ierr);
    ierr = VecNorm(x_true, NORM_2, &norm_true); CHKERRQ(ierr);
    PetscReal relative_error = norm_error / norm_true;
    if (rank == 0) { // Print only from the first MPI process
        PetscPrintf(PETSC_COMM_WORLD, "Relative error ||x - x_true||_2 / 
||x_true||_2: %g\n", (double)relative_error);
    }

    // Output the wall time taken for MatMult
    PetscPrintf(PETSC_COMM_WORLD, "Time taken for KSPSolve: %f seconds\n", t2 - 
t1);

    // Cleanup
    ierr = VecDestroy(&x); CHKERRQ(ierr);
    ierr = VecDestroy(&b); CHKERRQ(ierr);
    ierr = VecDestroy(&x_true); CHKERRQ(ierr);
    ierr = VecDestroy(&e); CHKERRQ(ierr);
    ierr = MatDestroy(&A); CHKERRQ(ierr);
    ierr = KSPDestroy(&ksp); CHKERRQ(ierr);
    PetscFinalize();
    return 0;
}

Here are some profiling results for GMERS solution.

OPENBLAS_NUM_THREADS = 1, iteration steps  = 859, solution time = 16.1
OPENBLAS_NUM_THREADS = 2, iteration steps  = 859, solution time = 16.3
OPENBLAS_NUM_THREADS = 4, iteration steps  = 859, solution time = 16.7
OPENBLAS_NUM_THREADS = 8, iteration steps  = 859, solution time = 16.8
OPENBLAS_NUM_THREADS = 16, iteration steps  = 859, solution time = 17.8

I am using one workstation with Intel® Core™ i9-11900K Processor, 8 cores, 16 
threads. Note that I am not using multiple MPI processes, such as 
mpirun/mpiexec, the default number of MPI processes should be 1, correct if I 
am wrong.

Thank you in advance!

Sincerely,
Yongzhong

-----------------------------------------------------------
Yongzhong Li
PhD student | Electromagnetics Group
Department of Electrical & Computer Engineering
University of Toronto
https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!aNwwhaNwa_gbiduyqOK-fh3XqaflSgzf_Epel-lCvCrQrOx_5whj-_fjlBOTZsR-8DKl0ZHsC7L78nIw1YvDUq9dQO1DoUi7Awk$
 
<https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!efVv_hPkRBEhyquXer2c8sFeGrjOtTjEGicYg2niCyfT9swzjLFyf6k4XrhKElaF-cX_Q02y9KSTRNFHPlKhXMtuzaTekCWcXgw$>

Reply via email to