Yes, only the routines that can explicitly use BLAS have multi-threading.
PETSc does support using nay MPI linear solvers from a sequential (or OpenMP) main program using the https://urldefense.us/v3/__https://petsc.org/release/manualpages/PC/PCMPI/*pcmpi__;Iw!!G_uCfscf7eWS!axQjxZeWC27cy3WnpqerXeWbd74F9I1B5K9M5m_81RGHmibyn9It_T8Ru5XaCj_2X3FG7XpHUh65OKUae7RSBr0$ construct. I am finishing up better support in the branch barry/2023-09-15/fix-log-pcmpi Barry > On Apr 23, 2024, at 3:59 PM, Yongzhong Li <yongzhong...@mail.utoronto.ca> > wrote: > > Thanks Barry! Does this mean that the sparse matrix-vector products, which > actually constitute the majority of the computations in my GMRES routine in > PETSc, don’t utilize multithreading? Only basic vector operations such as > VecAXPY and VecDot or dense matrix operations in PETSc will benefit from > multithreading, is it correct? > > Best, > Yongzhong > > From: Barry Smith <bsm...@petsc.dev <mailto:bsm...@petsc.dev>> > Date: Tuesday, April 23, 2024 at 3:35 PM > To: Yongzhong Li <yongzhong...@mail.utoronto.ca > <mailto:yongzhong...@mail.utoronto.ca>> > Cc: petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov> > <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>>, > petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov> > <petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov>>, Piero Triverio > <piero.trive...@utoronto.ca <mailto:piero.trive...@utoronto.ca>> > Subject: Re: [petsc-maint] Inquiry about Multithreading Capabilities in > PETSc's KSPSolver > > 你通常不会收到来自 bsm...@petsc.dev <mailto:bsm...@petsc.dev> 的电子邮件。了解这一点为什么很重要 > <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!axQjxZeWC27cy3WnpqerXeWbd74F9I1B5K9M5m_81RGHmibyn9It_T8Ru5XaCj_2X3FG7XpHUh65OKUa7GpfT4g$ > > > > Intel MKL or OpenBLAS are the best bet, but for vector operations they > will not be significant since the vector operations do not dominate the > computations. > > > On Apr 23, 2024, at 3:23 PM, Yongzhong Li <yongzhong...@mail.utoronto.ca > <mailto:yongzhong...@mail.utoronto.ca>> wrote: > > Hi Barry, > > Thank you for the information provided! > > Do you think different BLAS implementation will affect the multithreading > performance of some vectors operations in GMERS in PETSc? > > I am now using OpenBLAS but didn’t see much improvement when theb > multithreading are enabled, do you think other implementation such as netlib > and intel-mkl will help? > > Best, > Yongzhong > > From: Barry Smith <bsm...@petsc.dev <mailto:bsm...@petsc.dev>> > Date: Monday, April 22, 2024 at 4:20 PM > To: Yongzhong Li <yongzhong...@mail.utoronto.ca > <mailto:yongzhong...@mail.utoronto.ca>> > Cc: petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov> > <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>>, > petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov> > <petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov>>, Piero Triverio > <piero.trive...@utoronto.ca <mailto:piero.trive...@utoronto.ca>> > Subject: Re: [petsc-maint] Inquiry about Multithreading Capabilities in > PETSc's KSPSolver > > 你通常不会收到来自 bsm...@petsc.dev <mailto:bsm...@petsc.dev> 的电子邮件。了解这一点为什么很重要 > <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!axQjxZeWC27cy3WnpqerXeWbd74F9I1B5K9M5m_81RGHmibyn9It_T8Ru5XaCj_2X3FG7XpHUh65OKUa7GpfT4g$ > > > > PETSc provided solvers do not directly use threads. > > The BLAS used by LAPACK and PETSc may use threads depending on what BLAS > is being used and how it was configured. > > Some of the vector operations in GMRES in PETSc use BLAS that can use > threads, including axpy, dot, etc. For sufficiently large problems, the use > of threaded BLAS can help with these routines, but not significantly for the > solver. > > Dense matrix-vector products MatMult() and dense matrix direct solvers > PCLU use BLAS and thus can benefit from threading. The benefit can be > significant for large enough problems with good hardware, especially with > PCLU. > > If you run with -blas_view PETSc tries to indicate information about the > threading of BLAS. You can also use -blas_num_threads <n> to set the number > of threads, equivalent to setting the environmental variable. For dense > solvers you can vary the number of threads and run with -log_view to see what > it helps to improve and what it does not effect. > > > > > On Apr 22, 2024, at 4:06 PM, Yongzhong Li <yongzhong...@mail.utoronto.ca > <mailto:yongzhong...@mail.utoronto.ca>> wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hello all, > > I am writing to ask if PETSc’s KSPSolver makes use of OpenMP/multithreading, > specifically when performing iterative solutions with the GMRES algorithm. > > The questions appeared when I was running a large numerical program based on > boundary element method. I used the PETSc's GMRES algorithm in KSPSolve to > solve a shell matrix system iteratively. I observed that threads were being > utilized, controlled by the OPENBLAS_NUM_THREADS environment variable. > However, I noticed no significant performance difference between running the > solver with multiple threads versus a single thread. > > Could you please confirm if GMRES in KSPSolve leverages multithreading, and > also whether it is influenced by the multithreadings of the low-level math > libraries such as BLAS and LAPACK? If so, how can I enable multithreading > effectively to see noticeable improvements in solution times when using > GMRES? If not, why do I observe that threads are being used during the GMERS > solutions? > > For reference, I am using PETSc version 3.16.0, configured in CMakelists as > follows: > > ./configure PETSC_ARCH=config-release --with-scalar-type=complex > --with-fortran-kernels=1 --with-debugging=0 COPTFLAGS=-O3 -march=native > CXXOPTFLAGS=-O3 -march=native FOPTFLAGS=-O3 -march=native --with-cxx=g++ > --download-openmpi --download-superlu --download-opencascade > --with-openblas-include=${OPENBLAS_INC} --with-openblas-lib=${OPENBLAS_LIB} > --with-threadsafety --with-log=0 --with-openmp > > To simplify the diagnosis of potential issues, I have also written a small > example program using GMRES to solve a sparse matrix system derived from a 2D > Poisson problem using the finite difference method. I found similar issues on > this piece of codes. The code is as follows: > > #include <petscksp.h> > > /* Monitor function to print iteration number and residual norm */ > PetscErrorCode MyKSPMonitor(KSP ksp, PetscInt n, PetscReal rnorm, void *ctx) { > PetscErrorCode ierr; > ierr = PetscPrintf(PETSC_COMM_WORLD, "Iteration %D, Residual norm %g\n", > n, (double)rnorm); > CHKERRQ(ierr); > return 0; > } > > int main(int argc, char **args) { > Vec x, b, x_true, e; > Mat A; > KSP ksp; > PetscErrorCode ierr; > PetscInt i, j, Ii, J, n = 500; // Size of the grid n x n > PetscInt Istart, Iend, ncols; > PetscScalar v; > PetscMPIInt rank; > PetscInitialize(&argc, &args, NULL, NULL); > PetscLogDouble t1, t2; // Variables for timing > MPI_Comm_rank(PETSC_COMM_WORLD, &rank); > > // Create vectors and matrix > ierr = VecCreateMPI(PETSC_COMM_WORLD, PETSC_DECIDE, n*n, &x); > CHKERRQ(ierr); > ierr = VecDuplicate(x, &b); CHKERRQ(ierr); > ierr = VecDuplicate(x, &x_true); CHKERRQ(ierr); > > // Set true solution as all ones > ierr = VecSet(x_true, 1.0); CHKERRQ(ierr); > > // Create and assemble matrix A for the 2D Laplacian using 5-point stencil > ierr = MatCreate(PETSC_COMM_WORLD, &A); CHKERRQ(ierr); > ierr = MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, n*n, n*n); > CHKERRQ(ierr); > ierr = MatSetFromOptions(A); CHKERRQ(ierr); > ierr = MatSetUp(A); CHKERRQ(ierr); > ierr = MatGetOwnershipRange(A, &Istart, &Iend); CHKERRQ(ierr); > for (Ii = Istart; Ii < Iend; Ii++) { > i = Ii / n; // Row index > j = Ii % n; // Column index > v = -4.0; > ierr = MatSetValues(A, 1, &Ii, 1, &Ii, &v, INSERT_VALUES); > CHKERRQ(ierr); > if (i > 0) { // South > J = Ii - n; > v = 1.0; > ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES); > CHKERRQ(ierr); > } > if (i < n - 1) { // North > J = Ii + n; > v = 1.0; > ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES); > CHKERRQ(ierr); > } > if (j > 0) { // West > J = Ii - 1; > v = 1.0; > ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES); > CHKERRQ(ierr); > } > if (j < n - 1) { // East > J = Ii + 1; > v = 1.0; > ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES); > CHKERRQ(ierr); > } > } > ierr = MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr); > ierr = MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr); > > // Compute the RHS corresponding to the true solution > ierr = MatMult(A, x_true, b); CHKERRQ(ierr); > > // Set up and solve the linear system > ierr = KSPCreate(PETSC_COMM_WORLD, &ksp); CHKERRQ(ierr); > ierr = KSPSetOperators(ksp, A, A); CHKERRQ(ierr); > ierr = KSPSetType(ksp, KSPGMRES); CHKERRQ(ierr); > ierr = KSPSetTolerances(ksp, 1e-5, PETSC_DEFAULT, PETSC_DEFAULT, > PETSC_DEFAULT); CHKERRQ(ierr); > > /* Set up the monitor */ > ierr = KSPMonitorSet(ksp, MyKSPMonitor, NULL, NULL); CHKERRQ(ierr); > > // Start timing > PetscTime(&t1); > > ierr = KSPSolve(ksp, b, x); CHKERRQ(ierr); > > // Stop timing > PetscTime(&t2); > > // Compute error > ierr = VecDuplicate(x, &e); CHKERRQ(ierr); > ierr = VecWAXPY(e, -1.0, x_true, x); CHKERRQ(ierr); > PetscReal norm_error, norm_true; > ierr = VecNorm(e, NORM_2, &norm_error); CHKERRQ(ierr); > ierr = VecNorm(x_true, NORM_2, &norm_true); CHKERRQ(ierr); > PetscReal relative_error = norm_error / norm_true; > if (rank == 0) { // Print only from the first MPI process > PetscPrintf(PETSC_COMM_WORLD, "Relative error ||x - x_true||_2 / > ||x_true||_2: %g\n", (double)relative_error); > } > > // Output the wall time taken for MatMult > PetscPrintf(PETSC_COMM_WORLD, "Time taken for KSPSolve: %f seconds\n", t2 > - t1); > > // Cleanup > ierr = VecDestroy(&x); CHKERRQ(ierr); > ierr = VecDestroy(&b); CHKERRQ(ierr); > ierr = VecDestroy(&x_true); CHKERRQ(ierr); > ierr = VecDestroy(&e); CHKERRQ(ierr); > ierr = MatDestroy(&A); CHKERRQ(ierr); > ierr = KSPDestroy(&ksp); CHKERRQ(ierr); > PetscFinalize(); > return 0; > } > > Here are some profiling results for GMERS solution. > > OPENBLAS_NUM_THREADS = 1, iteration steps = 859, solution time = 16.1 > OPENBLAS_NUM_THREADS = 2, iteration steps = 859, solution time = 16.3 > OPENBLAS_NUM_THREADS = 4, iteration steps = 859, solution time = 16.7 > OPENBLAS_NUM_THREADS = 8, iteration steps = 859, solution time = 16.8 > OPENBLAS_NUM_THREADS = 16, iteration steps = 859, solution time = 17.8 > > I am using one workstation with Intel® Core™ i9-11900K Processor, 8 cores, 16 > threads. Note that I am not using multiple MPI processes, such as > mpirun/mpiexec, the default number of MPI processes should be 1, correct if I > am wrong. > > Thank you in advance! > > Sincerely, > Yongzhong > > ----------------------------------------------------------- > Yongzhong Li > PhD student | Electromagnetics Group > Department of Electrical & Computer Engineering > University of Toronto > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!axQjxZeWC27cy3WnpqerXeWbd74F9I1B5K9M5m_81RGHmibyn9It_T8Ru5XaCj_2X3FG7XpHUh65OKUacizprLQ$ > > <https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!efVv_hPkRBEhyquXer2c8sFeGrjOtTjEGicYg2niCyfT9swzjLFyf6k4XrhKElaF-cX_Q02y9KSTRNFHPlKhXMtuzaTekCWcXgw$>