Re: [petsc-users] [petsc-maint] Inquiry about Multithreading Capabilities in PETSc's KSPSolver

Barry Smith Mon, 29 Apr 2024 12:55:39 -0700

   Do you need fortran? If not just run again with also --with-fc=0 
--with-sowing=0


   If you need fortran send configure.log


> On Apr 29, 2024, at 3:45 PM, Yongzhong Li <yongzhong...@mail.utoronto.ca> 
> wrote:
> 
> Hi Barry,
> 
> Thanks for your reply, I checkout to the git branch 
> barry/2023-09-15/fix-log-pcmpi  but get some errors when configuring PETSc, 
> below is the error message,
> 
> =============================================================================================
>                          Configuring PETSc to compile on your system
> =============================================================================================
> =============================================================================================
>                                      ***** WARNING *****
>   Found environment variable: MAKEFLAGS=s -j14 --jobserver-auth=3,5. Ignoring 
> it! Use
>   "./configure MAKEFLAGS=$MAKEFLAGS" if you really want to use this value
> =============================================================================================
> =============================================================================================
>                   Running configure on SOWING; this may take several minutes
> =============================================================================================
>  
> *********************************************************************************************
>            UNABLE to CONFIGURE with GIVEN OPTIONS (see configure.log for 
> details):
> ---------------------------------------------------------------------------------------------
>                               Error running configure on SOWING
> *********************************************************************************************
> My configuration is 
> 
> ./configure PETSC_ARCH=config-release --with-scalar-type=complex 
> --with-fortran-kernels=1 --with-debugging=0 COPTFLAGS=-O3 -march=native 
> CXXOPTFLAGS=-O3 -march=native FOPTFLAGS=-O3 -march=native --with-cxx=g++ 
> --download-openmpi --download-superlu --download-opencascade 
> --with-openblas-include=${OPENBLAS_INC} --with-openblas-lib=${OPENBLAS_LIB} 
> --with-threadsafety --with-log=0 --with-openmp
> 
> I didn’t have this issue when I configured PETSc using tarball release 
> download version. Any suggestions on this?
> 
> Thanks and regards,
> Yongzhong
>  
> From: Barry Smith <bsm...@petsc.dev <mailto:bsm...@petsc.dev>>
> Date: Saturday, April 27, 2024 at 12:54 PM
> To: Yongzhong Li <yongzhong...@mail.utoronto.ca 
> <mailto:yongzhong...@mail.utoronto.ca>>
> Cc: petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov> 
> <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>>, 
> petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov> 
> <petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov>>
> Subject: Re: [petsc-maint] Inquiry about Multithreading Capabilities in 
> PETSc's KSPSolver
> 
>  
>    You should use the git branch barry/2023-09-15/fix-log-pcmpi  It is still 
> work-in-progress but much better than what is currently in the main PETSc 
> branch.
>  
>    By default, the MPI linear solver server requires 10,000 unknowns per MPI 
> process, so for smaller problems, it will only run on one MPI rank and list 
> Sequential   in your output. In general you need on the order of at least 
> 10,000 unknowns per MPI process to get good speedup. You can control it with 
>  
>    -mpi_linear_solver_server_minimum_count_per_rank <number_unknowns>
>  
> Regarding the report of 1 iteration, that is fixed in the branch listed above.
>  
>   Barry
>  
> On Apr 26, 2024, at 5:11 PM, Yongzhong Li <yongzhong...@mail.utoronto.ca 
> <mailto:yongzhong...@mail.utoronto.ca>> wrote:
>  
> Hi Barry,
> 
> Thanks, I am interested in this PCMPI solution provided by PETSc!
> 
> I tried the src/ksp/ksp/tutorials/ex1.c which is configured in CMakelists as 
> follows:
> 
> ./configure PETSC_ARCH=config-debug --with-scalar-type=complex 
> --with-fortran-kernels=1 --with-debugging=0 --with-logging=0 --with-cxx=g++ 
> --download-mpich --download-superlu --download-opencascade 
> --with-openblas-include=${OPENBLAS_INC} --with-openblas-lib=${OPENBLAS_LIB}
> 
> In the linux terminal, my bash script is as follows,
> 
> mpiexec -n 4 ./ex1 -mpi_linear_solver_server -mpi_linear_solver_server_view
>  
> However, I found the ouput a bit strange
> 
> Norm of error 1.23629e-15, Iterations 1
> MPI linear solver server statistics:
>     Ranks        KSPSolve()s     Mats        KSPs       Avg. Size      Avg. 
> Its
>   Sequential         3                         2            10           1
> Norm of error 1.23629e-15, Iterations 1
> MPI linear solver server statistics:
>     Ranks        KSPSolve()s     Mats        KSPs       Avg. Size      Avg. 
> Its
>   Sequential         3                         2            10           1
> Norm of error 1.23629e-15, Iterations 1
> MPI linear solver server statistics:
>     Ranks        KSPSolve()s     Mats        KSPs       Avg. Size      Avg. 
> Its
>   Sequential         3                         2            10           1
> Norm of error 1.23629e-15, Iterations 1
> MPI linear solver server statistics:
>     Ranks        KSPSolve()s     Mats        KSPs       Avg. Size      Avg. 
> Its
>   Sequential         3                         2            10           1
> 
> It seems that mpi started four processes, but they all did the same things, 
> and I am confused why the ranks showed sequential. Are these supposed to be 
> the desired output when the mpi_linear_solver_server is turned on?
> 
> And if I run ex1 without any hypen options, I got 
> 
> Norm of error 2.47258e-15, Iterations 5
> 
> It looks like the KSPSolver use 5 iterations to reach convergence, but why 
> when mpi_linear_solver_server is enabled, it uses 1?
> 
> I hope to get some help on these issues, thank you!
> 
> Sincerely,
> Yongzhong
> 
> 
> 
>  
> From: Barry Smith <bsm...@petsc.dev <mailto:bsm...@petsc.dev>>
> Date: Tuesday, April 23, 2024 at 5:15 PM
> To: Yongzhong Li <yongzhong...@mail.utoronto.ca 
> <mailto:yongzhong...@mail.utoronto.ca>>
> Cc: petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov> 
> <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>>, 
> petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov> 
> <petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov>>, Piero Triverio 
> <piero.trive...@utoronto.ca <mailto:piero.trive...@utoronto.ca>>
> Subject: Re: [petsc-maint] Inquiry about Multithreading Capabilities in 
> PETSc's KSPSolver
> 
>  
>   Yes, only the routines that can explicitly use BLAS have multi-threading.
>  
>    PETSc does support using nay MPI linear solvers from a sequential (or 
> OpenMP) main program using the 
> https://urldefense.us/v3/__https://petsc.org/release/manualpages/PC/PCMPI/*pcmpi__;Iw!!G_uCfscf7eWS!fEO7IZL3jQV-DOjKPiTVpQNQTTK1IngytmxtSmu5Zqk9hiDabFw2LqAQ93_tN8hG1wEk886ENFA3bM-EGAKIuSI$
>   construct.  I am finishing up better support in the branch 
> barry/2023-09-15/fix-log-pcmpi
>  
>   Barry
>  
>  
>  
>  
>  
>  
> 
> On Apr 23, 2024, at 3:59 PM, Yongzhong Li <yongzhong...@mail.utoronto.ca 
> <mailto:yongzhong...@mail.utoronto.ca>> wrote:
>  
> Thanks Barry! Does this mean that the sparse matrix-vector products, which 
> actually constitute the majority of the computations in my GMRES routine in 
> PETSc, don’t utilize multithreading? Only basic vector operations such as 
> VecAXPY and VecDot or dense matrix operations in PETSc will benefit from 
> multithreading, is it correct?
> 
> Best,
> Yongzhong
>  
> From: Barry Smith <bsm...@petsc.dev <mailto:bsm...@petsc.dev>>
> Date: Tuesday, April 23, 2024 at 3:35 PM
> To: Yongzhong Li <yongzhong...@mail.utoronto.ca 
> <mailto:yongzhong...@mail.utoronto.ca>>
> Cc: petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov> 
> <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>>, 
> petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov> 
> <petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov>>, Piero Triverio 
> <piero.trive...@utoronto.ca <mailto:piero.trive...@utoronto.ca>>
> Subject: Re: [petsc-maint] Inquiry about Multithreading Capabilities in 
> PETSc's KSPSolver
> 
> 你通常不会收到来自 bsm...@petsc.dev <mailto:bsm...@petsc.dev> 的电子邮件。了解这一点为什么很重要 
> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!fEO7IZL3jQV-DOjKPiTVpQNQTTK1IngytmxtSmu5Zqk9hiDabFw2LqAQ93_tN8hG1wEk886ENFA3bM-EoqbDGMA$
>  >  
>  
>    Intel MKL or OpenBLAS are the best bet, but for vector operations they 
> will not be significant since the vector operations do not dominate the 
> computations.
>  
> 
> On Apr 23, 2024, at 3:23 PM, Yongzhong Li <yongzhong...@mail.utoronto.ca 
> <mailto:yongzhong...@mail.utoronto.ca>> wrote:
>  
> Hi Barry,
> 
> Thank you for the information provided!
> 
> Do you think different BLAS implementation will affect the multithreading 
> performance of some vectors operations in GMERS in PETSc?
>  
> I am now using OpenBLAS but didn’t see much improvement when theb 
> multithreading are enabled, do you think other implementation such as netlib 
> and intel-mkl will help?
> 
> Best,
> Yongzhong
>  
> From: Barry Smith <bsm...@petsc.dev <mailto:bsm...@petsc.dev>>
> Date: Monday, April 22, 2024 at 4:20 PM
> To: Yongzhong Li <yongzhong...@mail.utoronto.ca 
> <mailto:yongzhong...@mail.utoronto.ca>>
> Cc: petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov> 
> <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>>, 
> petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov> 
> <petsc-ma...@mcs.anl.gov <mailto:petsc-ma...@mcs.anl.gov>>, Piero Triverio 
> <piero.trive...@utoronto.ca <mailto:piero.trive...@utoronto.ca>>
> Subject: Re: [petsc-maint] Inquiry about Multithreading Capabilities in 
> PETSc's KSPSolver
> 
> 你通常不会收到来自 bsm...@petsc.dev <mailto:bsm...@petsc.dev> 的电子邮件。了解这一点为什么很重要 
> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!fEO7IZL3jQV-DOjKPiTVpQNQTTK1IngytmxtSmu5Zqk9hiDabFw2LqAQ93_tN8hG1wEk886ENFA3bM-EoqbDGMA$
>  >  
>  
>    PETSc provided solvers do not directly use threads. 
>  
>    The BLAS used by LAPACK and PETSc may use threads depending on what BLAS 
> is being used and how it was configured. 
>  
>    Some of the vector operations in GMRES in PETSc use BLAS that can use 
> threads, including axpy, dot, etc. For sufficiently large problems, the use 
> of threaded BLAS can help with these routines, but not significantly for the 
> solver. 
>  
>    Dense matrix-vector products MatMult() and dense matrix direct solvers 
> PCLU use BLAS and thus can benefit from threading. The benefit can be 
> significant for large enough problems with good hardware, especially with 
> PCLU. 
>  
>    If you run with -blas_view  PETSc tries to indicate information about the 
> threading of BLAS. You can also use -blas_num_threads <n> to set the number 
> of threads, equivalent to setting the environmental variable.  For dense 
> solvers you can vary the number of threads and run with -log_view to see what 
> it helps to improve and what it does not effect.
>  
>  
>  
> 
> On Apr 22, 2024, at 4:06 PM, Yongzhong Li <yongzhong...@mail.utoronto.ca 
> <mailto:yongzhong...@mail.utoronto.ca>> wrote:
>  
> This Message Is From an External Sender
> This message came from outside your organization.
> Hello all,
>  
> I am writing to ask if PETSc’s KSPSolver makes use of OpenMP/multithreading, 
> specifically when performing iterative solutions with the GMRES algorithm.
>  
> The questions appeared when I was running a large numerical program based on 
> boundary element method. I used the PETSc's GMRES algorithm in KSPSolve to 
> solve a shell matrix system iteratively. I observed that threads were being 
> utilized, controlled by the OPENBLAS_NUM_THREADS environment variable. 
> However, I noticed no significant performance difference between running the 
> solver with multiple threads versus a single thread.
> 
> Could you please confirm if GMRES in KSPSolve leverages multithreading, and 
> also whether it is influenced by the multithreadings of the low-level math 
> libraries such as BLAS and LAPACK? If so, how can I enable multithreading 
> effectively to see noticeable improvements in solution times when using 
> GMRES? If not, why do I observe that threads are being used during the GMERS 
> solutions?
>  
> For reference, I am using PETSc version 3.16.0, configured in CMakelists as 
> follows:
> 
> ./configure PETSC_ARCH=config-release --with-scalar-type=complex 
> --with-fortran-kernels=1 --with-debugging=0 COPTFLAGS=-O3 -march=native 
> CXXOPTFLAGS=-O3 -march=native FOPTFLAGS=-O3 -march=native --with-cxx=g++ 
> --download-openmpi --download-superlu --download-opencascade 
> --with-openblas-include=${OPENBLAS_INC} --with-openblas-lib=${OPENBLAS_LIB} 
> --with-threadsafety --with-log=0 --with-openmp
> 
> To simplify the diagnosis of potential issues, I have also written a small 
> example program using GMRES to solve a sparse matrix system derived from a 2D 
> Poisson problem using the finite difference method. I found similar issues on 
> this piece of codes. The code is as follows:
> 
> #include <petscksp.h>
> 
> /* Monitor function to print iteration number and residual norm */
> PetscErrorCode MyKSPMonitor(KSP ksp, PetscInt n, PetscReal rnorm, void *ctx) {
>     PetscErrorCode ierr;
>     ierr = PetscPrintf(PETSC_COMM_WORLD, "Iteration %D, Residual norm %g\n", 
> n, (double)rnorm);
>     CHKERRQ(ierr);
>     return 0;
> }
> 
> int main(int argc, char **args) {
>     Vec x, b, x_true, e;
>     Mat A;
>     KSP ksp;
>     PetscErrorCode ierr;
>     PetscInt i, j, Ii, J, n = 500; // Size of the grid n x n
>     PetscInt Istart, Iend, ncols;
>     PetscScalar v;
>     PetscMPIInt rank;
>     PetscInitialize(&argc, &args, NULL, NULL);
>     PetscLogDouble t1, t2;     // Variables for timing
>     MPI_Comm_rank(PETSC_COMM_WORLD, &rank);
> 
>     // Create vectors and matrix
>     ierr = VecCreateMPI(PETSC_COMM_WORLD, PETSC_DECIDE, n*n, &x); 
> CHKERRQ(ierr);
>     ierr = VecDuplicate(x, &b); CHKERRQ(ierr);
>     ierr = VecDuplicate(x, &x_true); CHKERRQ(ierr);
> 
>     // Set true solution as all ones
>     ierr = VecSet(x_true, 1.0); CHKERRQ(ierr);
> 
>     // Create and assemble matrix A for the 2D Laplacian using 5-point stencil
>     ierr = MatCreate(PETSC_COMM_WORLD, &A); CHKERRQ(ierr);
>     ierr = MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, n*n, n*n); 
> CHKERRQ(ierr);
>     ierr = MatSetFromOptions(A); CHKERRQ(ierr);
>     ierr = MatSetUp(A); CHKERRQ(ierr);
>     ierr = MatGetOwnershipRange(A, &Istart, &Iend); CHKERRQ(ierr);
>     for (Ii = Istart; Ii < Iend; Ii++) {
>         i = Ii / n; // Row index
>         j = Ii % n; // Column index
>         v = -4.0;
>         ierr = MatSetValues(A, 1, &Ii, 1, &Ii, &v, INSERT_VALUES); 
> CHKERRQ(ierr);
>         if (i > 0) { // South
>             J = Ii - n;
>             v = 1.0;
>             ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES); 
> CHKERRQ(ierr);
>         }
>         if (i < n - 1) { // North
>             J = Ii + n;
>             v = 1.0;
>             ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES); 
> CHKERRQ(ierr);
>         }
>         if (j > 0) { // West
>             J = Ii - 1;
>             v = 1.0;
>             ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES); 
> CHKERRQ(ierr);
>         }
>         if (j < n - 1) { // East
>             J = Ii + 1;
>             v = 1.0;
>             ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES); 
> CHKERRQ(ierr);
>         }
>     }
>     ierr = MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
>     ierr = MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
> 
>     // Compute the RHS corresponding to the true solution
>     ierr = MatMult(A, x_true, b); CHKERRQ(ierr);
> 
>     // Set up and solve the linear system
>     ierr = KSPCreate(PETSC_COMM_WORLD, &ksp); CHKERRQ(ierr);
>     ierr = KSPSetOperators(ksp, A, A); CHKERRQ(ierr);
>     ierr = KSPSetType(ksp, KSPGMRES); CHKERRQ(ierr);
>     ierr = KSPSetTolerances(ksp, 1e-5, PETSC_DEFAULT, PETSC_DEFAULT, 
> PETSC_DEFAULT); CHKERRQ(ierr);
> 
>     /* Set up the monitor */
>     ierr = KSPMonitorSet(ksp, MyKSPMonitor, NULL, NULL); CHKERRQ(ierr);
> 
>     // Start timing
>     PetscTime(&t1);
> 
>     ierr = KSPSolve(ksp, b, x); CHKERRQ(ierr);
> 
>     // Stop timing
>     PetscTime(&t2);
> 
>     // Compute error
>     ierr = VecDuplicate(x, &e); CHKERRQ(ierr);
>     ierr = VecWAXPY(e, -1.0, x_true, x); CHKERRQ(ierr);
>     PetscReal norm_error, norm_true;
>     ierr = VecNorm(e, NORM_2, &norm_error); CHKERRQ(ierr);
>     ierr = VecNorm(x_true, NORM_2, &norm_true); CHKERRQ(ierr);
>     PetscReal relative_error = norm_error / norm_true;
>     if (rank == 0) { // Print only from the first MPI process
>         PetscPrintf(PETSC_COMM_WORLD, "Relative error ||x - x_true||_2 / 
> ||x_true||_2: %g\n", (double)relative_error);
>     }
> 
>     // Output the wall time taken for MatMult
>     PetscPrintf(PETSC_COMM_WORLD, "Time taken for KSPSolve: %f seconds\n", t2 
> - t1);
> 
>     // Cleanup
>     ierr = VecDestroy(&x); CHKERRQ(ierr);
>     ierr = VecDestroy(&b); CHKERRQ(ierr);
>     ierr = VecDestroy(&x_true); CHKERRQ(ierr);
>     ierr = VecDestroy(&e); CHKERRQ(ierr);
>     ierr = MatDestroy(&A); CHKERRQ(ierr);
>     ierr = KSPDestroy(&ksp); CHKERRQ(ierr);
>     PetscFinalize();
>     return 0;
> }
> 
> Here are some profiling results for GMERS solution.
> 
> OPENBLAS_NUM_THREADS = 1, iteration steps  = 859, solution time = 16.1
> OPENBLAS_NUM_THREADS = 2, iteration steps  = 859, solution time = 16.3
> OPENBLAS_NUM_THREADS = 4, iteration steps  = 859, solution time = 16.7
> OPENBLAS_NUM_THREADS = 8, iteration steps  = 859, solution time = 16.8
> OPENBLAS_NUM_THREADS = 16, iteration steps  = 859, solution time = 17.8
> 
> I am using one workstation with Intel® Core™ i9-11900K Processor, 8 cores, 16 
> threads. Note that I am not using multiple MPI processes, such as 
> mpirun/mpiexec, the default number of MPI processes should be 1, correct if I 
> am wrong.
>  
> Thank you in advance!
> 
> Sincerely,
> Yongzhong
>  
> -----------------------------------------------------------
> Yongzhong Li
> PhD student | Electromagnetics Group
> Department of Electrical & Computer Engineering
> University of Toronto
> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!fEO7IZL3jQV-DOjKPiTVpQNQTTK1IngytmxtSmu5Zqk9hiDabFw2LqAQ93_tN8hG1wEk886ENFA3bM-EQf0srRU$
>   
> <https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!efVv_hPkRBEhyquXer2c8sFeGrjOtTjEGicYg2niCyfT9swzjLFyf6k4XrhKElaF-cX_Q02y9KSTRNFHPlKhXMtuzaTekCWcXgw$>

Re: [petsc-users] [petsc-maint] Inquiry about Multithreading Capabilities in PETSc's KSPSolver

Reply via email to