Hi Petsc-Users,
I am trying to understand an issue where PetscCommDuplicate() calls are taking an increasing percentage of time as I run a fixed-sized problem on more processes. I am using the FEM to solve the steady-state heat transfer equation (K.x = q) using a PC direct solver, like MUMPS. I am running on the NERSC Cray X30, which has two Xeon's per node with 12 cores each, and profiling the code using CrayPat sampling. On a typical problem (1E+6 finite elements), running on a single node: -for 2 cores (1 on each Xeon), about 1% of time is PetscCommDuplicate (on process 1, but on the root it is less), and (for reference) 9% of total time is for MUMPS. -for 8 cores (4 on each Xeon), over 6% of time is PetscCommDuplicate (on every process except the root, where it is <1%), and 9-10% of total time is for MUMPS. What is the large PetscCommDuplicate time connected to, an increasing number of messages (tags)? Would using fewer MatSetValues() and VecSetValues() calls (with longer message lengths) alleviate this? For reference, the PETSc calling sequence in the code is as follows. // Create the solution and RHS vectors ierr = VecCreate(petscData->mpicomm,&mesh->hpx); ierr = PetscObjectSetName((PetscObject) mesh->hpx, "Solution"); ierr = VecSetSizes(mesh->hpx,mesh->lxN,mesh->neqns); // size = # of equations; distribution to match mesh ierr = VecSetFromOptions(mesh->hpx); // allow run time options ierr = VecDuplicate(mesh->hpx,&q); // create the RHS vector // Create the stiffnexx matrix ierr = MatCreate(petscData->mpicomm,&K); ierr = MatSetSizes(K,mesh->lxN,mesh->lxN,mesh->neqns,mesh->neqns); ierr = MatSetType(K,MATAIJ); // default sparse type // Do preallocation ierr = MatMPIAIJSetPreallocation(K,d_nz,NULL,o_nz,NULL); ierr = MatSeqAIJSetPreallocation(K,d_nz,NULL); ierr = MatSetOption(K,MAT_NEW_NONZERO_ALLOCATION_ERR,PETSC_FALSE); ierr = MatSetUp(K); // Create and set up the KSP context as a PreConditioner Only (Direct) Solution ierr = KSPCreate(petscData->mpicomm,&ksp); ierr = KSPSetOperators(ksp,K,K); ierr = KSPSetType(ksp,KSPPREONLY); // Set the temperature vector ierr = VecSet(mesh->hpx,mesh->Tmin); // Set the default PC method as MUMPS ierr = KSPGetPC(ksp,&pc); // extract the preconditioner ierr = PCSetType(pc,PCLU); // set pc options ierr = PCFactorSetMatSolverPackage(pc,MATSOLVERMUMPS); ierr = KSPSetFromOptions(ksp); // Set the values for the K matrix and q vector // which involves a lot of these calls ierr = MatSetValues(K,mrows,idxm,ncols,idxn,pKe,ADD_VALUES); // 1 call per matrix row (equation) ierr = VecSetValues(q,nqe,ixn,pqe,ADD_VALUES); // 1 call per element ierr = VecAssemblyBegin(q); ierr = MatAssemblyBegin(K,MAT_FINAL_ASSEMBLY); ierr = VecAssemblyEnd(q); ierr = MatAssemblyEnd(K,MAT_FINAL_ASSEMBLY); // Solve ////////////////////////////////////// ierr = KSPSolve(ksp,q,mesh->hpx); ... *Note that the code evenly divides the finite elements over the total number of processors, and I am using ghosting of the FE vertices vector to handle the vertices that are needed on more than 1 process. Thanks in advance for your help, Matt Overholt CapeSym, Inc. --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus