On Sat, Jul 13, 2019 at 2:39 PM Mohammed Mostafa via petsc-users < petsc-users@mcs.anl.gov> wrote:
> I am generating the matrix using the finite volume method > I basically loop over the face list instead of looping over the cells to > avoid double evaluation of the fluxes of cell faces > So I figured I would store the coefficients in a temp container ( in this > case a csr sparse matrix) and then loop over rows to set in the petsc > matrix > I know it looks like a waste of memory and copy overhead but for now I > can’t think of a better way. > You can use PETSc's setValues, just use ADD_VALUES instead of INSERT. Our matrix class is pretty much the same as your "temp container" so I would use PETSc directly. And I see "Maximum nonzeros in any row is 4" and "Maximum nonzeros in any row is 1". I guess the "1"s are from the off diagonal block matrix, but should the 4 be 2*D + 1 ? > For now I will try the two routines from the master branch > MatCreateMPIAIJWithArrays() MatUpdateMPIAIJWithArrays() > and send the logs > > Thanks > Kamra > > On Sun, Jul 14, 2019 at 1:51 AM Smith, Barry F. <bsm...@mcs.anl.gov> > wrote: > >> >> How are you generating entries in your matrix? Finite differences, >> finite element, finite volume, something else? >> >> If you are using finite differences you generally generate an entire >> row at a time and call MatSetValues() once per row. With finite elements >> you generates an element at a time and ADD_VALUES for a block of rows and >> columns. >> >> I don't know why generating directly in CSR format would be faster than >> than calling MatSetValues() once per row but anyways if you have the matrix >> in CSR format you can use >> >> MatCreateMPIAIJWithArrays() (and in the master branch of the >> repository) MatUpdateMPIAIJWithArrays(). >> >> to build the matrix the first time, and then "refill" it with numerical >> values each new time. There are a few other optimizations related to matrix >> insertion in the master branch you might also benefit from. >> >> Generally for problems with multiple "times" or "linear solve steps" we >> use two stages, the first to track the initial set up and first time step >> and the other to capture all the other steps (since the extra overhead is >> only in the first step.) You could make a new stage for each time step but >> I don't think that is needed. >> >> After you have this going send us the new log summary. >> >> Barry >> >> >> >> > On Jul 13, 2019, at 11:20 AM, Mohammed Mostafa via petsc-users < >> petsc-users@mcs.anl.gov> wrote: >> > >> > I am sorry but I don’t see what you mean by small times >> > Although mat assembly is relatively smaller >> > The cost of mat set values is still significant >> > The same can be said for vec assembly >> > Combined vec/mat assembly and matsetvalues constitute about 50% of the >> total cost of matrix construction >> > >> > So is this problem of my matrix setup/ preallocation >> > >> > Or is this a hardware issue, for whatever reason the copy is overly >> slow >> > The code was run on a single node >> > >> > Or is this function call overhead since matsetvalues is being called 1M >> times inside the for loop ( 170k times in each process) >> > >> > Thanks, Kamra >> > >> > On Sun, Jul 14, 2019 at 12:41 AM Matthew Knepley <knep...@gmail.com> >> wrote: >> > On Sat, Jul 13, 2019 at 9:56 AM Mohammed Mostafa < >> mo7ammedmost...@gmail.com> wrote: >> > Hello Matt, >> > >> > I revised my code and changed the way I create the rhs vector, >> > previosly I was using vecCreateGhost just in case I need the ghost >> values, but for now I changed that to >> > vecCreateMPI(.......) >> > So maybe that was the cause of the scatter >> > I am attaching with this email a new log output >> > >> > Okay, the times are now very small. How does it scale up? >> > >> > Thanks, >> > >> > Matt >> > >> > Also regarding how I fill my petsc matrix, >> > In my code I fill a temp CSR format matrix becasue otherwise I would >> need "MatSetValue" to fill the petsc mat element by element >> > which is not recommmeded in the petsc manual and probably very >> expensive due to function call overhead >> > So after I create my matrix in CSR format, I fill the PETSC mat A as >> follows >> > for (i = 0; i < nMatRows; i++) { >> > cffset = CSR_iptr[i]; >> > row_index = row_gIndex[i]; >> > nj = Eqn_nj[i]; >> > MatSetValues(PhiEqnSolver.A, 1, &row_index, nj, CSR_jptr + offset, >> CSR_vptr + offset, INSERT_VALUES); >> > } >> > After That >> > VecAssemblyBegin(RHS); >> > VecAssemblyEnd(RHS); >> > >> > MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); >> > MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); >> > >> > I don't believe , I am doing anything special, if possible I would like >> to set the whole csr matrix at once in one command. >> > I took a look at the code for MatSetValues, if I am understanding it >> correctly(hopefully) I think I could do it, maybe modify it or create a new >> routine entirely for this pupose. >> > i.e. MatSetValuesFromCSR(.....) >> > Or is there a particular reason why it has to be this way >> > >> > I also tried ksp ex3 but I slightly tweaked it to add a logging stage >> around the assembly and MatSetValues and I am attaching the modified >> example here as well. >> > Although in this example the matrix stash is not empty ( means >> off-processor values are being set ) but the timing values for roughly the >> same matrix size , the command I used is >> > mpirun -np 6 ./mod_ksp_ex3 -m 1000 -log_view -info >> > >> > >> > Regards, >> > Kamra >> > >> > On Sat, Jul 13, 2019 at 1:43 PM Matthew Knepley <knep...@gmail.com> >> wrote: >> > On Fri, Jul 12, 2019 at 10:51 PM Mohammed Mostafa < >> mo7ammedmost...@gmail.com> wrote: >> > Hello Matt, >> > Attached is the dumped entire log output using -log_view and -info. >> > >> > In matrix construction, it looks like you have a mixture of load >> imbalance (see the imbalance in the Begin events) >> > and lots of Scatter messages in your assembly. We turn off >> MatSetValues() logging by default since it is usually >> > called many times, but you can explicitly turn it back on if you want. >> I don't think that is the problem here. Its easy >> > to see from examples (say SNES ex5) that it is not the major time sink. >> What is the Scatter doing? >> > >> > Thanks, >> > >> > Matt >> > >> > Thanks, >> > Kamra >> > >> > On Fri, Jul 12, 2019 at 9:23 PM Matthew Knepley <knep...@gmail.com> >> wrote: >> > On Fri, Jul 12, 2019 at 5:19 AM Mohammed Mostafa via petsc-users < >> petsc-users@mcs.anl.gov> wrote: >> > Hello all, >> > I have a few question regarding Petsc, >> > >> > Please send the entire output of a run with all the logging turned on, >> using -log_view and -info. >> > >> > Thanks, >> > >> > Matt >> > >> > Question 1: >> > For the profiling , is it possible to only show the user defined log >> events in the breakdown of each stage in Log-view. >> > I tried deactivating all ClassIDs, MAT,VEC, PC, KSP,PC, >> > PetscLogEventExcludeClass(MAT_CLASSID); >> > PetscLogEventExcludeClass(VEC_CLASSID); >> > PetscLogEventExcludeClass(KSP_CLASSID); >> > PetscLogEventExcludeClass(PC_CLASSID); >> > which should "Deactivates event logging for a PETSc object class in >> every stage" according to the manual. >> > however I still see them in the stage breakdown >> > --- Event Stage 1: Matrix Construction >> > >> > BuildTwoSidedF 4 1.0 2.7364e-02 2.4 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 18 0 0 0 0 0 >> > VecSet 1 1.0 4.5300e-06 2.4 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> > VecAssemblyBegin 2 1.0 2.7344e-02 2.4 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 18 0 0 0 0 0 >> > VecAssemblyEnd 2 1.0 8.3447e-06 1.5 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> > VecScatterBegin 2 1.0 7.5102e-05 1.7 0.00e+00 0.0 3.6e+01 >> 2.1e+03 0.0e+00 0 0 3 0 0 0 0 50 80 0 0 >> > VecScatterEnd 2 1.0 3.5286e-05 2.2 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> > MatAssemblyBegin 2 1.0 8.8930e-05 1.9 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> > MatAssemblyEnd 2 1.0 1.3566e-02 1.1 0.00e+00 0.0 3.6e+01 >> 5.3e+02 8.0e+00 0 0 3 0 6 10 0 50 20100 0 >> > AssembleMats 2 1.0 3.9774e-02 1.7 0.00e+00 0.0 7.2e+01 >> 1.3e+03 8.0e+00 0 0 7 0 6 28 0100100100 0 # USER EVENT >> > myMatSetValues 2 1.0 2.6931e-02 1.2 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 19 0 0 0 0 0 # USER EVENT >> > setNativeMat 1 1.0 3.5613e-02 1.3 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 24 0 0 0 0 0 # USER EVENT >> > setNativeMatII 1 1.0 4.7023e-02 1.5 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 28 0 0 0 0 0 # USER EVENT >> > callScheme 1 1.0 2.2333e-03 1.2 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 2 0 0 0 0 0 # USER EVENT >> > >> > Also is possible to clear the logs so that I can write a separate >> profiling output file for each timestep ( since I am solving a transient >> problem and I want to know the change in performance as time goes by ) >> > >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> > Question 2: >> > Regarding MatSetValues >> > Right now, I writing a finite volume code, due to algorithm requirement >> I have to write the matrix into local native format ( array of arrays) and >> then loop through rows and use MatSetValues to set the elements in "Mat A" >> > MatSetValues(A, 1, &row, nj, j_index, coefvalues, INSERT_VALUES); >> > but it is very slow and it is killing my performance >> > although the matrix was properly set using >> > MatCreateAIJ(PETSC_COMM_WORLD, this->local_size, this->local_size, >> PETSC_DETERMINE, >> > PETSC_DETERMINE, -1, d_nnz, -1, o_nnz, &A); >> > with d_nnz,and o_nnz properly assigned so no mallocs occur during >> matsetvalues and all inserted values are local so no off-processor values >> > So my question is it possible to set multiple rows at once hopefully >> all, I checked the manual and MatSetValues can only set dense matrix block >> because it seems that row by row is expensive >> > Or perhaps is it possible to copy all rows to the underlying matrix >> data, as I mentioned all values are local and no off-processor values ( >> stash is 0 ) >> > [0] VecAssemblyBegin_MPI_BTS(): Stash has 0 entries, uses 0 mallocs. >> > [0] VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 >> mallocs. >> > [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. >> > [1] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. >> > [2] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. >> > [3] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. >> > [4] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. >> > [5] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. >> > [2] MatAssemblyEnd_SeqAIJ(): Matrix size: 186064 X 186064; storage >> space: 0 unneeded,743028 used >> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 186062 X 186062; storage >> space: 0 unneeded,742972 used >> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> 0 >> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 4 >> > [1] MatCheckCompressedRow(): Found the ratio (num_zerorows >> 0)/(num_localrows 186062) < 0.6. Do not use CompressedRow routines. >> > [2] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> 0 >> > [2] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 4 >> > [2] MatCheckCompressedRow(): Found the ratio (num_zerorows >> 0)/(num_localrows 186064) < 0.6. Do not use CompressedRow routines. >> > [4] MatAssemblyEnd_SeqAIJ(): Matrix size: 186063 X 186063; storage >> space: 0 unneeded,743093 used >> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 186062 X 186062; storage >> space: 0 unneeded,743036 used >> > [4] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> 0 >> > [4] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 4 >> > [4] MatCheckCompressedRow(): Found the ratio (num_zerorows >> 0)/(num_localrows 186063) < 0.6. Do not use CompressedRow routines. >> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> 0 >> > [5] MatAssemblyEnd_SeqAIJ(): Matrix size: 186062 X 186062; storage >> space: 0 unneeded,742938 used >> > [5] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> 0 >> > [5] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 4 >> > [5] MatCheckCompressedRow(): Found the ratio (num_zerorows >> 0)/(num_localrows 186062) < 0.6. Do not use CompressedRow routines. >> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 4 >> > [0] MatCheckCompressedRow(): Found the ratio (num_zerorows >> 0)/(num_localrows 186062) < 0.6. Do not use CompressedRow routines. >> > [3] MatAssemblyEnd_SeqAIJ(): Matrix size: 186063 X 186063; storage >> space: 0 unneeded,743049 used >> > [3] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> 0 >> > [3] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 4 >> > [3] MatCheckCompressedRow(): Found the ratio (num_zerorows >> 0)/(num_localrows 186063) < 0.6. Do not use CompressedRow routines. >> > [2] MatAssemblyEnd_SeqAIJ(): Matrix size: 186064 X 685; storage space: >> 0 unneeded,685 used >> > [4] MatAssemblyEnd_SeqAIJ(): Matrix size: 186063 X 649; storage space: >> 0 unneeded,649 used >> > [4] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> 0 >> > [4] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 1 >> > [4] MatCheckCompressedRow(): Found the ratio (num_zerorows >> 185414)/(num_localrows 186063) > 0.6. Use CompressedRow routines. >> > [2] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> 0 >> > [2] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 1 >> > [2] MatCheckCompressedRow(): Found the ratio (num_zerorows >> 185379)/(num_localrows 186064) > 0.6. Use CompressedRow routines. >> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 186062 X 1011; storage space: >> 0 unneeded,1011 used >> > [5] MatAssemblyEnd_SeqAIJ(): Matrix size: 186062 X 1137; storage space: >> 0 unneeded,1137 used >> > [5] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> 0 >> > [5] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 1 >> > [5] MatCheckCompressedRow(): Found the ratio (num_zerorows >> 184925)/(num_localrows 186062) > 0.6. Use CompressedRow routines. >> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> 0 >> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 1 >> > [3] MatAssemblyEnd_SeqAIJ(): Matrix size: 186063 X 658; storage space: >> 0 unneeded,658 used >> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 186062 X 648; storage space: >> 0 unneeded,648 used >> > [1] MatCheckCompressedRow(): Found the ratio (num_zerorows >> 185051)/(num_localrows 186062) > 0.6. Use CompressedRow routines. >> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> 0 >> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 1 >> > [0] MatCheckCompressedRow(): Found the ratio (num_zerorows >> 185414)/(num_localrows 186062) > 0.6. Use CompressedRow routines. >> > [3] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> 0 >> > [3] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 1 >> > [3] MatCheckCompressedRow(): Found the ratio (num_zerorows >> 185405)/(num_localrows 186063) > 0.6. Use CompressedRow routines. >> > >> > >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> > Question 3: >> > If all matrix and vector inserted data are local, what part of the >> vec/mat assembly consumes time because matsetvalues and matassembly consume >> more time than matrix builder >> > Also this is not just for the first time MAT_FINAL_ASSEMBLY >> > >> > >> > For context the matrix in the above is nearly 1Mx1M partitioned over >> six processes and it was NOT built using DM >> > >> > Finally the configure options are: >> > >> > Configure options: >> > PETSC_ARCH=release3 -with-debugging=0 COPTFLAGS="-O3 -march=native >> -mtune=native" CXXOPTFLAGS="-O3 -march=native -mtune=native" FOPTFLAGS="-O3 >> -march=native -mtune=native" --with-cc=mpicc --with-cxx=mpicxx >> --with-fc=mpif90 --download-metis --download-hypre >> > >> > Sorry for such long question and thanks in advance >> > Thanks >> > M. Kamra >> > >> > >> > -- >> > What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> > -- Norbert Wiener >> > >> > https://www.cse.buffalo.edu/~knepley/ >> > >> > >> > -- >> > What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> > -- Norbert Wiener >> > >> > https://www.cse.buffalo.edu/~knepley/ >> > >> > >> > -- >> > What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> > -- Norbert Wiener >> > >> > https://www.cse.buffalo.edu/~knepley/ >> >>