Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-28 Thread Mark Adams
On Wed, Jan 26, 2022 at 2:51 PM Barry Smith wrote: > > I have added a mini-MR to print out the key so we can see if it is 0 or > some crazy number. https://gitlab.com/petsc/petsc/-/merge_requests/4766 > Well, after all of our MRs (Junchao's in particular) I am not seeing this MPI error. So

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
Valgrind was not useful. Just an MPI abort message with 128 process output. Can we merge my MR and I can test your branch. On Wed, Jan 26, 2022 at 2:51 PM Barry Smith wrote: > > I have added a mini-MR to print out the key so we can see if it is 0 or > some crazy number.

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Barry Smith
I have added a mini-MR to print out the key so we can see if it is 0 or some crazy number. https://gitlab.com/petsc/petsc/-/merge_requests/4766 Note that the table data structure is not sent through MPI so if MPI is the culprit it is not just that MPI is putting incorrect (or no)

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
On Wed, Jan 26, 2022 at 2:32 PM Justin Chang wrote: > rocgdb requires "-ggdb" in addition to "-g" > Ah, OK. > > What happens if you lower AMD_LOG_LEVEL to something like 1 or 2? I was > hoping AMD_LOG_LEVEL could at least give you something like a "stacktrace" > showing what the last

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
> > > Are the crashes reproducible in the same place with identical runs? > > I have not seen my repoducer work and it is in MatAssemblyEnd with not finding a table entry. I can't tell if it is the same error everytime.

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Justin Chang
rocgdb requires "-ggdb" in addition to "-g" What happens if you lower AMD_LOG_LEVEL to something like 1 or 2? I was hoping AMD_LOG_LEVEL could at least give you something like a "stacktrace" showing what the last successful HIP/HSA call was. I believe it should also show line numbers in the code.

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
On Wed, Jan 26, 2022 at 1:54 PM Justin Chang wrote: > Couple suggestions: > > 1. Set the environment variable "export AMD_LOG_LEVEL=3" <- this will tell > you everything that's happening at the HIP level (memcpy's, mallocs, kernel > execution time, etc) > Humm, My reproducer uses 2 nodes and

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
On Wed, Jan 26, 2022 at 2:25 PM Mark Adams wrote: > I have used valgrind here. I did not run it on this MPI error. I will. > > On Wed, Jan 26, 2022 at 10:56 AM Barry Smith wrote: > >> >> Any way to run with valgrind (or a HIP variant of valgrind)? It looks >> like a memory corruption issue

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
I have used valgrind here. I did not run it on this MPI error. I will. On Wed, Jan 26, 2022 at 10:56 AM Barry Smith wrote: > > Any way to run with valgrind (or a HIP variant of valgrind)? It looks > like a memory corruption issue and tracking down exactly when the > corruption begins is 3/4's

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Justin Chang
Couple suggestions: 1. Set the environment variable "export AMD_LOG_LEVEL=3" <- this will tell you everything that's happening at the HIP level (memcpy's, mallocs, kernel execution time, etc) 2. Try rocgdb, AFAIK this is the closest "HIP variant of valgrind" that we officially support. There are

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Barry Smith
Any way to run with valgrind (or a HIP variant of valgrind)? It looks like a memory corruption issue and tracking down exactly when the corruption begins is 3/4's of the way to finding the exact cause. Are the crashes reproducible in the same place with identical runs? > On Jan 26, 2022,

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
I think it is an MPI bug. It works with GPU aware MPI turned off. I am sure Summit will be fine. We have had users fix this error by switching thier MPI. On Wed, Jan 26, 2022 at 10:10 AM Junchao Zhang wrote: > I don't know if this is due to bugs in petsc/kokkos backend. See if you > can run 6

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Junchao Zhang
I don't know if this is due to bugs in petsc/kokkos backend. See if you can run 6 nodes (48 mpi ranks). If it fails, then run the same problem on Summit with 8 nodes to see if it still fails. If yes, it is likely a bug of our own. --Junchao Zhang On Wed, Jan 26, 2022 at 8:44 AM Mark Adams

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
I am not able to reproduce this with a small problem. 2 nodes or less refinement works. This is from the 8 node test, the -dm_refine 5 version. I see that it comes from PtAP. This is on the fine grid. (I was thinking it could be on a reduced grid with idle processors, but no) [15]PETSC ERROR:

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-26 Thread Mark Adams
The GPU aware MPI is dying going 1 to 8 nodes, 8 processes per node. I will make a minimum reproducer. start with 2 nodes, one process on each node. On Tue, Jan 25, 2022 at 10:19 PM Barry Smith wrote: > > So the MPI is killing you in going from 8 to 64. (The GPU flop rate > scales almost

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Barry Smith
So the MPI is killing you in going from 8 to 64. (The GPU flop rate scales almost perfectly, but the overall flop rate is only half of what it should be at 64). > On Jan 25, 2022, at 9:24 PM, Mark Adams wrote: > > It looks like we have our instrumentation and job configuration in decent >

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
> > > Note that Mark's logs have been switching back and forth between > -use_gpu_aware_mpi and changing number of ranks -- we won't have that > information if we do manual timing hacks. This is going to be a routine > thing we'll need on the mailing list and we need the provenance to go with >

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
Here are two runs, without and with -log_view, respectively. My new timer is "Solve time = " About 10% difference On Tue, Jan 25, 2022 at 12:53 PM Mark Adams wrote: > BTW, a -device_view would be great. > > On Tue, Jan 25, 2022 at 12:30 PM Mark Adams wrote: > >> >> >> On Tue, Jan 25, 2022

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Jed Brown
Barry Smith writes: >> What is the command line option to turn >> PetscLogGpuTimeBegin/PetscLogGpuTimeEnd into a no-op even when -log_view is >> on? I know it'll mess up attribution, but it'll still tell us how long the >> solve took. > > We don't have an API for this yet. It is slightly

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
BTW, a -device_view would be great. On Tue, Jan 25, 2022 at 12:30 PM Mark Adams wrote: > > > On Tue, Jan 25, 2022 at 11:56 AM Jed Brown wrote: > >> Barry Smith writes: >> >> > Thanks Mark, far more interesting. I've improved the formatting to >> make it easier to read (and fixed width font

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Barry Smith
> On Jan 25, 2022, at 12:25 PM, Jed Brown wrote: > > Barry Smith writes: > >>> On Jan 25, 2022, at 11:55 AM, Jed Brown wrote: >>> >>> Barry Smith writes: >>> Thanks Mark, far more interesting. I've improved the formatting to make it easier to read (and fixed width font for

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
On Tue, Jan 25, 2022 at 11:56 AM Jed Brown wrote: > Barry Smith writes: > > > Thanks Mark, far more interesting. I've improved the formatting to > make it easier to read (and fixed width font for email reading) > > > > * Can you do same run with say 10 iterations of Jacobi PC? > > > > *

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Jed Brown
Barry Smith writes: >> On Jan 25, 2022, at 11:55 AM, Jed Brown wrote: >> >> Barry Smith writes: >> >>> Thanks Mark, far more interesting. I've improved the formatting to make it >>> easier to read (and fixed width font for email reading) >>> >>> * Can you do same run with say 10

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Barry Smith
> On Jan 25, 2022, at 11:55 AM, Jed Brown wrote: > > Barry Smith writes: > >> Thanks Mark, far more interesting. I've improved the formatting to make it >> easier to read (and fixed width font for email reading) >> >> * Can you do same run with say 10 iterations of Jacobi PC? >> >> *

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Jed Brown
Barry Smith writes: > Thanks Mark, far more interesting. I've improved the formatting to make it > easier to read (and fixed width font for email reading) > > * Can you do same run with say 10 iterations of Jacobi PC? > > * PCApply performance (looks like GAMG) is terrible! Problems too

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Barry Smith
Thanks Mark, far more interesting. I've improved the formatting to make it easier to read (and fixed width font for email reading) * Can you do same run with say 10 iterations of Jacobi PC? * PCApply performance (looks like GAMG) is terrible! Problems too small? * VecScatter time is

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
> > > > > VecPointwiseMult 201 1.0 1.0471e-02 1.1 3.09e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 1 0 0 0 1 1 0 0 0 235882 290088 0 0.00e+000 > 0.00e+00 100 > > VecScatterBegin 200 1.0 1.8458e-01 1.1 0.00e+00 0.0 1.1e+04 6.6e+04 > 1.0e+00 2 0 99 79 0 19 0100100 0

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Jed Brown
Mark Adams writes: > adding Suyash, > > I found the/a problem. Using ex56, which has a crappy decomposition, using > one MPI process/GPU is much faster than using 8 (64 total). (I am looking > at ex13 to see how much of this is due to the decomposition) > If you only use 8 processes it seems

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-25 Thread Mark Adams
adding Suyash, I found the/a problem. Using ex56, which has a crappy decomposition, using one MPI process/GPU is much faster than using 8 (64 total). (I am looking at ex13 to see how much of this is due to the decomposition) If you only use 8 processes it seems that all 8 are put on the first

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Barry Smith
January 24, 2022 at 2:24 PM > To: Justin Chang mailto:jychan...@gmail.com>> > Cc: "petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>" > mailto:petsc-dev@mcs.anl.gov>> > Subject: Re: [petsc-dev] Kokkos/Crusher perforance > > > For th

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Munson, Todd via petsc-dev
l.gov" Subject: Re: [petsc-dev] Kokkos/Crusher perforance For this, to start, someone can run src/vec/vec/tutorials/performance.c and compare the performance to that in the technical report Evaluation of PETSc on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I: Ve

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Barry Smith
For this, to start, someone can run src/vec/vec/tutorials/performance.c and compare the performance to that in the technical report Evaluation of PETSc on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I: Vector Node Performance. Google to find. One does not have to and

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Mark Adams
On Mon, Jan 24, 2022 at 2:57 PM Justin Chang wrote: > My name has been called. > > Mark, if you're having issues with Crusher, please contact Veronica > Vergara (vergar...@ornl.gov). You can cc me (justin.ch...@amd.com) in > those emails > I have worked with Veronica before. I'll ask Tood if we

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Justin Chang
Also, do you guys have an OLCF liaison? That's actually your better bet if you do. Performance issues with ROCm/Kokkos are pretty common in apps besides just PETSc. We have several teams actively working on rectifying this. However, I think performance issues can be quicker to identify if we had

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Justin Chang
My name has been called. Mark, if you're having issues with Crusher, please contact Veronica Vergara (vergar...@ornl.gov). You can cc me (justin.ch...@amd.com) in those emails On Mon, Jan 24, 2022 at 1:49 PM Barry Smith wrote: > > > On Jan 24, 2022, at 2:46 PM, Mark Adams wrote: > > Yea,

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Barry Smith
> On Jan 24, 2022, at 2:46 PM, Mark Adams wrote: > > Yea, CG/Jacobi is as close to a benchmark code as we could want. I could run > this on one processor to get cleaner numbers. > > Is there a designated ECP technical support contact? Mark, you've forgotten you work for DOE. There isn't

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Mark Adams
Yea, CG/Jacobi is as close to a benchmark code as we could want. I could run this on one processor to get cleaner numbers. Is there a designated ECP technical support contact? On Mon, Jan 24, 2022 at 2:18 PM Barry Smith wrote: > > I think you should contact the crusher ECP technical support

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Barry Smith
I think you should contact the crusher ECP technical support team and tell them you are getting dismel performance and ask if you should expect better. Don't waste time flogging a dead horse. > On Jan 24, 2022, at 2:16 PM, Matthew Knepley wrote: > > On Mon, Jan 24, 2022 at 2:11 PM Junchao

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Matthew Knepley
On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang wrote: > > > On Mon, Jan 24, 2022 at 12:55 PM Mark Adams wrote: > >> >> >> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang >> wrote: >> >>> Mark, I think you can benchmark individual vector operations, and once >>> we get reasonable profiling

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Junchao Zhang
On Mon, Jan 24, 2022 at 12:55 PM Mark Adams wrote: > > > On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang > wrote: > >> Mark, I think you can benchmark individual vector operations, and once we >> get reasonable profiling results, we can move to solvers etc. >> > > Can you suggest a code to run or

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Mark Adams
On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang wrote: > Mark, I think you can benchmark individual vector operations, and once we > get reasonable profiling results, we can move to solvers etc. > Can you suggest a code to run or are you suggesting making a vector benchmark code? > > --Junchao

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Junchao Zhang
Mark, I think you can benchmark individual vector operations, and once we get reasonable profiling results, we can move to solvers etc. --Junchao Zhang On Mon, Jan 24, 2022 at 12:09 PM Mark Adams wrote: > > > On Mon, Jan 24, 2022 at 12:44 PM Barry Smith wrote: > >> >> Here except for

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Mark Adams
On Mon, Jan 24, 2022 at 12:44 PM Barry Smith wrote: > > Here except for VecNorm the GPU is used effectively in that most of the > time is time is spent doing real work on the GPU > > VecNorm 402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 0.0e+00 > 4.0e+02 0 1 0 0 20 9 1 0 0

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Barry Smith
Here except for VecNorm the GPU is used effectively in that most of the time is time is spent doing real work on the GPU VecNorm 402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 0.0e+00 4.0e+02 0 1 0 0 20 9 1 0 0 33 30230 225393 0 0.00e+000 0.00e+00 100 Even the

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Mark Adams
> > Mark, can we compare with Spock? > Looks much better. This puts two processes/GPU because there are only 4. DM Object: box 8 MPI processes type: plex box in 3 dimensions: Number of 0-cells per rank: 274625 274625 274625 274625 274625 274625 274625 274625 Number of 1-cells per rank:

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Barry Smith
Not clear how to interpret, the "gpu" FLOP rate for dot and norm are a good amount higher (exact details of where the log functions are located can affect this) but the over flop rates of them are not much better. Scatter is better without GPU MPI. How much of this is noise, need to see

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-24 Thread Mark Adams
> Mark, > > Can you run both with GPU aware MPI? > > Perlmuter fails with GPU aware MPI. I think there are know problems with this that are being worked on. And here is Crusher with GPU aware MPI. DM Object: box 8 MPI processes type: plex box in 3 dimensions: Number of 0-cells per

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Junchao Zhang
On Sun, Jan 23, 2022 at 11:22 PM Barry Smith wrote: > > > On Jan 24, 2022, at 12:16 AM, Junchao Zhang > wrote: > > > > On Sun, Jan 23, 2022 at 10:44 PM Barry Smith wrote: > >> >> Junchao, >> >> Without GPU aware MPI, is it moving the entire vector to the CPU and >> doing the scatter and

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Jed Brown
Barry Smith writes: > We should make it easy to turn off the logging and synchronizations (from > PetscLogGpu) for everything Vec and below, and everything Mat and below to > remove all the synchronizations needed for the low level timing. I think we > can do that by having PetscLogGpu

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Barry Smith
> On Jan 24, 2022, at 12:16 AM, Junchao Zhang wrote: > > > > On Sun, Jan 23, 2022 at 10:44 PM Barry Smith > wrote: > > Junchao, > > Without GPU aware MPI, is it moving the entire vector to the CPU and > doing the scatter and moving everything back or does

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Junchao Zhang
On Sun, Jan 23, 2022 at 10:44 PM Barry Smith wrote: > > Junchao, > > Without GPU aware MPI, is it moving the entire vector to the CPU and > doing the scatter and moving everything back or does it just move up > exactly what needs to be sent to the other ranks and move back exactly what >

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Barry Smith
> On Jan 23, 2022, at 11:47 PM, Jed Brown wrote: > > Barry Smith via petsc-dev writes: > >> The PetscLogGpuTimeBegin()/End was written by Hong so it works with events >> to get a GPU timing, it is not suppose to include the CPU kernel launch >> times or the time to move the scalar

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Jed Brown
Barry Smith writes: > Norm, AXPY, pointwisemult roughly the same. These are where I think we need to start. The bandwidth they are achieving is supposed to be possible with just one chiplet. Mark, can we compare with Spock?

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Jed Brown
Barry Smith via petsc-dev writes: > The PetscLogGpuTimeBegin()/End was written by Hong so it works with events > to get a GPU timing, it is not suppose to include the CPU kernel launch times > or the time to move the scalar arguments to the GPU. It may not be perfect > but it is the best we

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Barry Smith
Junchao, Without GPU aware MPI, is it moving the entire vector to the CPU and doing the scatter and moving everything back or does it just move up exactly what needs to be sent to the other ranks and move back exactly what it received from other ranks? It is moving 4.74e+02 * 1e+6

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Mark Adams
Ugh, try again. Still a big difference, but less. Mat-vec does not change much. On Sun, Jan 23, 2022 at 7:12 PM Barry Smith wrote: > > You have debugging turned on on crusher but not permutter > > On Jan 23, 2022, at 6:37 PM, Mark Adams wrote: > > * Perlmutter is roughly 5x faster than

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Barry Smith via petsc-dev
> On Jan 23, 2022, at 10:47 PM, Jacob Faibussowitsch > wrote: > >> The outer LogEventBegin/End captures the entire time, including copies, >> kernel launches etc. > > Not if the GPU call is asynchronous. To time the call the stream must also be > synchronized with the host. The only way to

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Jacob Faibussowitsch
> The outer LogEventBegin/End captures the entire time, including copies, > kernel launches etc. Not if the GPU call is asynchronous. To time the call the stream must also be synchronized with the host. The only way to truly time only the kernel calls themselves is to wrap the actual call

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Barry Smith
> On Jan 23, 2022, at 10:01 PM, Junchao Zhang wrote: > > > > On Sat, Jan 22, 2022 at 9:00 PM Junchao Zhang > wrote: > > > > On Sat, Jan 22, 2022 at 5:00 PM Barry Smith > wrote: > > The GPU flop rate (when 100 percent flops on

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Junchao Zhang
On Sat, Jan 22, 2022 at 9:00 PM Junchao Zhang wrote: > > > > On Sat, Jan 22, 2022 at 5:00 PM Barry Smith wrote: > >> >> The GPU flop rate (when 100 percent flops on the GPU) should always be >> higher than the overall flop rate (the previous column). For large problems >> they should be

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Barry Smith
You have debugging turned on on crusher but not permutter > On Jan 23, 2022, at 6:37 PM, Mark Adams wrote: > > * Perlmutter is roughly 5x faster than Crusher on the one node 2M eq test. > (small) > This is with 8 processes. > > * The next largest version of this test, 16M eq total and 8

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Mark Adams
* Perlmutter is roughly 5x faster than Crusher on the one node 2M eq test. (small) This is with 8 processes. * The next largest version of this test, 16M eq total and 8 processes, fails in memory allocation in the mat-mult setup in the Kokkos Mat. * If I try to run with 64 processes on

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-23 Thread Mark Adams
On Sat, Jan 22, 2022 at 6:22 PM Barry Smith wrote: > >I cleaned up Mark's last run and put it in a fixed-width font. I > realize this may be too difficult but it would be great to have identical > runs to compare with on Summit. > I was planning on running this on Perlmutter today, as well

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Barry Smith
> On Jan 22, 2022, at 10:00 PM, Junchao Zhang wrote: > > > > > On Sat, Jan 22, 2022 at 5:00 PM Barry Smith > wrote: > > The GPU flop rate (when 100 percent flops on the GPU) should always be > higher than the overall flop rate (the previous column). For large

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Junchao Zhang
On Sat, Jan 22, 2022 at 5:00 PM Barry Smith wrote: > > The GPU flop rate (when 100 percent flops on the GPU) should always be > higher than the overall flop rate (the previous column). For large problems > they should be similar, for small problems the GPU one may be much higher. > > If the

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Barry Smith
I am not arguing for a rickety set of scripts, I am arguing that doing more is not so easy and it is only worth doing if the underlying benchmark is worth the effort. > On Jan 22, 2022, at 8:08 PM, Jed Brown wrote: > > Yeah, I'm referring to the operational aspect of data management, not

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jed Brown
Yeah, I'm referring to the operational aspect of data management, not benchmark design (which is hard and even Sam had years working with Mark and me on HPGMG to refine that). If you run libCEED BPs (which use PETSc), you can run one command srun -N ./bps -ceed

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Barry Smith
I submit it is actually a good amount of additional work and requires real creativity and very good judgment; it is not a good intro or undergrad project; especially for someone without a huge amount of hands-on experience already. Look who had to do the new SpecHPC multigrid benchmark. The

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jed Brown
This isn't so much more or less work, but work in more useful places. Maybe this is a good undergrad or intro project to make a clean workflow for these experiments. Barry Smith writes: > Performance studies are enormously difficult to do well; which is why there > are so few good ones out

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Barry Smith
I cleaned up Mark's last run and put it in a fixed-width font. I realize this may be too difficult but it would be great to have identical runs to compare with on Summit. As Jed noted Scatter takes a long time but the pack and unpack take no time? Is this not timed if using Kokkos?

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Barry Smith
The GPU flop rate (when 100 percent flops on the GPU) should always be higher than the overall flop rate (the previous column). For large problems they should be similar, for small problems the GPU one may be much higher. If the CPU one is higher (when 100 percent flops on the GPU)

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Barry Smith
Performance studies are enormously difficult to do well; which is why there are so few good ones out there. And unless you fall into the LINPACK benchmark or hit upon Streams the rewards of doing an excellent job are pretty thin. Even Streams was not properly maintained for many years, you

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jacob Faibussowitsch
> I suggested years ago that -log_view automatically print useful information > about the GPU setup (when GPUs are used) but everyone seemed comfortable with > the lack of information so no one improved it. FWIW, PetscDeviceView() does a bit of what you want (it just dumps all of

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jed Brown
We could create a communicator for the MPI ranks in the first shared-memory node, then enumerate their mapping (NUMA and core affinity, and what GPUs they see). Barry Smith writes: > I suggested years ago that -log_view automatically print useful information > about the GPU setup (when

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jed Brown
Barry Smith writes: >> On Jan 22, 2022, at 12:15 PM, Jed Brown wrote: >> Barry, when you did the tech reports, did you make an example to reproduce >> on other architectures? Like, run this one example (it'll run all the >> benchmarks across different sizes) and then run this script on the

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Barry Smith
I suggested years ago that -log_view automatically print useful information about the GPU setup (when GPUs are used) but everyone seemed comfortable with the lack of information so no one improved it. I think for a small number of GPUs -log_view should just print details and for a larger

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Barry Smith
> On Jan 22, 2022, at 12:15 PM, Jed Brown wrote: > > Mark Adams writes: > >> as far as streams, does it know to run on the GPU? You don't specify >> something like -G 1 here for GPUs. I think you just get them all. > > No, this isn't GPU code. BabelStream is a common STREAM suite for

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jed Brown
Mark Adams writes: > On Sat, Jan 22, 2022 at 12:29 PM Jed Brown wrote: > >> Mark Adams writes: >> >> >> >> >> >> >> >> >> > VecPointwiseMult 402 1.0 2.9605e-01 3.6 1.05e+08 1.0 0.0e+00 >> 0.0e+00 >> >> 0.0e+00 0 0 0 0 0 5 1 0 0 0 22515 70608 0 0.00e+00 >> 0 >> >> 0.00e+00

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
And I have a new MR with if you want to see what I've done so far.

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
On Sat, Jan 22, 2022 at 12:29 PM Jed Brown wrote: > Mark Adams writes: > > >> > >> > >> > >> > VecPointwiseMult 402 1.0 2.9605e-01 3.6 1.05e+08 1.0 0.0e+00 > 0.0e+00 > >> 0.0e+00 0 0 0 0 0 5 1 0 0 0 22515 70608 0 0.00e+00 > 0 > >> 0.00e+00 100 > >> > VecScatterBegin

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
So where are we as far as timers? See the latest examples (with 160 CHARACTER) Jed, "(I don't trust these timings)." what do you think? No sense in doing an MR if it is still nonsense. On Sat, Jan 22, 2022 at 12:16 PM Jed Brown wrote: > Mark Adams writes: > > > as far as streams, does it know

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jed Brown
Mark Adams writes: >> >> >> >> > VecPointwiseMult 402 1.0 2.9605e-01 3.6 1.05e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 5 1 0 0 0 22515 70608 0 0.00e+000 >> 0.00e+00 100 >> > VecScatterBegin 400 1.0 1.6791e-01 6.0 0.00e+00 0.0 3.7e+05 1.6e+04 >> 0.0e+00 0 0 62

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
> > > > > VecPointwiseMult 402 1.0 2.9605e-01 3.6 1.05e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 5 1 0 0 0 22515 70608 0 0.00e+000 > 0.00e+00 100 > > VecScatterBegin 400 1.0 1.6791e-01 6.0 0.00e+00 0.0 3.7e+05 1.6e+04 > 0.0e+00 0 0 62 54 0 2 0100100 0 0

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jed Brown
Mark Adams writes: > as far as streams, does it know to run on the GPU? You don't specify > something like -G 1 here for GPUs. I think you just get them all. No, this isn't GPU code. BabelStream is a common STREAM suite for different programming models, though I think it doesn't support MPI

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
I am getting some funny timings and I'm trying to figure it out. I figure the gPU flop rates are bit higher because the timers are inside of the CPU timers, but *some are a lot bigger or inverted* --- Event Stage 2: KSP Solve only MatMult 400 1.0 1.0094e+01 1.2 1.07e+11 1.0 3.7e+05

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
On Sat, Jan 22, 2022 at 10:25 AM Jed Brown wrote: > Mark Adams writes: > > > On Fri, Jan 21, 2022 at 9:55 PM Barry Smith wrote: > > > >> > >> Interesting, Is this with all native Kokkos kernels or do some kokkos > >> kernels use rocm? > >> > > > > Ah, good question. I often run with tpl=0 but

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Junchao Zhang
On Sat, Jan 22, 2022 at 10:04 AM Mark Adams wrote: > Logging GPU flops should be inside of PetscLogGpuTimeBegin()/End() right? > No, PetscLogGpuTime() does not know the flops of the caller. > > On Fri, Jan 21, 2022 at 9:47 PM Barry Smith wrote: > >> >> Mark, >> >> Fix the logging before

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
Logging GPU flops should be inside of PetscLogGpuTimeBegin()/End() right? On Fri, Jan 21, 2022 at 9:47 PM Barry Smith wrote: > > Mark, > > Fix the logging before you run more. It will help with seeing the true > disparity between the MatMult and the vector ops. > > > On Jan 21, 2022, at

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Jed Brown
Mark Adams writes: > On Fri, Jan 21, 2022 at 9:55 PM Barry Smith wrote: > >> >> Interesting, Is this with all native Kokkos kernels or do some kokkos >> kernels use rocm? >> > > Ah, good question. I often run with tpl=0 but I did not specify here on > Crusher. In looking at the log files I see

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
I should be able to add this profiling now. On Fri, Jan 21, 2022 at 10:48 PM Junchao Zhang wrote: > > > > On Fri, Jan 21, 2022 at 8:08 PM Barry Smith wrote: > >> >> Junchao, Mark, >> >> Some of the logging information is non-sensible, MatMult says all >> flops are done on the GPU (last

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-22 Thread Mark Adams
On Fri, Jan 21, 2022 at 9:55 PM Barry Smith wrote: > > Interesting, Is this with all native Kokkos kernels or do some kokkos > kernels use rocm? > Ah, good question. I often run with tpl=0 but I did not specify here on Crusher. In looking at the log files I see

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Junchao Zhang
On Fri, Jan 21, 2022 at 8:08 PM Barry Smith wrote: > > Junchao, Mark, > > Some of the logging information is non-sensible, MatMult says all > flops are done on the GPU (last column) but the GPU flop rate is zero. > > It looks like MatMult_SeqAIJKokkos() is missing >

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Barry Smith
Interesting, Is this with all native Kokkos kernels or do some kokkos kernels use rocm? I ask because VecNorm is 4 times higher than VecDot, I would not expect that and VecAXPY is less than 1/4 the performance of VecAYPX (I would not expect that) MatMult 400 1.0 1.0288e+00 1.0

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Barry Smith
Mark, Fix the logging before you run more. It will help with seeing the true disparity between the MatMult and the vector ops. > On Jan 21, 2022, at 9:37 PM, Mark Adams wrote: > > Here is one with 2M / GPU. Getting better. > > On Fri, Jan 21, 2022 at 9:17 PM Barry Smith

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Mark Adams
> > >But in particular look at the VecTDot and VecNorm CPU flop > rates compared to the GPU, much lower, this tells me the MPI_Allreduce is > likely hurting performance in there also a great deal. It would be good to > see a single MPI rank job to compare to see performance without the

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Mark Adams
Here is one with 2M / GPU. Getting better. On Fri, Jan 21, 2022 at 9:17 PM Barry Smith wrote: > >Matt is correct, vectors are way too small. > >BTW: Now would be a good time to run some of the Report I benchmarks on > Crusher to get a feel for the kernel launch times and performance on

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Barry Smith
Matt is correct, vectors are way too small. BTW: Now would be a good time to run some of the Report I benchmarks on Crusher to get a feel for the kernel launch times and performance on VecOps. Also Report 2. Barry > On Jan 21, 2022, at 7:58 PM, Matthew Knepley wrote: > > On

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Barry Smith
Junchao, Mark, Some of the logging information is non-sensible, MatMult says all flops are done on the GPU (last column) but the GPU flop rate is zero. It looks like MatMult_SeqAIJKokkos() is missing PetscLogGpuTimeBegin()/End() in fact all the operations in aijkok.kokkos.cxx

Re: [petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Matthew Knepley
On Fri, Jan 21, 2022 at 6:41 PM Mark Adams wrote: > I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian > (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it > MI200?). > This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI > are similar

[petsc-dev] Kokkos/Crusher perforance

2022-01-21 Thread Mark Adams
I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it MI200?). This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI are similar (mat-vec is a little faster w/o, the total is about the same,

  1   2   >