Ok thanks, And looking at the actual tests that I am doing more carefully I am not seeing much difference in the numbers that I care about with or without the GPU timers.
And FWIW, from a user experience point of view, having -log_view present you with a sea of nans by default is a bit disheartening for a first time user. And they have to send an email and we have to respond .... On Mon, May 16, 2022 at 9:36 PM Barry Smith <bsm...@petsc.dev> wrote: > > Mark, > > Yes, Jed has already asked for ways to select certain operations at > runtime that would get timed via the GPU timer. I am not sure of the ideal > way to handle it. One possibility is PetscLogGpuTimeEvent(event number) > which indicates events you want the time logged for. Another is > PetscLogGpuTimeObjectID(object ID) which indicates all events for a > particular object class would be logged. In either or both cases the > PetscLogGPUTime() calls would need to become smarter to determine when they > are turned on and off depending on the current state of the logging. Since > the events are nested and there can be multiple objects related to the > timings I don't have a clear plan in my mind. > > Barry > > > On May 16, 2022, at 7:31 PM, Mark Adams <mfad...@lbl.gov> wrote: > > I am not sure I understand the logic, we print the ratio of max/min. > I report max and look at the ratio to see if I might be catching some load > imbalance or whatever. Is there a problem with that workflow? > > I assume there is or you would not have done this, so can I add a method > that I think also has legitimate values? > > I did observe a 15% hit on my whole stage with the GPU timers but I would > still like to take my chances with: > > Landau Operator 29 1.0 1.1200e-02 1.0 4.54e+06 1.0 0.0e+00 0.0e+00 > 0.0e+00 3 0 0 0 0 40 4 0 0 0 405 509 0 0.00e+00 0 > 0.00e+00 100 > Landau Jacobian 15 1.0 7.8189e-03 1.0 4.54e+06 1.0 0.0e+00 0.0e+00 > 0.0e+00 2 0 0 0 0 28 4 0 0 0 580 731 0 0.00e+00 0 > 0.00e+00 100 > Landau Mass 14 1.0 3.3759e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 12 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > > These methods require barriers at the end before they > call MatSetValuesCOO, which is like < 1% of these total matrix times, so I > trust the data w/o GPU times set. > > (The rest of the stage is the rest of the time step, which is mostly the > linear solver, and I guess that is where the 15% comes from but I have not > dug into it) > > Thanks, > Mark > > > On Wed, Apr 27, 2022 at 11:06 AM Barry Smith <bsm...@petsc.dev> wrote: > >> >> Only KSPSolve, SNESSolve, TSSolve will have legitimate values. >> "High-level" functions like PtAP can be asynchronous (meaning the GPU >> returns to the CPU to do more stuff) before they are complete, hence their >> timings would be incorrect and so they must be labeled with a special >> marker and not have numbers for time. >> >> Jed reports a possible 10% smaller time for KSPSolve() in this mode in >> some of his runs, compared to timing correctly the inner operations, and >> thus feels it is the best default. >> >> Barry >> >> >> On Apr 27, 2022, at 10:08 AM, Mark Adams <mfad...@lbl.gov> wrote: >> >> >> >> On Tue, Apr 26, 2022 at 8:00 PM Barry Smith <bsm...@petsc.dev> wrote: >> >>> >>> The current nan output has to be replaced to get the column alignment >>> correct, I just didn't feel like making that change also in the same MR. >>> >>> Something like Unknown or anything that fits in the column space would >>> be fine. It just means for the given run the timing numbers are not >>> meaningful/correct for those events. >>> >> >> Just a note, just about every event is NAN for me. My GAMG setup that is >> all CPU is NAN. High level functions like PtAP as well. >> That said, adding -log_view_gpu_time is fine. Not worth the churn. >> >> >>> This is to obtain the best meaningful results for the outer events per >>> Jed since timing the inner events accurately introduces extra time in the >>> outer events. That is it is not possible to have the best accurate times >>> for both inner events and outer events in the same run. So if you want to >>> compare KSPSolve timings, for example, you run as-is, it you want to >>> examine, low-level vector operations run also with -log_view_gpu_time but >>> know that the KSP times are higher than need be. >>> >>> Sorry for the confusion. >>> >>> >>> >>> >>> On Apr 26, 2022, at 3:49 PM, Matthew Knepley <knep...@gmail.com> wrote: >>> >>> On Tue, Apr 26, 2022 at 12:03 PM Mark Adams <mfad...@lbl.gov> wrote: >>> >>>> Well, Nans are a clear sign that something is very wrong. >>>> >>> >>> Barry chose them so that it could not be mistaken for an actual number. >>> >>> Matt >>> >>> >>>> On Tue, Apr 26, 2022 at 11:52 AM Jacob Faibussowitsch < >>>> jacob....@gmail.com> wrote: >>>> >>>>> There is an automatic warning that shows when you do run with >>>>> `-log_view_gpu_time`, but perhaps there should also be an automatic >>>>> warning >>>>> when *not* running with it. It is unfortunate that NaN is the value >>>>> printed >>>>> as this implies a bug but AFAIK it is unavoidable (Barry can say more on >>>>> this though). >>>>> >>>>> Best regards, >>>>> >>>>> Jacob Faibussowitsch >>>>> (Jacob Fai - booss - oh - vitch) >>>>> >>>>> > On Apr 26, 2022, at 09:48, Jose E. Roman <jro...@dsic.upv.es> wrote: >>>>> > >>>>> > You have to add -log_view_gpu_time >>>>> > See https://gitlab.com/petsc/petsc/-/merge_requests/5056 >>>>> > >>>>> > Jose >>>>> > >>>>> > >>>>> >> El 26 abr 2022, a las 16:39, Mark Adams <mfad...@lbl.gov> escribió: >>>>> >> >>>>> >> I'm seeing this on Perlmutter with Kokkos-CUDA. Nans in most log >>>>> timing data except the two 'Solve' lines. >>>>> >> Just cg/jacobi on snes/ex56. >>>>> >> >>>>> >> Any ideas? >>>>> >> >>>>> >> VecTDot 2 1.0 nan nan 1.20e+01 1.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 >>>>> 0.00e+00 100 >>>>> >> VecNorm 2 1.0 nan nan 1.00e+01 1.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 >>>>> 0.00e+00 100 >>>>> >> VecCopy 2 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> >> VecSet 5 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> >> VecAXPY 4 1.0 nan nan 2.40e+01 1.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 1 0 0 0 0 -nan -nan 0 0.00e+00 0 >>>>> 0.00e+00 100 >>>>> >> VecPointwiseMult 1 1.0 nan nan 3.00e+00 1.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 >>>>> 0.00e+00 100 >>>>> >> KSPSetUp 1 1.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> >> KSPSolve 1 1.0 4.0514e-04 1.0 5.50e+01 1.0 0.0e+00 >>>>> 0.0e+00 0.0e+00 1 0 0 0 0 2 0 0 0 0 0 -nan 0 >>>>> 0.00e+00 0 0.00e+00 100 >>>>> >> SNESSolve 1 1.0 2.2128e-02 1.0 5.55e+05 1.0 0.0e+00 >>>>> 0.0e+00 0.0e+00 72 56 0 0 0 100100 0 0 0 25 -nan 0 >>>>> 0.00e+00 0 0.00e+00 0 >>>>> > >>>>> >>>>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> <http://www.cse.buffalo.edu/~knepley/> >>> >>> >>> >> >