Re: [petsc-dev] odd log behavior

2022-05-17 Thread Zhang, Hong via petsc-dev
Python users including myself would love NaN since NaN is the default missing 
value marker for reasons of computational speed and convenience. For example, 
if you load these values into pandas, no extra code is needed to handle them. 
Other choices such as N/A would require some extra work for text replacement.

The -nan looks a bit weird. There should be a way to get rid of the - sign.

Hong (Mr.)

> On Apr 26, 2022, at 10:52 AM, Jacob Faibussowitsch  
> wrote:
> 
> There is an automatic warning that shows when you do run with 
> `-log_view_gpu_time`, but perhaps there should also be an automatic warning 
> when *not* running with it. It is unfortunate that NaN is the value printed 
> as this implies a bug but AFAIK it is unavoidable (Barry can say more on this 
> though).
> 
> Best regards,
> 
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
> 
>> On Apr 26, 2022, at 09:48, Jose E. Roman  wrote:
>> 
>> You have to add -log_view_gpu_time
>> See https://gitlab.com/petsc/petsc/-/merge_requests/5056
>> 
>> Jose
>> 
>> 
>>> El 26 abr 2022, a las 16:39, Mark Adams  escribió:
>>> 
>>> I'm seeing this on Perlmutter with Kokkos-CUDA. Nans in most log timing 
>>> data except the two 'Solve' lines.
>>> Just cg/jacobi on snes/ex56.
>>> 
>>> Any ideas?
>>> 
>>> VecTDot2 1.0   nan nan 1.20e+01 1.0 0.0e+00 0.0e+00 0.0e+00 
>>>  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00 
>>> 100
>>> VecNorm2 1.0   nan nan 1.00e+01 1.0 0.0e+00 0.0e+00 0.0e+00 
>>>  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00 
>>> 100
>>> VecCopy2 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 
>>>  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00  >>> 0
>>> VecSet 5 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 
>>>  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00  >>> 0
>>> VecAXPY4 1.0   nan nan 2.40e+01 1.0 0.0e+00 0.0e+00 0.0e+00 
>>>  0  0  0  0  0   1  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00 
>>> 100
>>> VecPointwiseMult   1 1.0   nan nan 3.00e+00 1.0 0.0e+00 0.0e+00 0.0e+00 
>>>  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00 
>>> 100
>>> KSPSetUp   1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 
>>>  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00  >>> 0
>>> KSPSolve   1 1.0 4.0514e-04 1.0 5.50e+01 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   2  0  0  0  0 0-nan  0 0.00e+000 
>>> 0.00e+00 100
>>> SNESSolve  1 1.0 2.2128e-02 1.0 5.55e+05 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 72 56  0  0  0 100100  0  0  025-nan  0 0.00e+000 
>>> 0.00e+00  0
>> 
> 



Re: [petsc-dev] odd log behavior

2022-05-17 Thread Mark Adams
Oh, this is not hard:

  /* /\* Put NaN into the time for all events that may not be time
accurately since they may happen asynchronously on the GPU *\/ */
  /* #if defined(PETSC_HAVE_DEVICE) */
  /* if (!PetscLogGpuTimeFlag && petsc_gflops > 0) { */
  /*   memcpy(,,sizeof(PetscLogDouble)); */
  /*   PetscCall(PetscEventRegLogGetEvent(stageLog->eventLog, name,
)); */
  /*   if (eventid != SNES_Solve && eventid != KSP_Solve && eventid !=
TS_Step && eventid != TAO_Solve) { */
  /* memcpy(,,sizeof(PetscLogDouble)); */
  /* memcpy(,,sizeof(PetscLogDouble)); */
  /*   } */
  /* } */
  /* #endif */


On Tue, May 17, 2022 at 7:17 AM Mark Adams  wrote:

> Can I get some advice on how to add/hack an event now into the reported
> events?
>
> I am noticing that all of the difference in my clean (warmed up) stage
> comes for KSPSolve:
>
> KSPSolve 144 1.0 *2.1745e+00* 1.0 7.50e+10 1.0 0.0e+00
> 0.0e+00 0.0e+00 42 89  0  0  0  68 98  0  0  0 34511   52718  0
> 0.00e+000 0.00e+00 100
> KSPSolve 144 1.0 *1.2275e+00* 1.0 7.50e+10 1.0 0.0e+00
> 0.0e+00 0.0e+00 29 89  0  0  0  56 98  0  0  0 61138   * -nan*  0
> 0.00e+000 0.00e+00 100
>
> Maybe it is easier for me to just run it twice and get my Landau numbers
> from the full report and get KSPSolve numbers from the nan report.
> (Not great for a reproducibility doc)
>
> Thanks,
> Mark
>
>
>
> On Mon, May 16, 2022 at 10:38 PM Mark Adams  wrote:
>
>> Ok thanks,
>>
>> And looking at the actual tests that I am doing more carefully I am not
>> seeing much difference in the numbers that I care about with or without the
>> GPU timers.
>>
>> And FWIW, from a user experience point of view, having -log_view present
>> you with a sea of nans by default is a bit disheartening for a first time
>> user. And they have to send an email and we have to respond 
>>
>>
>> On Mon, May 16, 2022 at 9:36 PM Barry Smith  wrote:
>>
>>>
>>>   Mark,
>>>
>>> Yes, Jed has already asked for ways to select certain operations at
>>> runtime that would get timed via the GPU timer. I am not sure of the ideal
>>> way to handle it. One possibility is PetscLogGpuTimeEvent(event number)
>>> which indicates events you want the time logged for. Another is
>>> PetscLogGpuTimeObjectID(object ID) which indicates all events for a
>>> particular object class would be logged. In either or both cases the
>>> PetscLogGPUTime() calls would need to become smarter to determine when they
>>> are turned on and off depending on the current state of the logging. Since
>>> the events are nested and there can be multiple objects related to the
>>> timings I don't have a clear plan in my mind.
>>>
>>>Barry
>>>
>>>
>>> On May 16, 2022, at 7:31 PM, Mark Adams  wrote:
>>>
>>> I am not sure I understand the logic, we print the ratio of max/min.
>>> I report max and look at the ratio to see if I might be catching some
>>> load imbalance or whatever. Is there a problem with that workflow?
>>>
>>> I assume there is or you would not have done this, so can I add a method
>>> that I think also has legitimate values?
>>>
>>> I did observe a 15% hit on my whole stage with the GPU timers but I
>>> would still like to take my chances with:
>>>
>>> Landau Operator   29 1.0 1.1200e-02 1.0 4.54e+06 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  3  0  0  0  0  40  4  0  0  0   405 509  0 0.00e+000
>>> 0.00e+00 100
>>> Landau Jacobian   15 1.0 7.8189e-03 1.0 4.54e+06 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  2  0  0  0  0  28  4  0  0  0   580 731  0 0.00e+000
>>> 0.00e+00 100
>>> Landau Mass   14 1.0 3.3759e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  1  0  0  0  0  12  0  0  0  0 0   0  0 0.00e+000
>>> 0.00e+00  0
>>>
>>> These methods require barriers at the end before they
>>> call MatSetValuesCOO, which is like < 1% of these total matrix times, so I
>>> trust the data w/o GPU times set.
>>>
>>> (The rest of the stage is the rest of the time step, which is mostly the
>>> linear solver, and I guess that is where the 15% comes from but I have not
>>> dug into it)
>>>
>>> Thanks,
>>> Mark
>>>
>>>
>>> On Wed, Apr 27, 2022 at 11:06 AM Barry Smith  wrote:
>>>

   Only KSPSolve, SNESSolve, TSSolve will have legitimate values.
 "High-level" functions like PtAP can be asynchronous (meaning the GPU
 returns to the CPU to do more stuff) before they are complete, hence their
 timings would be incorrect and so they must be labeled with a special
 marker and not have numbers for time.

   Jed reports a possible 10% smaller time for KSPSolve() in this mode
 in some of his runs, compared to timing correctly the inner operations, and
 thus feels it is the best default.

   Barry


 On Apr 27, 2022, at 10:08 AM, Mark Adams  wrote:



 On Tue, Apr 26, 2022 at 8:00 PM Barry Smith  wrote:

>
>   The current nan output has to be replaced to 

Re: [petsc-dev] odd log behavior

2022-05-16 Thread Mark Adams
Ok thanks,

And looking at the actual tests that I am doing more carefully I am not
seeing much difference in the numbers that I care about with or without the
GPU timers.

And FWIW, from a user experience point of view, having -log_view present
you with a sea of nans by default is a bit disheartening for a first time
user. And they have to send an email and we have to respond 


On Mon, May 16, 2022 at 9:36 PM Barry Smith  wrote:

>
>   Mark,
>
> Yes, Jed has already asked for ways to select certain operations at
> runtime that would get timed via the GPU timer. I am not sure of the ideal
> way to handle it. One possibility is PetscLogGpuTimeEvent(event number)
> which indicates events you want the time logged for. Another is
> PetscLogGpuTimeObjectID(object ID) which indicates all events for a
> particular object class would be logged. In either or both cases the
> PetscLogGPUTime() calls would need to become smarter to determine when they
> are turned on and off depending on the current state of the logging. Since
> the events are nested and there can be multiple objects related to the
> timings I don't have a clear plan in my mind.
>
>Barry
>
>
> On May 16, 2022, at 7:31 PM, Mark Adams  wrote:
>
> I am not sure I understand the logic, we print the ratio of max/min.
> I report max and look at the ratio to see if I might be catching some load
> imbalance or whatever. Is there a problem with that workflow?
>
> I assume there is or you would not have done this, so can I add a method
> that I think also has legitimate values?
>
> I did observe a 15% hit on my whole stage with the GPU timers but I would
> still like to take my chances with:
>
> Landau Operator   29 1.0 1.1200e-02 1.0 4.54e+06 1.0 0.0e+00 0.0e+00
> 0.0e+00  3  0  0  0  0  40  4  0  0  0   405 509  0 0.00e+000
> 0.00e+00 100
> Landau Jacobian   15 1.0 7.8189e-03 1.0 4.54e+06 1.0 0.0e+00 0.0e+00
> 0.0e+00  2  0  0  0  0  28  4  0  0  0   580 731  0 0.00e+000
> 0.00e+00 100
> Landau Mass   14 1.0 3.3759e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0  12  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
>
> These methods require barriers at the end before they
> call MatSetValuesCOO, which is like < 1% of these total matrix times, so I
> trust the data w/o GPU times set.
>
> (The rest of the stage is the rest of the time step, which is mostly the
> linear solver, and I guess that is where the 15% comes from but I have not
> dug into it)
>
> Thanks,
> Mark
>
>
> On Wed, Apr 27, 2022 at 11:06 AM Barry Smith  wrote:
>
>>
>>   Only KSPSolve, SNESSolve, TSSolve will have legitimate values.
>> "High-level" functions like PtAP can be asynchronous (meaning the GPU
>> returns to the CPU to do more stuff) before they are complete, hence their
>> timings would be incorrect and so they must be labeled with a special
>> marker and not have numbers for time.
>>
>>   Jed reports a possible 10% smaller time for KSPSolve() in this mode in
>> some of his runs, compared to timing correctly the inner operations, and
>> thus feels it is the best default.
>>
>>   Barry
>>
>>
>> On Apr 27, 2022, at 10:08 AM, Mark Adams  wrote:
>>
>>
>>
>> On Tue, Apr 26, 2022 at 8:00 PM Barry Smith  wrote:
>>
>>>
>>>   The current nan output has to be replaced to get the column alignment
>>> correct, I just didn't feel like making that change also in the same MR.
>>>
>>>   Something like Unknown or anything that fits in the column space would
>>> be fine. It just means for the given run the timing numbers are not
>>> meaningful/correct for those events.
>>>
>>
>> Just a note, just about every event is NAN for me. My GAMG setup that is
>> all CPU is NAN. High level functions like PtAP as well.
>> That said, adding -log_view_gpu_time is fine. Not worth the churn.
>>
>>
>>> This is to obtain the best meaningful results for the outer events per
>>> Jed since timing the inner events accurately introduces extra time in the
>>> outer events. That is it is not possible to have the best accurate times
>>> for both inner events and outer events in the same run. So if you want to
>>> compare KSPSolve timings, for example, you run as-is, it you want to
>>> examine, low-level vector operations run also with -log_view_gpu_time but
>>> know that the KSP times are higher than need be.
>>>
>>>   Sorry for the confusion.
>>>
>>>
>>>
>>>
>>> On Apr 26, 2022, at 3:49 PM, Matthew Knepley  wrote:
>>>
>>> On Tue, Apr 26, 2022 at 12:03 PM Mark Adams  wrote:
>>>
 Well, Nans are a clear sign that something is very wrong.

>>>
>>> Barry chose them so that it could not be mistaken for an actual number.
>>>
>>>Matt
>>>
>>>
 On Tue, Apr 26, 2022 at 11:52 AM Jacob Faibussowitsch <
 jacob@gmail.com> wrote:

> There is an automatic warning that shows when you do run with
> `-log_view_gpu_time`, but perhaps there should also be an automatic 
> warning
> when *not* running with it. It is unfortunate 

Re: [petsc-dev] odd log behavior

2022-05-16 Thread Barry Smith

  Mark,

Yes, Jed has already asked for ways to select certain operations at runtime 
that would get timed via the GPU timer. I am not sure of the ideal way to 
handle it. One possibility is PetscLogGpuTimeEvent(event number) which 
indicates events you want the time logged for. Another is 
PetscLogGpuTimeObjectID(object ID) which indicates all events for a particular 
object class would be logged. In either or both cases the PetscLogGPUTime() 
calls would need to become smarter to determine when they are turned on and off 
depending on the current state of the logging. Since the events are nested and 
there can be multiple objects related to the timings I don't have a clear plan 
in my mind.

   Barry


> On May 16, 2022, at 7:31 PM, Mark Adams  wrote:
> 
> I am not sure I understand the logic, we print the ratio of max/min. 
> I report max and look at the ratio to see if I might be catching some load 
> imbalance or whatever. Is there a problem with that workflow?
> 
> I assume there is or you would not have done this, so can I add a method that 
> I think also has legitimate values?
> 
> I did observe a 15% hit on my whole stage with the GPU timers but I would 
> still like to take my chances with:
> 
> Landau Operator   29 1.0 1.1200e-02 1.0 4.54e+06 1.0 0.0e+00 0.0e+00 
> 0.0e+00  3  0  0  0  0  40  4  0  0  0   405 509  0 0.00e+000 
> 0.00e+00 100
> Landau Jacobian   15 1.0 7.8189e-03 1.0 4.54e+06 1.0 0.0e+00 0.0e+00 
> 0.0e+00  2  0  0  0  0  28  4  0  0  0   580 731  0 0.00e+000 
> 0.00e+00 100
> Landau Mass   14 1.0 3.3759e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0  12  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> 
> These methods require barriers at the end before they call MatSetValuesCOO, 
> which is like < 1% of these total matrix times, so I trust the data w/o GPU 
> times set.
> 
> (The rest of the stage is the rest of the time step, which is mostly the 
> linear solver, and I guess that is where the 15% comes from but I have not 
> dug into it)
> 
> Thanks,
> Mark
> 
> 
> On Wed, Apr 27, 2022 at 11:06 AM Barry Smith  > wrote:
> 
>   Only KSPSolve, SNESSolve, TSSolve will have legitimate values. "High-level" 
> functions like PtAP can be asynchronous (meaning the GPU returns to the CPU 
> to do more stuff) before they are complete, hence their timings would be 
> incorrect and so they must be labeled with a special marker and not have 
> numbers for time.
> 
>   Jed reports a possible 10% smaller time for KSPSolve() in this mode in some 
> of his runs, compared to timing correctly the inner operations, and thus 
> feels it is the best default.
> 
>   Barry
> 
> 
>> On Apr 27, 2022, at 10:08 AM, Mark Adams > > wrote:
>> 
>> 
>> 
>> On Tue, Apr 26, 2022 at 8:00 PM Barry Smith > > wrote:
>> 
>>   The current nan output has to be replaced to get the column alignment 
>> correct, I just didn't feel like making that change also in the same MR. 
>> 
>>   Something like Unknown or anything that fits in the column space would be 
>> fine. It just means for the given run the timing numbers are not 
>> meaningful/correct for those events.
>> 
>> Just a note, just about every event is NAN for me. My GAMG setup that is all 
>> CPU is NAN. High level functions like PtAP as well. 
>> That said, adding -log_view_gpu_time is fine. Not worth the churn.
>>  
>> This is to obtain the best meaningful results for the outer events per Jed 
>> since timing the inner events accurately introduces extra time in the outer 
>> events. That is it is not possible to have the best accurate times for both 
>> inner events and outer events in the same run. So if you want to compare 
>> KSPSolve timings, for example, you run as-is, it you want to examine, 
>> low-level vector operations run also with -log_view_gpu_time but know that 
>> the KSP times are higher than need be.
>> 
>>   Sorry for the confusion.
>> 
>> 
>> 
>> 
>>> On Apr 26, 2022, at 3:49 PM, Matthew Knepley >> > wrote:
>>> 
>>> On Tue, Apr 26, 2022 at 12:03 PM Mark Adams >> > wrote:
>>> Well, Nans are a clear sign that something is very wrong.
>>> 
>>> Barry chose them so that it could not be mistaken for an actual number.
>>> 
>>>Matt
>>>  
>>> On Tue, Apr 26, 2022 at 11:52 AM Jacob Faibussowitsch >> > wrote:
>>> There is an automatic warning that shows when you do run with 
>>> `-log_view_gpu_time`, but perhaps there should also be an automatic warning 
>>> when *not* running with it. It is unfortunate that NaN is the value printed 
>>> as this implies a bug but AFAIK it is unavoidable (Barry can say more on 
>>> this though).
>>> 
>>> Best regards,
>>> 
>>> Jacob Faibussowitsch
>>> (Jacob Fai - booss - oh - vitch)
>>> 
>>> > On Apr 26, 2022, at 09:48, Jose E. Roman >> > > wrote:
>>> > 
>>> > You 

Re: [petsc-dev] odd log behavior

2022-05-16 Thread Mark Adams
I am not sure I understand the logic, we print the ratio of max/min.
I report max and look at the ratio to see if I might be catching some load
imbalance or whatever. Is there a problem with that workflow?

I assume there is or you would not have done this, so can I add a method
that I think also has legitimate values?

I did observe a 15% hit on my whole stage with the GPU timers but I would
still like to take my chances with:

Landau Operator   29 1.0 1.1200e-02 1.0 4.54e+06 1.0 0.0e+00 0.0e+00
0.0e+00  3  0  0  0  0  40  4  0  0  0   405 509  0 0.00e+000
0.00e+00 100
Landau Jacobian   15 1.0 7.8189e-03 1.0 4.54e+06 1.0 0.0e+00 0.0e+00
0.0e+00  2  0  0  0  0  28  4  0  0  0   580 731  0 0.00e+000
0.00e+00 100
Landau Mass   14 1.0 3.3759e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  1  0  0  0  0  12  0  0  0  0 0   0  0 0.00e+000
0.00e+00  0

These methods require barriers at the end before they call MatSetValuesCOO,
which is like < 1% of these total matrix times, so I trust the data w/o GPU
times set.

(The rest of the stage is the rest of the time step, which is mostly the
linear solver, and I guess that is where the 15% comes from but I have not
dug into it)

Thanks,
Mark


On Wed, Apr 27, 2022 at 11:06 AM Barry Smith  wrote:

>
>   Only KSPSolve, SNESSolve, TSSolve will have legitimate values.
> "High-level" functions like PtAP can be asynchronous (meaning the GPU
> returns to the CPU to do more stuff) before they are complete, hence their
> timings would be incorrect and so they must be labeled with a special
> marker and not have numbers for time.
>
>   Jed reports a possible 10% smaller time for KSPSolve() in this mode in
> some of his runs, compared to timing correctly the inner operations, and
> thus feels it is the best default.
>
>   Barry
>
>
> On Apr 27, 2022, at 10:08 AM, Mark Adams  wrote:
>
>
>
> On Tue, Apr 26, 2022 at 8:00 PM Barry Smith  wrote:
>
>>
>>   The current nan output has to be replaced to get the column alignment
>> correct, I just didn't feel like making that change also in the same MR.
>>
>>   Something like Unknown or anything that fits in the column space would
>> be fine. It just means for the given run the timing numbers are not
>> meaningful/correct for those events.
>>
>
> Just a note, just about every event is NAN for me. My GAMG setup that is
> all CPU is NAN. High level functions like PtAP as well.
> That said, adding -log_view_gpu_time is fine. Not worth the churn.
>
>
>> This is to obtain the best meaningful results for the outer events per
>> Jed since timing the inner events accurately introduces extra time in the
>> outer events. That is it is not possible to have the best accurate times
>> for both inner events and outer events in the same run. So if you want to
>> compare KSPSolve timings, for example, you run as-is, it you want to
>> examine, low-level vector operations run also with -log_view_gpu_time but
>> know that the KSP times are higher than need be.
>>
>>   Sorry for the confusion.
>>
>>
>>
>>
>> On Apr 26, 2022, at 3:49 PM, Matthew Knepley  wrote:
>>
>> On Tue, Apr 26, 2022 at 12:03 PM Mark Adams  wrote:
>>
>>> Well, Nans are a clear sign that something is very wrong.
>>>
>>
>> Barry chose them so that it could not be mistaken for an actual number.
>>
>>Matt
>>
>>
>>> On Tue, Apr 26, 2022 at 11:52 AM Jacob Faibussowitsch <
>>> jacob@gmail.com> wrote:
>>>
 There is an automatic warning that shows when you do run with
 `-log_view_gpu_time`, but perhaps there should also be an automatic warning
 when *not* running with it. It is unfortunate that NaN is the value printed
 as this implies a bug but AFAIK it is unavoidable (Barry can say more on
 this though).

 Best regards,

 Jacob Faibussowitsch
 (Jacob Fai - booss - oh - vitch)

 > On Apr 26, 2022, at 09:48, Jose E. Roman  wrote:
 >
 > You have to add -log_view_gpu_time
 > See https://gitlab.com/petsc/petsc/-/merge_requests/5056
 >
 > Jose
 >
 >
 >> El 26 abr 2022, a las 16:39, Mark Adams  escribió:
 >>
 >> I'm seeing this on Perlmutter with Kokkos-CUDA. Nans in most log
 timing data except the two 'Solve' lines.
 >> Just cg/jacobi on snes/ex56.
 >>
 >> Any ideas?
 >>
 >> VecTDot2 1.0   nan nan 1.20e+01 1.0 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
 0.00e+00 100
 >> VecNorm2 1.0   nan nan 1.00e+01 1.0 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
 0.00e+00 100
 >> VecCopy2 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
 0.00e+00  0
 >> VecSet 5 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000

Re: [petsc-dev] odd log behavior

2022-04-27 Thread Barry Smith

  Only KSPSolve, SNESSolve, TSSolve will have legitimate values. "High-level" 
functions like PtAP can be asynchronous (meaning the GPU returns to the CPU to 
do more stuff) before they are complete, hence their timings would be incorrect 
and so they must be labeled with a special marker and not have numbers for time.

  Jed reports a possible 10% smaller time for KSPSolve() in this mode in some 
of his runs, compared to timing correctly the inner operations, and thus feels 
it is the best default.

  Barry


> On Apr 27, 2022, at 10:08 AM, Mark Adams  wrote:
> 
> 
> 
> On Tue, Apr 26, 2022 at 8:00 PM Barry Smith  > wrote:
> 
>   The current nan output has to be replaced to get the column alignment 
> correct, I just didn't feel like making that change also in the same MR. 
> 
>   Something like Unknown or anything that fits in the column space would be 
> fine. It just means for the given run the timing numbers are not 
> meaningful/correct for those events.
> 
> Just a note, just about every event is NAN for me. My GAMG setup that is all 
> CPU is NAN. High level functions like PtAP as well. 
> That said, adding -log_view_gpu_time is fine. Not worth the churn.
>  
> This is to obtain the best meaningful results for the outer events per Jed 
> since timing the inner events accurately introduces extra time in the outer 
> events. That is it is not possible to have the best accurate times for both 
> inner events and outer events in the same run. So if you want to compare 
> KSPSolve timings, for example, you run as-is, it you want to examine, 
> low-level vector operations run also with -log_view_gpu_time but know that 
> the KSP times are higher than need be.
> 
>   Sorry for the confusion.
> 
> 
> 
> 
>> On Apr 26, 2022, at 3:49 PM, Matthew Knepley > > wrote:
>> 
>> On Tue, Apr 26, 2022 at 12:03 PM Mark Adams > > wrote:
>> Well, Nans are a clear sign that something is very wrong.
>> 
>> Barry chose them so that it could not be mistaken for an actual number.
>> 
>>Matt
>>  
>> On Tue, Apr 26, 2022 at 11:52 AM Jacob Faibussowitsch > > wrote:
>> There is an automatic warning that shows when you do run with 
>> `-log_view_gpu_time`, but perhaps there should also be an automatic warning 
>> when *not* running with it. It is unfortunate that NaN is the value printed 
>> as this implies a bug but AFAIK it is unavoidable (Barry can say more on 
>> this though).
>> 
>> Best regards,
>> 
>> Jacob Faibussowitsch
>> (Jacob Fai - booss - oh - vitch)
>> 
>> > On Apr 26, 2022, at 09:48, Jose E. Roman > > > wrote:
>> > 
>> > You have to add -log_view_gpu_time
>> > See https://gitlab.com/petsc/petsc/-/merge_requests/5056 
>> > 
>> > 
>> > Jose
>> > 
>> > 
>> >> El 26 abr 2022, a las 16:39, Mark Adams > >> > escribió:
>> >> 
>> >> I'm seeing this on Perlmutter with Kokkos-CUDA. Nans in most log timing 
>> >> data except the two 'Solve' lines.
>> >> Just cg/jacobi on snes/ex56.
>> >> 
>> >> Any ideas?
>> >> 
>> >> VecTDot2 1.0   nan nan 1.20e+01 1.0 0.0e+00 0.0e+00 
>> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
>> >> 0.00e+00 100
>> >> VecNorm2 1.0   nan nan 1.00e+01 1.0 0.0e+00 0.0e+00 
>> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
>> >> 0.00e+00 100
>> >> VecCopy2 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 
>> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
>> >> 0.00e+00  0
>> >> VecSet 5 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 
>> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
>> >> 0.00e+00  0
>> >> VecAXPY4 1.0   nan nan 2.40e+01 1.0 0.0e+00 0.0e+00 
>> >> 0.0e+00  0  0  0  0  0   1  0  0  0  0  -nan-nan  0 0.00e+000 
>> >> 0.00e+00 100
>> >> VecPointwiseMult   1 1.0   nan nan 3.00e+00 1.0 0.0e+00 0.0e+00 
>> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
>> >> 0.00e+00 100
>> >> KSPSetUp   1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 
>> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
>> >> 0.00e+00  0
>> >> KSPSolve   1 1.0 4.0514e-04 1.0 5.50e+01 1.0 0.0e+00 0.0e+00 
>> >> 0.0e+00  1  0  0  0  0   2  0  0  0  0 0-nan  0 0.00e+000 
>> >> 0.00e+00 100
>> >> SNESSolve  1 1.0 2.2128e-02 1.0 5.55e+05 1.0 0.0e+00 0.0e+00 
>> >> 0.0e+00 72 56  0  0  0 100100  0  0  025-nan  0 0.00e+000 
>> >> 0.00e+00  0
>> > 
>> 
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments 
>> is infinitely more interesting than any results to which their experiments 
>> lead.
>> -- Norbert Wiener
>> 
>> 

Re: [petsc-dev] odd log behavior

2022-04-27 Thread Mark Adams
On Tue, Apr 26, 2022 at 8:00 PM Barry Smith  wrote:

>
>   The current nan output has to be replaced to get the column alignment
> correct, I just didn't feel like making that change also in the same MR.
>
>   Something like Unknown or anything that fits in the column space would
> be fine. It just means for the given run the timing numbers are not
> meaningful/correct for those events.
>

Just a note, just about every event is NAN for me. My GAMG setup that is
all CPU is NAN. High level functions like PtAP as well.
That said, adding -log_view_gpu_time is fine. Not worth the churn.


> This is to obtain the best meaningful results for the outer events per Jed
> since timing the inner events accurately introduces extra time in the outer
> events. That is it is not possible to have the best accurate times for both
> inner events and outer events in the same run. So if you want to compare
> KSPSolve timings, for example, you run as-is, it you want to examine,
> low-level vector operations run also with -log_view_gpu_time but know that
> the KSP times are higher than need be.
>
>   Sorry for the confusion.
>
>
>
>
> On Apr 26, 2022, at 3:49 PM, Matthew Knepley  wrote:
>
> On Tue, Apr 26, 2022 at 12:03 PM Mark Adams  wrote:
>
>> Well, Nans are a clear sign that something is very wrong.
>>
>
> Barry chose them so that it could not be mistaken for an actual number.
>
>Matt
>
>
>> On Tue, Apr 26, 2022 at 11:52 AM Jacob Faibussowitsch <
>> jacob@gmail.com> wrote:
>>
>>> There is an automatic warning that shows when you do run with
>>> `-log_view_gpu_time`, but perhaps there should also be an automatic warning
>>> when *not* running with it. It is unfortunate that NaN is the value printed
>>> as this implies a bug but AFAIK it is unavoidable (Barry can say more on
>>> this though).
>>>
>>> Best regards,
>>>
>>> Jacob Faibussowitsch
>>> (Jacob Fai - booss - oh - vitch)
>>>
>>> > On Apr 26, 2022, at 09:48, Jose E. Roman  wrote:
>>> >
>>> > You have to add -log_view_gpu_time
>>> > See https://gitlab.com/petsc/petsc/-/merge_requests/5056
>>> >
>>> > Jose
>>> >
>>> >
>>> >> El 26 abr 2022, a las 16:39, Mark Adams  escribió:
>>> >>
>>> >> I'm seeing this on Perlmutter with Kokkos-CUDA. Nans in most log
>>> timing data except the two 'Solve' lines.
>>> >> Just cg/jacobi on snes/ex56.
>>> >>
>>> >> Any ideas?
>>> >>
>>> >> VecTDot2 1.0   nan nan 1.20e+01 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>>> 0.00e+00 100
>>> >> VecNorm2 1.0   nan nan 1.00e+01 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>>> 0.00e+00 100
>>> >> VecCopy2 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>>> 0.00e+00  0
>>> >> VecSet 5 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>>> 0.00e+00  0
>>> >> VecAXPY4 1.0   nan nan 2.40e+01 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   1  0  0  0  0  -nan-nan  0 0.00e+000
>>> 0.00e+00 100
>>> >> VecPointwiseMult   1 1.0   nan nan 3.00e+00 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>>> 0.00e+00 100
>>> >> KSPSetUp   1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>>> 0.00e+00  0
>>> >> KSPSolve   1 1.0 4.0514e-04 1.0 5.50e+01 1.0 0.0e+00
>>> 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0 0-nan  0
>>> 0.00e+000 0.00e+00 100
>>> >> SNESSolve  1 1.0 2.2128e-02 1.0 5.55e+05 1.0 0.0e+00
>>> 0.0e+00 0.0e+00 72 56  0  0  0 100100  0  0  025-nan  0
>>> 0.00e+000 0.00e+00  0
>>> >
>>>
>>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> 
>
>
>


Re: [petsc-dev] odd log behavior

2022-04-26 Thread Barry Smith


> On Apr 26, 2022, at 6:40 PM, Matthew Knepley  wrote:
> 
> On Tue, Apr 26, 2022 at 6:12 PM Justin Chang  > wrote:
> I think N/A (not applicable) would be a better message than NaN. Prior to 
> this mail, even I thought I broke something with these NaN's
> 
> We would need some logic to check for NaN and change the format.

  Yes, that should to in another MR.

> 
>Matt
>  
> On Tue, Apr 26, 2022 at 2:49 PM Matthew Knepley  > wrote:
> On Tue, Apr 26, 2022 at 12:03 PM Mark Adams  > wrote:
> Well, Nans are a clear sign that something is very wrong.
> 
> Barry chose them so that it could not be mistaken for an actual number.
> 
>Matt
>  
> On Tue, Apr 26, 2022 at 11:52 AM Jacob Faibussowitsch  > wrote:
> There is an automatic warning that shows when you do run with 
> `-log_view_gpu_time`, but perhaps there should also be an automatic warning 
> when *not* running with it. It is unfortunate that NaN is the value printed 
> as this implies a bug but AFAIK it is unavoidable (Barry can say more on this 
> though).
> 
> Best regards,
> 
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
> 
> > On Apr 26, 2022, at 09:48, Jose E. Roman  > > wrote:
> > 
> > You have to add -log_view_gpu_time
> > See https://gitlab.com/petsc/petsc/-/merge_requests/5056 
> > 
> > 
> > Jose
> > 
> > 
> >> El 26 abr 2022, a las 16:39, Mark Adams  >> > escribió:
> >> 
> >> I'm seeing this on Perlmutter with Kokkos-CUDA. Nans in most log timing 
> >> data except the two 'Solve' lines.
> >> Just cg/jacobi on snes/ex56.
> >> 
> >> Any ideas?
> >> 
> >> VecTDot2 1.0   nan nan 1.20e+01 1.0 0.0e+00 0.0e+00 
> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
> >> 0.00e+00 100
> >> VecNorm2 1.0   nan nan 1.00e+01 1.0 0.0e+00 0.0e+00 
> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
> >> 0.00e+00 100
> >> VecCopy2 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 
> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
> >> 0.00e+00  0
> >> VecSet 5 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 
> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
> >> 0.00e+00  0
> >> VecAXPY4 1.0   nan nan 2.40e+01 1.0 0.0e+00 0.0e+00 
> >> 0.0e+00  0  0  0  0  0   1  0  0  0  0  -nan-nan  0 0.00e+000 
> >> 0.00e+00 100
> >> VecPointwiseMult   1 1.0   nan nan 3.00e+00 1.0 0.0e+00 0.0e+00 
> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
> >> 0.00e+00 100
> >> KSPSetUp   1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 
> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
> >> 0.00e+00  0
> >> KSPSolve   1 1.0 4.0514e-04 1.0 5.50e+01 1.0 0.0e+00 0.0e+00 
> >> 0.0e+00  1  0  0  0  0   2  0  0  0  0 0-nan  0 0.00e+000 
> >> 0.00e+00 100
> >> SNESSolve  1 1.0 2.2128e-02 1.0 5.55e+05 1.0 0.0e+00 0.0e+00 
> >> 0.0e+00 72 56  0  0  0 100100  0  0  025-nan  0 0.00e+000 
> >> 0.00e+00  0
> > 
> 
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ 
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ 



Re: [petsc-dev] odd log behavior

2022-04-26 Thread Barry Smith

  The current nan output has to be replaced to get the column alignment 
correct, I just didn't feel like making that change also in the same MR. 

  Something like Unknown or anything that fits in the column space would be 
fine. It just means for the given run the timing numbers are not 
meaningful/correct for those events. This is to obtain the best meaningful 
results for the outer events per Jed since timing the inner events accurately 
introduces extra time in the outer events. That is it is not possible to have 
the best accurate times for both inner events and outer events in the same run. 
So if you want to compare KSPSolve timings, for example, you run as-is, it you 
want to examine, low-level vector operations run also with -log_view_gpu_time 
but know that the KSP times are higher than need be.

  Sorry for the confusion.




> On Apr 26, 2022, at 3:49 PM, Matthew Knepley  wrote:
> 
> On Tue, Apr 26, 2022 at 12:03 PM Mark Adams  > wrote:
> Well, Nans are a clear sign that something is very wrong.
> 
> Barry chose them so that it could not be mistaken for an actual number.
> 
>Matt
>  
> On Tue, Apr 26, 2022 at 11:52 AM Jacob Faibussowitsch  > wrote:
> There is an automatic warning that shows when you do run with 
> `-log_view_gpu_time`, but perhaps there should also be an automatic warning 
> when *not* running with it. It is unfortunate that NaN is the value printed 
> as this implies a bug but AFAIK it is unavoidable (Barry can say more on this 
> though).
> 
> Best regards,
> 
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
> 
> > On Apr 26, 2022, at 09:48, Jose E. Roman  > > wrote:
> > 
> > You have to add -log_view_gpu_time
> > See https://gitlab.com/petsc/petsc/-/merge_requests/5056 
> > 
> > 
> > Jose
> > 
> > 
> >> El 26 abr 2022, a las 16:39, Mark Adams  >> > escribió:
> >> 
> >> I'm seeing this on Perlmutter with Kokkos-CUDA. Nans in most log timing 
> >> data except the two 'Solve' lines.
> >> Just cg/jacobi on snes/ex56.
> >> 
> >> Any ideas?
> >> 
> >> VecTDot2 1.0   nan nan 1.20e+01 1.0 0.0e+00 0.0e+00 
> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
> >> 0.00e+00 100
> >> VecNorm2 1.0   nan nan 1.00e+01 1.0 0.0e+00 0.0e+00 
> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
> >> 0.00e+00 100
> >> VecCopy2 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 
> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
> >> 0.00e+00  0
> >> VecSet 5 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 
> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
> >> 0.00e+00  0
> >> VecAXPY4 1.0   nan nan 2.40e+01 1.0 0.0e+00 0.0e+00 
> >> 0.0e+00  0  0  0  0  0   1  0  0  0  0  -nan-nan  0 0.00e+000 
> >> 0.00e+00 100
> >> VecPointwiseMult   1 1.0   nan nan 3.00e+00 1.0 0.0e+00 0.0e+00 
> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
> >> 0.00e+00 100
> >> KSPSetUp   1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 
> >> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 
> >> 0.00e+00  0
> >> KSPSolve   1 1.0 4.0514e-04 1.0 5.50e+01 1.0 0.0e+00 0.0e+00 
> >> 0.0e+00  1  0  0  0  0   2  0  0  0  0 0-nan  0 0.00e+000 
> >> 0.00e+00 100
> >> SNESSolve  1 1.0 2.2128e-02 1.0 5.55e+05 1.0 0.0e+00 0.0e+00 
> >> 0.0e+00 72 56  0  0  0 100100  0  0  025-nan  0 0.00e+000 
> >> 0.00e+00  0
> > 
> 
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ 



Re: [petsc-dev] odd log behavior

2022-04-26 Thread Matthew Knepley
On Tue, Apr 26, 2022 at 6:12 PM Justin Chang  wrote:

> I think N/A (not applicable) would be a better message than NaN. Prior to
> this mail, even I thought I broke something with these NaN's
>

We would need some logic to check for NaN and change the format.

   Matt


> On Tue, Apr 26, 2022 at 2:49 PM Matthew Knepley  wrote:
>
>> On Tue, Apr 26, 2022 at 12:03 PM Mark Adams  wrote:
>>
>>> Well, Nans are a clear sign that something is very wrong.
>>>
>>
>> Barry chose them so that it could not be mistaken for an actual number.
>>
>>Matt
>>
>>
>>> On Tue, Apr 26, 2022 at 11:52 AM Jacob Faibussowitsch <
>>> jacob@gmail.com> wrote:
>>>
 There is an automatic warning that shows when you do run with
 `-log_view_gpu_time`, but perhaps there should also be an automatic warning
 when *not* running with it. It is unfortunate that NaN is the value printed
 as this implies a bug but AFAIK it is unavoidable (Barry can say more on
 this though).

 Best regards,

 Jacob Faibussowitsch
 (Jacob Fai - booss - oh - vitch)

 > On Apr 26, 2022, at 09:48, Jose E. Roman  wrote:
 >
 > You have to add -log_view_gpu_time
 > See https://gitlab.com/petsc/petsc/-/merge_requests/5056
 >
 > Jose
 >
 >
 >> El 26 abr 2022, a las 16:39, Mark Adams  escribió:
 >>
 >> I'm seeing this on Perlmutter with Kokkos-CUDA. Nans in most log
 timing data except the two 'Solve' lines.
 >> Just cg/jacobi on snes/ex56.
 >>
 >> Any ideas?
 >>
 >> VecTDot2 1.0   nan nan 1.20e+01 1.0 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
 0.00e+00 100
 >> VecNorm2 1.0   nan nan 1.00e+01 1.0 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
 0.00e+00 100
 >> VecCopy2 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
 0.00e+00  0
 >> VecSet 5 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
 0.00e+00  0
 >> VecAXPY4 1.0   nan nan 2.40e+01 1.0 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   1  0  0  0  0  -nan-nan  0 0.00e+000
 0.00e+00 100
 >> VecPointwiseMult   1 1.0   nan nan 3.00e+00 1.0 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
 0.00e+00 100
 >> KSPSetUp   1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
 0.00e+00  0
 >> KSPSolve   1 1.0 4.0514e-04 1.0 5.50e+01 1.0 0.0e+00
 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0 0-nan  0
 0.00e+000 0.00e+00 100
 >> SNESSolve  1 1.0 2.2128e-02 1.0 5.55e+05 1.0 0.0e+00
 0.0e+00 0.0e+00 72 56  0  0  0 100100  0  0  025-nan  0
 0.00e+000 0.00e+00  0
 >


>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> 
>>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 


Re: [petsc-dev] odd log behavior

2022-04-26 Thread Justin Chang
I think N/A (not applicable) would be a better message than NaN. Prior to
this mail, even I thought I broke something with these NaN's

On Tue, Apr 26, 2022 at 2:49 PM Matthew Knepley  wrote:

> On Tue, Apr 26, 2022 at 12:03 PM Mark Adams  wrote:
>
>> Well, Nans are a clear sign that something is very wrong.
>>
>
> Barry chose them so that it could not be mistaken for an actual number.
>
>Matt
>
>
>> On Tue, Apr 26, 2022 at 11:52 AM Jacob Faibussowitsch <
>> jacob@gmail.com> wrote:
>>
>>> There is an automatic warning that shows when you do run with
>>> `-log_view_gpu_time`, but perhaps there should also be an automatic warning
>>> when *not* running with it. It is unfortunate that NaN is the value printed
>>> as this implies a bug but AFAIK it is unavoidable (Barry can say more on
>>> this though).
>>>
>>> Best regards,
>>>
>>> Jacob Faibussowitsch
>>> (Jacob Fai - booss - oh - vitch)
>>>
>>> > On Apr 26, 2022, at 09:48, Jose E. Roman  wrote:
>>> >
>>> > You have to add -log_view_gpu_time
>>> > See https://gitlab.com/petsc/petsc/-/merge_requests/5056
>>> >
>>> > Jose
>>> >
>>> >
>>> >> El 26 abr 2022, a las 16:39, Mark Adams  escribió:
>>> >>
>>> >> I'm seeing this on Perlmutter with Kokkos-CUDA. Nans in most log
>>> timing data except the two 'Solve' lines.
>>> >> Just cg/jacobi on snes/ex56.
>>> >>
>>> >> Any ideas?
>>> >>
>>> >> VecTDot2 1.0   nan nan 1.20e+01 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>>> 0.00e+00 100
>>> >> VecNorm2 1.0   nan nan 1.00e+01 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>>> 0.00e+00 100
>>> >> VecCopy2 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>>> 0.00e+00  0
>>> >> VecSet 5 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>>> 0.00e+00  0
>>> >> VecAXPY4 1.0   nan nan 2.40e+01 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   1  0  0  0  0  -nan-nan  0 0.00e+000
>>> 0.00e+00 100
>>> >> VecPointwiseMult   1 1.0   nan nan 3.00e+00 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>>> 0.00e+00 100
>>> >> KSPSetUp   1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>>> 0.00e+00  0
>>> >> KSPSolve   1 1.0 4.0514e-04 1.0 5.50e+01 1.0 0.0e+00
>>> 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0 0-nan  0
>>> 0.00e+000 0.00e+00 100
>>> >> SNESSolve  1 1.0 2.2128e-02 1.0 5.55e+05 1.0 0.0e+00
>>> 0.0e+00 0.0e+00 72 56  0  0  0 100100  0  0  025-nan  0
>>> 0.00e+000 0.00e+00  0
>>> >
>>>
>>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> 
>


Re: [petsc-dev] odd log behavior

2022-04-26 Thread Matthew Knepley
On Tue, Apr 26, 2022 at 12:03 PM Mark Adams  wrote:

> Well, Nans are a clear sign that something is very wrong.
>

Barry chose them so that it could not be mistaken for an actual number.

   Matt


> On Tue, Apr 26, 2022 at 11:52 AM Jacob Faibussowitsch 
> wrote:
>
>> There is an automatic warning that shows when you do run with
>> `-log_view_gpu_time`, but perhaps there should also be an automatic warning
>> when *not* running with it. It is unfortunate that NaN is the value printed
>> as this implies a bug but AFAIK it is unavoidable (Barry can say more on
>> this though).
>>
>> Best regards,
>>
>> Jacob Faibussowitsch
>> (Jacob Fai - booss - oh - vitch)
>>
>> > On Apr 26, 2022, at 09:48, Jose E. Roman  wrote:
>> >
>> > You have to add -log_view_gpu_time
>> > See https://gitlab.com/petsc/petsc/-/merge_requests/5056
>> >
>> > Jose
>> >
>> >
>> >> El 26 abr 2022, a las 16:39, Mark Adams  escribió:
>> >>
>> >> I'm seeing this on Perlmutter with Kokkos-CUDA. Nans in most log
>> timing data except the two 'Solve' lines.
>> >> Just cg/jacobi on snes/ex56.
>> >>
>> >> Any ideas?
>> >>
>> >> VecTDot2 1.0   nan nan 1.20e+01 1.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>> 0.00e+00 100
>> >> VecNorm2 1.0   nan nan 1.00e+01 1.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>> 0.00e+00 100
>> >> VecCopy2 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>> 0.00e+00  0
>> >> VecSet 5 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>> 0.00e+00  0
>> >> VecAXPY4 1.0   nan nan 2.40e+01 1.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   1  0  0  0  0  -nan-nan  0 0.00e+000
>> 0.00e+00 100
>> >> VecPointwiseMult   1 1.0   nan nan 3.00e+00 1.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>> 0.00e+00 100
>> >> KSPSetUp   1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
>> 0.00e+00  0
>> >> KSPSolve   1 1.0 4.0514e-04 1.0 5.50e+01 1.0 0.0e+00
>> 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0 0-nan  0
>> 0.00e+000 0.00e+00 100
>> >> SNESSolve  1 1.0 2.2128e-02 1.0 5.55e+05 1.0 0.0e+00
>> 0.0e+00 0.0e+00 72 56  0  0  0 100100  0  0  025-nan  0
>> 0.00e+000 0.00e+00  0
>> >
>>
>>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 


Re: [petsc-dev] odd log behavior

2022-04-26 Thread Mark Adams
Well, Nans are a clear sign that something is very wrong.

On Tue, Apr 26, 2022 at 11:52 AM Jacob Faibussowitsch 
wrote:

> There is an automatic warning that shows when you do run with
> `-log_view_gpu_time`, but perhaps there should also be an automatic warning
> when *not* running with it. It is unfortunate that NaN is the value printed
> as this implies a bug but AFAIK it is unavoidable (Barry can say more on
> this though).
>
> Best regards,
>
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
>
> > On Apr 26, 2022, at 09:48, Jose E. Roman  wrote:
> >
> > You have to add -log_view_gpu_time
> > See https://gitlab.com/petsc/petsc/-/merge_requests/5056
> >
> > Jose
> >
> >
> >> El 26 abr 2022, a las 16:39, Mark Adams  escribió:
> >>
> >> I'm seeing this on Perlmutter with Kokkos-CUDA. Nans in most log timing
> data except the two 'Solve' lines.
> >> Just cg/jacobi on snes/ex56.
> >>
> >> Any ideas?
> >>
> >> VecTDot2 1.0   nan nan 1.20e+01 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
> 0.00e+00 100
> >> VecNorm2 1.0   nan nan 1.00e+01 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
> 0.00e+00 100
> >> VecCopy2 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
> 0.00e+00  0
> >> VecSet 5 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
> 0.00e+00  0
> >> VecAXPY4 1.0   nan nan 2.40e+01 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   1  0  0  0  0  -nan-nan  0 0.00e+000
> 0.00e+00 100
> >> VecPointwiseMult   1 1.0   nan nan 3.00e+00 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
> 0.00e+00 100
> >> KSPSetUp   1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000
> 0.00e+00  0
> >> KSPSolve   1 1.0 4.0514e-04 1.0 5.50e+01 1.0 0.0e+00
> 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0 0-nan  0
> 0.00e+000 0.00e+00 100
> >> SNESSolve  1 1.0 2.2128e-02 1.0 5.55e+05 1.0 0.0e+00
> 0.0e+00 0.0e+00 72 56  0  0  0 100100  0  0  025-nan  0
> 0.00e+000 0.00e+00  0
> >
>
>


Re: [petsc-dev] odd log behavior

2022-04-26 Thread Jacob Faibussowitsch
There is an automatic warning that shows when you do run with 
`-log_view_gpu_time`, but perhaps there should also be an automatic warning 
when *not* running with it. It is unfortunate that NaN is the value printed as 
this implies a bug but AFAIK it is unavoidable (Barry can say more on this 
though).

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

> On Apr 26, 2022, at 09:48, Jose E. Roman  wrote:
> 
> You have to add -log_view_gpu_time
> See https://gitlab.com/petsc/petsc/-/merge_requests/5056
> 
> Jose
> 
> 
>> El 26 abr 2022, a las 16:39, Mark Adams  escribió:
>> 
>> I'm seeing this on Perlmutter with Kokkos-CUDA. Nans in most log timing data 
>> except the two 'Solve' lines.
>> Just cg/jacobi on snes/ex56.
>> 
>> Any ideas?
>> 
>> VecTDot2 1.0   nan nan 1.20e+01 1.0 0.0e+00 0.0e+00 0.0e+00  
>> 0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00 100
>> VecNorm2 1.0   nan nan 1.00e+01 1.0 0.0e+00 0.0e+00 0.0e+00  
>> 0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00 100
>> VecCopy2 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  
>> 0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00  0
>> VecSet 5 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  
>> 0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00  0
>> VecAXPY4 1.0   nan nan 2.40e+01 1.0 0.0e+00 0.0e+00 0.0e+00  
>> 0  0  0  0  0   1  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00 100
>> VecPointwiseMult   1 1.0   nan nan 3.00e+00 1.0 0.0e+00 0.0e+00 0.0e+00  
>> 0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00 100
>> KSPSetUp   1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  
>> 0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00  0
>> KSPSolve   1 1.0 4.0514e-04 1.0 5.50e+01 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   2  0  0  0  0 0-nan  0 0.00e+000 
>> 0.00e+00 100
>> SNESSolve  1 1.0 2.2128e-02 1.0 5.55e+05 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 72 56  0  0  0 100100  0  0  025-nan  0 0.00e+000 
>> 0.00e+00  0
> 



Re: [petsc-dev] odd log behavior

2022-04-26 Thread Jose E. Roman
You have to add -log_view_gpu_time
See https://gitlab.com/petsc/petsc/-/merge_requests/5056

Jose


> El 26 abr 2022, a las 16:39, Mark Adams  escribió:
> 
> I'm seeing this on Perlmutter with Kokkos-CUDA. Nans in most log timing data 
> except the two 'Solve' lines.
> Just cg/jacobi on snes/ex56.
> 
> Any ideas?
>  
> VecTDot2 1.0   nan nan 1.20e+01 1.0 0.0e+00 0.0e+00 0.0e+00  
> 0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00 100
> VecNorm2 1.0   nan nan 1.00e+01 1.0 0.0e+00 0.0e+00 0.0e+00  
> 0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00 100
> VecCopy2 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  
> 0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00  0
> VecSet 5 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  
> 0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00  0
> VecAXPY4 1.0   nan nan 2.40e+01 1.0 0.0e+00 0.0e+00 0.0e+00  
> 0  0  0  0  0   1  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00 100
> VecPointwiseMult   1 1.0   nan nan 3.00e+00 1.0 0.0e+00 0.0e+00 0.0e+00  
> 0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00 100
> KSPSetUp   1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  
> 0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00  0
> KSPSolve   1 1.0 4.0514e-04 1.0 5.50e+01 1.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   2  0  0  0  0 0-nan  0 0.00e+000 
> 0.00e+00 100
> SNESSolve  1 1.0 2.2128e-02 1.0 5.55e+05 1.0 0.0e+00 0.0e+00 
> 0.0e+00 72 56  0  0  0 100100  0  0  025-nan  0 0.00e+000 
> 0.00e+00  0