Re: [petsc-dev] bad cpu/MPI performance problem

Barry Smith Mon, 09 Jan 2023 12:13:53 -0800


> On Jan 9, 2023, at 12:13 PM, Mark Adams <mfad...@lbl.gov> wrote:
> 
> Sorry, I deleted it. I put a 128 node job with a debug version to try to 
> reproduce it, but finished w/o this error.
> With the same bad performance as my original log.
> I think this message comes from a (time out) signal.


  Sure but we would need to know if there are multiple different places in the 
code that the time-out is happening. 



> 
> What I am seeing though is just really bad performance on large jobs. It is 
> sudden. Good scaling up to about 32 nodes and then 100x slow down in the 
> solver, the vec norms, on 128 nodes (and 64).
> 
> Thanks,
> Mark
> 
> 
> On Sun, Jan 8, 2023 at 8:07 PM Barry Smith <bsm...@petsc.dev 
> <mailto:bsm...@petsc.dev>> wrote:
>> 
>>   Rather cryptic use of that integer variable for anyone who did not write 
>> the code, but I guess one was short on bytes when it was written :-).
>> 
>>   Ok, based on what Jed said, if we are certain that either all or none of 
>> the nroots are -1, since you are getting hangs in the PetscCommDuplicate() 
>> this might indicate some ranks have called a creation on some OTHER object, 
>> that not all ranks are calling. Of course, could be an MPI bug.
>> 
>>   Mark, can you send the file containing all the output from the Signal 
>> Terminate run, maybe there will be a hint in there of a different 
>> constructor being called on some ranks.
>> 
>>   Barry
>> 
>> 
>> 
>>>>>>> DMLabelGather
>> 
>>> On Jan 8, 2023, at 4:32 PM, Mark Adams <mfad...@lbl.gov 
>>> <mailto:mfad...@lbl.gov>> wrote:
>>> 
>>> 
>>> On Sun, Jan 8, 2023 at 4:13 PM Barry Smith <bsm...@petsc.dev 
>>> <mailto:bsm...@petsc.dev>> wrote:
>>>> 
>>>>    There is a bug in the routine DMPlexLabelComplete_Internal()! The code 
>>>> should definitely not have the code route around if (nroots >=0) because 
>>>> checking the nroots value to decide on the code route is simply nonsense 
>>>> (if one "knows" "by contract" that nroots is >=0 then the if () test is 
>>>> not needed.
>>>> 
>>>>    The first thing to do is to fix the bug with a PetscCheck() remove the 
>>>> nonsensical if (nroots >=0) check and rerun you code to see what happens.
>>> 
>>> This does not fix the bug right? It just fails cleanly, right?
>>> 
>>> I do have lots of empty processors in the first GAMG coarse grid. I just 
>>> saw that the first GAMG coarse grid reduces the processor count to 4, from 
>>> 4K.
>>> This is one case where coarse grids could be repartitioned, for once that 
>>> can be used.
>>> 
>>> Do you have a bug fix suggestion for me to try?
>>> 
>>> Thanks
>>> 
>>>> 
>>>>   Barry
>>>> 
>>>> Yes it is possible that in your run the nroots is always >= 0 and some MPI 
>>>> bug is causing the problem but this doesn't change the fact that the 
>>>> current code is buggy and needs to be fixed before blaming some other bug 
>>>> for the problem.
>>>> 
>>>> 
>>>> 
>>>>> On Jan 8, 2023, at 4:04 PM, Mark Adams <mfad...@lbl.gov 
>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sun, Jan 8, 2023 at 2:44 PM Matthew Knepley <knep...@gmail.com 
>>>>> <mailto:knep...@gmail.com>> wrote:
>>>>>> On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <bsm...@petsc.dev 
>>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>>> 
>>>>>>>   Mark,
>>>>>>> 
>>>>>>>   Looks like the error checking in PetscCommDuplicate() is doing its 
>>>>>>> job. It is reporting an attempt to use an PETSc object constructer on a 
>>>>>>> subset of ranks of an MPI_Comm (which is, of course, fundamentally 
>>>>>>> impossible in the PETSc/MPI model)
>>>>>>> 
>>>>>>> Note that nroots can be negative on a particular rank but 
>>>>>>> DMPlexLabelComplete_Internal() is collective on sf based on the comment 
>>>>>>> in the code below
>>>>>>> 
>>>>>>> 
>>>>>>> struct _p_PetscSF {
>>>>>>> ....
>>>>>>>   PetscInt     nroots;  /* Number of root vertices on current process 
>>>>>>> (candidates for incoming edges) */
>>>>>>> 
>>>>>>> But the next routine calls a collective only when nroots >= 0 
>>>>>>> 
>>>>>>> static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel 
>>>>>>> label, PetscBool completeCells){
>>>>>>> ...
>>>>>>>   PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL));
>>>>>>>   if (nroots >= 0) {
>>>>>>>     DMLabel         lblRoots, lblLeaves;
>>>>>>>     IS              valueIS, pointIS;
>>>>>>>     const PetscInt *values;
>>>>>>>     PetscInt        numValues, v;
>>>>>>> 
>>>>>>>     /* Pull point contributions from remote leaves into local roots */
>>>>>>>     PetscCall(DMLabelGather(label, sfPoint, &lblLeaves));
>>>>>>> 
>>>>>>> 
>>>>>>> The code is four years old? How come this problem of calling the 
>>>>>>> constructure on a subset of ranks hasn't come up since day 1? 
>>>>>> 
>>>>>> The contract here is that it should be impossible to have nroots < 0 
>>>>>> (meaning the SF is not setup) on a subset of processes. Do we know that 
>>>>>> this is happening?
>>>>> 
>>>>> Can't imagine a code bug here. Very simple code.
>>>>> 
>>>>> This code does use GAMG as the coarse grid solver in a pretty extreme way.
>>>>> GAMG is fairly complicated and not used on such small problems with high 
>>>>> parallelism.
>>>>> It is conceivable that its a GAMG bug, but that is not what was going on 
>>>>> in my initial emal here.
>>>>> 
>>>>> Here is a run that timed out, but it should not have so I think this is 
>>>>> the same issue. I always have perfectly distributed grids like this.
>>>>> 
>>>>> DM Object: box 2048 MPI processes
>>>>>   type: plex
>>>>> box in 2 dimensions:
>>>>>   Min/Max of 0-cells per rank: 8385/8580
>>>>>   Min/Max of 1-cells per rank: 24768/24960
>>>>>   Min/Max of 2-cells per rank: 16384/16384
>>>>> Labels:
>>>>>   celltype: 3 strata with value/size (1 (24768), 3 (16384), 0 (8385))
>>>>>   depth: 3 strata with value/size (0 (8385), 1 (24768), 2 (16384))
>>>>>   marker: 1 strata with value/size (1 (385))
>>>>>   Face Sets: 1 strata with value/size (1 (381))
>>>>>   Defined by transform from:
>>>>>   DM_0x84000002_1 in 2 dimensions:
>>>>>     Min/Max of 0-cells per rank:   2145/2244
>>>>>     Min/Max of 1-cells per rank:   6240/6336
>>>>>     Min/Max of 2-cells per rank:   4096/4096
>>>>>   Labels:
>>>>>     celltype: 3 strata with value/size (1 (6240), 3 (4096), 0 (2145))
>>>>>     depth: 3 strata with value/size (0 (2145), 1 (6240), 2 (4096))
>>>>>     marker: 1 strata with value/size (1 (193))
>>>>>     Face Sets: 1 strata with value/size (1 (189))
>>>>>     Defined by transform from:
>>>>>     DM_0x84000002_2 in 2 dimensions:
>>>>>       Min/Max of 0-cells per rank:     561/612
>>>>>       Min/Max of 1-cells per rank:     1584/1632
>>>>>       Min/Max of 2-cells per rank:     1024/1024
>>>>>     Labels:
>>>>>       celltype: 3 strata with value/size (1 (1584), 3 (1024), 0 (561))
>>>>>       depth: 3 strata with value/size (0 (561), 1 (1584), 2 (1024))
>>>>>       marker: 1 strata with value/size (1 (97))
>>>>>       Face Sets: 1 strata with value/size (1 (93))
>>>>>       Defined by transform from:
>>>>>       DM_0x84000002_3 in 2 dimensions:
>>>>>         Min/Max of 0-cells per rank:       153/180
>>>>>         Min/Max of 1-cells per rank:       408/432
>>>>>         Min/Max of 2-cells per rank:       256/256
>>>>>       Labels:
>>>>>         celltype: 3 strata with value/size (1 (408), 3 (256), 0 (153))
>>>>>         depth: 3 strata with value/size (0 (153), 1 (408), 2 (256))
>>>>>         marker: 1 strata with value/size (1 (49))
>>>>>         Face Sets: 1 strata with value/size (1 (45))
>>>>>         Defined by transform from:
>>>>>         DM_0x84000002_4 in 2 dimensions:
>>>>>           Min/Max of 0-cells per rank:         45/60
>>>>>           Min/Max of 1-cells per rank:         108/120
>>>>>           Min/Max of 2-cells per rank:         64/64
>>>>>         Labels:
>>>>>           celltype: 3 strata with value/size (1 (108), 3 (64), 0 (45))
>>>>>           depth: 3 strata with value/size (0 (45), 1 (108), 2 (64))
>>>>>           marker: 1 strata with value/size (1 (25))
>>>>>           Face Sets: 1 strata with value/size (1 (21))
>>>>>           Defined by transform from:
>>>>>           DM_0x84000002_5 in 2 dimensions:
>>>>>             Min/Max of 0-cells per rank:           15/24
>>>>>             Min/Max of 1-cells per rank:           30/36
>>>>>             Min/Max of 2-cells per rank:           16/16
>>>>>           Labels:
>>>>>             celltype: 3 strata with value/size (1 (30), 3 (16), 0 (15))
>>>>>             depth: 3 strata with value/size (0 (15), 1 (30), 2 (16))
>>>>>             marker: 1 strata with value/size (1 (13))
>>>>>             Face Sets: 1 strata with value/size (1 (9))
>>>>>             Defined by transform from:
>>>>>             DM_0x84000002_6 in 2 dimensions:
>>>>>               Min/Max of 0-cells per rank:             6/12
>>>>>               Min/Max of 1-cells per rank:             9/12
>>>>>               Min/Max of 2-cells per rank:             4/4
>>>>>             Labels:
>>>>>               depth: 3 strata with value/size (0 (6), 1 (9), 2 (4))
>>>>>               celltype: 3 strata with value/size (0 (6), 1 (9), 3 (4))
>>>>>               marker: 1 strata with value/size (1 (7))
>>>>>               Face Sets: 1 strata with value/size (1 (3))
>>>>> 0 TS dt 0.001 time 0.
>>>>> MHD    0) time =         0, Eergy=  2.3259668003585e+00 (plot ID 0)
>>>>>     0 SNES Function norm 5.415286407365e-03
>>>>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>>>>> slurmstepd: error: *** STEP 245100.0 ON crusher002 CANCELLED AT 
>>>>> 2023-01-08T15:32:43 DUE TO TIME LIMIT ***
>>>>> 
>>>>>  
>>>>>> 
>>>>>>   Thanks,
>>>>>> 
>>>>>>     Matt
>>>>>>  
>>>>>>>> On Jan 8, 2023, at 12:21 PM, Mark Adams <mfad...@lbl.gov 
>>>>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>>>> 
>>>>>>>> I am running on Crusher, CPU only, 64 cores per node with 
>>>>>>>> Plex/PetscFE. 
>>>>>>>> In going up to 64 nodes, something really catastrophic is happening. 
>>>>>>>> I understand I am not using the machine the way it was intended, but I 
>>>>>>>> just want to see if there are any options that I could try for a quick 
>>>>>>>> fix/help.
>>>>>>>> 
>>>>>>>> In a debug build I get a stack trace on many but not all of the 4K 
>>>>>>>> processes. 
>>>>>>>> Alas, I am not sure why this job was terminated but every process that 
>>>>>>>> I checked, that had an "ERROR", had this stack:
>>>>>>>> 
>>>>>>>> 11:57 main *+= 
>>>>>>>> crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ grep 
>>>>>>>> ERROR slurm-245063.out |g 3160
>>>>>>>> [3160]PETSC ERROR: 
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> [3160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or 
>>>>>>>> the batch system) has told this process to end
>>>>>>>> [3160]PETSC ERROR: Try option -start_in_debugger or 
>>>>>>>> -on_error_attach_debugger
>>>>>>>> [3160]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and 
>>>>>>>> https://petsc.org/release/faq/
>>>>>>>> [3160]PETSC ERROR: ---------------------  Stack Frames 
>>>>>>>> ------------------------------------
>>>>>>>> [3160]PETSC ERROR: The line numbers in the error traceback are not 
>>>>>>>> always exact.
>>>>>>>> [3160]PETSC ERROR: #1 MPI function
>>>>>>>> [3160]PETSC ERROR: #2 PetscCommDuplicate() at 
>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248
>>>>>>>> [3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at 
>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56
>>>>>>>> [3160]PETSC ERROR: #4 PetscSFCreate() at 
>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65
>>>>>>>> [3160]PETSC ERROR: #5 DMLabelGather() at 
>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932
>>>>>>>> [3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at 
>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177
>>>>>>>> [3160]PETSC ERROR: #7 DMPlexLabelComplete() at 
>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227
>>>>>>>> [3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at 
>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301
>>>>>>>> [3160]PETSC ERROR: #9 DMCopyDS() at 
>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117
>>>>>>>> [3160]PETSC ERROR: #10 DMCopyDisc() at 
>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143
>>>>>>>> [3160]PETSC ERROR: #11 SetupDiscretization() at 
>>>>>>>> /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755
>>>>>>>> 
>>>>>>>> Maybe the MPI is just getting overwhelmed. 
>>>>>>>> 
>>>>>>>> And I was able to get one run to to work (one TS with beuler), and the 
>>>>>>>> solver performance was horrendous and I see this (attached):
>>>>>>>> 
>>>>>>>> Time (sec):           1.601e+02     1.001   1.600e+02
>>>>>>>> VecMDot           111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00 
>>>>>>>> 0.0e+00 1.1e+05 30  4  0  0 23  30  4  0  0 23   499
>>>>>>>> VecNorm           163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00 
>>>>>>>> 0.0e+00 1.6e+05 39  2  0  0 34  39  2  0  0 34   139
>>>>>>>> VecNormalize      154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00 
>>>>>>>> 0.0e+00 1.5e+05 38  2  0  0 32  38  2  0  0 32   189
>>>>>>>> etc,
>>>>>>>> KSPSolve               3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09 
>>>>>>>> 6.0e+01 2.8e+05 72 95 45 72 58  72 95 45 72 58  4772
>>>>>>>> 
>>>>>>>> Any ideas would be welcome,
>>>>>>>> Thanks,
>>>>>>>> Mark
>>>>>>>> <cushersolve.txt>
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> What most experimenters take for granted before they begin their 
>>>>>> experiments is infinitely more interesting than any results to which 
>>>>>> their experiments lead.
>>>>>> -- Norbert Wiener
>>>>>> 
>>>>>> https://www.cse.buffalo.edu/~knepley/ 
>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>> 
>>

Re: [petsc-dev] bad cpu/MPI performance problem

Reply via email to