> On Jan 9, 2023, at 12:13 PM, Mark Adams <mfad...@lbl.gov> wrote: > > Sorry, I deleted it. I put a 128 node job with a debug version to try to > reproduce it, but finished w/o this error. > With the same bad performance as my original log. > I think this message comes from a (time out) signal.
Sure but we would need to know if there are multiple different places in the code that the time-out is happening. > > What I am seeing though is just really bad performance on large jobs. It is > sudden. Good scaling up to about 32 nodes and then 100x slow down in the > solver, the vec norms, on 128 nodes (and 64). > > Thanks, > Mark > > > On Sun, Jan 8, 2023 at 8:07 PM Barry Smith <bsm...@petsc.dev > <mailto:bsm...@petsc.dev>> wrote: >> >> Rather cryptic use of that integer variable for anyone who did not write >> the code, but I guess one was short on bytes when it was written :-). >> >> Ok, based on what Jed said, if we are certain that either all or none of >> the nroots are -1, since you are getting hangs in the PetscCommDuplicate() >> this might indicate some ranks have called a creation on some OTHER object, >> that not all ranks are calling. Of course, could be an MPI bug. >> >> Mark, can you send the file containing all the output from the Signal >> Terminate run, maybe there will be a hint in there of a different >> constructor being called on some ranks. >> >> Barry >> >> >> >>>>>>> DMLabelGather >> >>> On Jan 8, 2023, at 4:32 PM, Mark Adams <mfad...@lbl.gov >>> <mailto:mfad...@lbl.gov>> wrote: >>> >>> >>> On Sun, Jan 8, 2023 at 4:13 PM Barry Smith <bsm...@petsc.dev >>> <mailto:bsm...@petsc.dev>> wrote: >>>> >>>> There is a bug in the routine DMPlexLabelComplete_Internal()! The code >>>> should definitely not have the code route around if (nroots >=0) because >>>> checking the nroots value to decide on the code route is simply nonsense >>>> (if one "knows" "by contract" that nroots is >=0 then the if () test is >>>> not needed. >>>> >>>> The first thing to do is to fix the bug with a PetscCheck() remove the >>>> nonsensical if (nroots >=0) check and rerun you code to see what happens. >>> >>> This does not fix the bug right? It just fails cleanly, right? >>> >>> I do have lots of empty processors in the first GAMG coarse grid. I just >>> saw that the first GAMG coarse grid reduces the processor count to 4, from >>> 4K. >>> This is one case where coarse grids could be repartitioned, for once that >>> can be used. >>> >>> Do you have a bug fix suggestion for me to try? >>> >>> Thanks >>> >>>> >>>> Barry >>>> >>>> Yes it is possible that in your run the nroots is always >= 0 and some MPI >>>> bug is causing the problem but this doesn't change the fact that the >>>> current code is buggy and needs to be fixed before blaming some other bug >>>> for the problem. >>>> >>>> >>>> >>>>> On Jan 8, 2023, at 4:04 PM, Mark Adams <mfad...@lbl.gov >>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>> >>>>> >>>>> >>>>> On Sun, Jan 8, 2023 at 2:44 PM Matthew Knepley <knep...@gmail.com >>>>> <mailto:knep...@gmail.com>> wrote: >>>>>> On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <bsm...@petsc.dev >>>>>> <mailto:bsm...@petsc.dev>> wrote: >>>>>>> >>>>>>> Mark, >>>>>>> >>>>>>> Looks like the error checking in PetscCommDuplicate() is doing its >>>>>>> job. It is reporting an attempt to use an PETSc object constructer on a >>>>>>> subset of ranks of an MPI_Comm (which is, of course, fundamentally >>>>>>> impossible in the PETSc/MPI model) >>>>>>> >>>>>>> Note that nroots can be negative on a particular rank but >>>>>>> DMPlexLabelComplete_Internal() is collective on sf based on the comment >>>>>>> in the code below >>>>>>> >>>>>>> >>>>>>> struct _p_PetscSF { >>>>>>> .... >>>>>>> PetscInt nroots; /* Number of root vertices on current process >>>>>>> (candidates for incoming edges) */ >>>>>>> >>>>>>> But the next routine calls a collective only when nroots >= 0 >>>>>>> >>>>>>> static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel >>>>>>> label, PetscBool completeCells){ >>>>>>> ... >>>>>>> PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL)); >>>>>>> if (nroots >= 0) { >>>>>>> DMLabel lblRoots, lblLeaves; >>>>>>> IS valueIS, pointIS; >>>>>>> const PetscInt *values; >>>>>>> PetscInt numValues, v; >>>>>>> >>>>>>> /* Pull point contributions from remote leaves into local roots */ >>>>>>> PetscCall(DMLabelGather(label, sfPoint, &lblLeaves)); >>>>>>> >>>>>>> >>>>>>> The code is four years old? How come this problem of calling the >>>>>>> constructure on a subset of ranks hasn't come up since day 1? >>>>>> >>>>>> The contract here is that it should be impossible to have nroots < 0 >>>>>> (meaning the SF is not setup) on a subset of processes. Do we know that >>>>>> this is happening? >>>>> >>>>> Can't imagine a code bug here. Very simple code. >>>>> >>>>> This code does use GAMG as the coarse grid solver in a pretty extreme way. >>>>> GAMG is fairly complicated and not used on such small problems with high >>>>> parallelism. >>>>> It is conceivable that its a GAMG bug, but that is not what was going on >>>>> in my initial emal here. >>>>> >>>>> Here is a run that timed out, but it should not have so I think this is >>>>> the same issue. I always have perfectly distributed grids like this. >>>>> >>>>> DM Object: box 2048 MPI processes >>>>> type: plex >>>>> box in 2 dimensions: >>>>> Min/Max of 0-cells per rank: 8385/8580 >>>>> Min/Max of 1-cells per rank: 24768/24960 >>>>> Min/Max of 2-cells per rank: 16384/16384 >>>>> Labels: >>>>> celltype: 3 strata with value/size (1 (24768), 3 (16384), 0 (8385)) >>>>> depth: 3 strata with value/size (0 (8385), 1 (24768), 2 (16384)) >>>>> marker: 1 strata with value/size (1 (385)) >>>>> Face Sets: 1 strata with value/size (1 (381)) >>>>> Defined by transform from: >>>>> DM_0x84000002_1 in 2 dimensions: >>>>> Min/Max of 0-cells per rank: 2145/2244 >>>>> Min/Max of 1-cells per rank: 6240/6336 >>>>> Min/Max of 2-cells per rank: 4096/4096 >>>>> Labels: >>>>> celltype: 3 strata with value/size (1 (6240), 3 (4096), 0 (2145)) >>>>> depth: 3 strata with value/size (0 (2145), 1 (6240), 2 (4096)) >>>>> marker: 1 strata with value/size (1 (193)) >>>>> Face Sets: 1 strata with value/size (1 (189)) >>>>> Defined by transform from: >>>>> DM_0x84000002_2 in 2 dimensions: >>>>> Min/Max of 0-cells per rank: 561/612 >>>>> Min/Max of 1-cells per rank: 1584/1632 >>>>> Min/Max of 2-cells per rank: 1024/1024 >>>>> Labels: >>>>> celltype: 3 strata with value/size (1 (1584), 3 (1024), 0 (561)) >>>>> depth: 3 strata with value/size (0 (561), 1 (1584), 2 (1024)) >>>>> marker: 1 strata with value/size (1 (97)) >>>>> Face Sets: 1 strata with value/size (1 (93)) >>>>> Defined by transform from: >>>>> DM_0x84000002_3 in 2 dimensions: >>>>> Min/Max of 0-cells per rank: 153/180 >>>>> Min/Max of 1-cells per rank: 408/432 >>>>> Min/Max of 2-cells per rank: 256/256 >>>>> Labels: >>>>> celltype: 3 strata with value/size (1 (408), 3 (256), 0 (153)) >>>>> depth: 3 strata with value/size (0 (153), 1 (408), 2 (256)) >>>>> marker: 1 strata with value/size (1 (49)) >>>>> Face Sets: 1 strata with value/size (1 (45)) >>>>> Defined by transform from: >>>>> DM_0x84000002_4 in 2 dimensions: >>>>> Min/Max of 0-cells per rank: 45/60 >>>>> Min/Max of 1-cells per rank: 108/120 >>>>> Min/Max of 2-cells per rank: 64/64 >>>>> Labels: >>>>> celltype: 3 strata with value/size (1 (108), 3 (64), 0 (45)) >>>>> depth: 3 strata with value/size (0 (45), 1 (108), 2 (64)) >>>>> marker: 1 strata with value/size (1 (25)) >>>>> Face Sets: 1 strata with value/size (1 (21)) >>>>> Defined by transform from: >>>>> DM_0x84000002_5 in 2 dimensions: >>>>> Min/Max of 0-cells per rank: 15/24 >>>>> Min/Max of 1-cells per rank: 30/36 >>>>> Min/Max of 2-cells per rank: 16/16 >>>>> Labels: >>>>> celltype: 3 strata with value/size (1 (30), 3 (16), 0 (15)) >>>>> depth: 3 strata with value/size (0 (15), 1 (30), 2 (16)) >>>>> marker: 1 strata with value/size (1 (13)) >>>>> Face Sets: 1 strata with value/size (1 (9)) >>>>> Defined by transform from: >>>>> DM_0x84000002_6 in 2 dimensions: >>>>> Min/Max of 0-cells per rank: 6/12 >>>>> Min/Max of 1-cells per rank: 9/12 >>>>> Min/Max of 2-cells per rank: 4/4 >>>>> Labels: >>>>> depth: 3 strata with value/size (0 (6), 1 (9), 2 (4)) >>>>> celltype: 3 strata with value/size (0 (6), 1 (9), 3 (4)) >>>>> marker: 1 strata with value/size (1 (7)) >>>>> Face Sets: 1 strata with value/size (1 (3)) >>>>> 0 TS dt 0.001 time 0. >>>>> MHD 0) time = 0, Eergy= 2.3259668003585e+00 (plot ID 0) >>>>> 0 SNES Function norm 5.415286407365e-03 >>>>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >>>>> slurmstepd: error: *** STEP 245100.0 ON crusher002 CANCELLED AT >>>>> 2023-01-08T15:32:43 DUE TO TIME LIMIT *** >>>>> >>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Matt >>>>>> >>>>>>>> On Jan 8, 2023, at 12:21 PM, Mark Adams <mfad...@lbl.gov >>>>>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>>>>> >>>>>>>> I am running on Crusher, CPU only, 64 cores per node with >>>>>>>> Plex/PetscFE. >>>>>>>> In going up to 64 nodes, something really catastrophic is happening. >>>>>>>> I understand I am not using the machine the way it was intended, but I >>>>>>>> just want to see if there are any options that I could try for a quick >>>>>>>> fix/help. >>>>>>>> >>>>>>>> In a debug build I get a stack trace on many but not all of the 4K >>>>>>>> processes. >>>>>>>> Alas, I am not sure why this job was terminated but every process that >>>>>>>> I checked, that had an "ERROR", had this stack: >>>>>>>> >>>>>>>> 11:57 main *+= >>>>>>>> crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ grep >>>>>>>> ERROR slurm-245063.out |g 3160 >>>>>>>> [3160]PETSC ERROR: >>>>>>>> ------------------------------------------------------------------------ >>>>>>>> [3160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or >>>>>>>> the batch system) has told this process to end >>>>>>>> [3160]PETSC ERROR: Try option -start_in_debugger or >>>>>>>> -on_error_attach_debugger >>>>>>>> [3160]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and >>>>>>>> https://petsc.org/release/faq/ >>>>>>>> [3160]PETSC ERROR: --------------------- Stack Frames >>>>>>>> ------------------------------------ >>>>>>>> [3160]PETSC ERROR: The line numbers in the error traceback are not >>>>>>>> always exact. >>>>>>>> [3160]PETSC ERROR: #1 MPI function >>>>>>>> [3160]PETSC ERROR: #2 PetscCommDuplicate() at >>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248 >>>>>>>> [3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at >>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56 >>>>>>>> [3160]PETSC ERROR: #4 PetscSFCreate() at >>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65 >>>>>>>> [3160]PETSC ERROR: #5 DMLabelGather() at >>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932 >>>>>>>> [3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at >>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177 >>>>>>>> [3160]PETSC ERROR: #7 DMPlexLabelComplete() at >>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227 >>>>>>>> [3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at >>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301 >>>>>>>> [3160]PETSC ERROR: #9 DMCopyDS() at >>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117 >>>>>>>> [3160]PETSC ERROR: #10 DMCopyDisc() at >>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143 >>>>>>>> [3160]PETSC ERROR: #11 SetupDiscretization() at >>>>>>>> /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755 >>>>>>>> >>>>>>>> Maybe the MPI is just getting overwhelmed. >>>>>>>> >>>>>>>> And I was able to get one run to to work (one TS with beuler), and the >>>>>>>> solver performance was horrendous and I see this (attached): >>>>>>>> >>>>>>>> Time (sec): 1.601e+02 1.001 1.600e+02 >>>>>>>> VecMDot 111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00 >>>>>>>> 0.0e+00 1.1e+05 30 4 0 0 23 30 4 0 0 23 499 >>>>>>>> VecNorm 163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00 >>>>>>>> 0.0e+00 1.6e+05 39 2 0 0 34 39 2 0 0 34 139 >>>>>>>> VecNormalize 154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00 >>>>>>>> 0.0e+00 1.5e+05 38 2 0 0 32 38 2 0 0 32 189 >>>>>>>> etc, >>>>>>>> KSPSolve 3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09 >>>>>>>> 6.0e+01 2.8e+05 72 95 45 72 58 72 95 45 72 58 4772 >>>>>>>> >>>>>>>> Any ideas would be welcome, >>>>>>>> Thanks, >>>>>>>> Mark >>>>>>>> <cushersolve.txt> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> What most experimenters take for granted before they begin their >>>>>> experiments is infinitely more interesting than any results to which >>>>>> their experiments lead. >>>>>> -- Norbert Wiener >>>>>> >>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>> <http://www.cse.buffalo.edu/~knepley/> >>>> >>