Re: [petsc-dev] Parmetis bug
Nice. Anyway to add this exact reproducibility into a PETSc example that runs daily? The truism: all codes are buggy, even those that haven't been touched in 15 years, is definitely represented here. Barry > On Nov 10, 2019, at 7:31 PM, Fande Kong via petsc-dev > wrote: > > Valgrind info: > > ==32155== Invalid read of size 4 > ==32155==at 0x885F62F: libmetis__CreateCoarseGraphNoMask (coarsen.c:879) > ==32155==by 0x885E2B9: libmetis__CreateCoarseGraph (coarsen.c:636) > ==32155==by 0x885CA49: libmetis__Match_RM (coarsen.c:262) > ==32155==by 0x885BFBD: libmetis__CoarsenGraph (coarsen.c:55) > ==32155==by 0x8869A4C: libmetis__MultilevelBisect (pmetis.c:240) > ==32155==by 0x88696F6: libmetis__MlevelRecursiveBisection (pmetis.c:183) > ==32155==by 0x88698D7: libmetis__MlevelRecursiveBisection (pmetis.c:207) > ==32155==by 0x886951E: METIS_PartGraphRecursive (pmetis.c:133) > ==32155==by 0x885650B: libmetis__InitKWayPartitioning (kmetis.c:194) > ==32155==by 0x8856147: libmetis__MlevelKWayPartitioning (kmetis.c:121) > ==32155==by 0x8855FD3: METIS_PartGraphKway (kmetis.c:71) > ==32155==by 0x85E7611: libparmetis__PartitionSmallGraph (weird.c:478) > ==32155==by 0x85D00F3: ParMETIS_V3_PartKway (kmetis.c:91) > ==32155==by 0x5515C19: MatPartitioningApply_Parmetis_Private > (pmetis.c:147) > ==32155==by 0x5516F7D: MatPartitioningApply_Parmetis (pmetis.c:221) > ==32155==by 0x550DD7F: MatPartitioningApply (partition.c:332) > ==32155==by 0x6387033: PCGAMGCreateLevel_GAMG (gamg.c:226) > ==32155==by 0x638BCA5: PCSetUp_GAMG (gamg.c:593) > ==32155==by 0x64746B1: PCSetUp (precon.c:894) > ==32155==by 0x65D388A: KSPSetUp (itfunc.c:377) > ==32155== Address 0xee2502c is 0 bytes after a block of size 284 alloc'd > ==32155==at 0x4C2AB80: malloc (in > /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) > ==32155==by 0x8820179: gk_malloc (memory.c:147) > ==32155==by 0x883D92B: libmetis__imalloc (gklib.c:24) > ==32155==by 0x885BF6A: libmetis__CoarsenGraph (coarsen.c:46) > ==32155==by 0x8869A4C: libmetis__MultilevelBisect (pmetis.c:240) > ==32155==by 0x88696F6: libmetis__MlevelRecursiveBisection (pmetis.c:183) > ==32155==by 0x88698D7: libmetis__MlevelRecursiveBisection (pmetis.c:207) > ==32155==by 0x886951E: METIS_PartGraphRecursive (pmetis.c:133) > ==32155==by 0x885650B: libmetis__InitKWayPartitioning (kmetis.c:194) > ==32155==by 0x8856147: libmetis__MlevelKWayPartitioning (kmetis.c:121) > ==32155==by 0x8855FD3: METIS_PartGraphKway (kmetis.c:71) > ==32155==by 0x85E7611: libparmetis__PartitionSmallGraph (weird.c:478) > ==32155==by 0x85D00F3: ParMETIS_V3_PartKway (kmetis.c:91) > ==32155==by 0x5515C19: MatPartitioningApply_Parmetis_Private > (pmetis.c:147) > ==32155==by 0x5516F7D: MatPartitioningApply_Parmetis (pmetis.c:221) > ==32155==by 0x550DD7F: MatPartitioningApply (partition.c:332) > ==32155==by 0x6387033: PCGAMGCreateLevel_GAMG (gamg.c:226) > ==32155==by 0x638BCA5: PCSetUp_GAMG (gamg.c:593) > ==32155==by 0x64746B1: PCSetUp (precon.c:894) > ==32155==by 0x65D388A: KSPSetUp (itfunc.c:377) > ==32155== > ==32155== Conditional jump or move depends on uninitialised value(s) > ==32155==at 0x885F651: libmetis__CreateCoarseGraphNoMask (coarsen.c:880) > ==32155==by 0x885E2B9: libmetis__CreateCoarseGraph (coarsen.c:636) > ==32155==by 0x885CA49: libmetis__Match_RM (coarsen.c:262) > ==32155==by 0x885BFBD: libmetis__CoarsenGraph (coarsen.c:55) > ==32155==by 0x8869A4C: libmetis__MultilevelBisect (pmetis.c:240) > ==32155==by 0x88696F6: libmetis__MlevelRecursiveBisection (pmetis.c:183) > ==32155==by 0x88698D7: libmetis__MlevelRecursiveBisection (pmetis.c:207) > ==32155==by 0x886951E: METIS_PartGraphRecursive (pmetis.c:133) > ==32155==by 0x885650B: libmetis__InitKWayPartitioning (kmetis.c:194) > ==32155==by 0x8856147: libmetis__MlevelKWayPartitioning (kmetis.c:121) > ==32155==by 0x8855FD3: METIS_PartGraphKway (kmetis.c:71) > ==32155==by 0x85E7611: libparmetis__PartitionSmallGraph (weird.c:478) > ==32155==by 0x85D00F3: ParMETIS_V3_PartKway (kmetis.c:91) > ==32155==by 0x5515C19: MatPartitioningApply_Parmetis_Private > (pmetis.c:147) > ==32155==by 0x5516F7D: MatPartitioningApply_Parmetis (pmetis.c:221) > ==32155==by 0x550DD7F: MatPartitioningApply (partition.c:332) > ==32155==by 0x6387033: PCGAMGCreateLevel_GAMG (gamg.c:226) > ==32155==by 0x638BCA5: PCSetUp_GAMG (gamg.c:593) > ==32155==by 0x64746B1: PCSetUp (precon.c:894) > ==32155==by 0x65D388A: KSPSetUp (itfunc.c:377) > ==32155== > ==32155== Use of uninitialised value of size 8 > ==32155==at 0x885F6F2: libmetis__CreateCoarseGraphNoMask (coarsen.c:886) > ==32155==by 0x885E2B9: libmetis__CreateCoarseGraph (coarsen.c:636) > ==32155==by 0x885CA49: libmetis__Match_RM (coarsen.c:262) > ==32155==by 0x885BFBD: libmetis__CoarsenGraph (coa
Re: [petsc-dev] Parmetis bug
Fande, It looks to me like this branch in ParMetis must be taken to trigger this error. First *Match_SHEM* and then CreateCoarseGraphNoMask. /* determine which matching scheme you will use */ switch (ctrl->ctype) { case METIS_CTYPE_RM: Match_RM(ctrl, graph); break; case METIS_CTYPE_SHEM: if (eqewgts || graph->nedges == 0) Match_RM(ctrl, graph); else * Match_SHEM(ctrl, graph);*break; default: gk_errexit(SIGERR, "Unknown ctype: %d\n", ctrl->ctype); } --- /* Check if the mask-version of the code is a good choice */ mask = HTLENGTH; if (cnvtxs < 2*mask || graph->nedges/graph->nvtxs > mask/20) { CreateCoarseGraphNoMask(ctrl, graph, cnvtxs, match); return; } The actual error is in CreateCoarseGraphNoMask, graph->cmap is too small and this gets garbage. parmetis coarsen.c:856: istart = xadj[v]; iend = xadj[v+1]; for (j=istart; j wrote: > Fande, the problem is k below seems to index beyond the end of htable, > resulting in a crazy m and a segv on the last line below. > > I don't have a clean valgrind machine now, that is what is needed if no > one has seen anything like this. I could add a test in a MR and get the > pipeline to do it. > > void CreateCoarseGraphNoMask(ctrl_t *ctrl, graph_t *graph, idx_t cnvtxs, > idx_t *match) > { > idx_t j, k, m, istart, iend, nvtxs, nedges, ncon, cnedges, v, u, dovsize; > idx_t *xadj, *vwgt, *vsize, *adjncy, *adjwgt; > idx_t *cmap, *htable; > idx_t *cxadj, *cvwgt, *cvsize, *cadjncy, *cadjwgt; > graph_t *cgraph; > ine > WCOREPUSH; > > dovsize = (ctrl->objtype == METIS_OBJTYPE_VOL ? 1 : 0); > > IFSET(ctrl->dbglvl, METIS_DBG_TIME, gk_startcputimer(ctrl->ContractTmr)); > > nvtxs = graph->nvtxs; > ncon= graph->ncon; > xadj= graph->xadj; > vwgt= graph->vwgt; > vsize = graph->vsize; > adjncy = graph->adjncy; > adjwgt = graph->adjwgt; > cmap= graph->cmap; > > > /* Initialize the coarser graph */ > cgraph = SetupCoarseGraph(graph, cnvtxs, dovsize); > cxadj= cgraph->xadj; > cvwgt= cgraph->vwgt; > cvsize = cgraph->vsize; > cadjncy = cgraph->adjncy; > cadjwgt = cgraph->adjwgt; > > htable = iset(cnvtxs, -1, iwspacemalloc(ctrl, cnvtxs)); > > cxadj[0] = cnvtxs = cnedges = 0; > for (v=0; v if ((u = match[v]) < v) > continue; > > ASSERT(cmap[v] == cnvtxs); > ASSERT(cmap[match[v]] == cnvtxs); > > if (ncon == 1) > cvwgt[cnvtxs] = vwgt[v]; > else > icopy(ncon, vwgt+v*ncon, cvwgt+cnvtxs*ncon); > > if (dovsize) > cvsize[cnvtxs] = vsize[v]; > > nedges = 0; > > istart = xadj[v]; > iend = xadj[v+1]; > for (j=istart; j k = cmap[adjncy[j]]; > if ((m = htable[k]) == -1) { > cadjncy[nedges] = k; > cadjwgt[nedges] = adjwgt[j]; > htable[k] = nedges++; > } > else { > cadjwgt[m] += adjwgt[j]; > > On Sun, Nov 10, 2019 at 1:35 AM Mark Adams wrote: > >> >> >> On Sat, Nov 9, 2019 at 10:51 PM Fande Kong wrote: >> >>> Hi Mark, >>> >>> Thanks for reporting this bug. I was surprised because we have >>> sufficient heavy tests in moose using partition weights and do not have any >>> issue so far. >>> >>> >> I have been pounding on this code with elasticity and have not seen this >> issue. I am now looking at Lapacianas and I only see it with pretty large >> problems. The example below is pretty minimal (eg, it works with 16 cores >> and it works with -dm_refine 4). I have reproduced this on Cori, SUMMIT and >> my laptop. >> >> >>> I will take a shot on this. >>> >> >> Thanks, I'll try to take a look at it also. I have seen it in DDT, but >> did not dig further. It looked like a typical segv in ParMetis. >> >> >>> >>> Fande, >>> >>> On Sat, Nov 9, 2019 at 3:08 PM Mark Adams wrote: >>> snes/ex13 is getting a ParMetis segv with GAMG and coarse grid repartitioning. Below shows the branch and how to run it. I've tried valgrind on Cori but it gives a lot of false positives. I've seen this error in DDT but I have not had a chance to dig and try to fix it. At least I know it has something to do with weights. If anyone wants to take a shot at it feel free. This bug rarely happens. The changes use weights and are just a few lines of code (from 1.5 years ago): 12:08 (0455fb9fec...)|BISECTING ~/Codes/petsc$ git bisect bad 0455fb9fecf69cf5cf35948c84d3837e5a427e2e is the first bad commit commit 0455fb9fecf69cf5cf35948c84d3837e5a427e2e Author: Fande Kong Date: Thu Jun 21 18:21:19 2018 -0600 Let parmetis and ptsotch take edge weights and vertex weights src/mat/partition/impls/pmetis/pmetis.c | 7 +++ src/mat/partition/impls/scotch/scotch.c | 6 +++--- 2 files changed, 10 insertions(+), 3 deletions(-) > mpiexec -n 32 ./ex13
[petsc-dev] Place to capture all our work on GPUs (and ECP ...)
Please do not respond to this email: use https://gitlab.com/petsc/petsc/issues/490 Mark Adams has been generating some great information on Summit with GAMG and now AMGx and other people such as Hannah and Junchao generating information important to our education about GPUs and, of course, Karl has worked hard on the next generation libaxb. Other people as well including Stefano. This work is reflected in email, MR, issues, chats on MR. We need a way to capture all this stuff so it is easy to find. Now it is just fragments of knowledge hanging around. Any thoughts? Some common tag that can be used that can be searched for (tricky with mail, ban email?). Have each GPU discussion/concept/annoucement be an Issue with a label (GPU?) that can be searched for (how do we organize all the previous stuff that was just in email?) Use Microsoft's excellent replacement for Slack (no, that won't work). Having a website where everyone dumps everything will not work, we'll always forget and it won't have a native format. All I ideas appreciated Barry
Re: [petsc-dev] Parmetis bug
Fande, the problem is k below seems to index beyond the end of htable, resulting in a crazy m and a segv on the last line below. I don't have a clean valgrind machine now, that is what is needed if no one has seen anything like this. I could add a test in a MR and get the pipeline to do it. void CreateCoarseGraphNoMask(ctrl_t *ctrl, graph_t *graph, idx_t cnvtxs, idx_t *match) { idx_t j, k, m, istart, iend, nvtxs, nedges, ncon, cnedges, v, u, dovsize; idx_t *xadj, *vwgt, *vsize, *adjncy, *adjwgt; idx_t *cmap, *htable; idx_t *cxadj, *cvwgt, *cvsize, *cadjncy, *cadjwgt; graph_t *cgraph; ine WCOREPUSH; dovsize = (ctrl->objtype == METIS_OBJTYPE_VOL ? 1 : 0); IFSET(ctrl->dbglvl, METIS_DBG_TIME, gk_startcputimer(ctrl->ContractTmr)); nvtxs = graph->nvtxs; ncon= graph->ncon; xadj= graph->xadj; vwgt= graph->vwgt; vsize = graph->vsize; adjncy = graph->adjncy; adjwgt = graph->adjwgt; cmap= graph->cmap; /* Initialize the coarser graph */ cgraph = SetupCoarseGraph(graph, cnvtxs, dovsize); cxadj= cgraph->xadj; cvwgt= cgraph->vwgt; cvsize = cgraph->vsize; cadjncy = cgraph->adjncy; cadjwgt = cgraph->adjwgt; htable = iset(cnvtxs, -1, iwspacemalloc(ctrl, cnvtxs)); cxadj[0] = cnvtxs = cnedges = 0; for (v=0; v wrote: > > > On Sat, Nov 9, 2019 at 10:51 PM Fande Kong wrote: > >> Hi Mark, >> >> Thanks for reporting this bug. I was surprised because we have sufficient >> heavy tests in moose using partition weights and do not have any issue so >> far. >> >> > I have been pounding on this code with elasticity and have not seen this > issue. I am now looking at Lapacianas and I only see it with pretty large > problems. The example below is pretty minimal (eg, it works with 16 cores > and it works with -dm_refine 4). I have reproduced this on Cori, SUMMIT and > my laptop. > > >> I will take a shot on this. >> > > Thanks, I'll try to take a look at it also. I have seen it in DDT, but did > not dig further. It looked like a typical segv in ParMetis. > > >> >> Fande, >> >> On Sat, Nov 9, 2019 at 3:08 PM Mark Adams wrote: >> >>> snes/ex13 is getting a ParMetis segv with GAMG and coarse grid >>> repartitioning. Below shows the branch and how to run it. >>> >>> I've tried valgrind on Cori but it gives a lot of false positives. I've >>> seen this error in DDT but I have not had a chance to dig and try to fix >>> it. At least I know it has something to do with weights. >>> >>> If anyone wants to take a shot at it feel free. This bug rarely happens. >>> >>> The changes use weights and are just a few lines of code (from 1.5 years >>> ago): >>> >>> 12:08 (0455fb9fec...)|BISECTING ~/Codes/petsc$ git bisect bad >>> 0455fb9fecf69cf5cf35948c84d3837e5a427e2e is the first bad commit >>> commit 0455fb9fecf69cf5cf35948c84d3837e5a427e2e >>> Author: Fande Kong >>> Date: Thu Jun 21 18:21:19 2018 -0600 >>> >>> Let parmetis and ptsotch take edge weights and vertex weights >>> >>> src/mat/partition/impls/pmetis/pmetis.c | 7 +++ >>> src/mat/partition/impls/scotch/scotch.c | 6 +++--- >>> 2 files changed, 10 insertions(+), 3 deletions(-) >>> >>> > mpiexec -n 32 ./ex13 -cells 2,4,4, -dm_refine 5 -simplex 0 -dim 3 >>> -potential_petscspace_degree 1 -potential_petscspace_order 1 -pc_type gamg >>> -petscpartitioner_type simple -pc_gamg_repartition >>> true -check_pointer_intensity 0 >>> >>