Re: [petsc-dev] Parmetis bug

2019-11-10 Thread Smith, Barry F. via petsc-dev


  Nice. Anyway to add this exact reproducibility into a PETSc example that runs 
daily? 

  The truism: all codes are buggy, even those that haven't been touched in 15 
years, is definitely represented here.

   Barry


> On Nov 10, 2019, at 7:31 PM, Fande Kong via petsc-dev  
> wrote:
> 
> Valgrind info:
> 
> ==32155== Invalid read of size 4
> ==32155==at 0x885F62F: libmetis__CreateCoarseGraphNoMask (coarsen.c:879)
> ==32155==by 0x885E2B9: libmetis__CreateCoarseGraph (coarsen.c:636)
> ==32155==by 0x885CA49: libmetis__Match_RM (coarsen.c:262)
> ==32155==by 0x885BFBD: libmetis__CoarsenGraph (coarsen.c:55)
> ==32155==by 0x8869A4C: libmetis__MultilevelBisect (pmetis.c:240)
> ==32155==by 0x88696F6: libmetis__MlevelRecursiveBisection (pmetis.c:183)
> ==32155==by 0x88698D7: libmetis__MlevelRecursiveBisection (pmetis.c:207)
> ==32155==by 0x886951E: METIS_PartGraphRecursive (pmetis.c:133)
> ==32155==by 0x885650B: libmetis__InitKWayPartitioning (kmetis.c:194)
> ==32155==by 0x8856147: libmetis__MlevelKWayPartitioning (kmetis.c:121)
> ==32155==by 0x8855FD3: METIS_PartGraphKway (kmetis.c:71)
> ==32155==by 0x85E7611: libparmetis__PartitionSmallGraph (weird.c:478)
> ==32155==by 0x85D00F3: ParMETIS_V3_PartKway (kmetis.c:91)
> ==32155==by 0x5515C19: MatPartitioningApply_Parmetis_Private 
> (pmetis.c:147)
> ==32155==by 0x5516F7D: MatPartitioningApply_Parmetis (pmetis.c:221)
> ==32155==by 0x550DD7F: MatPartitioningApply (partition.c:332)
> ==32155==by 0x6387033: PCGAMGCreateLevel_GAMG (gamg.c:226)
> ==32155==by 0x638BCA5: PCSetUp_GAMG (gamg.c:593)
> ==32155==by 0x64746B1: PCSetUp (precon.c:894)
> ==32155==by 0x65D388A: KSPSetUp (itfunc.c:377)
> ==32155==  Address 0xee2502c is 0 bytes after a block of size 284 alloc'd
> ==32155==at 0x4C2AB80: malloc (in 
> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==32155==by 0x8820179: gk_malloc (memory.c:147)
> ==32155==by 0x883D92B: libmetis__imalloc (gklib.c:24)
> ==32155==by 0x885BF6A: libmetis__CoarsenGraph (coarsen.c:46)
> ==32155==by 0x8869A4C: libmetis__MultilevelBisect (pmetis.c:240)
> ==32155==by 0x88696F6: libmetis__MlevelRecursiveBisection (pmetis.c:183)
> ==32155==by 0x88698D7: libmetis__MlevelRecursiveBisection (pmetis.c:207)
> ==32155==by 0x886951E: METIS_PartGraphRecursive (pmetis.c:133)
> ==32155==by 0x885650B: libmetis__InitKWayPartitioning (kmetis.c:194)
> ==32155==by 0x8856147: libmetis__MlevelKWayPartitioning (kmetis.c:121)
> ==32155==by 0x8855FD3: METIS_PartGraphKway (kmetis.c:71)
> ==32155==by 0x85E7611: libparmetis__PartitionSmallGraph (weird.c:478)
> ==32155==by 0x85D00F3: ParMETIS_V3_PartKway (kmetis.c:91)
> ==32155==by 0x5515C19: MatPartitioningApply_Parmetis_Private 
> (pmetis.c:147)
> ==32155==by 0x5516F7D: MatPartitioningApply_Parmetis (pmetis.c:221)
> ==32155==by 0x550DD7F: MatPartitioningApply (partition.c:332)
> ==32155==by 0x6387033: PCGAMGCreateLevel_GAMG (gamg.c:226)
> ==32155==by 0x638BCA5: PCSetUp_GAMG (gamg.c:593)
> ==32155==by 0x64746B1: PCSetUp (precon.c:894)
> ==32155==by 0x65D388A: KSPSetUp (itfunc.c:377)
> ==32155== 
> ==32155== Conditional jump or move depends on uninitialised value(s)
> ==32155==at 0x885F651: libmetis__CreateCoarseGraphNoMask (coarsen.c:880)
> ==32155==by 0x885E2B9: libmetis__CreateCoarseGraph (coarsen.c:636)
> ==32155==by 0x885CA49: libmetis__Match_RM (coarsen.c:262)
> ==32155==by 0x885BFBD: libmetis__CoarsenGraph (coarsen.c:55)
> ==32155==by 0x8869A4C: libmetis__MultilevelBisect (pmetis.c:240)
> ==32155==by 0x88696F6: libmetis__MlevelRecursiveBisection (pmetis.c:183)
> ==32155==by 0x88698D7: libmetis__MlevelRecursiveBisection (pmetis.c:207)
> ==32155==by 0x886951E: METIS_PartGraphRecursive (pmetis.c:133)
> ==32155==by 0x885650B: libmetis__InitKWayPartitioning (kmetis.c:194)
> ==32155==by 0x8856147: libmetis__MlevelKWayPartitioning (kmetis.c:121)
> ==32155==by 0x8855FD3: METIS_PartGraphKway (kmetis.c:71)
> ==32155==by 0x85E7611: libparmetis__PartitionSmallGraph (weird.c:478)
> ==32155==by 0x85D00F3: ParMETIS_V3_PartKway (kmetis.c:91)
> ==32155==by 0x5515C19: MatPartitioningApply_Parmetis_Private 
> (pmetis.c:147)
> ==32155==by 0x5516F7D: MatPartitioningApply_Parmetis (pmetis.c:221)
> ==32155==by 0x550DD7F: MatPartitioningApply (partition.c:332)
> ==32155==by 0x6387033: PCGAMGCreateLevel_GAMG (gamg.c:226)
> ==32155==by 0x638BCA5: PCSetUp_GAMG (gamg.c:593)
> ==32155==by 0x64746B1: PCSetUp (precon.c:894)
> ==32155==by 0x65D388A: KSPSetUp (itfunc.c:377)
> ==32155== 
> ==32155== Use of uninitialised value of size 8
> ==32155==at 0x885F6F2: libmetis__CreateCoarseGraphNoMask (coarsen.c:886)
> ==32155==by 0x885E2B9: libmetis__CreateCoarseGraph (coarsen.c:636)
> ==32155==by 0x885CA49: libmetis__Match_RM (coarsen.c:262)
> ==32155==by 0x885BFBD: libmetis__CoarsenGraph (coa

Re: [petsc-dev] Parmetis bug

2019-11-10 Thread Mark Adams via petsc-dev
Fande, It looks to me like this branch in ParMetis must be taken to trigger
this error. First *Match_SHEM* and then CreateCoarseGraphNoMask.

   /* determine which matching scheme you will use */
switch (ctrl->ctype) {
  case METIS_CTYPE_RM:
Match_RM(ctrl, graph);
break;
  case METIS_CTYPE_SHEM:
if (eqewgts || graph->nedges == 0)
  Match_RM(ctrl, graph);
else

*  Match_SHEM(ctrl, graph);*break;
  default:
gk_errexit(SIGERR, "Unknown ctype: %d\n", ctrl->ctype);
}

---

  /* Check if the mask-version of the code is a good choice */
  mask = HTLENGTH;
  if (cnvtxs < 2*mask || graph->nedges/graph->nvtxs > mask/20) {
CreateCoarseGraphNoMask(ctrl, graph, cnvtxs, match);
return;
  }



The actual error is in CreateCoarseGraphNoMask, graph->cmap is too small
and this gets garbage. parmetis coarsen.c:856:

istart = xadj[v];
iend   = xadj[v+1];
for (j=istart; j wrote:

> Fande, the problem is k below seems to index beyond the end of htable,
> resulting in a crazy m and a segv on the last line below.
>
> I don't have a clean valgrind machine now, that is what is needed if no
> one has seen anything like this. I could add a test in a MR and get the
> pipeline to do it.
>
> void CreateCoarseGraphNoMask(ctrl_t *ctrl, graph_t *graph, idx_t cnvtxs,
>  idx_t *match)
> {
>   idx_t j, k, m, istart, iend, nvtxs, nedges, ncon, cnedges, v, u, dovsize;
>   idx_t *xadj, *vwgt, *vsize, *adjncy, *adjwgt;
>   idx_t *cmap, *htable;
>   idx_t *cxadj, *cvwgt, *cvsize, *cadjncy, *cadjwgt;
>   graph_t *cgraph;
> ine
>   WCOREPUSH;
>
>   dovsize = (ctrl->objtype == METIS_OBJTYPE_VOL ? 1 : 0);
>
>   IFSET(ctrl->dbglvl, METIS_DBG_TIME, gk_startcputimer(ctrl->ContractTmr));
>
>   nvtxs   = graph->nvtxs;
>   ncon= graph->ncon;
>   xadj= graph->xadj;
>   vwgt= graph->vwgt;
>   vsize   = graph->vsize;
>   adjncy  = graph->adjncy;
>   adjwgt  = graph->adjwgt;
>   cmap= graph->cmap;
>
>
>   /* Initialize the coarser graph */
>   cgraph = SetupCoarseGraph(graph, cnvtxs, dovsize);
>   cxadj= cgraph->xadj;
>   cvwgt= cgraph->vwgt;
>   cvsize   = cgraph->vsize;
>   cadjncy  = cgraph->adjncy;
>   cadjwgt  = cgraph->adjwgt;
>
>   htable = iset(cnvtxs, -1, iwspacemalloc(ctrl, cnvtxs));
>
>   cxadj[0] = cnvtxs = cnedges = 0;
>   for (v=0; v if ((u = match[v]) < v)
>   continue;
>
> ASSERT(cmap[v] == cnvtxs);
> ASSERT(cmap[match[v]] == cnvtxs);
>
> if (ncon == 1)
>   cvwgt[cnvtxs] = vwgt[v];
> else
>   icopy(ncon, vwgt+v*ncon, cvwgt+cnvtxs*ncon);
>
> if (dovsize)
>   cvsize[cnvtxs] = vsize[v];
>
> nedges = 0;
>
> istart = xadj[v];
> iend   = xadj[v+1];
> for (j=istart; j   k = cmap[adjncy[j]];
>   if ((m = htable[k]) == -1) {
> cadjncy[nedges] = k;
> cadjwgt[nedges] = adjwgt[j];
> htable[k] = nedges++;
>   }
>   else {
> cadjwgt[m] += adjwgt[j];
>
> On Sun, Nov 10, 2019 at 1:35 AM Mark Adams  wrote:
>
>>
>>
>> On Sat, Nov 9, 2019 at 10:51 PM Fande Kong  wrote:
>>
>>> Hi Mark,
>>>
>>> Thanks for reporting this bug. I was surprised because we have
>>> sufficient heavy tests in moose using partition weights and do not have any
>>> issue so far.
>>>
>>>
>> I have been pounding on this code with elasticity and have not seen this
>> issue. I am now looking at Lapacianas and I only see it with pretty large
>> problems. The example below is pretty minimal (eg, it works with 16 cores
>> and it works with -dm_refine 4). I have reproduced this on Cori, SUMMIT and
>> my laptop.
>>
>>
>>> I will take a shot on this.
>>>
>>
>> Thanks, I'll try to take a look at it also. I have seen it in DDT, but
>> did not dig further. It looked like a typical segv in ParMetis.
>>
>>
>>>
>>> Fande,
>>>
>>> On Sat, Nov 9, 2019 at 3:08 PM Mark Adams  wrote:
>>>
 snes/ex13 is getting a ParMetis segv with GAMG and coarse grid
 repartitioning. Below shows the branch and how to run it.

 I've tried valgrind on Cori but it gives a lot of false positives. I've
 seen this error in DDT but I have not had a chance to dig and try to fix
 it. At least I know it has something to do with weights.

 If anyone wants to take a shot at it feel free. This bug rarely happens.

 The changes use weights and are just a few lines of code (from 1.5
 years ago):

 12:08 (0455fb9fec...)|BISECTING ~/Codes/petsc$ git bisect bad
 0455fb9fecf69cf5cf35948c84d3837e5a427e2e is the first bad commit
 commit 0455fb9fecf69cf5cf35948c84d3837e5a427e2e
 Author: Fande Kong 
 Date:   Thu Jun 21 18:21:19 2018 -0600

 Let parmetis and ptsotch take edge weights and vertex weights

  src/mat/partition/impls/pmetis/pmetis.c | 7 +++
  src/mat/partition/impls/scotch/scotch.c | 6 +++---
  2 files changed, 10 insertions(+), 3 deletions(-)

 > mpiexec -n 32 ./ex13 

[petsc-dev] Place to capture all our work on GPUs (and ECP ...)

2019-11-10 Thread Smith, Barry F. via petsc-dev
   Please do not respond to this email: use  
https://gitlab.com/petsc/petsc/issues/490

   Mark Adams has been generating some great information on Summit with GAMG 
and now AMGx and other people such as Hannah and Junchao generating information 
important to our education about GPUs and, of course, Karl has worked hard on 
the next generation libaxb.  Other people as well including Stefano. This work 
is reflected in email, MR, issues, chats on MR. 

   We need a way to capture all this stuff so it is easy to find. Now it is 
just fragments of knowledge hanging around.

   Any thoughts?

   Some common tag that can be used that can be searched for (tricky with mail, 
ban email?). Have each GPU discussion/concept/annoucement be an Issue with a 
label (GPU?) that can be searched for (how do we organize all the previous 
stuff that was just in email?)

   Use Microsoft's excellent replacement for Slack (no, that won't work).

  Having a website where everyone dumps everything will not work, we'll always 
forget and it won't have a native format. 

   All I ideas appreciated

   Barry



Re: [petsc-dev] Parmetis bug

2019-11-10 Thread Mark Adams via petsc-dev
Fande, the problem is k below seems to index beyond the end of htable,
resulting in a crazy m and a segv on the last line below.

I don't have a clean valgrind machine now, that is what is needed if no one
has seen anything like this. I could add a test in a MR and get the
pipeline to do it.

void CreateCoarseGraphNoMask(ctrl_t *ctrl, graph_t *graph, idx_t cnvtxs,
 idx_t *match)
{
  idx_t j, k, m, istart, iend, nvtxs, nedges, ncon, cnedges, v, u, dovsize;
  idx_t *xadj, *vwgt, *vsize, *adjncy, *adjwgt;
  idx_t *cmap, *htable;
  idx_t *cxadj, *cvwgt, *cvsize, *cadjncy, *cadjwgt;
  graph_t *cgraph;
ine
  WCOREPUSH;

  dovsize = (ctrl->objtype == METIS_OBJTYPE_VOL ? 1 : 0);

  IFSET(ctrl->dbglvl, METIS_DBG_TIME, gk_startcputimer(ctrl->ContractTmr));

  nvtxs   = graph->nvtxs;
  ncon= graph->ncon;
  xadj= graph->xadj;
  vwgt= graph->vwgt;
  vsize   = graph->vsize;
  adjncy  = graph->adjncy;
  adjwgt  = graph->adjwgt;
  cmap= graph->cmap;


  /* Initialize the coarser graph */
  cgraph = SetupCoarseGraph(graph, cnvtxs, dovsize);
  cxadj= cgraph->xadj;
  cvwgt= cgraph->vwgt;
  cvsize   = cgraph->vsize;
  cadjncy  = cgraph->adjncy;
  cadjwgt  = cgraph->adjwgt;

  htable = iset(cnvtxs, -1, iwspacemalloc(ctrl, cnvtxs));

  cxadj[0] = cnvtxs = cnedges = 0;
  for (v=0; v wrote:

>
>
> On Sat, Nov 9, 2019 at 10:51 PM Fande Kong  wrote:
>
>> Hi Mark,
>>
>> Thanks for reporting this bug. I was surprised because we have sufficient
>> heavy tests in moose using partition weights and do not have any issue so
>> far.
>>
>>
> I have been pounding on this code with elasticity and have not seen this
> issue. I am now looking at Lapacianas and I only see it with pretty large
> problems. The example below is pretty minimal (eg, it works with 16 cores
> and it works with -dm_refine 4). I have reproduced this on Cori, SUMMIT and
> my laptop.
>
>
>> I will take a shot on this.
>>
>
> Thanks, I'll try to take a look at it also. I have seen it in DDT, but did
> not dig further. It looked like a typical segv in ParMetis.
>
>
>>
>> Fande,
>>
>> On Sat, Nov 9, 2019 at 3:08 PM Mark Adams  wrote:
>>
>>> snes/ex13 is getting a ParMetis segv with GAMG and coarse grid
>>> repartitioning. Below shows the branch and how to run it.
>>>
>>> I've tried valgrind on Cori but it gives a lot of false positives. I've
>>> seen this error in DDT but I have not had a chance to dig and try to fix
>>> it. At least I know it has something to do with weights.
>>>
>>> If anyone wants to take a shot at it feel free. This bug rarely happens.
>>>
>>> The changes use weights and are just a few lines of code (from 1.5 years
>>> ago):
>>>
>>> 12:08 (0455fb9fec...)|BISECTING ~/Codes/petsc$ git bisect bad
>>> 0455fb9fecf69cf5cf35948c84d3837e5a427e2e is the first bad commit
>>> commit 0455fb9fecf69cf5cf35948c84d3837e5a427e2e
>>> Author: Fande Kong 
>>> Date:   Thu Jun 21 18:21:19 2018 -0600
>>>
>>> Let parmetis and ptsotch take edge weights and vertex weights
>>>
>>>  src/mat/partition/impls/pmetis/pmetis.c | 7 +++
>>>  src/mat/partition/impls/scotch/scotch.c | 6 +++---
>>>  2 files changed, 10 insertions(+), 3 deletions(-)
>>>
>>> > mpiexec -n 32 ./ex13 -cells 2,4,4, -dm_refine 5 -simplex 0 -dim 3
>>> -potential_petscspace_degree 1 -potential_petscspace_order 1 -pc_type gamg
>>> -petscpartitioner_type simple -pc_gamg_repartition
>>> true -check_pointer_intensity 0
>>>
>>