Re: [petsc-dev] MatMPIAIJGetLocalMat problem with GPUs

Matthew Martineau via petsc-dev Sat, 25 Jun 2022 06:45:03 -0700

Thanks - AmgX will accept unordered column indices.

________________________________
From: Barry Smith <bsm...@petsc.dev>
Sent: Saturday, 25 June 2022, 14:39
To: Mark Adams <mfad...@lbl.gov>
Cc: Matthew Martineau <mmartin...@nvidia.com>; For users of the development 
version of PETSc <petsc-dev@mcs.anl.gov>
Subject: Re: [petsc-dev] MatMPIAIJGetLocalMat problem with GPUs


External email: Use caution opening links or attachments


  Does AMGX require sorted column indices? (Python indentation notation below)

  If not
     just use MatMPIAIJGetLocalMatMerge instead of MatMPIAIJGetLocalMat.

  If yes,
     on the first call

     Mat tmplocal;
     PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_INITIAL_MATRIX, &tmplocal));
     PetscCall(MatConvert(tmplocal,MATSEQAIJCUSPARSE,&amgx->localA));
     PetscCall(MatDestroy(&tmplocal));

     leave the later calls as is with

     PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX, &amgx->localA);

Eventually, someone will need to buckle down and write 
MatMPIAIJGetLocalMat_SeqAIJCUSPARSE, but that can be done later.

Barry


On Jun 25, 2022, at 9:13 AM, Mark Adams 
<mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:



On Fri, Jun 24, 2022 at 1:54 PM Barry Smith 
<bsm...@petsc.dev<mailto:bsm...@petsc.dev>> wrote:


On Jun 24, 2022, at 1:38 PM, Mark Adams 
<mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:

I am rearranging the code for clarity from the repo but I have:

  PetscBool is_dev_ptrs;
  PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_INITIAL_MATRIX, &amgx->localA));
  PetscCall(PetscObjectTypeCompareAny((PetscObject)amgx->localA, &is_dev_ptrs, 
MATAIJCUSPARSE, MATSEQAIJCUSPARSE, MATMPIAIJCUSPARSE, ""));
  PetscPrintf(PETSC_COMM_SELF,"checking against mataijcusparse amgx->localA = 
%d\n",is_dev_ptrs);
  PetscCall(PetscObjectTypeCompareAny((PetscObject)Pmat, &is_dev_ptrs, 
MATAIJCUSPARSE, MATSEQAIJCUSPARSE, MATMPIAIJCUSPARSE, ""));
  PetscPrintf(PETSC_COMM_SELF,"checking against mataijcusparse Pmat = 
%d\n",is_dev_ptrs);

And it seems to show that MatMPIAIJGetLocalMat takes a mpiaijcusparse Mat and 
returns an seqaij mat (see below):

 Yes, this is how it currently behaves as Stefano has indicated. Thus it is not 
currently directly suitable for use with GPUs. As Stefano has indicated it has 
be revised to handle mpiaijcusparse matrices correctly in the same way that 
MatMPIAIJGetLocalMatMerge has been revised for GPUs.


OK, sorry, I did not understand that this is not supported. We need a 
MatMPIAIJCusparseGetLocalMatMerge (I read this as supported with "hstack" 
format, unsorted?, by Stefano)

What is the best way to proceed? Should we just convert to amgx->localA to 
mpiaijcusparse if Pmat is a cusparse matrix?
If so, should this code go in amgx or MatMPIAIJGetLocalMat(MAT_INITIAL_MATRIX) ?
Or should I add a MatMPIAIJCusparseGetLocalMatMerge that simply wraps these two 
calls for now?

Thanks,
Mark




AMGX version 2.2.0.132-opensource
Built on Jun 24 2022, 09:21:43
Compiled with CUDA Runtime 11.5, using CUDA driver 11.5
checking against mataijcusparse amgx->localA = 0
checking against mataijcusparse Pmat = 1
localA_name seqaij
Pmat_name mpiaijcusparse

Matt's existing testing code (below) then shows the types that conform with 
these tests and prints that I added.

  // XXX DEBUG REMOVE
  const char* localA_name;
  PetscObjectGetType((PetscObject)amgx->localA, &localA_name);
  PetscPrintf(PETSC_COMM_SELF,"localA_name %s\n", localA_name);
  const char* Pmat_name;
  PetscObjectGetType((PetscObject)Pmat, &Pmat_name);
  PetscPrintf(PETSC_COMM_SELF,"Pmat_name %s\n", Pmat_name);






On Fri, Jun 24, 2022 at 10:00 AM Barry Smith 
<bsm...@petsc.dev<mailto:bsm...@petsc.dev>> wrote:


On Jun 24, 2022, at 8:58 AM, Mark Adams 
<mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:

And before we move to the MR, I think Matt found a clear problem:

* PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX, &amgx->localA));
returns "localA seqaij"

* And, oddly, PetscCall(MatSeqAIJGetArrayRead(amgx->localA, &amgx->values)); 
returns:
"it seems to detect that the pointer is a device mapped pointer but that it is 
invalid"

  It does not return a device mapped pointer, it returns a valid host pointer 
only. MatSeqAIJGetArrayRead() is intended to only return a host pointer, it 
cannot return a device pointer. MatSeqAIJCusparseGetArrayRead() returns device 
pointers and should be used for this purpose.


Matt, lets just comment out the REUSE line and add another INITIAL line 
(destroying the old Mat of course), and lets press on.

  Looking at the code there is no way that simply using INITIAL instead of 
REUSE will make a code that does not work on the GPU run on the GPU. The 
MatMPIAIJGetLocalMat() returns only a MATSEQAIJ matrix regardless of the 
INITIAL versus REUSE and one can never get a device pointer from a non-GPU 
matrix.

   As noted by Stefano, the code either needs to use MatMPIAIJGetLocalMatMerge 
() which does return a CUSPARSE matrix (but the columns are not supported) or 
MatMPIAIJGetLocalMat()
needs to be updated to return a CUSPARSE matrix when the input MPI matrix is a 
CUSPARSE matrix.




We can keep the debugging code for now.

We (PETSc) can work on this independently,

Thanks,
Mark

On Fri, Jun 24, 2022 at 8:51 AM Mark Adams 
<mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:
I am not seeing this response, I see my "hstack" comment last.
https://gitlab.com/petsc/petsc/-/merge_requests/4323<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.com%2Fpetsc%2Fpetsc%2F-%2Fmerge_requests%2F4323&data=05%7C01%7Cmmartineau%40nvidia.com%7C416bbc844bf646be910a08da56b03085%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637917611977954097%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=m5I2%2F%2BpR5a4PKMZcs9la%2BQcGne6CZHe7vx0SDDUPNEU%3D&reserved=0>

On Thu, Jun 23, 2022 at 4:37 PM Barry Smith 
<bsm...@petsc.dev<mailto:bsm...@petsc.dev>> wrote:

 I have responded in the MR, which has all the context and the code. Please 
move this conversation from petsc-dev to the MR. Note you can use the little 
cartoon cloud symbol (upper write of the sub window with my text)  to reply to 
my post and keep everything in a thread for clarity.

  We are confused because it seems you are trying a variety of things and we 
don't know how the different things you tried resulted in the multiple errors 
you reported.



On Jun 23, 2022, at 3:59 PM, Matthew Martineau 
<mmartin...@nvidia.com<mailto:mmartin...@nvidia.com>> wrote:

I checked in the changes and some debugging statements.

PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX, &amgx->localA));
PetscCall(PetscObjectTypeCompareAny((PetscObject)amgx->localA, &is_dev_ptrs, 
MATAIJCUSPARSE, MATSEQAIJCUSPARSE, MATMPIAIJCUSPARSE, ""));

Then the call returns false. If we instead call PetscObjectTypeCompareAny on 
Pmat then it returns true. If you print the type of the matrices:

localA seqaij
Pmat mpiaijcusparse

If you subsequently call MatSeqAIJCUSPARSEGetArrayRead on localA then it errors 
(presumably because of the type mismatch).

If we call MatSeqAIJGetArrayRead on localA and then pass the `values` to AmgX 
it seems to detect that the pointer is a device mapped pointer but that it is 
invalid.

PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX, &amgx->localA));
PetscCall(MatSeqAIJGetArrayRead(amgx->localA, &amgx->values)); // Seems to 
return invalid pointer, but I’ll investigate more

This doesn’t reproduce if we call:

PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_INITIAL_MATRIX, &amgx->localA));
PetscCall(MatSeqAIJGetArrayRead(amgx->localA, &amgx->values)); // Pointer 
appears to be valid and we converge

Essentially all I want to achieve is that when we are parallel, we fetch the 
local part of A and the device pointer to the matrix values from that structure 
so that we can pass to AmgX. Preferring whichever API calls are the most 
efficient.


From: Stefano Zampini 
<stefano.zamp...@gmail.com<mailto:stefano.zamp...@gmail.com>>
Sent: 23 June 2022 20:55
To: Mark Adams <mfad...@lbl.gov<mailto:mfad...@lbl.gov>>
Cc: Barry Smith <bsm...@petsc.dev<mailto:bsm...@petsc.dev>>; For users of the 
development version of PETSc 
<petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>>; Matthew Martineau 
<mmartin...@nvidia.com<mailto:mmartin...@nvidia.com>>
Subject: Re: [petsc-dev] MatMPIAIJGetLocalMat problem with GPUs

External email: Use caution opening links or attachments

The logic is wrong. It should check for MATSEQAIJCUSPARSE.

On Thu, Jun 23, 2022, 21:36 Mark Adams 
<mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:


On Thu, Jun 23, 2022 at 3:02 PM Barry Smith 
<bsm...@petsc.dev<mailto:bsm...@petsc.dev>> wrote:

  It looks like the current code copies the nonzero values to the CPU from the 
MPI matrix (with the calls PetscCall(MatSeqAIJGetArrayRead(mpimat->A,&aav));
  PetscCall(MatSeqAIJGetArrayRead(mpimat->B,&bav));, then copies them into the 
CPU memory of the Seq matrix. When the matrix entries are next accessed on the 
GPU it should automatically copy them down to the GPU.  So the code looks ok 
even for GPUs. We'll need to see the full error message with what the "invalid 
pointer" is.

I showed Matt how to peek into offloadmask and he found that it is a host 
state, but this is not the issue. The access method should do the copy to the 
device.

I am thinking the logic here might be wrong. (Matt fixed "VEC" --> "MAT" in the 
comparison below).

Matt, is the issue that you are calling  MatSeqAIJCUSPARSEGetArrayRead and 
getting a host pointer?

I think the state of amgx->localA after the call to 
MatSeqAIJCUSPARSEGetArrayRead should be "BOTH" because this copied the data to 
the device so they are both valid and you should have device data.

211   PetscBool is_dev_ptrs;
212   PetscCall(PetscObjectTypeCompareAny((PetscObject)amgx->localA, 
&is_dev_ptrs, VECCUDA, VECMPICUDA, VECSEQCUDA, ""));
213
214   if (is_dev_ptrs) {
216     PetscCall(MatSeqAIJCUSPARSEGetArrayRead(amgx->localA, &amgx->values));
217   } else {
219     PetscCall(MatSeqAIJGetArrayRead(amgx->localA, &amgx->values));
220   }


  Barry


 Yes this routine is terribly inefficient for GPU matrices, it needs to be 
specialized to not use the GPU memory but that is a separate issue from there 
being bugs in the current code.

  The code also seems to implicitly assume the parallel matrix has the same 
nonzero pattern with a reuse. This should be checked with each use by stashing 
the nonzero state of the matrix into the sequential matrix and making sure the 
parallel matrix has that same stashed value each time. Currently if one changes 
the nonzero matrix of the parallel matrix one is likely to get random confusing 
crashes due to memory corruption. But likely not the problem here.


On Jun 23, 2022, at 2:23 PM, Mark Adams 
<mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:

We have a bug in the AMGx test snes_tests-ex13_amgx in parallel.
Matt Martineau found that MatMPIAIJGetLocalMat worked in the first pass in the 
code below, where the local matrix is created (INITIAL), but in the next pass, 
when "REUSE" is used, he sees an invalid pointer.
Matt found that it does have offloadmask == CPU.
Maybe it is missing logic to put the output in same state as the input?

Any ideas on this or should I just dig into it?

Thanks,

bool partial_setup_allowed = (pc->setupcalled && pc->flag != 
DIFFERENT_NONZERO_PATTERN);

199   if (amgx->nranks > 1) {

200     if (partial_setup_allowed) {

202       PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX, 
&amgx->localA)); // This path seems doesn't work by the time we reach AmgX API

203     } else {

205       PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_INITIAL_MATRIX, 
&amgx->localA)); // This path works

206     }

207   } else {

208     amgx->localA = Pmat;

209   }

210

Re: [petsc-dev] MatMPIAIJGetLocalMat problem with GPUs

Reply via email to