Re: [petsc-dev] MatPinToCPU

2019-07-23 Thread Smith, Barry F. via petsc-dev



> On Jul 23, 2019, at 9:12 AM, Mark Adams via petsc-dev  
> wrote:
> 
> I've tried to add pining the matrix and prolongator to the CPU on coarse 
> grids in GAMG with this:
> 
> /* pin reduced coase grid - could do something smarter */
> ierr = MatPinToCPU(*a_Amat_crs,PETSC_TRUE);CHKERRQ(ierr);
> ierr = MatPinToCPU(*a_P_inout,PETSC_TRUE);CHKERRQ(ierr);

  What are the symptoms of it not working? Does it appear to be still copying 
the matrices to the GPU? then running the functions on the GPU?

  I suspect the pinning is incompletely done for CUDA (and MPIOpenCL) matrices. 

We need the equivalent of 

static PetscErrorCode MatPinToCPU_SeqAIJViennaCL(Mat A,PetscBool flg)
{
  PetscFunctionBegin;
  A->pinnedtocpu = flg;
  if (flg) {
A->ops->mult   = MatMult_SeqAIJ;
A->ops->multadd= MatMultAdd_SeqAIJ;
A->ops->assemblyend= MatAssemblyEnd_SeqAIJ;
A->ops->duplicate  = MatDuplicate_SeqAIJ;
  } else {
A->ops->mult   = MatMult_SeqAIJViennaCL;
A->ops->multadd= MatMultAdd_SeqAIJViennaCL;
A->ops->assemblyend= MatAssemblyEnd_SeqAIJViennaCL;
A->ops->destroy= MatDestroy_SeqAIJViennaCL;
A->ops->duplicate  = MatDuplicate_SeqAIJViennaCL;
  }
  PetscFunctionReturn(0);
}

for MPIViennaCL and MPISeqAIJ Cusparse but it doesn't look like it has been 
written yet. 


> 
> It does not seem to work. It does not look like CUDA has an MatCreateVecs. 
> Should I add one and copy this flag over?

   We do need this function. But I don't see how it relates to pinning. When 
the matrix is pinned to the CPU we want it to create CPU vectors which I assume 
it does.


> 
> Mark



Re: [petsc-dev] MatPinToCPU

2019-07-23 Thread Mark Adams via petsc-dev
>
>
>   What are the symptoms of it not working? Does it appear to be still
> copying the matrices to the GPU? then running the functions on the GPU?
>
>
The object is dispatching the CUDA mat-vec etc.

  I suspect the pinning is incompletely done for CUDA (and MPIOpenCL)
> matrices.
>
>
Yes, git grep MatPinToCPU shows stuff for ViennaCL but not CUDA.

I guess I can add something like this below. Do we need to set the device
methods? They are already set when this method is set, right?


> We need the equivalent of
>
> static PetscErrorCode MatPinToCPU_SeqAIJViennaCL(Mat A,PetscBool flg)
> {
>   PetscFunctionBegin;
>   A->pinnedtocpu = flg;
>   if (flg) {
> A->ops->mult   = MatMult_SeqAIJ;
> A->ops->multadd= MatMultAdd_SeqAIJ;
> A->ops->assemblyend= MatAssemblyEnd_SeqAIJ;
> A->ops->duplicate  = MatDuplicate_SeqAIJ;
>   } else {
> A->ops->mult   = MatMult_SeqAIJViennaCL;
> A->ops->multadd= MatMultAdd_SeqAIJViennaCL;
> A->ops->assemblyend= MatAssemblyEnd_SeqAIJViennaCL;
> A->ops->destroy= MatDestroy_SeqAIJViennaCL;
> A->ops->duplicate  = MatDuplicate_SeqAIJViennaCL;
>   }
>   PetscFunctionReturn(0);
> }
>
> for MPIViennaCL and MPISeqAIJ Cusparse but it doesn't look like it has
> been written yet.
>
>
> >
> > It does not seem to work. It does not look like CUDA has an
> MatCreateVecs. Should I add one and copy this flag over?
>
>We do need this function. But I don't see how it relates to pinning.
> When the matrix is pinned to the CPU we want it to create CPU vectors which
> I assume it does.
>
>
> >
> > Mark
>
>


Re: [petsc-dev] MatPinToCPU

2019-07-23 Thread Smith, Barry F. via petsc-dev
 
 Yes, it needs to be able to switch back and forth between the CPU and GPU 
methods so you need to move into it the setting of the methods that is 
currently directly in the create method. See how  
MatConvert_SeqAIJ_SeqAIJViennaCL() calls ierr = 
MatPinToCPU_SeqAIJViennaCL(A,PETSC_FALSE);CHKERRQ(ierr); to set the methods for 
the GPU initially.

  Barry


> On Jul 23, 2019, at 7:32 PM, Mark Adams  wrote:
> 
> 
>   What are the symptoms of it not working? Does it appear to be still copying 
> the matrices to the GPU? then running the functions on the GPU?
> 
> 
> The object is dispatching the CUDA mat-vec etc.
> 
>   I suspect the pinning is incompletely done for CUDA (and MPIOpenCL) 
> matrices. 
> 
> 
> Yes, git grep MatPinToCPU shows stuff for ViennaCL but not CUDA.
> 
> I guess I can add something like this below. Do we need to set the device 
> methods? They are already set when this method is set, right?
>  
> We need the equivalent of 
> 
> static PetscErrorCode MatPinToCPU_SeqAIJViennaCL(Mat A,PetscBool flg)
> {
>   PetscFunctionBegin;
>   A->pinnedtocpu = flg;
>   if (flg) {
> A->ops->mult   = MatMult_SeqAIJ;
> A->ops->multadd= MatMultAdd_SeqAIJ;
> A->ops->assemblyend= MatAssemblyEnd_SeqAIJ;
> A->ops->duplicate  = MatDuplicate_SeqAIJ;
>   } else {
> A->ops->mult   = MatMult_SeqAIJViennaCL;
> A->ops->multadd= MatMultAdd_SeqAIJViennaCL;
> A->ops->assemblyend= MatAssemblyEnd_SeqAIJViennaCL;
> A->ops->destroy= MatDestroy_SeqAIJViennaCL;
> A->ops->duplicate  = MatDuplicate_SeqAIJViennaCL;
>   }
>   PetscFunctionReturn(0);
> }
> 
> for MPIViennaCL and MPISeqAIJ Cusparse but it doesn't look like it has been 
> written yet. 
> 
> 
> > 
> > It does not seem to work. It does not look like CUDA has an MatCreateVecs. 
> > Should I add one and copy this flag over?
> 
>We do need this function. But I don't see how it relates to pinning. When 
> the matrix is pinned to the CPU we want it to create CPU vectors which I 
> assume it does.
> 
> 
> > 
> > Mark
> 



Re: [petsc-dev] MatPinToCPU

2019-07-27 Thread Mark Adams via petsc-dev
I'm not sure what to do here. The problem is that pinned-to-cpu vectors are
calling *VecCUDACopyFromGPU* here.

Should I set *x->valid_GPU_array *to something else, like
PETSC_OFFLOAD_CPU, in PinToCPU so this block of code i s not executed?

PetscErrorCode VecGetArray(Vec x,PetscScalar **a)
{
  PetscErrorCode ierr;
#if defined(PETSC_HAVE_VIENNACL)
  PetscBool  is_viennacltype = PETSC_FALSE;
#endif

  PetscFunctionBegin;
  PetscValidHeaderSpecific(x,VEC_CLASSID,1);
  ierr = VecSetErrorIfLocked(x,1);CHKERRQ(ierr);
  if (x->petscnative) {
#if defined(PETSC_HAVE_VIENNACL) || defined(PETSC_HAVE_CUDA)
if (*x->valid_GPU_array* == PETSC_OFFLOAD_GPU) {
#if defined(PETSC_HAVE_VIENNACL)
  ierr =
PetscObjectTypeCompareAny((PetscObject)x,&is_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
  if (is_viennacltype) {
ierr = VecViennaCLCopyFromGPU(x);CHKERRQ(ierr);
  } else
#endif
  {
#if defined(PETSC_HAVE_CUDA)

*ierr = VecCUDACopyFromGPU(x);CHKERRQ(ierr);*#endif
 }
} else if (x->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {
#if defined(PETSC_HAVE_VIENNACL)
  ierr =
PetscObjectTypeCompareAny((PetscObject)x,&is_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
  if (is_viennacltype) {
ierr = VecViennaCLAllocateCheckHost(x);CHKERRQ(ierr);
  } else
#endif
  {
#if defined(PETSC_HAVE_CUDA)
ierr = VecCUDAAllocateCheckHost(x);CHKERRQ(ierr);
#endif
  }
}
#endif
*a = *((PetscScalar**)x->data);
  } else {


On Tue, Jul 23, 2019 at 9:18 PM Smith, Barry F.  wrote:

>
>  Yes, it needs to be able to switch back and forth between the CPU and GPU
> methods so you need to move into it the setting of the methods that is
> currently directly in the create method. See how
> MatConvert_SeqAIJ_SeqAIJViennaCL() calls ierr =
> MatPinToCPU_SeqAIJViennaCL(A,PETSC_FALSE);CHKERRQ(ierr); to set the methods
> for the GPU initially.
>
>   Barry
>
>
> > On Jul 23, 2019, at 7:32 PM, Mark Adams  wrote:
> >
> >
> >   What are the symptoms of it not working? Does it appear to be still
> copying the matrices to the GPU? then running the functions on the GPU?
> >
> >
> > The object is dispatching the CUDA mat-vec etc.
> >
> >   I suspect the pinning is incompletely done for CUDA (and MPIOpenCL)
> matrices.
> >
> >
> > Yes, git grep MatPinToCPU shows stuff for ViennaCL but not CUDA.
> >
> > I guess I can add something like this below. Do we need to set the
> device methods? They are already set when this method is set, right?
> >
> > We need the equivalent of
> >
> > static PetscErrorCode MatPinToCPU_SeqAIJViennaCL(Mat A,PetscBool flg)
> > {
> >   PetscFunctionBegin;
> >   A->pinnedtocpu = flg;
> >   if (flg) {
> > A->ops->mult   = MatMult_SeqAIJ;
> > A->ops->multadd= MatMultAdd_SeqAIJ;
> > A->ops->assemblyend= MatAssemblyEnd_SeqAIJ;
> > A->ops->duplicate  = MatDuplicate_SeqAIJ;
> >   } else {
> > A->ops->mult   = MatMult_SeqAIJViennaCL;
> > A->ops->multadd= MatMultAdd_SeqAIJViennaCL;
> > A->ops->assemblyend= MatAssemblyEnd_SeqAIJViennaCL;
> > A->ops->destroy= MatDestroy_SeqAIJViennaCL;
> > A->ops->duplicate  = MatDuplicate_SeqAIJViennaCL;
> >   }
> >   PetscFunctionReturn(0);
> > }
> >
> > for MPIViennaCL and MPISeqAIJ Cusparse but it doesn't look like it has
> been written yet.
> >
> >
> > >
> > > It does not seem to work. It does not look like CUDA has an
> MatCreateVecs. Should I add one and copy this flag over?
> >
> >We do need this function. But I don't see how it relates to pinning.
> When the matrix is pinned to the CPU we want it to create CPU vectors which
> I assume it does.
> >
> >
> > >
> > > Mark
> >
>
>


Re: [petsc-dev] MatPinToCPU

2019-07-27 Thread Smith, Barry F. via petsc-dev


  I don't understand the context. Once a vector is pinned to the CPU the flag 
should be PETSC_OFFLOAD_CPU permanently until the pin to cpu is turned off.  Do 
you have a pinned vector that has the value PETSC_OFFLOAD_GPU?  For example 
here it is set to PETSC_OFFLOAD_CPU

PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
{

  if (pin) {
ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr);
V->valid_GPU_array = PETSC_OFFLOAD_CPU; /* since the CPU code will likely 
change values in the vector */


  Is there any way to reproduce the problem?

  Barry




> On Jul 27, 2019, at 10:28 AM, Mark Adams  wrote:
> 
> I'm not sure what to do here. The problem is that pinned-to-cpu vectors are 
> calling VecCUDACopyFromGPU here.
> 
> Should I set x->valid_GPU_array to something else, like PETSC_OFFLOAD_CPU, in 
> PinToCPU so this block of code i s not executed?
> 
> PetscErrorCode VecGetArray(Vec x,PetscScalar **a)
> {
>   PetscErrorCode ierr;
> #if defined(PETSC_HAVE_VIENNACL)
>   PetscBool  is_viennacltype = PETSC_FALSE;
> #endif
> 
>   PetscFunctionBegin;
>   PetscValidHeaderSpecific(x,VEC_CLASSID,1);
>   ierr = VecSetErrorIfLocked(x,1);CHKERRQ(ierr);
>   if (x->petscnative) {
> #if defined(PETSC_HAVE_VIENNACL) || defined(PETSC_HAVE_CUDA)
> if (x->valid_GPU_array == PETSC_OFFLOAD_GPU) {
> #if defined(PETSC_HAVE_VIENNACL)
>   ierr = 
> PetscObjectTypeCompareAny((PetscObject)x,&is_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
>   if (is_viennacltype) {
> ierr = VecViennaCLCopyFromGPU(x);CHKERRQ(ierr);
>   } else
> #endif
>   {
> #if defined(PETSC_HAVE_CUDA)
> ierr = VecCUDACopyFromGPU(x);CHKERRQ(ierr);
> #endif
>  }
> } else if (x->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {
> #if defined(PETSC_HAVE_VIENNACL)
>   ierr = 
> PetscObjectTypeCompareAny((PetscObject)x,&is_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
>   if (is_viennacltype) {
> ierr = VecViennaCLAllocateCheckHost(x);CHKERRQ(ierr);
>   } else
> #endif
>   {
> #if defined(PETSC_HAVE_CUDA)
> ierr = VecCUDAAllocateCheckHost(x);CHKERRQ(ierr);
> #endif
>   }
> }
> #endif
> *a = *((PetscScalar**)x->data);
>   } else {
> 
> 
> On Tue, Jul 23, 2019 at 9:18 PM Smith, Barry F.  wrote:
>  
>  Yes, it needs to be able to switch back and forth between the CPU and GPU 
> methods so you need to move into it the setting of the methods that is 
> currently directly in the create method. See how  
> MatConvert_SeqAIJ_SeqAIJViennaCL() calls ierr = 
> MatPinToCPU_SeqAIJViennaCL(A,PETSC_FALSE);CHKERRQ(ierr); to set the methods 
> for the GPU initially.
> 
>   Barry
> 
> 
> > On Jul 23, 2019, at 7:32 PM, Mark Adams  wrote:
> > 
> > 
> >   What are the symptoms of it not working? Does it appear to be still 
> > copying the matrices to the GPU? then running the functions on the GPU?
> > 
> > 
> > The object is dispatching the CUDA mat-vec etc.
> > 
> >   I suspect the pinning is incompletely done for CUDA (and MPIOpenCL) 
> > matrices. 
> > 
> > 
> > Yes, git grep MatPinToCPU shows stuff for ViennaCL but not CUDA.
> > 
> > I guess I can add something like this below. Do we need to set the device 
> > methods? They are already set when this method is set, right?
> >  
> > We need the equivalent of 
> > 
> > static PetscErrorCode MatPinToCPU_SeqAIJViennaCL(Mat A,PetscBool flg)
> > {
> >   PetscFunctionBegin;
> >   A->pinnedtocpu = flg;
> >   if (flg) {
> > A->ops->mult   = MatMult_SeqAIJ;
> > A->ops->multadd= MatMultAdd_SeqAIJ;
> > A->ops->assemblyend= MatAssemblyEnd_SeqAIJ;
> > A->ops->duplicate  = MatDuplicate_SeqAIJ;
> >   } else {
> > A->ops->mult   = MatMult_SeqAIJViennaCL;
> > A->ops->multadd= MatMultAdd_SeqAIJViennaCL;
> > A->ops->assemblyend= MatAssemblyEnd_SeqAIJViennaCL;
> > A->ops->destroy= MatDestroy_SeqAIJViennaCL;
> > A->ops->duplicate  = MatDuplicate_SeqAIJViennaCL;
> >   }
> >   PetscFunctionReturn(0);
> > }
> > 
> > for MPIViennaCL and MPISeqAIJ Cusparse but it doesn't look like it has been 
> > written yet. 
> > 
> > 
> > > 
> > > It does not seem to work. It does not look like CUDA has an 
> > > MatCreateVecs. Should I add one and copy this flag over?
> > 
> >We do need this function. But I don't see how it relates to pinning. 
> > When the matrix is pinned to the CPU we want it to create CPU vectors which 
> > I assume it does.
> > 
> > 
> > > 
> > > Mark
> > 
> 



Re: [petsc-dev] MatPinToCPU

2019-07-27 Thread Mark Adams via petsc-dev
Yea, I just figured out the problem. VecDuplicate_MPICUDA did not call
PinToCPU or even copy pinnedtocpu. It just copied ops, so I added and am
testing:

  ierr = VecCreate_MPICUDA_Private(*v,PETSC_TRUE,w->nghost,0);CHKERRQ(ierr);
  vw   = (Vec_MPI*)(*v)->data;
  ierr = PetscMemcpy((*v)->ops,win->ops,sizeof(struct
_VecOps));CHKERRQ(ierr);
*  ierr = VecPinToCPU(*v,win->pinnedtocpu);CHKERRQ(ierr);*

Thanks,

On Sat, Jul 27, 2019 at 11:33 AM Smith, Barry F.  wrote:

>
>   I don't understand the context. Once a vector is pinned to the CPU the
> flag should be PETSC_OFFLOAD_CPU permanently until the pin to cpu is turned
> off.  Do you have a pinned vector that has the value PETSC_OFFLOAD_GPU?
> For example here it is set to PETSC_OFFLOAD_CPU
>
> PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
> {
> 
>   if (pin) {
> ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr);
> V->valid_GPU_array = PETSC_OFFLOAD_CPU; /* since the CPU code will
> likely change values in the vector */
>
>
>   Is there any way to reproduce the problem?
>
>   Barry
>
>
>
>
> > On Jul 27, 2019, at 10:28 AM, Mark Adams  wrote:
> >
> > I'm not sure what to do here. The problem is that pinned-to-cpu vectors
> are calling VecCUDACopyFromGPU here.
> >
> > Should I set x->valid_GPU_array to something else, like
> PETSC_OFFLOAD_CPU, in PinToCPU so this block of code i s not executed?
> >
> > PetscErrorCode VecGetArray(Vec x,PetscScalar **a)
> > {
> >   PetscErrorCode ierr;
> > #if defined(PETSC_HAVE_VIENNACL)
> >   PetscBool  is_viennacltype = PETSC_FALSE;
> > #endif
> >
> >   PetscFunctionBegin;
> >   PetscValidHeaderSpecific(x,VEC_CLASSID,1);
> >   ierr = VecSetErrorIfLocked(x,1);CHKERRQ(ierr);
> >   if (x->petscnative) {
> > #if defined(PETSC_HAVE_VIENNACL) || defined(PETSC_HAVE_CUDA)
> > if (x->valid_GPU_array == PETSC_OFFLOAD_GPU) {
> > #if defined(PETSC_HAVE_VIENNACL)
> >   ierr =
> PetscObjectTypeCompareAny((PetscObject)x,&is_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
> >   if (is_viennacltype) {
> > ierr = VecViennaCLCopyFromGPU(x);CHKERRQ(ierr);
> >   } else
> > #endif
> >   {
> > #if defined(PETSC_HAVE_CUDA)
> > ierr = VecCUDACopyFromGPU(x);CHKERRQ(ierr);
> > #endif
> >  }
> > } else if (x->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {
> > #if defined(PETSC_HAVE_VIENNACL)
> >   ierr =
> PetscObjectTypeCompareAny((PetscObject)x,&is_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
> >   if (is_viennacltype) {
> > ierr = VecViennaCLAllocateCheckHost(x);CHKERRQ(ierr);
> >   } else
> > #endif
> >   {
> > #if defined(PETSC_HAVE_CUDA)
> > ierr = VecCUDAAllocateCheckHost(x);CHKERRQ(ierr);
> > #endif
> >   }
> > }
> > #endif
> > *a = *((PetscScalar**)x->data);
> >   } else {
> >
> >
> > On Tue, Jul 23, 2019 at 9:18 PM Smith, Barry F. 
> wrote:
> >
> >  Yes, it needs to be able to switch back and forth between the CPU and
> GPU methods so you need to move into it the setting of the methods that is
> currently directly in the create method. See how
> MatConvert_SeqAIJ_SeqAIJViennaCL() calls ierr =
> MatPinToCPU_SeqAIJViennaCL(A,PETSC_FALSE);CHKERRQ(ierr); to set the methods
> for the GPU initially.
> >
> >   Barry
> >
> >
> > > On Jul 23, 2019, at 7:32 PM, Mark Adams  wrote:
> > >
> > >
> > >   What are the symptoms of it not working? Does it appear to be still
> copying the matrices to the GPU? then running the functions on the GPU?
> > >
> > >
> > > The object is dispatching the CUDA mat-vec etc.
> > >
> > >   I suspect the pinning is incompletely done for CUDA (and MPIOpenCL)
> matrices.
> > >
> > >
> > > Yes, git grep MatPinToCPU shows stuff for ViennaCL but not CUDA.
> > >
> > > I guess I can add something like this below. Do we need to set the
> device methods? They are already set when this method is set, right?
> > >
> > > We need the equivalent of
> > >
> > > static PetscErrorCode MatPinToCPU_SeqAIJViennaCL(Mat A,PetscBool flg)
> > > {
> > >   PetscFunctionBegin;
> > >   A->pinnedtocpu = flg;
> > >   if (flg) {
> > > A->ops->mult   = MatMult_SeqAIJ;
> > > A->ops->multadd= MatMultAdd_SeqAIJ;
> > > A->ops->assemblyend= MatAssemblyEnd_SeqAIJ;
> > > A->ops->duplicate  = MatDuplicate_SeqAIJ;
> > >   } else {
> > > A->ops->mult   = MatMult_SeqAIJViennaCL;
> > > A->ops->multadd= MatMultAdd_SeqAIJViennaCL;
> > > A->ops->assemblyend= MatAssemblyEnd_SeqAIJViennaCL;
> > > A->ops->destroy= MatDestroy_SeqAIJViennaCL;
> > > A->ops->duplicate  = MatDuplicate_SeqAIJViennaCL;
> > >   }
> > >   PetscFunctionReturn(0);
> > > }
> > >
> > > for MPIViennaCL and MPISeqAIJ Cusparse but it doesn't look like it has
> been written yet.
> > >
> > >
> > > >
> > > > It does not seem to work. It does not look like CUDA has an
> MatCreateVecs. Should I add one and copy this flag over?
> > >
> > >We do need this functio

Re: [petsc-dev] MatPinToCPU

2019-07-27 Thread Smith, Barry F. via petsc-dev


  Good catch. Thanks. Maybe the SeqCUDA has the same problem?

> On Jul 27, 2019, at 10:40 AM, Mark Adams  wrote:
> 
> Yea, I just figured out the problem. VecDuplicate_MPICUDA did not call 
> PinToCPU or even copy pinnedtocpu. It just copied ops, so I added and am 
> testing:
> 
>   ierr = VecCreate_MPICUDA_Private(*v,PETSC_TRUE,w->nghost,0);CHKERRQ(ierr);
>   vw   = (Vec_MPI*)(*v)->data;
>   ierr = PetscMemcpy((*v)->ops,win->ops,sizeof(struct _VecOps));CHKERRQ(ierr);
>   ierr = VecPinToCPU(*v,win->pinnedtocpu);CHKERRQ(ierr);
> 
> Thanks,
> 
> On Sat, Jul 27, 2019 at 11:33 AM Smith, Barry F.  wrote:
> 
>   I don't understand the context. Once a vector is pinned to the CPU the flag 
> should be PETSC_OFFLOAD_CPU permanently until the pin to cpu is turned off.  
> Do you have a pinned vector that has the value PETSC_OFFLOAD_GPU?  For 
> example here it is set to PETSC_OFFLOAD_CPU
> 
> PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
> {
> 
>   if (pin) {
> ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr);
> V->valid_GPU_array = PETSC_OFFLOAD_CPU; /* since the CPU code will likely 
> change values in the vector */
> 
> 
>   Is there any way to reproduce the problem?
> 
>   Barry
> 
> 
> 
> 
> > On Jul 27, 2019, at 10:28 AM, Mark Adams  wrote:
> > 
> > I'm not sure what to do here. The problem is that pinned-to-cpu vectors are 
> > calling VecCUDACopyFromGPU here.
> > 
> > Should I set x->valid_GPU_array to something else, like PETSC_OFFLOAD_CPU, 
> > in PinToCPU so this block of code i s not executed?
> > 
> > PetscErrorCode VecGetArray(Vec x,PetscScalar **a)
> > {
> >   PetscErrorCode ierr;
> > #if defined(PETSC_HAVE_VIENNACL)
> >   PetscBool  is_viennacltype = PETSC_FALSE;
> > #endif
> > 
> >   PetscFunctionBegin;
> >   PetscValidHeaderSpecific(x,VEC_CLASSID,1);
> >   ierr = VecSetErrorIfLocked(x,1);CHKERRQ(ierr);
> >   if (x->petscnative) {
> > #if defined(PETSC_HAVE_VIENNACL) || defined(PETSC_HAVE_CUDA)
> > if (x->valid_GPU_array == PETSC_OFFLOAD_GPU) {
> > #if defined(PETSC_HAVE_VIENNACL)
> >   ierr = 
> > PetscObjectTypeCompareAny((PetscObject)x,&is_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
> >   if (is_viennacltype) {
> > ierr = VecViennaCLCopyFromGPU(x);CHKERRQ(ierr);
> >   } else
> > #endif
> >   {
> > #if defined(PETSC_HAVE_CUDA)
> > ierr = VecCUDACopyFromGPU(x);CHKERRQ(ierr);
> > #endif
> >  }
> > } else if (x->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {
> > #if defined(PETSC_HAVE_VIENNACL)
> >   ierr = 
> > PetscObjectTypeCompareAny((PetscObject)x,&is_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
> >   if (is_viennacltype) {
> > ierr = VecViennaCLAllocateCheckHost(x);CHKERRQ(ierr);
> >   } else
> > #endif
> >   {
> > #if defined(PETSC_HAVE_CUDA)
> > ierr = VecCUDAAllocateCheckHost(x);CHKERRQ(ierr);
> > #endif
> >   }
> > }
> > #endif
> > *a = *((PetscScalar**)x->data);
> >   } else {
> > 
> > 
> > On Tue, Jul 23, 2019 at 9:18 PM Smith, Barry F.  wrote:
> >  
> >  Yes, it needs to be able to switch back and forth between the CPU and GPU 
> > methods so you need to move into it the setting of the methods that is 
> > currently directly in the create method. See how  
> > MatConvert_SeqAIJ_SeqAIJViennaCL() calls ierr = 
> > MatPinToCPU_SeqAIJViennaCL(A,PETSC_FALSE);CHKERRQ(ierr); to set the methods 
> > for the GPU initially.
> > 
> >   Barry
> > 
> > 
> > > On Jul 23, 2019, at 7:32 PM, Mark Adams  wrote:
> > > 
> > > 
> > >   What are the symptoms of it not working? Does it appear to be still 
> > > copying the matrices to the GPU? then running the functions on the GPU?
> > > 
> > > 
> > > The object is dispatching the CUDA mat-vec etc.
> > > 
> > >   I suspect the pinning is incompletely done for CUDA (and MPIOpenCL) 
> > > matrices. 
> > > 
> > > 
> > > Yes, git grep MatPinToCPU shows stuff for ViennaCL but not CUDA.
> > > 
> > > I guess I can add something like this below. Do we need to set the device 
> > > methods? They are already set when this method is set, right?
> > >  
> > > We need the equivalent of 
> > > 
> > > static PetscErrorCode MatPinToCPU_SeqAIJViennaCL(Mat A,PetscBool flg)
> > > {
> > >   PetscFunctionBegin;
> > >   A->pinnedtocpu = flg;
> > >   if (flg) {
> > > A->ops->mult   = MatMult_SeqAIJ;
> > > A->ops->multadd= MatMultAdd_SeqAIJ;
> > > A->ops->assemblyend= MatAssemblyEnd_SeqAIJ;
> > > A->ops->duplicate  = MatDuplicate_SeqAIJ;
> > >   } else {
> > > A->ops->mult   = MatMult_SeqAIJViennaCL;
> > > A->ops->multadd= MatMultAdd_SeqAIJViennaCL;
> > > A->ops->assemblyend= MatAssemblyEnd_SeqAIJViennaCL;
> > > A->ops->destroy= MatDestroy_SeqAIJViennaCL;
> > > A->ops->duplicate  = MatDuplicate_SeqAIJViennaCL;
> > >   }
> > >   PetscFunctionReturn(0);
> > > }
> > > 
> > > for MPIViennaCL and MPISeqAIJ Cusparse but it doesn't 

Re: [petsc-dev] MatPinToCPU

2019-07-27 Thread Mark Adams via petsc-dev
On Sat, Jul 27, 2019 at 11:39 AM Smith, Barry F.  wrote:

>
>   Good catch. Thanks. Maybe the SeqCUDA has the same problem?
>

THis is done  (I may have done it).

Now it seems to me that when you call VecPinToCPU you are setting up and
don't have data, so this copy does not seem necessary. Maybe remove the
copy here:

PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
{
  PetscErrorCode ierr;

  PetscFunctionBegin;
  V->pinnedtocpu = pin;
  if (pin) {

*ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr); *
or

Not allocate the GPU if it is pinned by added in *a check *here:

PetscErrorCode VecCUDAAllocateCheck(Vec v)
{
  PetscErrorCode ierr;
  cudaError_terr;
  cudaStream_t   stream;
  Vec_CUDA   *veccuda;

  PetscFunctionBegin;
  if (!v->spptr) {
ierr = PetscMalloc(sizeof(Vec_CUDA),&v->spptr);CHKERRQ(ierr);
veccuda = (Vec_CUDA*)v->spptr;
*if (v->valid_GPU_array != PETSC_OFFLOAD_CPU) {*
err =
cudaMalloc((void**)&veccuda->GPUarray_allocated,sizeof(PetscScalar)*((PetscBLASInt)v->map->n));CHKERRCUDA(err);
veccuda->GPUarray = veccuda->GPUarray_allocated;
err = cudaStreamCreate(&stream);CHKERRCUDA(err);
veccuda->stream = stream;
veccuda->hostDataRegisteredAsPageLocked = PETSC_FALSE;
if (v->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {
  if (v->data && ((Vec_Seq*)v->data)->array) {
v->valid_GPU_array = PETSC_OFFLOAD_CPU;
  } else {
v->valid_GPU_array = PETSC_OFFLOAD_GPU;
  }
}
*}*
  }
  PetscFunctionReturn(0);
}





>
> > On Jul 27, 2019, at 10:40 AM, Mark Adams  wrote:
> >
> > Yea, I just figured out the problem. VecDuplicate_MPICUDA did not call
> PinToCPU or even copy pinnedtocpu. It just copied ops, so I added and am
> testing:
> >
> >   ierr =
> VecCreate_MPICUDA_Private(*v,PETSC_TRUE,w->nghost,0);CHKERRQ(ierr);
> >   vw   = (Vec_MPI*)(*v)->data;
> >   ierr = PetscMemcpy((*v)->ops,win->ops,sizeof(struct
> _VecOps));CHKERRQ(ierr);
> >   ierr = VecPinToCPU(*v,win->pinnedtocpu);CHKERRQ(ierr);
> >
> > Thanks,
> >
> > On Sat, Jul 27, 2019 at 11:33 AM Smith, Barry F. 
> wrote:
> >
> >   I don't understand the context. Once a vector is pinned to the CPU the
> flag should be PETSC_OFFLOAD_CPU permanently until the pin to cpu is turned
> off.  Do you have a pinned vector that has the value PETSC_OFFLOAD_GPU?
> For example here it is set to PETSC_OFFLOAD_CPU
> >
> > PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
> > {
> > 
> >   if (pin) {
> > ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr);
> > V->valid_GPU_array = PETSC_OFFLOAD_CPU; /* since the CPU code will
> likely change values in the vector */
> >
> >
> >   Is there any way to reproduce the problem?
> >
> >   Barry
> >
> >
> >
> >
> > > On Jul 27, 2019, at 10:28 AM, Mark Adams  wrote:
> > >
> > > I'm not sure what to do here. The problem is that pinned-to-cpu
> vectors are calling VecCUDACopyFromGPU here.
> > >
> > > Should I set x->valid_GPU_array to something else, like
> PETSC_OFFLOAD_CPU, in PinToCPU so this block of code i s not executed?
> > >
> > > PetscErrorCode VecGetArray(Vec x,PetscScalar **a)
> > > {
> > >   PetscErrorCode ierr;
> > > #if defined(PETSC_HAVE_VIENNACL)
> > >   PetscBool  is_viennacltype = PETSC_FALSE;
> > > #endif
> > >
> > >   PetscFunctionBegin;
> > >   PetscValidHeaderSpecific(x,VEC_CLASSID,1);
> > >   ierr = VecSetErrorIfLocked(x,1);CHKERRQ(ierr);
> > >   if (x->petscnative) {
> > > #if defined(PETSC_HAVE_VIENNACL) || defined(PETSC_HAVE_CUDA)
> > > if (x->valid_GPU_array == PETSC_OFFLOAD_GPU) {
> > > #if defined(PETSC_HAVE_VIENNACL)
> > >   ierr =
> PetscObjectTypeCompareAny((PetscObject)x,&is_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
> > >   if (is_viennacltype) {
> > > ierr = VecViennaCLCopyFromGPU(x);CHKERRQ(ierr);
> > >   } else
> > > #endif
> > >   {
> > > #if defined(PETSC_HAVE_CUDA)
> > > ierr = VecCUDACopyFromGPU(x);CHKERRQ(ierr);
> > > #endif
> > >  }
> > > } else if (x->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {
> > > #if defined(PETSC_HAVE_VIENNACL)
> > >   ierr =
> PetscObjectTypeCompareAny((PetscObject)x,&is_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
> > >   if (is_viennacltype) {
> > > ierr = VecViennaCLAllocateCheckHost(x);CHKERRQ(ierr);
> > >   } else
> > > #endif
> > >   {
> > > #if defined(PETSC_HAVE_CUDA)
> > > ierr = VecCUDAAllocateCheckHost(x);CHKERRQ(ierr);
> > > #endif
> > >   }
> > > }
> > > #endif
> > > *a = *((PetscScalar**)x->data);
> > >   } else {
> > >
> > >
> > > On Tue, Jul 23, 2019 at 9:18 PM Smith, Barry F. 
> wrote:
> > >
> > >  Yes, it needs to be able to switch back and forth between the CPU and
> GPU methods so you need to move into it the setting of the methods that is
> currently directly in the create method. See how
> MatConvert_SeqAIJ_SeqAIJViennaCL() calls ierr =
> MatPinToCPU_SeqAIJViennaCL(A,PETSC_FALSE);CHKERRQ(ierr); to set the me

Re: [petsc-dev] MatPinToCPU

2019-07-27 Thread Smith, Barry F. via petsc-dev



> On Jul 27, 2019, at 11:53 AM, Mark Adams  wrote:
> 
> 
> On Sat, Jul 27, 2019 at 11:39 AM Smith, Barry F.  wrote:
> 
>   Good catch. Thanks. Maybe the SeqCUDA has the same problem?
> 
> THis is done  (I may have done it).
> 
> Now it seems to me that when you call VecPinToCPU you are setting up and 
> don't have data, so this copy does not seem necessary. Maybe remove the copy 
> here:
> 
> PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
> {
>   PetscErrorCode ierr;
> 
>   PetscFunctionBegin;
>   V->pinnedtocpu = pin;
>   if (pin) {
> ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr); 

   The copy from GPU should actually only do anything if the GPU already has 
data and PETSC_OFFLOAD_GPU. If the GPU does not have data 
the copy doesn't do anything. When one calls VecPinToCPU() one doesn't know 
where the data is so the call must be made, but it may do nothing

  Note that VecCUDACopyFromGPU() calls VecCUDAAllocateCheckHost() not 
VecCUDAAllocateCheck() so the GPU will not allocate space, 
VecCUDAAllocateCheck() is called from VecCUDACopyToGPU().

   Yes, perhaps the naming could be more consistent: 

1) in one place it is Host in an other place it is nothing
2) some places it is Host, Device, some places GPU,CPU

   Perhaps Karl can make these all consistent and simpler in his refactorization


  Barry


> 
> or
> 
> Not allocate the GPU if it is pinned by added in a check here:
> 
> PetscErrorCode VecCUDAAllocateCheck(Vec v)
> {
>   PetscErrorCode ierr;
>   cudaError_terr;
>   cudaStream_t   stream;
>   Vec_CUDA   *veccuda;
> 
>   PetscFunctionBegin;
>   if (!v->spptr) {
> ierr = PetscMalloc(sizeof(Vec_CUDA),&v->spptr);CHKERRQ(ierr);
> veccuda = (Vec_CUDA*)v->spptr;
> if (v->valid_GPU_array != PETSC_OFFLOAD_CPU) {
> err = 
> cudaMalloc((void**)&veccuda->GPUarray_allocated,sizeof(PetscScalar)*((PetscBLASInt)v->map->n));CHKERRCUDA(err);
> veccuda->GPUarray = veccuda->GPUarray_allocated;
> err = cudaStreamCreate(&stream);CHKERRCUDA(err);
> veccuda->stream = stream;
> veccuda->hostDataRegisteredAsPageLocked = PETSC_FALSE;
> if (v->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {
>   if (v->data && ((Vec_Seq*)v->data)->array) {
> v->valid_GPU_array = PETSC_OFFLOAD_CPU;
>   } else {
> v->valid_GPU_array = PETSC_OFFLOAD_GPU;
>   }
> }
> }
>   }
>   PetscFunctionReturn(0);
> }
> 
> 
> 
>  
> 
> > On Jul 27, 2019, at 10:40 AM, Mark Adams  wrote:
> > 
> > Yea, I just figured out the problem. VecDuplicate_MPICUDA did not call 
> > PinToCPU or even copy pinnedtocpu. It just copied ops, so I added and am 
> > testing:
> > 
> >   ierr = VecCreate_MPICUDA_Private(*v,PETSC_TRUE,w->nghost,0);CHKERRQ(ierr);
> >   vw   = (Vec_MPI*)(*v)->data;
> >   ierr = PetscMemcpy((*v)->ops,win->ops,sizeof(struct 
> > _VecOps));CHKERRQ(ierr);
> >   ierr = VecPinToCPU(*v,win->pinnedtocpu);CHKERRQ(ierr);
> > 
> > Thanks,
> > 
> > On Sat, Jul 27, 2019 at 11:33 AM Smith, Barry F.  wrote:
> > 
> >   I don't understand the context. Once a vector is pinned to the CPU the 
> > flag should be PETSC_OFFLOAD_CPU permanently until the pin to cpu is turned 
> > off.  Do you have a pinned vector that has the value PETSC_OFFLOAD_GPU?  
> > For example here it is set to PETSC_OFFLOAD_CPU
> > 
> > PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
> > {
> > 
> >   if (pin) {
> > ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr);
> > V->valid_GPU_array = PETSC_OFFLOAD_CPU; /* since the CPU code will 
> > likely change values in the vector */
> > 
> > 
> >   Is there any way to reproduce the problem?
> > 
> >   Barry
> > 
> > 
> > 
> > 
> > > On Jul 27, 2019, at 10:28 AM, Mark Adams  wrote:
> > > 
> > > I'm not sure what to do here. The problem is that pinned-to-cpu vectors 
> > > are calling VecCUDACopyFromGPU here.
> > > 
> > > Should I set x->valid_GPU_array to something else, like 
> > > PETSC_OFFLOAD_CPU, in PinToCPU so this block of code i s not executed?
> > > 
> > > PetscErrorCode VecGetArray(Vec x,PetscScalar **a)
> > > {
> > >   PetscErrorCode ierr;
> > > #if defined(PETSC_HAVE_VIENNACL)
> > >   PetscBool  is_viennacltype = PETSC_FALSE;
> > > #endif
> > > 
> > >   PetscFunctionBegin;
> > >   PetscValidHeaderSpecific(x,VEC_CLASSID,1);
> > >   ierr = VecSetErrorIfLocked(x,1);CHKERRQ(ierr);
> > >   if (x->petscnative) {
> > > #if defined(PETSC_HAVE_VIENNACL) || defined(PETSC_HAVE_CUDA)
> > > if (x->valid_GPU_array == PETSC_OFFLOAD_GPU) {
> > > #if defined(PETSC_HAVE_VIENNACL)
> > >   ierr = 
> > > PetscObjectTypeCompareAny((PetscObject)x,&is_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
> > >   if (is_viennacltype) {
> > > ierr = VecViennaCLCopyFromGPU(x);CHKERRQ(ierr);
> > >   } else
> > > #endif
> > >   {
> > > #if defined(PETSC_HAVE_CUDA)
> > > ierr = VecCUDACopyFromGPU(x);CHKERRQ(ierr);
> > > #endif
> > >  }
> > > } else if (x->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {

Re: [petsc-dev] MatPinToCPU

2019-07-27 Thread Mark Adams via petsc-dev
Barry, I fixed CUDA to pin to CPUs correctly for GAMG at least. There are
some hacks here that we can work on.

I will start testing it tomorrow, but I am pretty sure that I have not
regressed. I am hoping that this will fix the numerical problems, which
seem to be associated with empty processors.

I did need to touch code outside of GAMG and CUDA. It might be nice to test
this in a next.

GAMG now puts all reduced processorg grids on the CPU. This could be looked
at in the future.


On Sat, Jul 27, 2019 at 1:00 PM Smith, Barry F.  wrote:

>
>
> > On Jul 27, 2019, at 11:53 AM, Mark Adams  wrote:
> >
> >
> > On Sat, Jul 27, 2019 at 11:39 AM Smith, Barry F. 
> wrote:
> >
> >   Good catch. Thanks. Maybe the SeqCUDA has the same problem?
> >
> > THis is done  (I may have done it).
> >
> > Now it seems to me that when you call VecPinToCPU you are setting up and
> don't have data, so this copy does not seem necessary. Maybe remove the
> copy here:
> >
> > PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
> > {
> >   PetscErrorCode ierr;
> >
> >   PetscFunctionBegin;
> >   V->pinnedtocpu = pin;
> >   if (pin) {
> > ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr); 
>
>The copy from GPU should actually only do anything if the GPU already
> has data and PETSC_OFFLOAD_GPU. If the GPU does not have data
> the copy doesn't do anything. When one calls VecPinToCPU() one doesn't
> know where the data is so the call must be made, but it may do nothing
>
>   Note that VecCUDACopyFromGPU() calls VecCUDAAllocateCheckHost() not
> VecCUDAAllocateCheck() so the GPU will not allocate space,
> VecCUDAAllocateCheck() is called from VecCUDACopyToGPU().
>
>Yes, perhaps the naming could be more consistent:
>
> 1) in one place it is Host in an other place it is nothing
> 2) some places it is Host, Device, some places GPU,CPU
>
>Perhaps Karl can make these all consistent and simpler in his
> refactorization
>
>
>   Barry
>
>
> >
> > or
> >
> > Not allocate the GPU if it is pinned by added in a check here:
> >
> > PetscErrorCode VecCUDAAllocateCheck(Vec v)
> > {
> >   PetscErrorCode ierr;
> >   cudaError_terr;
> >   cudaStream_t   stream;
> >   Vec_CUDA   *veccuda;
> >
> >   PetscFunctionBegin;
> >   if (!v->spptr) {
> > ierr = PetscMalloc(sizeof(Vec_CUDA),&v->spptr);CHKERRQ(ierr);
> > veccuda = (Vec_CUDA*)v->spptr;
> > if (v->valid_GPU_array != PETSC_OFFLOAD_CPU) {
> > err =
> cudaMalloc((void**)&veccuda->GPUarray_allocated,sizeof(PetscScalar)*((PetscBLASInt)v->map->n));CHKERRCUDA(err);
> > veccuda->GPUarray = veccuda->GPUarray_allocated;
> > err = cudaStreamCreate(&stream);CHKERRCUDA(err);
> > veccuda->stream = stream;
> > veccuda->hostDataRegisteredAsPageLocked = PETSC_FALSE;
> > if (v->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {
> >   if (v->data && ((Vec_Seq*)v->data)->array) {
> > v->valid_GPU_array = PETSC_OFFLOAD_CPU;
> >   } else {
> > v->valid_GPU_array = PETSC_OFFLOAD_GPU;
> >   }
> > }
> > }
> >   }
> >   PetscFunctionReturn(0);
> > }
> >
> >
> >
> >
> >
> > > On Jul 27, 2019, at 10:40 AM, Mark Adams  wrote:
> > >
> > > Yea, I just figured out the problem. VecDuplicate_MPICUDA did not call
> PinToCPU or even copy pinnedtocpu. It just copied ops, so I added and am
> testing:
> > >
> > >   ierr =
> VecCreate_MPICUDA_Private(*v,PETSC_TRUE,w->nghost,0);CHKERRQ(ierr);
> > >   vw   = (Vec_MPI*)(*v)->data;
> > >   ierr = PetscMemcpy((*v)->ops,win->ops,sizeof(struct
> _VecOps));CHKERRQ(ierr);
> > >   ierr = VecPinToCPU(*v,win->pinnedtocpu);CHKERRQ(ierr);
> > >
> > > Thanks,
> > >
> > > On Sat, Jul 27, 2019 at 11:33 AM Smith, Barry F. 
> wrote:
> > >
> > >   I don't understand the context. Once a vector is pinned to the CPU
> the flag should be PETSC_OFFLOAD_CPU permanently until the pin to cpu is
> turned off.  Do you have a pinned vector that has the value
> PETSC_OFFLOAD_GPU?  For example here it is set to PETSC_OFFLOAD_CPU
> > >
> > > PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
> > > {
> > > 
> > >   if (pin) {
> > > ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr);
> > > V->valid_GPU_array = PETSC_OFFLOAD_CPU; /* since the CPU code will
> likely change values in the vector */
> > >
> > >
> > >   Is there any way to reproduce the problem?
> > >
> > >   Barry
> > >
> > >
> > >
> > >
> > > > On Jul 27, 2019, at 10:28 AM, Mark Adams  wrote:
> > > >
> > > > I'm not sure what to do here. The problem is that pinned-to-cpu
> vectors are calling VecCUDACopyFromGPU here.
> > > >
> > > > Should I set x->valid_GPU_array to something else, like
> PETSC_OFFLOAD_CPU, in PinToCPU so this block of code i s not executed?
> > > >
> > > > PetscErrorCode VecGetArray(Vec x,PetscScalar **a)
> > > > {
> > > >   PetscErrorCode ierr;
> > > > #if defined(PETSC_HAVE_VIENNACL)
> > > >   PetscBool  is_viennacltype = PETSC_FALSE;
> > > > #endif
> > > >
> > > >   PetscFunctionBegin;
> > > >   PetscValidHeaderSpecific(x,VEC_CLASSID,1);
> > > > 

Re: [petsc-dev] MatPinToCPU

2019-07-28 Thread Mark Adams via petsc-dev
This is looking good. I'm not seeing the numerical problems, but I've just
hid them by avoiding the GPU on coarse grids.

Should I submit a pull request now or test more or wait for Karl?

On Sat, Jul 27, 2019 at 7:37 PM Mark Adams  wrote:

> Barry, I fixed CUDA to pin to CPUs correctly for GAMG at least. There are
> some hacks here that we can work on.
>
> I will start testing it tomorrow, but I am pretty sure that I have not
> regressed. I am hoping that this will fix the numerical problems, which
> seem to be associated with empty processors.
>
> I did need to touch code outside of GAMG and CUDA. It might be nice to
> test this in a next.
>
> GAMG now puts all reduced processorg grids on the CPU. This could be
> looked at in the future.
>
>
> On Sat, Jul 27, 2019 at 1:00 PM Smith, Barry F. 
> wrote:
>
>>
>>
>> > On Jul 27, 2019, at 11:53 AM, Mark Adams  wrote:
>> >
>> >
>> > On Sat, Jul 27, 2019 at 11:39 AM Smith, Barry F. 
>> wrote:
>> >
>> >   Good catch. Thanks. Maybe the SeqCUDA has the same problem?
>> >
>> > THis is done  (I may have done it).
>> >
>> > Now it seems to me that when you call VecPinToCPU you are setting up
>> and don't have data, so this copy does not seem necessary. Maybe remove the
>> copy here:
>> >
>> > PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
>> > {
>> >   PetscErrorCode ierr;
>> >
>> >   PetscFunctionBegin;
>> >   V->pinnedtocpu = pin;
>> >   if (pin) {
>> > ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr); 
>>
>>The copy from GPU should actually only do anything if the GPU already
>> has data and PETSC_OFFLOAD_GPU. If the GPU does not have data
>> the copy doesn't do anything. When one calls VecPinToCPU() one doesn't
>> know where the data is so the call must be made, but it may do nothing
>>
>>   Note that VecCUDACopyFromGPU() calls VecCUDAAllocateCheckHost() not
>> VecCUDAAllocateCheck() so the GPU will not allocate space,
>> VecCUDAAllocateCheck() is called from VecCUDACopyToGPU().
>>
>>Yes, perhaps the naming could be more consistent:
>>
>> 1) in one place it is Host in an other place it is nothing
>> 2) some places it is Host, Device, some places GPU,CPU
>>
>>Perhaps Karl can make these all consistent and simpler in his
>> refactorization
>>
>>
>>   Barry
>>
>>
>> >
>> > or
>> >
>> > Not allocate the GPU if it is pinned by added in a check here:
>> >
>> > PetscErrorCode VecCUDAAllocateCheck(Vec v)
>> > {
>> >   PetscErrorCode ierr;
>> >   cudaError_terr;
>> >   cudaStream_t   stream;
>> >   Vec_CUDA   *veccuda;
>> >
>> >   PetscFunctionBegin;
>> >   if (!v->spptr) {
>> > ierr = PetscMalloc(sizeof(Vec_CUDA),&v->spptr);CHKERRQ(ierr);
>> > veccuda = (Vec_CUDA*)v->spptr;
>> > if (v->valid_GPU_array != PETSC_OFFLOAD_CPU) {
>> > err =
>> cudaMalloc((void**)&veccuda->GPUarray_allocated,sizeof(PetscScalar)*((PetscBLASInt)v->map->n));CHKERRCUDA(err);
>> > veccuda->GPUarray = veccuda->GPUarray_allocated;
>> > err = cudaStreamCreate(&stream);CHKERRCUDA(err);
>> > veccuda->stream = stream;
>> > veccuda->hostDataRegisteredAsPageLocked = PETSC_FALSE;
>> > if (v->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {
>> >   if (v->data && ((Vec_Seq*)v->data)->array) {
>> > v->valid_GPU_array = PETSC_OFFLOAD_CPU;
>> >   } else {
>> > v->valid_GPU_array = PETSC_OFFLOAD_GPU;
>> >   }
>> > }
>> > }
>> >   }
>> >   PetscFunctionReturn(0);
>> > }
>> >
>> >
>> >
>> >
>> >
>> > > On Jul 27, 2019, at 10:40 AM, Mark Adams  wrote:
>> > >
>> > > Yea, I just figured out the problem. VecDuplicate_MPICUDA did not
>> call PinToCPU or even copy pinnedtocpu. It just copied ops, so I added and
>> am testing:
>> > >
>> > >   ierr =
>> VecCreate_MPICUDA_Private(*v,PETSC_TRUE,w->nghost,0);CHKERRQ(ierr);
>> > >   vw   = (Vec_MPI*)(*v)->data;
>> > >   ierr = PetscMemcpy((*v)->ops,win->ops,sizeof(struct
>> _VecOps));CHKERRQ(ierr);
>> > >   ierr = VecPinToCPU(*v,win->pinnedtocpu);CHKERRQ(ierr);
>> > >
>> > > Thanks,
>> > >
>> > > On Sat, Jul 27, 2019 at 11:33 AM Smith, Barry F. 
>> wrote:
>> > >
>> > >   I don't understand the context. Once a vector is pinned to the CPU
>> the flag should be PETSC_OFFLOAD_CPU permanently until the pin to cpu is
>> turned off.  Do you have a pinned vector that has the value
>> PETSC_OFFLOAD_GPU?  For example here it is set to PETSC_OFFLOAD_CPU
>> > >
>> > > PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
>> > > {
>> > > 
>> > >   if (pin) {
>> > > ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr);
>> > > V->valid_GPU_array = PETSC_OFFLOAD_CPU; /* since the CPU code
>> will likely change values in the vector */
>> > >
>> > >
>> > >   Is there any way to reproduce the problem?
>> > >
>> > >   Barry
>> > >
>> > >
>> > >
>> > >
>> > > > On Jul 27, 2019, at 10:28 AM, Mark Adams  wrote:
>> > > >
>> > > > I'm not sure what to do here. The problem is that pinned-to-cpu
>> vectors are calling VecCUDACopyFromGPU here.
>> > > >
>> > > > Should I set x->valid_GPU_array to something els

Re: [petsc-dev] MatPinToCPU

2019-07-28 Thread Karl Rupp via petsc-dev

Hi Mark,

feel free to submit a fresh pull request now. I looked at your latest 
commit in the repository in order to cherry-pick it, but it looked like 
it had a few other bits in it as well.


Best regards,
Karli


On 7/28/19 6:27 PM, Mark Adams via petsc-dev wrote:
This is looking good. I'm not seeing the numerical problems, but I've 
just hid them by avoiding the GPU on coarse grids.


Should I submit a pull request now or test more or wait for Karl?

On Sat, Jul 27, 2019 at 7:37 PM Mark Adams > wrote:


Barry, I fixed CUDA to pin to CPUs correctly for GAMG at least.
There are some hacks here that we can work on.

I will start testing it tomorrow, but I am pretty sure that I have
not regressed. I am hoping that this will fix the numerical
problems, which seem to be associated with empty processors.

I did need to touch code outside of GAMG and CUDA. It might be nice
to test this in a next.

GAMG now puts all reduced processorg grids on the CPU. This could be
looked at in the future.


On Sat, Jul 27, 2019 at 1:00 PM Smith, Barry F. mailto:bsm...@mcs.anl.gov>> wrote:



 > On Jul 27, 2019, at 11:53 AM, Mark Adams mailto:mfad...@lbl.gov>> wrote:
 >
 >
 > On Sat, Jul 27, 2019 at 11:39 AM Smith, Barry F.
mailto:bsm...@mcs.anl.gov>> wrote:
 >
 >   Good catch. Thanks. Maybe the SeqCUDA has the same problem?
 >
 > THis is done  (I may have done it).
 >
 > Now it seems to me that when you call VecPinToCPU you are
setting up and don't have data, so this copy does not seem
necessary. Maybe remove the copy here:
 >
 > PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
 > {
 >   PetscErrorCode ierr;
 >
 >   PetscFunctionBegin;
 >   V->pinnedtocpu = pin;
 >   if (pin) {
 >     ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr); 

    The copy from GPU should actually only do anything if the
GPU already has data and PETSC_OFFLOAD_GPU. If the GPU does not
have data
the copy doesn't do anything. When one calls VecPinToCPU() one
doesn't know where the data is so the call must be made, but it
may do nothing

   Note that VecCUDACopyFromGPU() calls
VecCUDAAllocateCheckHost() not VecCUDAAllocateCheck() so the GPU
will not allocate space,
VecCUDAAllocateCheck() is called from VecCUDACopyToGPU().

    Yes, perhaps the naming could be more consistent:

1) in one place it is Host in an other place it is nothing
2) some places it is Host, Device, some places GPU,CPU

    Perhaps Karl can make these all consistent and simpler in
his refactorization


   Barry


 >
 > or
 >
 > Not allocate the GPU if it is pinned by added in a check here:
 >
 > PetscErrorCode VecCUDAAllocateCheck(Vec v)
 > {
 >   PetscErrorCode ierr;
 >   cudaError_t    err;
 >   cudaStream_t   stream;
 >   Vec_CUDA       *veccuda;
 >
 >   PetscFunctionBegin;
 >   if (!v->spptr) {
 >     ierr = PetscMalloc(sizeof(Vec_CUDA),&v->spptr);CHKERRQ(ierr);
 >     veccuda = (Vec_CUDA*)v->spptr;
 > if (v->valid_GPU_array != PETSC_OFFLOAD_CPU) {
 >     err =

cudaMalloc((void**)&veccuda->GPUarray_allocated,sizeof(PetscScalar)*((PetscBLASInt)v->map->n));CHKERRCUDA(err);
 >     veccuda->GPUarray = veccuda->GPUarray_allocated;
 >     err = cudaStreamCreate(&stream);CHKERRCUDA(err);
 >     veccuda->stream = stream;
 >     veccuda->hostDataRegisteredAsPageLocked = PETSC_FALSE;
 >     if (v->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {
 >       if (v->data && ((Vec_Seq*)v->data)->array) {
 >         v->valid_GPU_array = PETSC_OFFLOAD_CPU;
 >       } else {
 >         v->valid_GPU_array = PETSC_OFFLOAD_GPU;
 >       }
 >     }
 > }
 >   }
 >   PetscFunctionReturn(0);
 > }
 >
 >
 >
 >
 >
 > > On Jul 27, 2019, at 10:40 AM, Mark Adams mailto:mfad...@lbl.gov>> wrote:
 > >
 > > Yea, I just figured out the problem. VecDuplicate_MPICUDA
did not call PinToCPU or even copy pinnedtocpu. It just copied
ops, so I added and am testing:
 > >
 > >   ierr =
VecCreate_MPICUDA_Private(*v,PETSC_TRUE,w->nghost,0);CHKERRQ(ierr);
 > >   vw   = (Vec_MPI*)(*v)->data;
 > >   ierr = PetscMemcpy((*v)->ops,win->ops,sizeof(struct
_VecOps));CHKERRQ(ierr);
 > >   ierr = VecPinToCPU(*v,win->pinnedtocpu);CHKERRQ(ierr);
 > >
 > > Thanks,
 > >
 > > On Sat, Jul 27, 2019 at 11:33 AM Smith, Barry F.
mailto:bsm..

Re: [petsc-dev] MatPinToCPU

2019-07-29 Thread Smith, Barry F. via petsc-dev


  I don't understand the notation in the legend on the second page

12,288 cpus and no GPUs ?

24 GPUs?  or 6 GPUs

192 GPUs?

1536 GPUs?

12,288 GPUs?  or 12288/4 = 3072  GPUs?

So on the largest run using GPUs or not takes pretty much exactly the same 
amount of  time?

What about 6 GPUs vs 24 CPUs ? Same equal amount of time. 

Can you send some log summaries



> On Jul 29, 2019, at 4:01 PM, Mark Adams  wrote:
> 
> FYI, CUDA is running and here is some preliminary data on up to 1/8 of 
> SUMMIT. This run with 4 cores/processes per GPU, so the GPU is virtualized 
> into 4 GPUs.
> 
> On Sun, Jul 28, 2019 at 2:34 PM Karl Rupp  wrote:
> Hi Mark,
> 
> feel free to submit a fresh pull request now. I looked at your latest 
> commit in the repository in order to cherry-pick it, but it looked like 
> it had a few other bits in it as well.
> 
> Best regards,
> Karli
> 
> 
> On 7/28/19 6:27 PM, Mark Adams via petsc-dev wrote:
> > This is looking good. I'm not seeing the numerical problems, but I've 
> > just hid them by avoiding the GPU on coarse grids.
> > 
> > Should I submit a pull request now or test more or wait for Karl?
> > 
> > On Sat, Jul 27, 2019 at 7:37 PM Mark Adams  > > wrote:
> > 
> > Barry, I fixed CUDA to pin to CPUs correctly for GAMG at least.
> > There are some hacks here that we can work on.
> > 
> > I will start testing it tomorrow, but I am pretty sure that I have
> > not regressed. I am hoping that this will fix the numerical
> > problems, which seem to be associated with empty processors.
> > 
> > I did need to touch code outside of GAMG and CUDA. It might be nice
> > to test this in a next.
> > 
> > GAMG now puts all reduced processorg grids on the CPU. This could be
> > looked at in the future.
> > 
> > 
> > On Sat, Jul 27, 2019 at 1:00 PM Smith, Barry F.  > > wrote:
> > 
> > 
> > 
> >  > On Jul 27, 2019, at 11:53 AM, Mark Adams  > > wrote:
> >  >
> >  >
> >  > On Sat, Jul 27, 2019 at 11:39 AM Smith, Barry F.
> > mailto:bsm...@mcs.anl.gov>> wrote:
> >  >
> >  >   Good catch. Thanks. Maybe the SeqCUDA has the same problem?
> >  >
> >  > THis is done  (I may have done it).
> >  >
> >  > Now it seems to me that when you call VecPinToCPU you are
> > setting up and don't have data, so this copy does not seem
> > necessary. Maybe remove the copy here:
> >  >
> >  > PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
> >  > {
> >  >   PetscErrorCode ierr;
> >  >
> >  >   PetscFunctionBegin;
> >  >   V->pinnedtocpu = pin;
> >  >   if (pin) {
> >  > ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr); 
> > 
> > The copy from GPU should actually only do anything if the
> > GPU already has data and PETSC_OFFLOAD_GPU. If the GPU does not
> > have data
> > the copy doesn't do anything. When one calls VecPinToCPU() one
> > doesn't know where the data is so the call must be made, but it
> > may do nothing
> > 
> >Note that VecCUDACopyFromGPU() calls
> > VecCUDAAllocateCheckHost() not VecCUDAAllocateCheck() so the GPU
> > will not allocate space,
> > VecCUDAAllocateCheck() is called from VecCUDACopyToGPU().
> > 
> > Yes, perhaps the naming could be more consistent:
> > 
> > 1) in one place it is Host in an other place it is nothing
> > 2) some places it is Host, Device, some places GPU,CPU
> > 
> > Perhaps Karl can make these all consistent and simpler in
> > his refactorization
> > 
> > 
> >Barry
> > 
> > 
> >  >
> >  > or
> >  >
> >  > Not allocate the GPU if it is pinned by added in a check here:
> >  >
> >  > PetscErrorCode VecCUDAAllocateCheck(Vec v)
> >  > {
> >  >   PetscErrorCode ierr;
> >  >   cudaError_terr;
> >  >   cudaStream_t   stream;
> >  >   Vec_CUDA   *veccuda;
> >  >
> >  >   PetscFunctionBegin;
> >  >   if (!v->spptr) {
> >  > ierr = PetscMalloc(sizeof(Vec_CUDA),&v->spptr);CHKERRQ(ierr);
> >  > veccuda = (Vec_CUDA*)v->spptr;
> >  > if (v->valid_GPU_array != PETSC_OFFLOAD_CPU) {
> >  > err =
> > 
> > cudaMalloc((void**)&veccuda->GPUarray_allocated,sizeof(PetscScalar)*((PetscBLASInt)v->map->n));CHKERRCUDA(err);
> >  > veccuda->GPUarray = veccuda->GPUarray_allocated;
> >  > err = cudaStreamCreate(&stream);CHKERRCUDA(err);
> >  > veccuda->stream = stream;
> >  > veccuda->hostDataRegisteredAsPageLocked = PETSC_FALSE;
> >  > if (v->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {
> >  >   if (v->data && 

Re: [petsc-dev] MatPinToCPU

2019-07-29 Thread Smith, Barry F. via petsc-dev


  Thanks. Could you please send the 24 processors with the GPU? 

   Note the final column of the table gives you the percentage of flops (not 
rates, actual operations) on the GPU. For you biggest run it is

   For the MatMult it is 18 percent and for KSP solve it is 23 percent. I think 
this is much too low, we'd like to see well over 90 percent of the flops on the 
GPU; or 95 or more. Is this because you are forced to put very large matrices 
only the CPU? 

   For the MatMult if we assume the flop rate for the GPU is 25 times as fast 
as the CPU and 18 percent of the flops are done on the GPU then the ratio of 
time for the GPU should be 82.7 percent of the time for the CPU but  it is .90; 
so where is the extra time? Seems too much than just for the communication. 

   There is so much information and so much happening in the final stage that 
it is hard to discern what is killing the performance in the GPU case for the 
KSP solve. Anyway you can just have a stage at the end with several KSP solves 
and nothing else? 

   Barry


> On Jul 29, 2019, at 5:26 PM, Mark Adams  wrote:
> 
> 
> 
> On Mon, Jul 29, 2019 at 5:31 PM Smith, Barry F.  wrote:
> 
>   I don't understand the notation in the legend on the second page
> 
> 12,288 cpus and no GPUs ?
> 
> Yes
>  
> 
> 24 GPUs?  or 6 GPUs
> 
> 24 virtual, 6 real GPUs per node. The first case is one node, 24 cores/vGPUs
>  
> 
> 192 GPUs?
> 
> 1536 GPUs?
> 
> 12,288 GPUs?  or 12288/4 = 3072  GPUs?
> 
> All "GPUs" are one core/process/vGPU. So 12288 virtual GPUs and 3072 physical 
> GPUs.
> 
> Maybe I should add 'virtual GPUs' and put (4 processes/SUMMIT GPU)
>  
> 
> So on the largest run using GPUs or not takes pretty much exactly the same 
> amount of  time?
> 
> yes. The raw Mat-vec is about 3x faster with ~95K equations/process. I've 
> attached the data.
>  
> 
> What about 6 GPUs vs 24 CPUs ? Same equal amount of time. 
> 
> Can you send some log summaries
> 
> 



Re: [petsc-dev] MatPinToCPU

2019-07-30 Thread Mark Adams via petsc-dev
On Mon, Jul 29, 2019 at 11:27 PM Smith, Barry F.  wrote:

>
>   Thanks. Could you please send the 24 processors with the GPU?
>

That is in  out_cuda_24


>Note the final column of the table gives you the percentage of flops
> (not rates, actual operations) on the GPU. For you biggest run it is
>
>For the MatMult it is 18 percent and for KSP solve it is 23 percent. I
> think this is much too low, we'd like to see well over 90 percent of the
> flops on the GPU; or 95 or more. Is this because you are forced to put very
> large matrices only the CPU?
>

Humm, that is strange. BLAS1 stuff is 100% GPU but the coarse grids are on
the CPU. This could be because it is > 99.5%. And there is this in the last
solve phase:

MatMult  679 1.0 5.2220e+00 1.2 7.58e+09 1.3 8.0e+07 1.1e+04
0.0e+00  1 39 14  8  0   3 74 79 60  0 16438647   438720307578 1.99e+02
 519 2.55e+02 18
MatMultAdd   150 1.0 1.1836e+00 4.7 3.41e+08 1.2 1.0e+07 1.8e+03
0.0e+00  0  2  2  0  0   1  3 10  1  0 3409019   191195194120 2.48e+01
  60 2.25e+00 21
MatMultTranspose 150 1.0 5.7940e-01 2.4 3.37e+08 1.2 1.0e+07 1.8e+03
0.0e+00  0  2  2  0  0   0  3 10  1  0 6867795   2539317196 38 1.02e+02
 150 3.22e+00 92

I have added print statements to MatMult_[CUDA,CPU] and it looks fine. Well
over 90% should be on the GPU. I am puzzled. I'll keep digging but the log
statements look OK.


>For the MatMult if we assume the flop rate for the GPU is 25 times as
> fast as the CPU and 18 percent of the flops are done on the GPU then the
> ratio of time for the GPU should be 82.7 percent of the time for the CPU
> but  it is .90; so where is the extra time? Seems too much than just for
> the communication.
>

I don't follow this analysis but the there is something funny about the
logging ...


>
>There is so much information and so much happening in the final stage
> that it is hard to discern what is killing the performance in the GPU case
> for the KSP solve. Anyway you can just have a stage at the end with several
> KSP solves and nothing else?
>

I added this, eg,

--- Event Stage 7: KSP only

SFBcastOpBegin   263 1.0 8.4140e-03 2.7 0.00e+00 0.0 6.1e+04 2.5e+03
0.0e+00  0  0 15  7  0   1  0 91 98  0 0   0  0 0.00e+000
0.00e+00  0
SFBcastOpEnd 263 1.0 6.6676e-02 6.9 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   8  0  0  0  0 0   0  0 0.00e+000
0.00e+00  0
SFReduceBegin 48 1.0 4.5977e-04 2.1 0.00e+00 0.0 6.4e+03 6.0e+02
0.0e+00  0  0  2  0  0   0  0  9  2  0 0   0  0 0.00e+000
0.00e+00  0
SFReduceEnd   48 1.0 5.4065e-0321.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
0.00e+00  0
MatMult  215 1.0 3.9271e-01 1.0 6.33e+08 1.4 5.5e+04 2.7e+03
0.0e+00  1 24 14  7  0  83 89 81 95  0 33405   177859430 1.75e+01  358
2.23e+01 17
MatMultAdd48 1.0 3.3079e-02 1.3 3.20e+07 1.3 6.4e+03 6.0e+02
0.0e+00  0  1  2  0  0   7  5  9  2  0 20318   106989 48 2.33e+00   48
2.24e-01 20
MatMultTranspose  48 1.0 1.1967e-02 1.8 3.15e+07 1.3 6.4e+03 6.0e+02
0.0e+00  0  1  2  0  0   2  4  9  2  0 55325   781863  0 0.00e+00   72
3.23e-01 93
MatSolve  24 0.0 3.6270e-03 0.0 1.02e+07 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0  2810   0  0 0.00e+000
0.00e+00  0
MatResidual   48 1.0 8.2272e-02 1.0 1.33e+08 1.4 1.2e+04 2.6e+03
0.0e+00  0  5  3  1  0  17 19 18 20  0 33284   136803 96 3.62e+00   72
4.50e+00 19
VecTDot   46 1.0 6.1646e-03 1.3 1.13e+06 1.2 0.0e+00 0.0e+00
4.6e+01  0  0  0  0  2   1  0  0  0 66  41096814  0 0.00e+000
0.00e+00 100
VecNorm   24 1.0 5.2724e-03 1.9 5.90e+05 1.2 0.0e+00 0.0e+00
2.4e+01  0  0  0  0  1   1  0  0  0 34  25075050  0 0.00e+000
0.00e+00 100
VecCopy  146 1.0 3.9029e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   1  0  0  0  0 0   0  0 0.00e+00   24
9.87e-02  0
VecSet   169 1.0 1.3301e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
0.00e+00  0
VecAXPY   46 1.0 1.5963e-03 1.2 1.13e+06 1.2 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 15870   23070  0 0.00e+000
0.00e+00 100
VecAYPX  310 1.0 1.3059e-02 1.1 4.25e+06 1.2 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   3  1  0  0  0  7273   12000 48 1.97e-010
0.00e+00 100
VecAXPBYCZ96 1.0 6.8591e-03 1.2 6.19e+06 1.2 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   1  1  0  0  0 20134   46381  0 0.00e+000
0.00e+00 100
VecPointwiseMult 192 1.0 7.1075e-03 1.2 1.24e+06 1.2 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   1  0  0  0  0  38864184 24 9.87e-020
0.00e+00 100
VecScatterBegin  311 1.0 1.1026e-02 2.0 0.00e+00 0.0 6.8e+04 2.3e+03
0.0e+00  0  0 17  7  0   2  0100100  0 0   0  0 0.00e+00   72
3.50e-

Re: [petsc-dev] MatPinToCPU

2019-07-30 Thread Smith, Barry F. via petsc-dev


  Sorry, I meant 24 CPU only


> On Jul 30, 2019, at 9:19 AM, Mark Adams  wrote:
> 
> 
> 
> On Mon, Jul 29, 2019 at 11:27 PM Smith, Barry F.  wrote:
> 
>   Thanks. Could you please send the 24 processors with the GPU? 
> 
> That is in  out_cuda_24
> 
> 
>Note the final column of the table gives you the percentage of flops (not 
> rates, actual operations) on the GPU. For you biggest run it is
> 
>For the MatMult it is 18 percent and for KSP solve it is 23 percent. I 
> think this is much too low, we'd like to see well over 90 percent of the 
> flops on the GPU; or 95 or more. Is this because you are forced to put very 
> large matrices only the CPU? 
> 
> Humm, that is strange. BLAS1 stuff is 100% GPU but the coarse grids are on 
> the CPU. This could be because it is > 99.5%. And there is this in the last 
> solve phase:
> 
> MatMult  679 1.0 5.2220e+00 1.2 7.58e+09 1.3 8.0e+07 1.1e+04 
> 0.0e+00  1 39 14  8  0   3 74 79 60  0 16438647   438720307578 1.99e+02  
> 519 2.55e+02 18
> MatMultAdd   150 1.0 1.1836e+00 4.7 3.41e+08 1.2 1.0e+07 1.8e+03 
> 0.0e+00  0  2  2  0  0   1  3 10  1  0 3409019   191195194120 2.48e+01   
> 60 2.25e+00 21
> MatMultTranspose 150 1.0 5.7940e-01 2.4 3.37e+08 1.2 1.0e+07 1.8e+03 
> 0.0e+00  0  2  2  0  0   0  3 10  1  0 6867795   2539317196 38 1.02e+02  
> 150 3.22e+00 92
>  
> I have added print statements to MatMult_[CUDA,CPU] and it looks fine. Well 
> over 90% should be on the GPU. I am puzzled. I'll keep digging but the log 
> statements look OK.
> 
> 
>For the MatMult if we assume the flop rate for the GPU is 25 times as fast 
> as the CPU and 18 percent of the flops are done on the GPU then the ratio of 
> time for the GPU should be 82.7 percent of the time for the CPU but  it is 
> .90; so where is the extra time? Seems too much than just for the 
> communication. 
> 
> I don't follow this analysis but the there is something funny about the 
> logging ...
>  
> 
>There is so much information and so much happening in the final stage that 
> it is hard to discern what is killing the performance in the GPU case for the 
> KSP solve. Anyway you can just have a stage at the end with several KSP 
> solves and nothing else? 
> 
> I added this, eg, 
> 
> --- Event Stage 7: KSP only
> 
> SFBcastOpBegin   263 1.0 8.4140e-03 2.7 0.00e+00 0.0 6.1e+04 2.5e+03 
> 0.0e+00  0  0 15  7  0   1  0 91 98  0 0   0  0 0.00e+000 
> 0.00e+00  0
> SFBcastOpEnd 263 1.0 6.6676e-02 6.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   8  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> SFReduceBegin 48 1.0 4.5977e-04 2.1 0.00e+00 0.0 6.4e+03 6.0e+02 
> 0.0e+00  0  0  2  0  0   0  0  9  2  0 0   0  0 0.00e+000 
> 0.00e+00  0
> SFReduceEnd   48 1.0 5.4065e-0321.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> MatMult  215 1.0 3.9271e-01 1.0 6.33e+08 1.4 5.5e+04 2.7e+03 
> 0.0e+00  1 24 14  7  0  83 89 81 95  0 33405   177859430 1.75e+01  358 
> 2.23e+01 17
> MatMultAdd48 1.0 3.3079e-02 1.3 3.20e+07 1.3 6.4e+03 6.0e+02 
> 0.0e+00  0  1  2  0  0   7  5  9  2  0 20318   106989 48 2.33e+00   48 
> 2.24e-01 20
> MatMultTranspose  48 1.0 1.1967e-02 1.8 3.15e+07 1.3 6.4e+03 6.0e+02 
> 0.0e+00  0  1  2  0  0   2  4  9  2  0 55325   781863  0 0.00e+00   72 
> 3.23e-01 93
> MatSolve  24 0.0 3.6270e-03 0.0 1.02e+07 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  2810   0  0 0.00e+000 
> 0.00e+00  0
> MatResidual   48 1.0 8.2272e-02 1.0 1.33e+08 1.4 1.2e+04 2.6e+03 
> 0.0e+00  0  5  3  1  0  17 19 18 20  0 33284   136803 96 3.62e+00   72 
> 4.50e+00 19
> VecTDot   46 1.0 6.1646e-03 1.3 1.13e+06 1.2 0.0e+00 0.0e+00 
> 4.6e+01  0  0  0  0  2   1  0  0  0 66  41096814  0 0.00e+000 
> 0.00e+00 100
> VecNorm   24 1.0 5.2724e-03 1.9 5.90e+05 1.2 0.0e+00 0.0e+00 
> 2.4e+01  0  0  0  0  1   1  0  0  0 34  25075050  0 0.00e+000 
> 0.00e+00 100
> VecCopy  146 1.0 3.9029e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   1  0  0  0  0 0   0  0 0.00e+00   24 
> 9.87e-02  0
> VecSet   169 1.0 1.3301e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecAXPY   46 1.0 1.5963e-03 1.2 1.13e+06 1.2 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 15870   23070  0 0.00e+000 
> 0.00e+00 100
> VecAYPX  310 1.0 1.3059e-02 1.1 4.25e+06 1.2 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   3  1  0  0  0  7273   12000 48 1.97e-010 
> 0.00e+00 100
> VecAXPBYCZ96 1.0 6.8591e-03 1.2 6.19e+06 1.2 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   1  1  0  0  0 20134   46381  0 0.00e+000 
> 0.00e+00 100
> V