Re: [petsc-dev] Why no SpGEMM support in AIJCUSPARSE and AIJVIENNACL?

2019-10-02 Thread Jed Brown via petsc-dev
Do you have any experience with nsparse?

https://github.com/EBD-CREST/nsparse

I've seen claims that it is much faster than cuSPARSE for sparse
matrix-matrix products.

Karl Rupp via petsc-dev  writes:

> Hi Richard,
>
> CPU spGEMM is about twice as fast even on the GPU-friendly case of a 
> single rank: http://viennacl.sourceforge.net/viennacl-benchmarks-spmm.html
>
> I agree that it would be good to have a GPU-MatMatMult for the sake of 
> experiments. Under these performance constraints it's not top priority, 
> though.
>
> Best regards,
> Karli
>
>
> On 10/3/19 12:00 AM, Mills, Richard Tran via petsc-dev wrote:
>> Fellow PETSc developers,
>> 
>> I am wondering why the AIJCUSPARSE and AIJVIENNACL matrix types do not 
>> support the sparse matrix-matrix multiplication (SpGEMM, or MatMatMult() 
>> in PETSc parlance) routines provided by cuSPARSE and ViennaCL, 
>> respectively. Is there a good reason that I shouldn't add those? My 
>> guess is that support was not added because SpGEMM is hard to do well on 
>> a GPU compared to many CPUs (it is hard to compete with, say, Intel Xeon 
>> CPUs with their huge caches) and it has been the case that one would 
>> generally be better off doing these operations on the CPU. Since the 
>> trend at the big supercomputing centers seems to be to put more and more 
>> of the computational power into GPUs, I'm thinking that I should add the 
>> option to use the GPU library routines for SpGEMM, though. Is there some 
>> good reason to *not* do this that I am not aware of? (Maybe the CPUs are 
>> better for this even on a machine like Summit, but I think we're at the 
>> point that we should at least be able to experimentally verify this.)
>> 
>> --Richard


Re: [petsc-dev] Why no SpGEMM support in AIJCUSPARSE and AIJVIENNACL?

2019-10-02 Thread Karl Rupp via petsc-dev

Hi Richard,

CPU spGEMM is about twice as fast even on the GPU-friendly case of a 
single rank: http://viennacl.sourceforge.net/viennacl-benchmarks-spmm.html


I agree that it would be good to have a GPU-MatMatMult for the sake of 
experiments. Under these performance constraints it's not top priority, 
though.


Best regards,
Karli


On 10/3/19 12:00 AM, Mills, Richard Tran via petsc-dev wrote:

Fellow PETSc developers,

I am wondering why the AIJCUSPARSE and AIJVIENNACL matrix types do not 
support the sparse matrix-matrix multiplication (SpGEMM, or MatMatMult() 
in PETSc parlance) routines provided by cuSPARSE and ViennaCL, 
respectively. Is there a good reason that I shouldn't add those? My 
guess is that support was not added because SpGEMM is hard to do well on 
a GPU compared to many CPUs (it is hard to compete with, say, Intel Xeon 
CPUs with their huge caches) and it has been the case that one would 
generally be better off doing these operations on the CPU. Since the 
trend at the big supercomputing centers seems to be to put more and more 
of the computational power into GPUs, I'm thinking that I should add the 
option to use the GPU library routines for SpGEMM, though. Is there some 
good reason to *not* do this that I am not aware of? (Maybe the CPUs are 
better for this even on a machine like Summit, but I think we're at the 
point that we should at least be able to experimentally verify this.)


--Richard


Re: [petsc-dev] Why no SpGEMM support in AIJCUSPARSE and AIJVIENNACL?

2019-10-02 Thread Mark Adams via petsc-dev
FWIW, I've heard that CUSPARSE is going to provide integer matrix-matrix
products for indexing applications, and that it should be easy to extend
that to double, etc.

On Wed, Oct 2, 2019 at 6:00 PM Mills, Richard Tran via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

> Fellow PETSc developers,
>
> I am wondering why the AIJCUSPARSE and AIJVIENNACL matrix types do not
> support the sparse matrix-matrix multiplication (SpGEMM, or MatMatMult() in
> PETSc parlance) routines provided by cuSPARSE and ViennaCL, respectively.
> Is there a good reason that I shouldn't add those? My guess is that support
> was not added because SpGEMM is hard to do well on a GPU compared to many
> CPUs (it is hard to compete with, say, Intel Xeon CPUs with their huge
> caches) and it has been the case that one would generally be better off
> doing these operations on the CPU. Since the trend at the big
> supercomputing centers seems to be to put more and more of the
> computational power into GPUs, I'm thinking that I should add the option to
> use the GPU library routines for SpGEMM, though. Is there some good reason
> to *not* do this that I am not aware of? (Maybe the CPUs are better for
> this even on a machine like Summit, but I think we're at the point that we
> should at least be able to experimentally verify this.)
>
> --Richard
>


[petsc-dev] Why no SpGEMM support in AIJCUSPARSE and AIJVIENNACL?

2019-10-02 Thread Mills, Richard Tran via petsc-dev
Fellow PETSc developers,

I am wondering why the AIJCUSPARSE and AIJVIENNACL matrix types do not support 
the sparse matrix-matrix multiplication (SpGEMM, or MatMatMult() in PETSc 
parlance) routines provided by cuSPARSE and ViennaCL, respectively. Is there a 
good reason that I shouldn't add those? My guess is that support was not added 
because SpGEMM is hard to do well on a GPU compared to many CPUs (it is hard to 
compete with, say, Intel Xeon CPUs with their huge caches) and it has been the 
case that one would generally be better off doing these operations on the CPU. 
Since the trend at the big supercomputing centers seems to be to put more and 
more of the computational power into GPUs, I'm thinking that I should add the 
option to use the GPU library routines for SpGEMM, though. Is there some good 
reason to *not* do this that I am not aware of? (Maybe the CPUs are better for 
this even on a machine like Summit, but I think we're at the point that we 
should at least be able to experimentally verify this.)

--Richard


Re: [petsc-dev] CUDA STREAMS

2019-10-02 Thread Mills, Richard Tran via petsc-dev
Mark,

It looks like you are missing some critical CUDA library (or libraries) in your 
link line. I know you will at least need the CUDA runtime "-lcudart". Look at 
something like PETSC_WITH_EXTERNAL_LIB for one of your CUDA-enabled PETSc 
builds in $PETSC_ARCH/lib/petsc/conf/petscvariables to see what else you might 
need.

--Richard

On 10/2/19 7:20 AM, Mark Adams via petsc-dev wrote:

I found a CUDAVersion.cu of STREAMS and tried to build it. I got it to compile 
manually with:

nvcc -o CUDAVersion.o -ccbin pgc++ 
-I/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/include
 -Wno-deprecated-gpu-targets -c --compiler-options="-g 
-I/ccs/home/adams/petsc/include 
-I/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/include   " 
`pwd`/CUDAVersion.cu
/gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu(22): warning: conversion from 
a string literal to "char *" is deprecated
 

And this did produce a .o file. But I get this when I try to link.

make -f makestreams CUDAVersion
mpicc -g -fast  -o CUDAVersion CUDAVersion.o 
-Wl,-rpath,/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib 
-L/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib 
-Wl,-rpath,/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib 
-L/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/pgi.ld
 
-Wl,-rpath,/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib
 
-L/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib
 
-Wl,-rpath,/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib
 
-L/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib
 -Wl,-rpath,/usr/lib/gcc/ppc64le-redhat-linux/4.8.5 
-L/usr/lib/gcc/ppc64le-redhat-linux/4.8.5 -lpetsc -llapack -lblas -lparmetis 
-lmetis -lstdc++ -ldl -lpthread -lmpiprofilesupport -lmpi_ibm_usempi 
-lmpi_ibm_mpifh -lmpi_ibm -lpgf90rtl -lpgf90 -lpgf90_rpm1 -lpgf902 -lpgftnrtl 
-latomic -lpgkomp -lomp -lomptarget -lpgmath -lpgc -lrt -lmass_simdp9 -lmassvp9 
-lmassp9 -lm -lgcc_s -lstdc++ -ldl
CUDAVersion.o: In function `setupStream(long, PetscBool, PetscBool)':
/gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu:394: undefined reference to 
`cudaGetDeviceCount'
/gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu:406: undefined reference to 
`cudaSetDevice'
 

I have compared this link line with working examples and it looks the same. 
There is not .c file here -- main is in the .cu file. I assume that is the 
difference.

Any ideas?
Thanks,
Mark



Re: [petsc-dev] test harness: output of actually executed command for V=1 gone?

2019-10-02 Thread Scott Kruger via petsc-dev




In MR !2138 I have this target as show-fail  which I think is more 
descriptive.


config/report_tests.py -f
is what's done directly.

I made it such that one can copy and paste, but it might be too verbose.

Scott


On 9/20/19 8:53 PM, Jed Brown wrote:

"Smith, Barry F."  writes:


Satish and Barry:  Do we need the Error codes or can I revert to previous 
functionality?


   I think it is important to display the error codes.

   How about displaying at the bottom how to run the broken tests? You already 
show how to run them with the test harness, you could also print how to run 
them directly? Better then mixing it up with the TAP output?


How about a target for it?

make -f gmakefile show-test search=abcd

We already have print-test, which might more accurately be named ls-test.



--
Tech-X Corporation   kru...@txcorp.com
5621 Arapahoe Ave, Suite A   Phone: (720) 974-1841
Boulder, CO 80303Fax:   (303) 448-7756


Re: [petsc-dev] Mixing separate and shared ouputs

2019-10-02 Thread Scott Kruger via petsc-dev




Fixed in MR# 2138
https://gitlab.com/petsc/petsc/merge_requests/2138

Thanks for the report.

Scott


On 9/28/19 3:44 AM, Pierre Jolivet via petsc-dev wrote:

Hello,
If I put something like this in src/ksp/ksp/examples/tutorials/ex12.c
   args: -ksp_gmres_cgs_refinement_type refine_always -ksp_type {{cg 
gmres}separate output} -pc_type {{jacobi bjacobi lu}separate output}
I get
# success 9/13 tests (69.2%)

Now
   args: -ksp_gmres_cgs_refinement_type refine_always -ksp_type {{cg 
gmres}shared output} -pc_type {{jacobi bjacobi lu}shared output}
Still gives me
# success 9/13 tests (69.2%)

But
   args: -ksp_gmres_cgs_refinement_type refine_always -ksp_type {{cg 
gmres}shared output} -pc_type {{jacobi bjacobi lu}separate output}
Gives me
# success 6/7 tests (85.7%)

Is this the expected behavior?
Any easy way to get 13 tests as well?

Thanks,
Pierre



--
Tech-X Corporation   kru...@txcorp.com
5621 Arapahoe Ave, Suite A   Phone: (720) 974-1841
Boulder, CO 80303Fax:   (303) 448-7756


[petsc-dev] CUDA STREAMS

2019-10-02 Thread Mark Adams via petsc-dev
I found a CUDAVersion.cu of STREAMS and tried to build it. I got it to
compile manually with:

nvcc -o CUDAVersion.o -ccbin pgc++
-I/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/include
-Wno-deprecated-gpu-targets -c --compiler-options="-g
-I/ccs/home/adams/petsc/include
-I/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/include   "
`pwd`/CUDAVersion.cu
/gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu(22): warning: conversion
from a string literal to "char *" is deprecated
 

And this did produce a .o file. But I get this when I try to link.

make -f makestreams CUDAVersion
mpicc -g -fast  -o CUDAVersion CUDAVersion.o
-Wl,-rpath,/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib
-L/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib
-Wl,-rpath,/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib
-L/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/pgi.ld
-Wl,-rpath,/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib
-L/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib
-Wl,-rpath,/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib
-L/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib
-Wl,-rpath,/usr/lib/gcc/ppc64le-redhat-linux/4.8.5
-L/usr/lib/gcc/ppc64le-redhat-linux/4.8.5 -lpetsc -llapack -lblas
-lparmetis -lmetis -lstdc++ -ldl -lpthread -lmpiprofilesupport
-lmpi_ibm_usempi -lmpi_ibm_mpifh -lmpi_ibm -lpgf90rtl -lpgf90 -lpgf90_rpm1
-lpgf902 -lpgftnrtl -latomic -lpgkomp -lomp -lomptarget -lpgmath -lpgc -lrt
-lmass_simdp9 -lmassvp9 -lmassp9 -lm -lgcc_s -lstdc++ -ldl
CUDAVersion.o: In function `setupStream(long, PetscBool, PetscBool)':
/gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu:394: undefined reference
to `cudaGetDeviceCount'
/gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu:406: undefined reference
to `cudaSetDevice'
 

I have compared this link line with working examples and it looks the same.
There is not .c file here -- main is in the .cu file. I assume that is the
difference.

Any ideas?
Thanks,
Mark


Re: [petsc-dev] Should v->valid_GPU_array be a bitmask?

2019-10-02 Thread Zhang, Junchao via petsc-dev
Yes, the name valid_GPU_array is very confusing. I read it as valid_places.
--Junchao Zhang


On Wed, Oct 2, 2019 at 1:12 AM Karl Rupp 
mailto:r...@iue.tuwien.ac.at>> wrote:
Hi Junchao,

I recall that Jed already suggested to make this a bitmask ~7 years ago ;-)

On the other hand: If we touch valid_GPU_array, then we should also use
a better name or refactor completely. Code like

  (V->valid_GPU_array & PETSC_OFFLOAD_GPU)

simply isn't intuitive (nor does it make sense) when read aloud.

Best regards,
Karli


On 10/2/19 5:24 AM, Zhang, Junchao via petsc-dev wrote:
> Stafano recently modified the following code,
>
> PetscErrorCode VecCreate_SeqCUDA(Vec V)
> {
>PetscErrorCode ierr;
>
>PetscFunctionBegin;
>ierr = PetscLayoutSetUp(V->map);CHKERRQ(ierr);
>ierr = VecCUDAAllocateCheck(V);CHKERRQ(ierr);
>ierr =
> VecCreate_SeqCUDA_Private(V,((Vec_CUDA*)V->spptr)->GPUarray_allocated);CHKERRQ(ierr);
>ierr = VecCUDAAllocateCheckHost(V);CHKERRQ(ierr);
>ierr = VecSet(V,0.0);CHKERRQ(ierr);
>ierr = VecSet_Seq(V,0.0);CHKERRQ(ierr);
> V->valid_GPU_array = PETSC_OFFLOAD_BOTH;
> PetscFunctionReturn(0);
> }
>
> That means if one creates an SEQCUDA vector V and then immediately tests
> if (V->valid_GPU_array == PETSC_OFFLOAD_GPU), the test will fail. That
> is counterintuitive.  I think we should have
>
> enum 
> {PETSC_OFFLOAD_UNALLOCATED=0x0,PETSC_OFFLOAD_GPU=0x1,PETSC_OFFLOAD_CPU=0x2,PETSC_OFFLOAD_BOTH=0x3}
>
>
> and then use if (V->valid_GPU_array & PETSC_OFFLOAD_GPU). What do you think?
>
> --Junchao Zhang


Re: [petsc-dev] Should v->valid_GPU_array be a bitmask?

2019-10-02 Thread Karl Rupp via petsc-dev

Hi Junchao,

I recall that Jed already suggested to make this a bitmask ~7 years ago ;-)

On the other hand: If we touch valid_GPU_array, then we should also use 
a better name or refactor completely. Code like


 (V->valid_GPU_array & PETSC_OFFLOAD_GPU)

simply isn't intuitive (nor does it make sense) when read aloud.

Best regards,
Karli


On 10/2/19 5:24 AM, Zhang, Junchao via petsc-dev wrote:

Stafano recently modified the following code,

PetscErrorCode VecCreate_SeqCUDA(Vec V)
{
   PetscErrorCode ierr;

   PetscFunctionBegin;
   ierr = PetscLayoutSetUp(V->map);CHKERRQ(ierr);
   ierr = VecCUDAAllocateCheck(V);CHKERRQ(ierr);
   ierr = 
VecCreate_SeqCUDA_Private(V,((Vec_CUDA*)V->spptr)->GPUarray_allocated);CHKERRQ(ierr);

   ierr = VecCUDAAllocateCheckHost(V);CHKERRQ(ierr);
   ierr = VecSet(V,0.0);CHKERRQ(ierr);
   ierr = VecSet_Seq(V,0.0);CHKERRQ(ierr);
V->valid_GPU_array = PETSC_OFFLOAD_BOTH;
PetscFunctionReturn(0);
}

That means if one creates an SEQCUDA vector V and then immediately tests 
if (V->valid_GPU_array == PETSC_OFFLOAD_GPU), the test will fail. That 
is counterintuitive.  I think we should have


enum 
{PETSC_OFFLOAD_UNALLOCATED=0x0,PETSC_OFFLOAD_GPU=0x1,PETSC_OFFLOAD_CPU=0x2,PETSC_OFFLOAD_BOTH=0x3}


and then use if (V->valid_GPU_array & PETSC_OFFLOAD_GPU). What do you think?

--Junchao Zhang