Re: [petsc-dev] Why no SpGEMM support in AIJCUSPARSE and AIJVIENNACL?
Do you have any experience with nsparse? https://github.com/EBD-CREST/nsparse I've seen claims that it is much faster than cuSPARSE for sparse matrix-matrix products. Karl Rupp via petsc-dev writes: > Hi Richard, > > CPU spGEMM is about twice as fast even on the GPU-friendly case of a > single rank: http://viennacl.sourceforge.net/viennacl-benchmarks-spmm.html > > I agree that it would be good to have a GPU-MatMatMult for the sake of > experiments. Under these performance constraints it's not top priority, > though. > > Best regards, > Karli > > > On 10/3/19 12:00 AM, Mills, Richard Tran via petsc-dev wrote: >> Fellow PETSc developers, >> >> I am wondering why the AIJCUSPARSE and AIJVIENNACL matrix types do not >> support the sparse matrix-matrix multiplication (SpGEMM, or MatMatMult() >> in PETSc parlance) routines provided by cuSPARSE and ViennaCL, >> respectively. Is there a good reason that I shouldn't add those? My >> guess is that support was not added because SpGEMM is hard to do well on >> a GPU compared to many CPUs (it is hard to compete with, say, Intel Xeon >> CPUs with their huge caches) and it has been the case that one would >> generally be better off doing these operations on the CPU. Since the >> trend at the big supercomputing centers seems to be to put more and more >> of the computational power into GPUs, I'm thinking that I should add the >> option to use the GPU library routines for SpGEMM, though. Is there some >> good reason to *not* do this that I am not aware of? (Maybe the CPUs are >> better for this even on a machine like Summit, but I think we're at the >> point that we should at least be able to experimentally verify this.) >> >> --Richard
Re: [petsc-dev] Why no SpGEMM support in AIJCUSPARSE and AIJVIENNACL?
Hi Richard, CPU spGEMM is about twice as fast even on the GPU-friendly case of a single rank: http://viennacl.sourceforge.net/viennacl-benchmarks-spmm.html I agree that it would be good to have a GPU-MatMatMult for the sake of experiments. Under these performance constraints it's not top priority, though. Best regards, Karli On 10/3/19 12:00 AM, Mills, Richard Tran via petsc-dev wrote: Fellow PETSc developers, I am wondering why the AIJCUSPARSE and AIJVIENNACL matrix types do not support the sparse matrix-matrix multiplication (SpGEMM, or MatMatMult() in PETSc parlance) routines provided by cuSPARSE and ViennaCL, respectively. Is there a good reason that I shouldn't add those? My guess is that support was not added because SpGEMM is hard to do well on a GPU compared to many CPUs (it is hard to compete with, say, Intel Xeon CPUs with their huge caches) and it has been the case that one would generally be better off doing these operations on the CPU. Since the trend at the big supercomputing centers seems to be to put more and more of the computational power into GPUs, I'm thinking that I should add the option to use the GPU library routines for SpGEMM, though. Is there some good reason to *not* do this that I am not aware of? (Maybe the CPUs are better for this even on a machine like Summit, but I think we're at the point that we should at least be able to experimentally verify this.) --Richard
Re: [petsc-dev] Why no SpGEMM support in AIJCUSPARSE and AIJVIENNACL?
FWIW, I've heard that CUSPARSE is going to provide integer matrix-matrix products for indexing applications, and that it should be easy to extend that to double, etc. On Wed, Oct 2, 2019 at 6:00 PM Mills, Richard Tran via petsc-dev < petsc-dev@mcs.anl.gov> wrote: > Fellow PETSc developers, > > I am wondering why the AIJCUSPARSE and AIJVIENNACL matrix types do not > support the sparse matrix-matrix multiplication (SpGEMM, or MatMatMult() in > PETSc parlance) routines provided by cuSPARSE and ViennaCL, respectively. > Is there a good reason that I shouldn't add those? My guess is that support > was not added because SpGEMM is hard to do well on a GPU compared to many > CPUs (it is hard to compete with, say, Intel Xeon CPUs with their huge > caches) and it has been the case that one would generally be better off > doing these operations on the CPU. Since the trend at the big > supercomputing centers seems to be to put more and more of the > computational power into GPUs, I'm thinking that I should add the option to > use the GPU library routines for SpGEMM, though. Is there some good reason > to *not* do this that I am not aware of? (Maybe the CPUs are better for > this even on a machine like Summit, but I think we're at the point that we > should at least be able to experimentally verify this.) > > --Richard >
[petsc-dev] Why no SpGEMM support in AIJCUSPARSE and AIJVIENNACL?
Fellow PETSc developers, I am wondering why the AIJCUSPARSE and AIJVIENNACL matrix types do not support the sparse matrix-matrix multiplication (SpGEMM, or MatMatMult() in PETSc parlance) routines provided by cuSPARSE and ViennaCL, respectively. Is there a good reason that I shouldn't add those? My guess is that support was not added because SpGEMM is hard to do well on a GPU compared to many CPUs (it is hard to compete with, say, Intel Xeon CPUs with their huge caches) and it has been the case that one would generally be better off doing these operations on the CPU. Since the trend at the big supercomputing centers seems to be to put more and more of the computational power into GPUs, I'm thinking that I should add the option to use the GPU library routines for SpGEMM, though. Is there some good reason to *not* do this that I am not aware of? (Maybe the CPUs are better for this even on a machine like Summit, but I think we're at the point that we should at least be able to experimentally verify this.) --Richard
Re: [petsc-dev] CUDA STREAMS
Mark, It looks like you are missing some critical CUDA library (or libraries) in your link line. I know you will at least need the CUDA runtime "-lcudart". Look at something like PETSC_WITH_EXTERNAL_LIB for one of your CUDA-enabled PETSc builds in $PETSC_ARCH/lib/petsc/conf/petscvariables to see what else you might need. --Richard On 10/2/19 7:20 AM, Mark Adams via petsc-dev wrote: I found a CUDAVersion.cu of STREAMS and tried to build it. I got it to compile manually with: nvcc -o CUDAVersion.o -ccbin pgc++ -I/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/include -Wno-deprecated-gpu-targets -c --compiler-options="-g -I/ccs/home/adams/petsc/include -I/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/include " `pwd`/CUDAVersion.cu /gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu(22): warning: conversion from a string literal to "char *" is deprecated And this did produce a .o file. But I get this when I try to link. make -f makestreams CUDAVersion mpicc -g -fast -o CUDAVersion CUDAVersion.o -Wl,-rpath,/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib -L/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib -Wl,-rpath,/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib -L/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/pgi.ld -Wl,-rpath,/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib -L/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib -Wl,-rpath,/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib -L/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib -Wl,-rpath,/usr/lib/gcc/ppc64le-redhat-linux/4.8.5 -L/usr/lib/gcc/ppc64le-redhat-linux/4.8.5 -lpetsc -llapack -lblas -lparmetis -lmetis -lstdc++ -ldl -lpthread -lmpiprofilesupport -lmpi_ibm_usempi -lmpi_ibm_mpifh -lmpi_ibm -lpgf90rtl -lpgf90 -lpgf90_rpm1 -lpgf902 -lpgftnrtl -latomic -lpgkomp -lomp -lomptarget -lpgmath -lpgc -lrt -lmass_simdp9 -lmassvp9 -lmassp9 -lm -lgcc_s -lstdc++ -ldl CUDAVersion.o: In function `setupStream(long, PetscBool, PetscBool)': /gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu:394: undefined reference to `cudaGetDeviceCount' /gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu:406: undefined reference to `cudaSetDevice' I have compared this link line with working examples and it looks the same. There is not .c file here -- main is in the .cu file. I assume that is the difference. Any ideas? Thanks, Mark
Re: [petsc-dev] test harness: output of actually executed command for V=1 gone?
In MR !2138 I have this target as show-fail which I think is more descriptive. config/report_tests.py -f is what's done directly. I made it such that one can copy and paste, but it might be too verbose. Scott On 9/20/19 8:53 PM, Jed Brown wrote: "Smith, Barry F." writes: Satish and Barry: Do we need the Error codes or can I revert to previous functionality? I think it is important to display the error codes. How about displaying at the bottom how to run the broken tests? You already show how to run them with the test harness, you could also print how to run them directly? Better then mixing it up with the TAP output? How about a target for it? make -f gmakefile show-test search=abcd We already have print-test, which might more accurately be named ls-test. -- Tech-X Corporation kru...@txcorp.com 5621 Arapahoe Ave, Suite A Phone: (720) 974-1841 Boulder, CO 80303Fax: (303) 448-7756
Re: [petsc-dev] Mixing separate and shared ouputs
Fixed in MR# 2138 https://gitlab.com/petsc/petsc/merge_requests/2138 Thanks for the report. Scott On 9/28/19 3:44 AM, Pierre Jolivet via petsc-dev wrote: Hello, If I put something like this in src/ksp/ksp/examples/tutorials/ex12.c args: -ksp_gmres_cgs_refinement_type refine_always -ksp_type {{cg gmres}separate output} -pc_type {{jacobi bjacobi lu}separate output} I get # success 9/13 tests (69.2%) Now args: -ksp_gmres_cgs_refinement_type refine_always -ksp_type {{cg gmres}shared output} -pc_type {{jacobi bjacobi lu}shared output} Still gives me # success 9/13 tests (69.2%) But args: -ksp_gmres_cgs_refinement_type refine_always -ksp_type {{cg gmres}shared output} -pc_type {{jacobi bjacobi lu}separate output} Gives me # success 6/7 tests (85.7%) Is this the expected behavior? Any easy way to get 13 tests as well? Thanks, Pierre -- Tech-X Corporation kru...@txcorp.com 5621 Arapahoe Ave, Suite A Phone: (720) 974-1841 Boulder, CO 80303Fax: (303) 448-7756
[petsc-dev] CUDA STREAMS
I found a CUDAVersion.cu of STREAMS and tried to build it. I got it to compile manually with: nvcc -o CUDAVersion.o -ccbin pgc++ -I/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/include -Wno-deprecated-gpu-targets -c --compiler-options="-g -I/ccs/home/adams/petsc/include -I/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/include " `pwd`/CUDAVersion.cu /gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu(22): warning: conversion from a string literal to "char *" is deprecated And this did produce a .o file. But I get this when I try to link. make -f makestreams CUDAVersion mpicc -g -fast -o CUDAVersion CUDAVersion.o -Wl,-rpath,/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib -L/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib -Wl,-rpath,/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib -L/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/pgi.ld -Wl,-rpath,/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib -L/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib -Wl,-rpath,/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib -L/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib -Wl,-rpath,/usr/lib/gcc/ppc64le-redhat-linux/4.8.5 -L/usr/lib/gcc/ppc64le-redhat-linux/4.8.5 -lpetsc -llapack -lblas -lparmetis -lmetis -lstdc++ -ldl -lpthread -lmpiprofilesupport -lmpi_ibm_usempi -lmpi_ibm_mpifh -lmpi_ibm -lpgf90rtl -lpgf90 -lpgf90_rpm1 -lpgf902 -lpgftnrtl -latomic -lpgkomp -lomp -lomptarget -lpgmath -lpgc -lrt -lmass_simdp9 -lmassvp9 -lmassp9 -lm -lgcc_s -lstdc++ -ldl CUDAVersion.o: In function `setupStream(long, PetscBool, PetscBool)': /gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu:394: undefined reference to `cudaGetDeviceCount' /gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu:406: undefined reference to `cudaSetDevice' I have compared this link line with working examples and it looks the same. There is not .c file here -- main is in the .cu file. I assume that is the difference. Any ideas? Thanks, Mark
Re: [petsc-dev] Should v->valid_GPU_array be a bitmask?
Yes, the name valid_GPU_array is very confusing. I read it as valid_places. --Junchao Zhang On Wed, Oct 2, 2019 at 1:12 AM Karl Rupp mailto:r...@iue.tuwien.ac.at>> wrote: Hi Junchao, I recall that Jed already suggested to make this a bitmask ~7 years ago ;-) On the other hand: If we touch valid_GPU_array, then we should also use a better name or refactor completely. Code like (V->valid_GPU_array & PETSC_OFFLOAD_GPU) simply isn't intuitive (nor does it make sense) when read aloud. Best regards, Karli On 10/2/19 5:24 AM, Zhang, Junchao via petsc-dev wrote: > Stafano recently modified the following code, > > PetscErrorCode VecCreate_SeqCUDA(Vec V) > { >PetscErrorCode ierr; > >PetscFunctionBegin; >ierr = PetscLayoutSetUp(V->map);CHKERRQ(ierr); >ierr = VecCUDAAllocateCheck(V);CHKERRQ(ierr); >ierr = > VecCreate_SeqCUDA_Private(V,((Vec_CUDA*)V->spptr)->GPUarray_allocated);CHKERRQ(ierr); >ierr = VecCUDAAllocateCheckHost(V);CHKERRQ(ierr); >ierr = VecSet(V,0.0);CHKERRQ(ierr); >ierr = VecSet_Seq(V,0.0);CHKERRQ(ierr); > V->valid_GPU_array = PETSC_OFFLOAD_BOTH; > PetscFunctionReturn(0); > } > > That means if one creates an SEQCUDA vector V and then immediately tests > if (V->valid_GPU_array == PETSC_OFFLOAD_GPU), the test will fail. That > is counterintuitive. I think we should have > > enum > {PETSC_OFFLOAD_UNALLOCATED=0x0,PETSC_OFFLOAD_GPU=0x1,PETSC_OFFLOAD_CPU=0x2,PETSC_OFFLOAD_BOTH=0x3} > > > and then use if (V->valid_GPU_array & PETSC_OFFLOAD_GPU). What do you think? > > --Junchao Zhang
Re: [petsc-dev] Should v->valid_GPU_array be a bitmask?
Hi Junchao, I recall that Jed already suggested to make this a bitmask ~7 years ago ;-) On the other hand: If we touch valid_GPU_array, then we should also use a better name or refactor completely. Code like (V->valid_GPU_array & PETSC_OFFLOAD_GPU) simply isn't intuitive (nor does it make sense) when read aloud. Best regards, Karli On 10/2/19 5:24 AM, Zhang, Junchao via petsc-dev wrote: Stafano recently modified the following code, PetscErrorCode VecCreate_SeqCUDA(Vec V) { PetscErrorCode ierr; PetscFunctionBegin; ierr = PetscLayoutSetUp(V->map);CHKERRQ(ierr); ierr = VecCUDAAllocateCheck(V);CHKERRQ(ierr); ierr = VecCreate_SeqCUDA_Private(V,((Vec_CUDA*)V->spptr)->GPUarray_allocated);CHKERRQ(ierr); ierr = VecCUDAAllocateCheckHost(V);CHKERRQ(ierr); ierr = VecSet(V,0.0);CHKERRQ(ierr); ierr = VecSet_Seq(V,0.0);CHKERRQ(ierr); V->valid_GPU_array = PETSC_OFFLOAD_BOTH; PetscFunctionReturn(0); } That means if one creates an SEQCUDA vector V and then immediately tests if (V->valid_GPU_array == PETSC_OFFLOAD_GPU), the test will fail. That is counterintuitive. I think we should have enum {PETSC_OFFLOAD_UNALLOCATED=0x0,PETSC_OFFLOAD_GPU=0x1,PETSC_OFFLOAD_CPU=0x2,PETSC_OFFLOAD_BOTH=0x3} and then use if (V->valid_GPU_array & PETSC_OFFLOAD_GPU). What do you think? --Junchao Zhang