On Thu, May 27, 2021 at 11:50 PM Barry Smith <[email protected]> wrote:
> > Mark, > > > > Where did you run the little test program I sent you > > 1) when it produced > > The 1120 and negative number and (was this on the compile server or > on a compute node?) > This is fine now. look at my last email. I was not using srun. > 2) when it produced the correct answer? (compile server or compute node?) > > Do you run configure on a compile server (that has no GPUs) or a compute > server that has GPUs > You have to do everything on the compute nodes on Cori/gpu. > Don't spend your time bisecting PETSc we know exactly where the problem > is, we just don't see how it happens. > > cuda.py, if it cannot find deviceQuery and if you did not provide a > generation arch with -with-cuda-gencodearch=70, > I thought I was not supposed to use that anymore. It sounds like it is optional. > runs a version of the little code I sent you to get the number but it is > ??apparently?? producing garbage or not running on the compiler server and > gives the wrong number 1120. > Does PETSc use MPIEXEC to run this? Note, I have not been able to get 'make check' to work on Cori/gpu. I use '-with-mpiexec=srun -G1 [-c 20]' and it fails to execute the tests. OK, putting -with-cuda-gencodearch=70 back in has fixed this problem. It is running now. Thanks, > > Just use the option -with-cuda-gencodearch=70 (you do not need to pass > this information to any flags any more, just with this option and it will > use it). > > Barry > > Ideally we want it to figure it out automatically and this little test > program in configure is suppose to do this but since that is not always > working yet you should just use -with-cuda-gencodearch=70 > > > > On May 27, 2021, at 5:45 AM, Mark Adams <[email protected]> wrote: > > FYI, I was running the test incorrectly: > 03:38 cgpu12 ~/petsc_install$ srun -n 1 -G 1 ./a.out > 70 > 70 > > On Wed, May 26, 2021 at 10:21 PM Mark Adams <[email protected]> wrote: > >> I had git bisect working and was 4 steps away when I got a new crash. >> configure.log is empty. >> >> 19:15 1 cgpu02 (a531cba26b...)|BISECTING ~/petsc$ git bisect bad >> Bisecting: 19 revisions left to test after this (roughly 4 steps) >> [149e269f455574fbe8ce3ebaf42121ae7fdf0635] Merge branch >> 'tisaac/feature-spqr' into 'main' >> 19:16 cgpu02 (149e269f45...)|BISECTING ~/petsc$ >> ../arch-cori-gpu-opt-gcc.py PETSC_DIR=$PWD >> >> =============================================================================== >> Configuring PETSc to compile on your system >> >> >> =============================================================================== >> >> ******************************************************************************* >> CONFIGURATION CRASH (Please send configure.log to >> [email protected]) >> >> ******************************************************************************* >> >> EOL while scanning string literal (cuda.py, line 176) >> File "/global/u2/m/madams/petsc/config/configure.py", line 455, in >> petsc_configure >> framework = >> config.framework.Framework(['--configModules=PETSc.Configure','--optionsModule=config.compilerOptions']+sys.argv[1:], >> loadArgDB = 0) >> File >> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >> 107, in __init__ >> self.createChildren() >> File >> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >> 344, in createChildren >> self.getChild(moduleName) >> File >> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >> 329, in getChild >> config.setupDependencies(self) >> File "/global/u2/m/madams/petsc/config/PETSc/Configure.py", line 80, in >> setupDependencies >> self.blasLapack = >> framework.require('config.packages.BlasLapack',self) >> File >> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >> 349, in require >> config = self.getChild(moduleName, keywordArgs) >> File >> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >> 329, in getChild >> config.setupDependencies(self) >> File >> "/global/u2/m/madams/petsc/config/BuildSystem/config/packages/BlasLapack.py", >> line 21, in setupDependencies >> config.package.Package.setupDependencies(self, framework) >> File "/global/u2/m/madams/petsc/config/BuildSystem/config/package.py", >> line 151, in setupDependencies >> self.mpi = framework.require('config.packages.MPI',self) >> File >> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >> 349, in require >> config = self.getChild(moduleName, keywordArgs) >> File >> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >> 329, in getChild >> config.setupDependencies(self) >> File >> "/global/u2/m/madams/petsc/config/BuildSystem/config/packages/MPI.py", line >> 73, in setupDependencies >> self.mpich = framework.require('config.packages.MPICH', self) >> File >> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >> 349, in require >> config = self.getChild(moduleName, keywordArgs) >> File >> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >> 329, in getChild >> config.setupDependencies(self) >> File >> "/global/u2/m/madams/petsc/config/BuildSystem/config/packages/MPICH.py", >> line 16, in setupDependencies >> self.cuda = framework.require('config.packages.cuda',self) >> File >> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >> 349, in require >> config = self.getChild(moduleName, keywordArgs) >> File >> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >> 302, in getChild >> type = __import__(moduleName, globals(), locals(), >> ['Configure']).Configure >> 19:16 cgpu02 (149e269f45...)|BISECTING ~/petsc$ >> ../arch-cori-gpu-opt-gcc.py PETSC_DIR=$PWD >> >> On Wed, May 26, 2021 at 10:10 PM Junchao Zhang <[email protected]> >> wrote: >> >>> >>> >>> >>> On Wed, May 26, 2021 at 6:13 PM Barry Smith <[email protected]> wrote: >>> >>>> >>>> What is HOST=cori09 Does it have GPUs? >>>> >>>> >>>> https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_164490976c8e07e028a8f1ce1f5cd42d6 >>>> >>>> Seems to clearly state >>>> >>>> int cudaDeviceProp >>>> <https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp> >>>> ::major >>>> <https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_164490976c8e07e028a8f1ce1f5cd42d6> >>>> [inherited] >>>> >>>> Major compute capability >>>> >>>> >>>> Mark, please compile and run this program on the machine you are >>>> running configure on >>>> >>>> #include <stdio.h> >>>> #include <cuda.h> >>>> #include <cuda_runtime.h> >>>> #include <cuda_runtime_api.h> >>>> #include <cuda_device_runtime_api.h> >>>> int main(int arg,char **args) >>>> { >>>> struct cudaDeviceProp dp; >>>> cudaGetDeviceProperties(&dp, 0); >>>> printf("%d\n",10*dp.major+dp.minor); >>>> >>>> int major,minor; >>>> cuDeviceGetAttribute(&major, >>>> CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, 0); >>>> cuDeviceGetAttribute(&minor, >>>> CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, 0); >>>> printf("%d\n",10*major+minor); >>>> return(0); >>>> >>> Probably, you need to check the return code of these two function calls >>> to make sure they are correct. >>> >>> >>>> } >>>> >>>> This is what I get >>>> >>>> $ nvcc mytest.c -lcuda >>>> ~/petsc* (main=)* arch-main >>>> $ ./a.out >>>> 70 >>>> 70 >>>> >>>> Which is exactly what it is suppose to do. >>>> >>>> Barry >>>> >>>> On May 26, 2021, at 5:31 PM, Barry Smith <[email protected]> wrote: >>>> >>>> >>>> Yes, this code which I guess never got hit before >>>> >>>> cudaDeviceProp dp; cudaGetDeviceProperties(&dp, 0); >>>> printf("%d\n",10*dp.major+dp.minor); >>>> return(0);; >>>> >>>> is using the wrong property for the generation. >>>> >>>> Back to the CUDA documentation for the correct information. >>>> >>>> >>>> >>>> On May 26, 2021, at 3:47 PM, Jacob Faibussowitsch <[email protected]> >>>> wrote: >>>> >>>> 1120 sounds suspiciously like some CUDA version rather than >>>> architecture or compute capability… >>>> >>>> Best regards, >>>> >>>> Jacob Faibussowitsch >>>> (Jacob Fai - booss - oh - vitch) >>>> Cell: +1 (312) 694-3391 >>>> >>>> On May 26, 2021, at 22:29, Mark Adams <[email protected]> wrote: >>>> >>>> I started to get this error today on Cori. >>>> >>>> nvcc fatal : Unsupported gpu architecture 'compute_1120' >>>> >>>> I am pretty sure I had a clean build but I can redo it if you don't >>>> know where this is from. >>>> >>>> Thanks, >>>> Mark >>>> <configure.log> >>>> >>>> >>>> >>>> >
