I wrote sent this yesterday but am having some strange mailing issues.
On 2021-04-03 22:42, Barry Smith did write: > > It would be very nice to NOT require PETSc users to provide this flag, how > the heck will they know what it should be when we cannot automate it > ourselves? > > Any ideas of how this can be determined based on the current system? NVIDIA > does not help since these "advertising" names don't seem to trivially map to > information you can get from a particular GPU when you logged into it. For > example nvidia-smi doesn't use these names directly. Is there some mapping > from nvidia-smi to these names we could use? If we are serious about having > a non-trivial number of users utilizing GPUs, which we need to be for future, > we cannot have this absurd demands in our installation process. The mapping of the Nvidia card to the gencodes and cuda arch is one of those annoyances that is so ridiculous it is hard to believe. The best reference I have found is this: https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ To this end, the fact that Kokkos provides a mapping from colloquial card name to gencode/arch is a real benefit and useful. The problem is that this mapping is buried in their build system and lacks introspection. > > Barry > > Does spack have some magic for this we could use? > spack developed the archspec repo to abstract all of these issues: https://github.com/archspec/archspec This is a *great* idea and eventually BuildSystem should incorporate it as the standard way of doing things; however, it is been focused mostly on the CPU issues, and is still under active development (my understanding is that the pulling it out of spack and getting those interop issues sorted out is tangled up in how spack handles dependencies and compilers). It'd be nice if someone would go in and port the Kokkos gpu mappings to archspec as there is some great knowledge on these mapping buried in the Kokkos build system (not volunteering); i.e., translating that webpage to some real code (even if it is in make) is valuable. TL;DR: It's a known problem with currently no good solution AFAIK. Waiting until archspec gets further along seems like the best solution. Scott P.S. ROCm has rocminfo which also doesn't solve the problem but is at least sane.