Re: PyTorch with ROCm
Hello, Ludovic Courtès writes: > Yeah, we could think about a transformation option. Maybe > ‘--with-configure-flags=python-pytorch=-DAMDGPU_TARGETS=xyz’ would work, > and if not, we can come up with a specific transformation and/or an > procedure that takes a list of architectures and returns a package. I think that would work for python-pytorch itself, but it would need to be set for all ROCm dependencies as well. It would be good to make sure that the targets for a package are a subset of the intersection of the targets specified for its dependencies. - Many tests assume a GPU to be present, so they need to be disabled. >>> >>> Yes. I/we’d like to eventually support that. (There’d need to be some >>> annotation in derivations or packages specifying what hardware is >>> required, and ‘cuirass remote-worker’, ‘guix offload’, etc. would need >>> to honor that.) >> >> That sounds like a good idea, could this also include CPU ISA >> extensions, such as AVX2 and AVX-512? > > That’d be great, yes. Don’t hold your breath though as I/we haven’t > scheduled work on this yet. If you’re interested in working on it, we > can discuss it of course. I am definitively interested, but am not familiar with Cuirass. Would this also require support by the build daemon to determine which hardware is available? >> I think the issue is simply that elf-file? just checks the magic bytes >> and has-elf-header? checks for the entire header. If the former returns >> #t and the latter #f, an error is raised by parse-elf in guix/elf.scm. >> It seems some ROCm (or tensile?) ELF files have another header format. > > Uh, never came across such a situation. What’s so special about those > ELF files? How are they created? After checking again, I noticed that the error actually only occurs for rocblas. :) Here, the problematic ELF files are generated by Tensile [1], and are installed in lib/rocblas/library (by library/src/CMakeLists.txt, which calls a CMake function from the Tensile package). They are shared object libraries for the GPU architecture(s) [2]. Tensile uses `clang-offload-builder` (from rocm-toolchain) to create the files, and it seems to me that the "ELF" header comes from there, but I don't know why it is special. Thanks, David [1] https://github.com/ROCm/Tensile/blob/17df881bde80fc20f997dfb290f4bb4b0e05a7e9/Tensile/TensileCreateLibrary.py#L283 [2] https://github.com/ROCm/Tensile/wiki/TensileCreateLibrary#code-object-libraries
Re: PyTorch with ROCm
Hello! (Cc’ing my colleague Romain who may work on related things soon.) David Elsing skribis: > It is the same as for other HIP/ROCm libraries, so the GPU architectures > chosen at build time are all available at runtime and automatically > picked. For reference, the Arch Linux package for PyTorch [1] enables 12 > architectures. I think the architectures which can be chosen at compile > time also depend on the ROCm version. Nice. We’d have to check what the size and build time tradeoff is, but it makes sense to enable a bunch of architectures. >>> I'm not sure they can be combined however, as the GPU code is included >>> in the shared libraries. Thus all dependent packages like >>> python-pytorch-rocm would need to be built for each architecture as >>> well, which is a large duplication for the non-GPU parts. >> >> Yeah, but maybe that’s OK if we keep the number of supported GPU >> architectures to a minimum? > > If it's no issue for the build farm it would probably be good to include > a set of default architectures (the officially supported ones?) like you > suggested, and make it easy to recompile all dependent packages for > other architectures. Maybe this can be done with a package > transformation like for '--tune'?. IIRC, building composable-kernel for > the default architectures with 16 threads exceeded 32 GB of memory > before I cancelled the build and set it to only architecture. Yeah, we could think about a transformation option. Maybe ‘--with-configure-flags=python-pytorch=-DAMDGPU_TARGETS=xyz’ would work, and if not, we can come up with a specific transformation and/or an procedure that takes a list of architectures and returns a package. >>> - Many tests assume a GPU to be present, so they need to be disabled. >> >> Yes. I/we’d like to eventually support that. (There’d need to be some >> annotation in derivations or packages specifying what hardware is >> required, and ‘cuirass remote-worker’, ‘guix offload’, etc. would need >> to honor that.) > > That sounds like a good idea, could this also include CPU ISA > extensions, such as AVX2 and AVX-512? That’d be great, yes. Don’t hold your breath though as I/we haven’t scheduled work on this yet. If you’re interested in working on it, we can discuss it of course. > I think the issue is simply that elf-file? just checks the magic bytes > and has-elf-header? checks for the entire header. If the former returns > #t and the latter #f, an error is raised by parse-elf in guix/elf.scm. > It seems some ROCm (or tensile?) ELF files have another header format. Uh, never came across such a situation. What’s so special about those ELF files? How are they created? >> Oh, just noticed your patch bring a lot of things beyond PyTorch itself! >> I think there’s some overlap with >> <https://gitlab.inria.fr/guix-hpc/guix-hpc/-/merge_requests/38>, we >> should synchronize. > Ah, I did not see this before, the overlap seems to be tensile, > roctracer and rocblas. For rocblas, I saw that they set > "-DAMDGPU_TARGETS=gfx1030;gfx90a", probably for testing? Could be, we’ll see. Thanks, Ludo’.
Re: PyTorch with ROCm
Hi! Ludovic Courtès writes: > I’m happy to merge your changes in the ‘guix-hpc’ channel for the time > being (I can create you an account there if you wish so you can create > merge requests etc.). Let me know! Ok sure, that sounds good! I made the packages only for ROCm 6.0.2 so far though. > I agree with Ricardo that this should be merged into Guix proper > eventually. This is still in flux and we’d need to check what Kjetil > and Thomas at AMD think, in particular wrt. versions, so no ETA so far. Yes I agree, the ROCm packages are not ready to be merged yet. > Is PyTorch able to build code for several GPU architectures and pick the > right one at run time? If it does, that would seem like the better > option for me, unless that is indeed so computationally expensive that > it’s not affordable. It is the same as for other HIP/ROCm libraries, so the GPU architectures chosen at build time are all available at runtime and automatically picked. For reference, the Arch Linux package for PyTorch [1] enables 12 architectures. I think the architectures which can be chosen at compile time also depend on the ROCm version. >> I'm not sure they can be combined however, as the GPU code is included >> in the shared libraries. Thus all dependent packages like >> python-pytorch-rocm would need to be built for each architecture as >> well, which is a large duplication for the non-GPU parts. > > Yeah, but maybe that’s OK if we keep the number of supported GPU > architectures to a minimum? If it's no issue for the build farm it would probably be good to include a set of default architectures (the officially supported ones?) like you suggested, and make it easy to recompile all dependent packages for other architectures. Maybe this can be done with a package transformation like for '--tune'?. IIRC, building composable-kernel for the default architectures with 16 threads exceeded 32 GB of memory before I cancelled the build and set it to only architecture. >> - Many tests assume a GPU to be present, so they need to be disabled. > > Yes. I/we’d like to eventually support that. (There’d need to be some > annotation in derivations or packages specifying what hardware is > required, and ‘cuirass remote-worker’, ‘guix offload’, etc. would need > to honor that.) That sounds like a good idea, could this also include CPU ISA extensions, such as AVX2 and AVX-512? >> - For several packages (e.g. rocfft), I had to disable the >> validate-runpath? phase, as there was an error when reading ELF >> files. It is however possible that I also disabled it for packages >> where it was not necessary, but it was the case for rocblas at >> least. Here, kernels generated are contained in ELF files, which are >> detected by elf-file? in guix/build/utils.scm, but rejected by >> has-elf-header? in guix/elf.scm, which leads to an error. > > Weird. We’d need to look more closely into the errors you got. I think the issue is simply that elf-file? just checks the magic bytes and has-elf-header? checks for the entire header. If the former returns #t and the latter #f, an error is raised by parse-elf in guix/elf.scm. It seems some ROCm (or tensile?) ELF files have another header format. > Oh, just noticed your patch bring a lot of things beyond PyTorch itself! > I think there’s some overlap with > <https://gitlab.inria.fr/guix-hpc/guix-hpc/-/merge_requests/38>, we > should synchronize. Ah, I did not see this before, the overlap seems to be tensile, roctracer and rocblas. For rocblas, I saw that they set "-DAMDGPU_TARGETS=gfx1030;gfx90a", probably for testing? Thank you! David [1] https://gitlab.archlinux.org/archlinux/packaging/packages/python-pytorch/-/blob/ae90c1e8bdb99af458ca0a545c5736950a747690/PKGBUILD
Re: PyTorch with ROCm
Hello! David Elsing skribis: > after seeing that ROCm packages [1] are available in the Guix-HPC > channel, I decided to try and package PyTorch 2.2.1 with ROCm 6.0.2. Nice! > The changes for the ROCm packages are here [4] as a modification of > Guix-HPC. There, the python-pytorch-rocm package in > amd/machine-learning.scm depends on the python-pytorch-avx package in > [2,3]. Both python-pytorch and python-pytorch-avx support AVX2 / AVX-512 > instructions, but the latter also has support for fbgemm and nnpack. I > used it over python-pytorch because AVX2 or AVX-512 instructions should > be available on a CPU with PCIe atomics anyway, which ROCm requires. I’m happy to merge your changes in the ‘guix-hpc’ channel for the time being (I can create you an account there if you wish so you can create merge requests etc.). Let me know! I agree with Ricardo that this should be merged into Guix proper eventually. This is still in flux and we’d need to check what Kjetil and Thomas at AMD think, in particular wrt. versions, so no ETA so far. > For some packages, such as composable-kernel, the build time and > memory requirement is already very high when building only for one GPU > architecture, so maybe it would be best to make a separate package for > each architecture? Is PyTorch able to build code for several GPU architectures and pick the right one at run time? If it does, that would seem like the better option for me, unless that is indeed so computationally expensive that it’s not affordable. > I'm not sure they can be combined however, as the GPU code is included > in the shared libraries. Thus all dependent packages like > python-pytorch-rocm would need to be built for each architecture as > well, which is a large duplication for the non-GPU parts. Yeah, but maybe that’s OK if we keep the number of supported GPU architectures to a minimum? > There were a few other issues as well, some of them should probably be > addressed upstream: > - Many tests assume a GPU to be present, so they need to be disabled. Yes. I/we’d like to eventually support that. (There’d need to be some annotation in derivations or packages specifying what hardware is required, and ‘cuirass remote-worker’, ‘guix offload’, etc. would need to honor that.) > - For several packages (e.g. rocfft), I had to disable the > validate-runpath? phase, as there was an error when reading ELF > files. It is however possible that I also disabled it for packages > where it was not necessary, but it was the case for rocblas at > least. Here, kernels generated are contained in ELF files, which are > detected by elf-file? in guix/build/utils.scm, but rejected by > has-elf-header? in guix/elf.scm, which leads to an error. Weird. We’d need to look more closely into the errors you got. [...] > - There were a few errors due to using the GCC 11 system headers with > rocm-toolchain (which is based on Clang+LLVM). For roctracer, > replacing std::experimental::filesystem by std::filesystem suffices, > but for rocthrust, the placement new operator is not found. I > applied the patch from Gentoo [5], where it is replaced by a simple > assignment. It looks like UB to me though, even if it happens to > work. The question is whether this is a bug in libstdc++, clang or > amdclang++... > - rocMLIR also contains a fork of the LLVM source tree and it is not > clear at a first glance how exactly it differs from the main ROCm > fork of LLVM or upstream LLVM. Oh, just noticed your patch bring a lot of things beyond PyTorch itself! I think there’s some overlap with <https://gitlab.inria.fr/guix-hpc/guix-hpc/-/merge_requests/38>, we should synchronize. Thanks! Ludo’.
Re: PyTorch with ROCm
Hi Ricardo, thanks for the information! Ricardo Wurmus writes: > Oh, commit 8429f25ecd83594e80676a67ad9c54f0d6cf3f16 added > python-pytorch2 at version 2.2.1. Do you think you could adjust your > patches to modify that one instead? I already adjusted the patches yesterday to remove the python-pytorch2 package you added, as the patch series updates the main python-pytorch package to version 2. The new inputs of your package were already included, with the exception of python-opt-einsum. I had overlooked it before and included it now. :) Is there a reason to keep version 1 around? Then I could adjust the patches again. Otherwise, it makes sense for me to move the python-pytorch package to version 2.2.1 and have a package variant with 2.0.1 for r-torch (which I kept and adjusted). Due to problems when building dependencies, the new package only succeeds to build for x86_64. As I explained in the patch series, asmjit fails on armhf because GCC runs out of memory (it reaches 4 GB I think and more is of course not possible) and cpuinfo has a known bug on aarch64 [1], which causes the tests to fail and AFAICT also break PyTorch at runtime. Through python-pytorch -> python-expecttest -> poetry -> python-keyring -> python-secretstorage -> python-cryptography, the python-pytorch package now depends on rust, which currently requires too much memory to build on 32 bit systems, so i686 is not supported either. What do you think should be done here? I added all packages required for the core tests to native-inputs, but decided to disable them as they require a long time to run. If I remove the test inputs (in particular python-expecttest), the package could probably also be built for i686. Would it be acceptable keep them as a comment for reference? > I think it is sufficient to only have the current version of ROCm; other > versions could be added if there is reasonable demand. That sounds good to me. Best, David [1] https://github.com/pytorch/cpuinfo/issues/14
Re: PyTorch with ROCm
Hi David, > after seeing that ROCm packages [1] are available in the Guix-HPC > channel, I decided to try and package PyTorch 2.2.1 with ROCm 6.0.2. Excellent initiative! > For this, I first unbundled the (many) remaining dependencies of the > python-pytorch package and updated it to 2.2.1, the patch series for > which can be found here [2,3]. Oh, commit 8429f25ecd83594e80676a67ad9c54f0d6cf3f16 added python-pytorch2 at version 2.2.1. Do you think you could adjust your patches to modify that one instead? > It would be really great to have these packages in Guix proper, but > first of course the base ROCm packages need to be added after deciding > how to deal with the different architectures. Also, are several ROCm > versions necessary or would only one (the current latest) version > suffice? As for the ROCm-specific work, I'm not qualified to comment. I do support a move of the ROCm packages from Guix HPC to Guix proper. I think it is sufficient to only have the current version of ROCm; other versions could be added if there is reasonable demand. -- Ricardo
PyTorch with ROCm
Hello, after seeing that ROCm packages [1] are available in the Guix-HPC channel, I decided to try and package PyTorch 2.2.1 with ROCm 6.0.2. For this, I first unbundled the (many) remaining dependencies of the python-pytorch package and updated it to 2.2.1, the patch series for which can be found here [2,3]. For building ROCm and building the remaining packages, I did not apply the same quality standard as for python-pytorch and just tried to get it working at all with ROCM 6.0.2. To reduce the build time, I also only tested them for gfx1101 as set in the %amdgpu-targets variable in amd/rocm-base.scm (which needs to be adjusted for other GPUs). Here, it seemed to work fine on my GPU. The changes for the ROCm packages are here [4] as a modification of Guix-HPC. There, the python-pytorch-rocm package in amd/machine-learning.scm depends on the python-pytorch-avx package in [2,3]. Both python-pytorch and python-pytorch-avx support AVX2 / AVX-512 instructions, but the latter also has support for fbgemm and nnpack. I used it over python-pytorch because AVX2 or AVX-512 instructions should be available on a CPU with PCIe atomics anyway, which ROCm requires. For some packages, such as composable-kernel, the build time and memory requirement is already very high when building only for one GPU architecture, so maybe it would be best to make a separate package for each architecture? I'm not sure they can be combined however, as the GPU code is included in the shared libraries. Thus all dependent packages like python-pytorch-rocm would need to be built for each architecture as well, which is a large duplication for the non-GPU parts. There were a few other issues as well, some of them should probably be addressed upstream: - Many tests assume a GPU to be present, so they need to be disabled. - For several packages (e.g. rocfft), I had to disable the validate-runpath? phase, as there was an error when reading ELF files. It is however possible that I also disabled it for packages where it was not necessary, but it was the case for rocblas at least. Here, kernels generated are contained in ELF files, which are detected by elf-file? in guix/build/utils.scm, but rejected by has-elf-header? in guix/elf.scm, which leads to an error. - Dependencies of python-tensile copy source files and later copy them with shutil.copy, sometimes twice. This leads to permission errors, as the permissions in the store are kept, so I patched it to use shutil.copyfile instead. - There were a few errors due to using the GCC 11 system headers with rocm-toolchain (which is based on Clang+LLVM). For roctracer, replacing std::experimental::filesystem by std::filesystem suffices, but for rocthrust, the placement new operator is not found. I applied the patch from Gentoo [5], where it is replaced by a simple assignment. It looks like UB to me though, even if it happens to work. The question is whether this is a bug in libstdc++, clang or amdclang++... - rocMLIR also contains a fork of the LLVM source tree and it is not clear at a first glance how exactly it differs from the main ROCm fork of LLVM or upstream LLVM. It would be really great to have these packages in Guix proper, but first of course the base ROCm packages need to be added after deciding how to deal with the different architectures. Also, are several ROCm versions necessary or would only one (the current latest) version suffice? Cheers, David [1] https://hpc.guix.info/blog/2024/01/hip-and-rocm-come-to-guix/ [2] https://issues.guix.gnu.org/69591 [3] https://codeberg.org/dtelsing/Guix/src/branch/pytorch [4] https://codeberg.org/dtelsing/Guix-HPC/src/branch/pytorch-rocm [5] https://gitweb.gentoo.org/repo/gentoo.git/tree/sci-libs/rocThrust/files/rocThrust-4.0-operator_new.patch