Note added: When running strace:

$ strace /opt/rocm-6.2.2/lib/llvm/bin/amdgpu-arch

an error occurred on /dev/kfd:

openat(AT_FDCWD, "/dev/kfd", O_RDWR|O_CLOEXEC) = -1 EACCES (Permission denied)

$ ls -l /dev/kfd
crw-rw----. 1 root video 241, 0 Nov  1 12:50 /dev/kfd

It turned out that I had to add my EB user "modules" to the "video" UNIX group as described in https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/prerequisites.html#setting-permissions-for-groups

Now the amdgpu-arch works correctly:

# /opt/rocm-6.2.2/lib/llvm/bin/amdgpu-arch
gfx90a
gfx90a

and the UCC-1.3.0-GCCcore-13.3.0.eb module has been built without errors!

Best regards,
Ole


On 11/1/24 14:12, Ole Holm Nielsen wrote:
Hi EasyBuilders,

We have an AMD EPYC 7313 node (running AlmaLinux 8.10) with two AMD Instinct MI210 GPUs:

# rocm-smi --showhw
===================================== ROCm System Management Interface ===================================== ========================================== Concise Hardware Info =========================================== GPU  NODE  DID     GUID   GFX VER  GFX RAS  SDMA RAS  UMC RAS  VBIOS BUS           PARTITION ID 0    8     0x740f  63484  gfx9010  ENABLED  ENABLED   ENABLED 113- D67301-064D  0000:23:00.0  0 1    9     0x740f  36740  gfx9010  ENABLED  ENABLED   ENABLED 113- D67301-064D  0000:83:00.0  0
============================================================================================================
=========================================== End of ROCm SMI Log ============================================

We have installed ROCm 6.2.2 libraries, and now we need to build our application which requires OpenMPI.

We have found out that ROCm 6.2 requires UCC >= 1.3.0 and UCX >= 1.15.0, see https://rocm.docs.amd.com/en/docs-6.2.0/compatibility/compatibility- matrix.html

Therefore we need to build OpenMPI-5.0.3-GCC-13.3.0.eb in order to get supported UCX and UCC versions.  Unfortunately the prerequisite UCC-1.3.0- GCCcore-13.3.0.eb fails to build because this command fails:

$ /opt/rocm-6.2.2/lib/llvm/bin/amdgpu-arch
Failed to get device count

Question: Does anyone know how to fix the amdgpu-arch command so that it recognizes the AMD MI210 GPU (gfx version gfx9010)?

FYI the UCC build log says:

Making all in kernel
make[4]: Entering directory '/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ ucc-1.3.0/src/components/ec/rocm/kernel' /bin/bash ../../../../../cuda_lt.sh "/bin/sh ../../../../../libtool" ec_rocm_executor_kernel.lo /opt/rocm/bin/amdclang -c ec_rocm_executor_kernel.cu   -D__HIP_PLATFORM_AMD__ -I/opt/rocm/include/ hip -I/opt/rocm/include -I/opt/rocm/llvm/include -I/opt/rocm/include/hsa -I/opt/rocm/include -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 -I/ dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 -I/dev/shm/UCC/1.3.0/ GCCcore-13.3.0/ucc-1.3.0/src -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ ucc-1.3.0/src -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src/ components/ec/rocm /bin/bash ../../../../../cuda_lt.sh "/bin/sh ../../../../../libtool" ec_rocm_reduce.lo /opt/rocm/bin/amdclang -c  ec_rocm_reduce.cu   - D__HIP_PLATFORM_AMD__ -I/opt/rocm/include/hip -I/opt/rocm/include -I/ opt/rocm/llvm/include -I/opt/rocm/include/hsa -I/opt/rocm/include -I/ dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 -I/dev/shm/UCC/1.3.0/ GCCcore-13.3.0/ucc-1.3.0 -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/ src -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src -I/dev/shm/ UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src/components/ec/rocm /opt/rocm/bin/amdclang -c -x hip -target x86_64-unknown-linux-gnu -- offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx940 -- offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1030 -- offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 -- offload-arch=native ec_rocm_reduce.cu -D__HIP_PLATFORM_AMD__ -I/opt/ rocm/include/hip -I/opt/rocm/include -I/opt/rocm/llvm/include -I/opt/ rocm/include/hsa -I/opt/rocm/include -I/dev/shm/UCC/1.3.0/ GCCcore-13.3.0/ucc-1.3.0 -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 - I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src -I/dev/shm/UCC/1.3.0/ GCCcore-13.3.0/ucc-1.3.0/src -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ ucc-1.3.0/src/components/ec/rocm -fPIC -O3 -o ./.libs/ec_rocm_reduce.o /opt/rocm/bin/amdclang -c -x hip -target x86_64-unknown-linux-gnu -- offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx940 -- offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1030 -- offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 -- offload-arch=native ec_rocm_executor_kernel.cu -D__HIP_PLATFORM_AMD__ - I/opt/rocm/include/hip -I/opt/rocm/include -I/opt/rocm/llvm/include -I/ opt/rocm/include/hsa -I/opt/rocm/include -I/dev/shm/UCC/1.3.0/ GCCcore-13.3.0/ucc-1.3.0 -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 - I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src -I/dev/shm/UCC/1.3.0/ GCCcore-13.3.0/ucc-1.3.0/src -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ ucc-1.3.0/src/components/ec/rocm -fPIC -O3 -o ./.libs/ ec_rocm_executor_kernel.o clang: error: cannot determine amdgcn architecture: /opt/rocm-6.2.2/lib/ llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch' clang: error: cannot determine amdgcn architecture: /opt/rocm-6.2.2/lib/ llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'

Reply via email to