Note added: When running strace:
$ strace /opt/rocm-6.2.2/lib/llvm/bin/amdgpu-arch
an error occurred on /dev/kfd:
openat(AT_FDCWD, "/dev/kfd", O_RDWR|O_CLOEXEC) = -1 EACCES (Permission denied)
$ ls -l /dev/kfd
crw-rw----. 1 root video 241, 0 Nov 1 12:50 /dev/kfd
It turned out that I had to add my EB user "modules" to the "video" UNIX
group as described in
https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/prerequisites.html#setting-permissions-for-groups
Now the amdgpu-arch works correctly:
# /opt/rocm-6.2.2/lib/llvm/bin/amdgpu-arch
gfx90a
gfx90a
and the UCC-1.3.0-GCCcore-13.3.0.eb module has been built without errors!
Best regards,
Ole
On 11/1/24 14:12, Ole Holm Nielsen wrote:
Hi EasyBuilders,
We have an AMD EPYC 7313 node (running AlmaLinux 8.10) with two AMD
Instinct MI210 GPUs:
# rocm-smi --showhw
===================================== ROCm System Management Interface
=====================================
========================================== Concise Hardware Info
===========================================
GPU NODE DID GUID GFX VER GFX RAS SDMA RAS UMC RAS VBIOS
BUS PARTITION ID
0 8 0x740f 63484 gfx9010 ENABLED ENABLED ENABLED 113-
D67301-064D 0000:23:00.0 0
1 9 0x740f 36740 gfx9010 ENABLED ENABLED ENABLED 113-
D67301-064D 0000:83:00.0 0
============================================================================================================
=========================================== End of ROCm SMI Log
============================================
We have installed ROCm 6.2.2 libraries, and now we need to build our
application which requires OpenMPI.
We have found out that ROCm 6.2 requires UCC >= 1.3.0 and UCX >= 1.15.0,
see https://rocm.docs.amd.com/en/docs-6.2.0/compatibility/compatibility-
matrix.html
Therefore we need to build OpenMPI-5.0.3-GCC-13.3.0.eb in order to get
supported UCX and UCC versions. Unfortunately the prerequisite UCC-1.3.0-
GCCcore-13.3.0.eb fails to build because this command fails:
$ /opt/rocm-6.2.2/lib/llvm/bin/amdgpu-arch
Failed to get device count
Question: Does anyone know how to fix the amdgpu-arch command so that it
recognizes the AMD MI210 GPU (gfx version gfx9010)?
FYI the UCC build log says:
Making all in kernel
make[4]: Entering directory '/dev/shm/UCC/1.3.0/GCCcore-13.3.0/
ucc-1.3.0/src/components/ec/rocm/kernel'
/bin/bash ../../../../../cuda_lt.sh "/bin/sh ../../../../../libtool"
ec_rocm_executor_kernel.lo /opt/rocm/bin/amdclang -c
ec_rocm_executor_kernel.cu -D__HIP_PLATFORM_AMD__ -I/opt/rocm/include/
hip -I/opt/rocm/include -I/opt/rocm/llvm/include -I/opt/rocm/include/hsa
-I/opt/rocm/include -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 -I/
dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 -I/dev/shm/UCC/1.3.0/
GCCcore-13.3.0/ucc-1.3.0/src -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/
ucc-1.3.0/src -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src/
components/ec/rocm
/bin/bash ../../../../../cuda_lt.sh "/bin/sh ../../../../../libtool"
ec_rocm_reduce.lo /opt/rocm/bin/amdclang -c ec_rocm_reduce.cu -
D__HIP_PLATFORM_AMD__ -I/opt/rocm/include/hip -I/opt/rocm/include -I/
opt/rocm/llvm/include -I/opt/rocm/include/hsa -I/opt/rocm/include -I/
dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 -I/dev/shm/UCC/1.3.0/
GCCcore-13.3.0/ucc-1.3.0 -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/
src -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src -I/dev/shm/
UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src/components/ec/rocm
/opt/rocm/bin/amdclang -c -x hip -target x86_64-unknown-linux-gnu --
offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx940 --
offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1030 --
offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --
offload-arch=native ec_rocm_reduce.cu -D__HIP_PLATFORM_AMD__ -I/opt/
rocm/include/hip -I/opt/rocm/include -I/opt/rocm/llvm/include -I/opt/
rocm/include/hsa -I/opt/rocm/include -I/dev/shm/UCC/1.3.0/
GCCcore-13.3.0/ucc-1.3.0 -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 -
I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src -I/dev/shm/UCC/1.3.0/
GCCcore-13.3.0/ucc-1.3.0/src -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/
ucc-1.3.0/src/components/ec/rocm -fPIC -O3 -o ./.libs/ec_rocm_reduce.o
/opt/rocm/bin/amdclang -c -x hip -target x86_64-unknown-linux-gnu --
offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx940 --
offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1030 --
offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --
offload-arch=native ec_rocm_executor_kernel.cu -D__HIP_PLATFORM_AMD__ -
I/opt/rocm/include/hip -I/opt/rocm/include -I/opt/rocm/llvm/include -I/
opt/rocm/include/hsa -I/opt/rocm/include -I/dev/shm/UCC/1.3.0/
GCCcore-13.3.0/ucc-1.3.0 -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 -
I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src -I/dev/shm/UCC/1.3.0/
GCCcore-13.3.0/ucc-1.3.0/src -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/
ucc-1.3.0/src/components/ec/rocm -fPIC -O3 -o ./.libs/
ec_rocm_executor_kernel.o
clang: error: cannot determine amdgcn architecture: /opt/rocm-6.2.2/lib/
llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'
clang: error: cannot determine amdgcn architecture: /opt/rocm-6.2.2/lib/
llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'