Hi EasyBuilders,

We have an AMD EPYC 7313 node (running AlmaLinux 8.10) with two AMD Instinct MI210 GPUs:

# rocm-smi --showhw
===================================== ROCm System Management Interface ===================================== ========================================== Concise Hardware Info =========================================== GPU NODE DID GUID GFX VER GFX RAS SDMA RAS UMC RAS VBIOS BUS PARTITION ID 0 8 0x740f 63484 gfx9010 ENABLED ENABLED ENABLED 113-D67301-064D 0000:23:00.0 0 1 9 0x740f 36740 gfx9010 ENABLED ENABLED ENABLED 113-D67301-064D 0000:83:00.0 0
============================================================================================================
=========================================== End of ROCm SMI Log ============================================

We have installed ROCm 6.2.2 libraries, and now we need to build our application which requires OpenMPI.

We have found out that ROCm 6.2 requires UCC >= 1.3.0 and UCX >= 1.15.0, see https://rocm.docs.amd.com/en/docs-6.2.0/compatibility/compatibility-matrix.html

Therefore we need to build OpenMPI-5.0.3-GCC-13.3.0.eb in order to get supported UCX and UCC versions. Unfortunately the prerequisite UCC-1.3.0-GCCcore-13.3.0.eb fails to build because this command fails:

$ /opt/rocm-6.2.2/lib/llvm/bin/amdgpu-arch
Failed to get device count

Question: Does anyone know how to fix the amdgpu-arch command so that it recognizes the AMD MI210 GPU (gfx version gfx9010)?

FYI the UCC build log says:

Making all in kernel
make[4]: Entering directory 
'/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src/components/ec/rocm/kernel'
/bin/bash ../../../../../cuda_lt.sh "/bin/sh ../../../../../libtool" 
ec_rocm_executor_kernel.lo /opt/rocm/bin/amdclang -c  ec_rocm_executor_kernel.cu   
-D__HIP_PLATFORM_AMD__ -I/opt/rocm/include/hip -I/opt/rocm/include 
-I/opt/rocm/llvm/include -I/opt/rocm/include/hsa -I/opt/rocm/include 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src/components/ec/rocm
/bin/bash ../../../../../cuda_lt.sh "/bin/sh ../../../../../libtool" 
ec_rocm_reduce.lo /opt/rocm/bin/amdclang -c  ec_rocm_reduce.cu   -D__HIP_PLATFORM_AMD__ 
-I/opt/rocm/include/hip -I/opt/rocm/include -I/opt/rocm/llvm/include 
-I/opt/rocm/include/hsa -I/opt/rocm/include -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src/components/ec/rocm
/opt/rocm/bin/amdclang -c -x hip -target x86_64-unknown-linux-gnu 
--offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx940 
--offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1030 
--offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 
--offload-arch=native ec_rocm_reduce.cu -D__HIP_PLATFORM_AMD__ 
-I/opt/rocm/include/hip -I/opt/rocm/include -I/opt/rocm/llvm/include 
-I/opt/rocm/include/hsa -I/opt/rocm/include 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src/components/ec/rocm -fPIC -O3 
-o ./.libs/ec_rocm_reduce.o
/opt/rocm/bin/amdclang -c -x hip -target x86_64-unknown-linux-gnu 
--offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx940 
--offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1030 
--offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 
--offload-arch=native ec_rocm_executor_kernel.cu -D__HIP_PLATFORM_AMD__ 
-I/opt/rocm/include/hip -I/opt/rocm/include -I/opt/rocm/llvm/include 
-I/opt/rocm/include/hsa -I/opt/rocm/include 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src 
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src/components/ec/rocm -fPIC -O3 
-o ./.libs/ec_rocm_executor_kernel.o
clang: error: cannot determine amdgcn architecture: 
/opt/rocm-6.2.2/lib/llvm/bin/amdgpu-arch: ; consider passing it via 
'--offload-arch'
clang: error: cannot determine amdgcn architecture: 
/opt/rocm-6.2.2/lib/llvm/bin/amdgpu-arch: ; consider passing it via 
'--offload-arch'

Thanks a lot,
Ole


--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

Reply via email to