Hey Cory, On 2023-11-21 21:01, Cordell Bloor wrote: > On 2023-11-18 00:39, Cordell Bloor wrote: >> Each time a HIP application is executed, the rocr-runtime prints the message: >> >> KFD does not support xnack mode query. >> ROCr must assume xnack is disabled. >> >> It is unclear to me whether something is actually wrong or not. This >> message is emitted from a debug_print statement in amd_topology.cpp. An >> example of this message can be found in the CI logs [1]. > > This is a debug message. It is guarded by NDEBUG, so it would not be > printed if rocr were built in Release mode. There is a bit of discussion > upstream as to whether the debug_print should instead be guarded by an > environment variable rather than a preprocessor definition.
> The Linux kernel on Debian is built without HSA_AMD_SVM enabled. That is > the KConfig for "Enable HMM-based shared virtual memory manager", which > is required for xnack+ operation. The xnack feature allows some AMD GPUs > to retry memory accesses that fail due to a page fault, which is used as > a mechanism for migrating managed memory automatically from host to > device. With xnack disabled, page faults in device code are not > recoverable [1]. I've rebuilt our kernel with this option enabled, and the message indeed went away. Great! This also required DEVICE_PRIVATE (and that one also suggests HMM_MIRROR). I don't see any downside to these; should we request them from the Kernel Team? That did remind me of another message I've seen in dmesg, repeated a few dozen times, when some (but not all) tests are run: amdgpu: init_user_pages: Failed to get user pages: -1 rocrand is a good example where these occur. Despite the failure, I did not observe any negative side effects, but the above change also did not solve this. Have you seen this message in dmesg as well? Best, Christian