Hi Xiaogang, Alex, gentle ping on the tested candidate fix for this regression. The candidate change fixed the reproducer here (10/10 clean), and regzbot now tracks it as "fix incoming". Do you plan to send the formal patch, or would it help if I send a patch based on the public candidate fix? Thanks, Gerhard
On 05/06/26 at 20:46, Chen, Xiaogang wrote: > Thank you for the testing/confirming. > > Xiaogang > > On 6/5/2026 1:41 PM, Gerhard Schwanzer wrote: >> Hi Xiaogang, Thanks. I tested your attached patch on my RX 7600 XT >> system. Test setup: >> - >> kernel 7.0.11 with 448ee453/bf2084a7 active >> - >> local revert not applied >> - >> your attached candidate fix applied >> - >> same self-contained v2 reproducer source as before, unchanged sha256: >> 33347b5a1915f7452417f776c85527e55f825078c146163470bfe3eacabe3b27 >> Command: ./kfd_svm_split_hsa_copy --upstream-ab Result: >> - >> 10/10 runs completed successfully >> - >> all HSA/SDMA D2H copies completed >> - >> no ROCr memory access fault >> - >> no new GCVM_L2_PROTECTION_FAULT_STATUS >> - >> no SDMA0 permission fault >> - >> no GPU page fault in the kernel log So your patch fixes the reproducer >> on my system with the original reproducer unchanged. Please feel free to >> add: Tested-by: Gerhard Schwanzer >> [email protected] >> Thanks, Gerhard >> >> >> On 05/06/26 at 19:59, Chen, Xiaogang wrote: >>> AMD General >>> >>> >>> Hi Gerhard: >>> >>> I think the cause is checking the last byte address of svm range for >>> 2MB alignment when decide possible huge page mapping. Your test case >>> has vm range that ends just one byte before alignment. >>> >>> I tested your app with the attachment, no page fault during sdma >>> operation. Please verify it. >>> >>> Thanks >>> >>> Xiaogang >>> >>> *From:*Chen, Xiaogang >>> *Sent:* Wednesday, June 3, 2026 5:51 PM >>> *To:* Gerhard Schwanzer <[email protected]>; [email protected] >>> *Cc:* [email protected]; [email protected]; Deucher, >>> Alexander <[email protected]>; Yang, Philip <[email protected]> >>> *Subject:* Re: [REGRESSION] drm/amdkfd: SVM split-tail remap >>> regression causes SDMA0 permission fault on RX 7600 XT >>> >>> Hi Gerhard: >>> >>> Thanks. I can build the app now. And I saw the regression. I am >>> triaging it. >>> >>> The purpose of this patch is to remap split svm ranges(head/tail) that >>> were mapped with huge page mapping(pmd), but cannot be mapped in huge >>> page mapping after split due to new svm ranges are not 2MB aligned. It >>> seems the remap decision misses case that both head and tail ranges >>> are from original range with huge page mappings were used. Will check.... >>> >>> Regards >>> >>> Xiaogang >>> >>> On 6/3/2026 12:54 AM, Gerhard Schwanzer wrote: >>> >>> [Some people who received this message don't often get email >>> [email protected]. Learn why this is important >>> athttps://aka.ms/LearnAboutSenderIdentification ] >>> >>> Hi Xiaogang, >>> >>> Sorry, you are right. The source I uploaded was not self-contained, >>> it still >>> >>> referenced trace_history_replay.inc from an older local replay mode. >>> >>> I uploaded a self-contained v2 source to the GitLab report: >>> >>> >>> https://gitlab.freedesktop.org/-/project/4522/uploads/7395b8985ecd7c54183a7615d479c02c/kfd_svm_split_hsa_copy-v2.c >>> >>> The --upstream-ab path does not use that replay table, but the missing >>> >>> include >>> >>> obviously broke fresh builds. The v2 source embeds the table and >>> otherwise >>> >>> preserves the same source. >>> >>> I re-tested this v2 source before uploading: >>> >>> - clean build from only kfd_svm_split_hsa_copy-v2.c: OK >>> >>> - ./kfd_svm_split_hsa_copy --help: OK >>> >>> - good/workaround kernel: --upstream-ab completed 10/10 runs, no >>> new >>> >>> GCVM/SDMA0/protection-fault messages in the test window >>> >>> - broken kernel: --upstream-ab reproduced the SDMA0 permission >>> fault; >>> >>> the first kernel fault address matched the planned split-tail >>> page >>> >>> Validation summaries: >>> >>> >>> https://gitlab.freedesktop.org/-/project/4522/uploads/e6d0f31c0fda0df2c999439411f29dca/good-kernel-validation-summary.md >>> >>> >>> https://gitlab.freedesktop.org/-/project/4522/uploads/bdf8a3ac6786ddb88dd426b59edb32a9/broken-kernel-validation-summary.md >>> >>> The intended triage command remains: >>> >>> ./kfd_svm_split_hsa_copy --upstream-ab >>> >>> Generic build shape is: >>> >>> cc -O2 -g -Wall -Wextra -pthread \ >>> >>> -I/path/to/rocm/include -L/path/to/rocm/lib \ >>> >>> -o kfd_svm_split_hsa_copy kfd_svm_split_hsa_copy-v2.c \ >>> >>> -lhsa-runtime64 >>> >>> If you still prefer a binary, please tell me the target >>> runtime/distro. A >>> >>> binary built on my NixOS system is Nix-store linked and likely not >>> >>> portable to >>> >>> your test system. >>> >>> One more thing that would help me test any replacement fix: do you >>> know what >>> >>> specific failure or workload 448ee453 was intended to fix? I would >>> like to >>> >>> avoid validating only the revert side while accidentally losing the >>> original >>> >>> fix. >>> >>> Thanks for catching this, and thanks for taking a look. >>> >>> Regards, >>> >>> Gerhard >>> >>> On 06/03/2026 Chen, Xiaogang wrote: >>> >>> I cannot compile kfd_svm_split_hsa_copy.c, there is no >>> >>> "trace_history_replay.inc". >>> >>> Or can you send the test binary? That should be enough to >>> triage the >>> >>> issue since it is a regression as you mentioned. >>> >>> Regards >>> >>> Xiaogang >>> >>> On 6/2/2026 5:04 AM, Gerhard Schwanzer wrote: >>> >>> Hi, >>> >>> I would like to make sure this AMDKFD SVM regression is >>> tracked by the >>> >>> Linux regression process. >>> >>> GitLab report: >>> >>> https://gitlab.freedesktop.org/drm/amd/-/work_items/4914 >>> >>> The regression was originally reported on 2026-01-27. It was >>> bisected >>> >>> to the >>> >>> same functional change that Alex Deucher's revert patch later >>> targeted: >>> >>> 448ee45353ef9fb1a34f5f26eb3f48923c6f0898 >>> >>> drm/amdkfd: Use huge page size to check split svm range >>> alignment >>> >>> The affected kernel line I tested identifies the same change >>> as: >>> >>> bf2084a7b1d75d093b6a79df4c10142d49fbaa0e >>> >>> Alex's revert patch: >>> >>> >>> https://lists.freedesktop.org/archives/amd-gfx/2026-February/138824.html >>> >>> A small C/HSA reproducer is now available in the GitLab >>> report. It >>> >>> does not >>> >>> require PyTorch, ComfyUI, Docker, model files, or the original >>> >>> workload. It >>> >>> uses ROCr/HSA, an anonymous THP-advised host mapping, >>> explicit KFD SVM >>> >>> SET_ATTR ioctls, and an HSA SDMA D2H copy. >>> >>> Single reproducer command, same binary on both kernels: >>> >>> ./kfd_svm_split_hsa_copy --upstream-ab >>> >>> Same-machine A/B result on an RX 7600 XT: >>> >>> 448ee453/bf2084a7 active: >>> >>> 1/1 run faults with SDMA0 permission fault >>> >>> GCVM_L2_PROTECTION_FAULT_STATUS=0x00841A51 >>> >>> 448ee453/bf2084a7 locally reverted: >>> >>> 10/10 runs complete >>> >>> no ROCr memory access fault >>> >>> no new GCVM/SDMA0 permission fault in dmesg >>> >>> The bad fault page is inside the split tail and inside the >>> SDMA copy >>> >>> range: >>> >>> critical tail: [0x722429d61..0x722429dff] >>> >>> copy pages: [0x722429b30..0x722429d70] >>> >>> fault page: 0x722429d65 >>> >>> A full ftrace/PTE run with the same C reproducer/SVM sequence >>> also shows: >>> >>> split_tail ... current_remap=0 old_remap=1 missed=1 >>> >>> MISSED_REMAP_CANDIDATE split=tail >>> >>> no amdgpu_vm_update_ptes covering the fault page after >>> the marker >>> >>> before >>> >>> the fault-side GET_ATTR >>> >>> The suspected code issue is that the split-tail/head remap >>> predicate >>> >>> introduced >>> >>> by 448ee453/bf2084a7 can miss tails inside the final 512-page >>> block. >>> >>> Since >>> >>> prange->last is inclusive, ALIGN_DOWN(prange->last, 512) is >>> the start >>> >>> of the >>> >>> final block, not an exclusive upper bound. >>> >>> I also sent a short follow-up to amd-gfx with the >>> reproducer/A-B >>> >>> summary and >>> >>> asked what original failure or workload 448ee453/bf2084a7 was >>> intended >>> >>> to fix: >>> >>> >>> https://lists.freedesktop.org/archives/amd-gfx/2026-June/145800.html >>> >>> I can resend the reproducer source and summaries directly >>> on-list if >>> >>> preferred. >>> >>> #regzbot introduced: 448ee45353ef9fb1a34f5f26eb3f48923c6f0898 >>> >>> #regzbot monitor: >>> >>> https://gitlab.freedesktop.org/drm/amd/-/work_items/4914 >>> >>> Thanks, >>> >>> Gerhard Schwanzer >>>
