Hi Xiaogang, Alex, gentle ping on the tested candidate fix for this 
regression. The candidate change fixed the reproducer here (10/10 
clean), and regzbot now tracks it as "fix incoming". Do you plan to send 
the formal patch, or would it help if I send a patch based on the public 
candidate fix? Thanks, Gerhard

On 05/06/26 at 20:46, Chen, Xiaogang wrote:

> Thank you for the testing/confirming.
>
> Xiaogang
>
> On 6/5/2026 1:41 PM, Gerhard Schwanzer wrote:
>> Hi Xiaogang, Thanks. I tested your attached patch on my RX 7600 XT
>> system. Test setup:
>> -
>> kernel 7.0.11 with 448ee453/bf2084a7 active
>> -
>> local revert not applied
>> -
>> your attached candidate fix applied
>> -
>> same self-contained v2 reproducer source as before, unchanged sha256:
>> 33347b5a1915f7452417f776c85527e55f825078c146163470bfe3eacabe3b27
>> Command: ./kfd_svm_split_hsa_copy --upstream-ab Result:
>> -
>> 10/10 runs completed successfully
>> -
>> all HSA/SDMA D2H copies completed
>> -
>> no ROCr memory access fault
>> -
>> no new GCVM_L2_PROTECTION_FAULT_STATUS
>> -
>> no SDMA0 permission fault
>> -
>> no GPU page fault in the kernel log So your patch fixes the reproducer
>> on my system with the original reproducer unchanged. Please feel free to
>> add: Tested-by: Gerhard Schwanzer
>> [email protected]
>> Thanks, Gerhard
>>
>>
>> On 05/06/26 at 19:59, Chen, Xiaogang wrote:
>>> AMD General
>>>
>>>
>>> Hi Gerhard:
>>>
>>> I think the cause is checking the last byte address of svm range for
>>> 2MB alignment when decide possible huge page mapping. Your test case
>>> has vm range that ends just one byte before alignment.
>>>
>>> I tested your app with the attachment, no page fault during sdma
>>> operation. Please verify it.
>>>
>>> Thanks
>>>
>>> Xiaogang
>>>
>>> *From:*Chen, Xiaogang
>>> *Sent:* Wednesday, June 3, 2026 5:51 PM
>>> *To:* Gerhard Schwanzer <[email protected]>; [email protected]
>>> *Cc:* [email protected]; [email protected]; Deucher,
>>> Alexander <[email protected]>; Yang, Philip <[email protected]>
>>> *Subject:* Re: [REGRESSION] drm/amdkfd: SVM split-tail remap
>>> regression causes SDMA0 permission fault on RX 7600 XT
>>>
>>> Hi Gerhard:
>>>
>>> Thanks. I can build the app now. And I saw the regression. I am
>>> triaging it.
>>>
>>> The purpose of this patch is to remap split svm ranges(head/tail) that
>>> were mapped with huge page mapping(pmd), but cannot be mapped in huge
>>> page mapping after split due to new svm ranges are not 2MB aligned. It
>>> seems the remap decision misses case that both head and tail ranges
>>> are from original range with huge page mappings were used. Will check....
>>>
>>> Regards
>>>
>>> Xiaogang
>>>
>>> On 6/3/2026 12:54 AM, Gerhard Schwanzer wrote:
>>>
>>>       [Some people who received this message don't often get email 
>>> [email protected]. Learn why this is important 
>>> athttps://aka.ms/LearnAboutSenderIdentification ]
>>>
>>>       Hi Xiaogang,
>>>
>>>       Sorry, you are right. The source I uploaded was not self-contained, 
>>> it still
>>>
>>>       referenced trace_history_replay.inc from an older local replay mode.
>>>
>>>       I uploaded a self-contained v2 source to the GitLab report:
>>>
>>>       
>>> https://gitlab.freedesktop.org/-/project/4522/uploads/7395b8985ecd7c54183a7615d479c02c/kfd_svm_split_hsa_copy-v2.c
>>>
>>>       The --upstream-ab path does not use that replay table, but the missing
>>>
>>>       include
>>>
>>>       obviously broke fresh builds. The v2 source embeds the table and 
>>> otherwise
>>>
>>>       preserves the same source.
>>>
>>>       I re-tested this v2 source before uploading:
>>>
>>>           - clean build from only kfd_svm_split_hsa_copy-v2.c: OK
>>>
>>>           - ./kfd_svm_split_hsa_copy --help: OK
>>>
>>>           - good/workaround kernel: --upstream-ab completed 10/10 runs, no 
>>> new
>>>
>>>             GCVM/SDMA0/protection-fault messages in the test window
>>>
>>>           - broken kernel: --upstream-ab reproduced the SDMA0 permission 
>>> fault;
>>>
>>>             the first kernel fault address matched the planned split-tail 
>>> page
>>>
>>>       Validation summaries:
>>>
>>>       
>>> https://gitlab.freedesktop.org/-/project/4522/uploads/e6d0f31c0fda0df2c999439411f29dca/good-kernel-validation-summary.md
>>>
>>>       
>>> https://gitlab.freedesktop.org/-/project/4522/uploads/bdf8a3ac6786ddb88dd426b59edb32a9/broken-kernel-validation-summary.md
>>>
>>>       The intended triage command remains:
>>>
>>>           ./kfd_svm_split_hsa_copy --upstream-ab
>>>
>>>       Generic build shape is:
>>>
>>>           cc -O2 -g -Wall -Wextra -pthread \
>>>
>>>             -I/path/to/rocm/include -L/path/to/rocm/lib \
>>>
>>>             -o kfd_svm_split_hsa_copy kfd_svm_split_hsa_copy-v2.c \
>>>
>>>             -lhsa-runtime64
>>>
>>>       If you still prefer a binary, please tell me the target 
>>> runtime/distro. A
>>>
>>>       binary built on my NixOS system is Nix-store linked and likely not
>>>
>>>       portable to
>>>
>>>       your test system.
>>>
>>>       One more thing that would help me test any replacement fix: do you 
>>> know what
>>>
>>>       specific failure or workload 448ee453 was intended to fix? I would 
>>> like to
>>>
>>>       avoid validating only the revert side while accidentally losing the 
>>> original
>>>
>>>       fix.
>>>
>>>       Thanks for catching this, and thanks for taking a look.
>>>
>>>       Regards,
>>>
>>>       Gerhard
>>>
>>>       On 06/03/2026 Chen, Xiaogang wrote:
>>>
>>>           I cannot compile kfd_svm_split_hsa_copy.c, there is no
>>>
>>>           "trace_history_replay.inc".
>>>
>>>           Or can you  send the test binary?  That should be enough to 
>>> triage the
>>>
>>>           issue since it is a regression as you mentioned.
>>>
>>>           Regards
>>>
>>>           Xiaogang
>>>
>>>           On 6/2/2026 5:04 AM, Gerhard Schwanzer wrote:
>>>
>>>               Hi,
>>>
>>>               I would like to make sure this AMDKFD SVM regression is 
>>> tracked by the
>>>
>>>               Linux regression process.
>>>
>>>               GitLab report:
>>>
>>>                   https://gitlab.freedesktop.org/drm/amd/-/work_items/4914
>>>
>>>               The regression was originally reported on 2026-01-27. It was 
>>> bisected
>>>
>>>               to the
>>>
>>>               same functional change that Alex Deucher's revert patch later 
>>> targeted:
>>>
>>>                   448ee45353ef9fb1a34f5f26eb3f48923c6f0898
>>>
>>>                   drm/amdkfd: Use huge page size to check split svm range 
>>> alignment
>>>
>>>               The affected kernel line I tested identifies the same change 
>>> as:
>>>
>>>                   bf2084a7b1d75d093b6a79df4c10142d49fbaa0e
>>>
>>>               Alex's revert patch:
>>>
>>>               
>>> https://lists.freedesktop.org/archives/amd-gfx/2026-February/138824.html
>>>
>>>               A small C/HSA reproducer is now available in the GitLab 
>>> report. It
>>>
>>>               does not
>>>
>>>               require PyTorch, ComfyUI, Docker, model files, or the original
>>>
>>>               workload. It
>>>
>>>               uses ROCr/HSA, an anonymous THP-advised host mapping, 
>>> explicit KFD SVM
>>>
>>>               SET_ATTR ioctls, and an HSA SDMA D2H copy.
>>>
>>>               Single reproducer command, same binary on both kernels:
>>>
>>>                   ./kfd_svm_split_hsa_copy --upstream-ab
>>>
>>>               Same-machine A/B result on an RX 7600 XT:
>>>
>>>                   448ee453/bf2084a7 active:
>>>
>>>                     1/1 run faults with SDMA0 permission fault
>>>
>>>                     GCVM_L2_PROTECTION_FAULT_STATUS=0x00841A51
>>>
>>>                   448ee453/bf2084a7 locally reverted:
>>>
>>>                     10/10 runs complete
>>>
>>>                     no ROCr memory access fault
>>>
>>>                     no new GCVM/SDMA0 permission fault in dmesg
>>>
>>>               The bad fault page is inside the split tail and inside the 
>>> SDMA copy
>>>
>>>               range:
>>>
>>>                   critical tail: [0x722429d61..0x722429dff]
>>>
>>>                   copy pages:    [0x722429b30..0x722429d70]
>>>
>>>                   fault page:    0x722429d65
>>>
>>>               A full ftrace/PTE run with the same C reproducer/SVM sequence 
>>> also shows:
>>>
>>>                   split_tail ... current_remap=0 old_remap=1 missed=1
>>>
>>>                   MISSED_REMAP_CANDIDATE split=tail
>>>
>>>                   no amdgpu_vm_update_ptes covering the fault page after 
>>> the marker
>>>
>>>               before
>>>
>>>                   the fault-side GET_ATTR
>>>
>>>               The suspected code issue is that the split-tail/head remap 
>>> predicate
>>>
>>>               introduced
>>>
>>>               by 448ee453/bf2084a7 can miss tails inside the final 512-page 
>>> block.
>>>
>>>               Since
>>>
>>>               prange->last is inclusive, ALIGN_DOWN(prange->last, 512) is 
>>> the start
>>>
>>>               of the
>>>
>>>               final block, not an exclusive upper bound.
>>>
>>>               I also sent a short follow-up to amd-gfx with the 
>>> reproducer/A-B
>>>
>>>               summary and
>>>
>>>               asked what original failure or workload 448ee453/bf2084a7 was 
>>> intended
>>>
>>>               to fix:
>>>
>>>               
>>> https://lists.freedesktop.org/archives/amd-gfx/2026-June/145800.html
>>>
>>>               I can resend the reproducer source and summaries directly 
>>> on-list if
>>>
>>>               preferred.
>>>
>>>               #regzbot introduced: 448ee45353ef9fb1a34f5f26eb3f48923c6f0898
>>>
>>>               #regzbot monitor:
>>>
>>>               https://gitlab.freedesktop.org/drm/amd/-/work_items/4914
>>>
>>>               Thanks,
>>>
>>>               Gerhard Schwanzer
>>>

Reply via email to