https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=289813

            Bug ID: 289813
           Summary: Vulkan: running and inferencing with "koboldcpp" or
                    "llama.cpp" using the Vulkan backend locks up the
                    GPU...
           Product: Base System
           Version: 15.0-STABLE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: [email protected]
          Reporter: [email protected]

Hi,

while using and inferencing with "koboldcpp" or "llama.cpp" using the Vulkan
backend will lock up my iGPU Radeon 780M eventually. Same happened with my old
Radeon RX460. Using the integrated benchmarks of the two mentioned LLM engines
a few times in a row would trigger the lock-up faster.

------------------------------- SNIP -------------------------------
Sep 24 15:16:35 asbach kernel: [drm ERROR :amdgpu_job_timedout] ring gfx_0.0.0
timeout, signaled seq=31038, emitted seq=31040
Sep 24 15:16:35 asbach kernel: [drm ERROR :amdgpu_job_timedout] Process
information: process  pid 101072 thread  pid 101072
Sep 24 15:16:35 asbach kernel: drmn0: GPU reset begin!
Sep 24 15:16:36 asbach kernel: [drm ERROR
:mes_v11_0_submit_pkt_and_poll_completion] MES failed to response msg=3
Sep 24 15:16:36 asbach kernel: [drm ERROR :amdgpu_mes_unmap_legacy_queue]
failed to unmap legacy queue
Sep 24 15:16:36 asbach kernel: [drm ERROR
:mes_v11_0_submit_pkt_and_poll_completion] MES failed to response msg=3
Sep 24 15:16:36 asbach kernel: [drm ERROR :amdgpu_mes_unmap_legacy_queue]
failed to unmap legacy queue
Sep 24 15:16:36 asbach kernel: [drm ERROR
:mes_v11_0_submit_pkt_and_poll_completion] MES failed to response msg=3
Sep 24 15:16:36 asbach kernel: [drm ERROR :amdgpu_mes_unmap_legacy_queue]
failed to unmap legacy queue
Sep 24 15:16:36 asbach kernel: [drm ERROR
:mes_v11_0_submit_pkt_and_poll_completion] MES failed to response msg=3
Sep 24 15:16:36 asbach kernel: [drm ERROR :amdgpu_mes_unmap_legacy_queue]
failed to unmap legacy queue
Sep 24 15:16:36 asbach kernel: [drm ERROR
:mes_v11_0_submit_pkt_and_poll_completion] MES failed to response msg=3
Sep 24 15:16:36 asbach kernel: [drm ERROR :amdgpu_mes_unmap_legacy_queue]
failed to unmap legacy queue
Sep 24 15:16:36 asbach kernel: [drm ERROR
:mes_v11_0_submit_pkt_and_poll_completion] MES failed to response msg=3
Sep 24 15:16:36 asbach kernel: [drm ERROR :amdgpu_mes_unmap_legacy_queue]
failed to unmap legacy queue
Sep 24 15:16:36 asbach kernel: [drm ERROR
:mes_v11_0_submit_pkt_and_poll_completion] MES failed to response msg=3
Sep 24 15:16:36 asbach kernel: [drm ERROR :amdgpu_mes_unmap_legacy_queue]
failed to unmap legacy queue
Sep 24 15:16:36 asbach kernel: [drm ERROR
:mes_v11_0_submit_pkt_and_poll_completion] MES failed to response msg=3
Sep 24 15:16:36 asbach kernel: [drm ERROR :amdgpu_mes_unmap_legacy_queue]
failed to unmap legacy queue
Sep 24 15:16:37 asbach kernel: [drm ERROR
:mes_v11_0_submit_pkt_and_poll_completion] MES failed to response msg=3
Sep 24 15:16:37 asbach kernel: [drm ERROR :amdgpu_mes_unmap_legacy_queue]
failed to unmap legacy queue
Sep 24 15:16:37 asbach kernel: drmn0: MODE2 reset
Sep 24 15:16:37 asbach kernel: drmn0: GPU reset succeeded, trying to resume
Sep 24 15:16:37 asbach kernel: [drm] PCIE GART of 512M enabled (table at
0x0000008000300000).
Sep 24 15:16:37 asbach kernel: drmn0: SMU is resuming...
Sep 24 15:16:37 asbach kernel: drmn0: SMU is resumed successfully!
Sep 24 15:16:37 asbach kernel: [drm] DMUB hardware initialized:
version=0x08001B00
Sep 24 15:16:37 asbach kernel: WARNING !(0) failed at
/usr/ports/graphics/drm-66-kmod/work/drm-kmod-drm_v6.6.25_6/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_capability.c:1530
Sep 24 15:16:37 asbach kernel: [drm] kiq ring mec 3 pipe 1 q 0
Sep 24 15:16:37 asbach kernel: [drm] VCN decode and encode initialized
successfully(under DPG Mode).
Sep 24 15:16:37 asbach kernel: drmn0: [drm] jpeg_v4_0_hw_initdrmn0: ring
gfx_0.0.0 uses VM inv eng 0 on hub 0
Sep 24 15:16:37 asbach kernel: drmn0: ring comp_1.0.0 uses VM inv eng 1 on hub
0
Sep 24 15:16:37 asbach kernel: drmn0: ring comp_1.1.0 uses VM inv eng 4 on hub
0
Sep 24 15:16:37 asbach kernel: drmn0: ring comp_1.2.0 uses VM inv eng 6 on hub
0
Sep 24 15:16:37 asbach kernel: drmn0: ring comp_1.3.0 uses VM inv eng 7 on hub
0
Sep 24 15:16:37 asbach kernel: drmn0: ring comp_1.0.1 uses VM inv eng 8 on hub
0
Sep 24 15:16:37 asbach kernel: drmn0: ring comp_1.1.1 uses VM inv eng 9 on hub
0
Sep 24 15:16:37 asbach kernel: drmn0: ring comp_1.2.1 uses VM inv eng 10 on hub
0
Sep 24 15:16:37 asbach kernel: drmn0: ring comp_1.3.1 uses VM inv eng 11 on hub
0
Sep 24 15:16:37 asbach kernel: drmn0: ring sdma0 uses VM inv eng 12 on hub 0
Sep 24 15:16:37 asbach kernel: drmn0: ring vcn_unified_0 uses VM inv eng 0 on
hub 8
Sep 24 15:16:37 asbach kernel: drmn0: ring jpeg_dec uses VM inv eng 1 on hub 8
Sep 24 15:16:37 asbach kernel: drmn0: ring mes_kiq_3.1.0 uses VM inv eng 13 on
hub 0
Sep 24 15:16:37 asbach kernel: drmn0: recover vram bo from shadow start
Sep 24 15:16:37 asbach kernel: drmn0: recover vram bo from shadow done
Sep 24 15:16:37 asbach kernel: [drm ERROR :amdgpu_cs_ioctl] Failed to
initialize parser -85!
------------------------------- SNIP -------------------------------

The mouse cursor still is movable, but everything else on the screens is
frozen. No VT switches are possible anymore. Only a hardware reset will help
and recover the GPU.

If you need any more infos or instructed debugs, let me know. Would a "truss"
output help?



Thanks in advance and regards,
Nils

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to