Public bug reported:

Summary
- Running Ollama with CUDA on a hybrid Intel + NVIDIA laptop causes the NVIDIA 
GPU to enter a fatal state.
- After failure, `nvidia-smi` reports no usable device until full reboot.
- This is reproducible with Ollama workload and not fixed by restarting Ollama 
alone.

Environment (sanitized)
- OS: Ubuntu 24.04.4 LTS
- Kernel: 6.17.0-14-generic
- GPU: NVIDIA GeForce RTX 4060 Laptop GPU (AD107M)
- PRIME mode: on-demand
- Driver stack: nvidia-driver-570-open 570.211.01
- Relevant module setting: NVreg_DynamicPowerManagement=0x02 (enabled by 
default runtime PM config)
- Ollama client/server: 0.15.2

Observed behavior
1. Ollama uses CUDA normally at first.
2. During inference, Ollama logs CUDA launch failure.
3. Kernel logs Xid 79 and "GPU has fallen off the bus".
4. Driver escalates to Xid 154 and reports reboot required.
5. Afterwards:
   - `nvidia-smi` => "Unable to determine the device handle ... Unknown Error / 
No devices were found"
   - Live reset attempts do not recover GPU
   - Only full reboot restores operation

Key evidence
- Ollama log:
  - "CUDA error: unspecified launch failure"
  - "ggml_cuda_init: failed to initialize CUDA: unknown error" (after crash)

- Kernel log:
  - "NVRM: Xid (PCI:0000:01:00): 79 ... GPU has fallen off the bus."
  - "NVRM: Xid (PCI:0000:01:00): 154 ... Node Reboot Required"
  - "NVRM: nvGpuOpsReportFatalError: uvm encountered global fatal error 0x60, 
requiring os reboot to recover."

- PCI state after failure:
  - `lspci -vv -s 01:00.0` shows "Unknown header type 7f"
  - GPU remains present on bus ID but is not manageable by NVML/CUDA

Additional signal
- Repeated ACPI warning around resume events:
  - "ACPI Error: No handler or method for GPE 6B, disabling event"
- Hybrid + runtime power management path is active; failure may be in 
driver/firmware power-state handling under CUDA load.

Reproduction (minimal)
1. Boot system normally with nvidia-driver-570-open and PRIME on-demand.
2. Start Ollama and run a CUDA-backed inference request.
3. Repeat requests until failure occurs.
4. Observe Xid 79 / Xid 154 in kernel log and loss of GPU usability until 
reboot.

Expected
- CUDA workload should either succeed or fail gracefully without losing 
PCIe/NVML access to GPU.

Actual
- Driver loses GPU from bus (Xid 79), enters non-recoverable state (Xid 154), 
requires reboot.

Current root-cause statement
- Immediate root cause (confirmed): NVIDIA kernel driver enters GPU-lost 
condition (`Xid 79`, bus loss) during Ollama CUDA activity, then marks GPU 
unrecoverable (`Xid 154`, reboot required).
- Most likely underlying cause: bug/regression in NVIDIA open-kernel 
hybrid-runtime-power-management path (PCIe/power-state transition + CUDA load), 
potentially aggravated by resume/ACPI event issues.
- Note: user-space Ollama triggers the path, but the fatal condition is in the 
GPU driver/kernel stack.

What has already been tried
- Restarting Ollama service: does not recover GPU.
- Runtime reset attempts: no recovery.
- Reboot: consistently recovers GPU (only fix that seems to work)

** Affects: nvidia-graphics-drivers-570 (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2142940

Title:
  NVIDIA open driver crash under Ollama CUDA workload (Xid 79 -> GPU
  fallen off bus -> reboot required)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-570/+bug/2142940/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to