Bug#1002978: Another GPU hang

2022-01-17 Thread CJ Fearnley
OK, now we know the i915 driver, in my environment is NOT fixed with just
the microcode. I get GPU hangs whether the microcode is up-to-date or not.

This boot had NO kernel command line options (besides Debian defaults):
GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force"

I don't remember if I added pcie_aspm=force or if Debian did. I've had that
forever. I will try removing it next, but I probably need it for my Realtek
Ethernet that uses the 8168 driver. Currently, I have:
$ cat  /sys/module/pcie_aspm/parameters/policy
[default] performance powersave powersupersave

Here is the output of
/sys/class/drm/card0/error:
GPU HANG: ecode 6:2:bb86, in systemd-logind [662]
Kernel: 5.10.0-10-amd64 x86_64
Driver: 20200917
Time: 1642463910 s 885852 us
Boottime: 7195 s 975866 us
Uptime: 7189 s 236627 us
Capture: 4296691264 jiffies; 543240 ms ago
Active process (on ring bcs0): systemd-logind [662]
Reset count: 0
Suspend count: 0
Platform: SANDYBRIDGE
Subplatform: 0x0
PCI ID: 0x0102
PCI Revision: 0x09
PCI Subsystem: 1565:110d
IOMMU enabled?: 0
RPM wakelock: yes
PM suspended: no
GT awake: yes
EIR: 0x
IER: 0x82bc8585
GTIER[0]: 0x00401001
PGTBL_ER: 0x
FORCEWAKE: 0x
DERRMR: 0x
  fence[0] = fde03b007f6001
  fence[1] = 122600701127003
  fence[2] = 
  fence[3] = 
  fence[4] = 
  fence[5] = 
  fence[6] = 
  fence[7] = 
  fence[8] = 
  fence[9] = 
  fence[10] = 
  fence[11] = 
  fence[12] = 
  fence[13] = 
  fence[14] = 
  fence[15] = 
ERROR: 0x
DONE_REG: 0x
bcs0 command stream:
  CCID:  0x
  START: 0x5000
  HEAD:  0x0fe034d0 [0x3458]
  TAIL:  0x36f8 [0x34d0, 0x34e0]
  CTL:   0x3001
  MODE:  0x
  HWS:   0x7fffb000
  ACTHD: 0x 7fff3ee8
  IPEIR: 0x0008
  IPEHR: 0x4477
  ESR:   0x
  INSTDONE: 0xfff1
  batch: [0x_7fff3000, 0x_7fff7000]
  BBADDR: 0x_7fff3ee3
  BB_STATE: 0x
  INSTPS: 0x
  INSTPM: 0x
  FADDR: 0x 7fff3f80
  RC PSMI: 0x0010
  FAULT_REG: 0x
  GFX_MODE: 0x
  PP_DIR_BASE: 0x
  engine reset count: 0
  Active context: systemd-logind[662] prio 0, guilty 0 active 0, runtime total 
0ns, avg 0ns
bcs0 --- HW Status = 0x 7fffb000
:cL%-H!!!Tc0;,4aK$Fk6JP^IKP"K!4)Z.
bcs0 --- batch = 0x 7fff3000
:=m_`(+@[H9*1>F@9kb_ZL,tUu]nJ#WN6g$ROq$#UUsrA'--e$d_R^0)luEdKiJD?%4+2TScP/afS7MKb-!')6*QmkM"Blk<@4s1<;EjkJ#1.f"Jfpej!qVZa!XN-KA[pR`*rg0HQ5T%GpPsj+--CS"%=$jQ6kk;s1f^q.Lo'!^l]kiSj8M4Fpl!m3?ri7jfXZH[J]1gY4)\@FT%>S]p_nQEd7.q"nNW"9#LIO!8f2Y>V`Jhe43XiCr++i\Q2=(BsA+ZA/pcS]Z=Ym?gVZd/E`ZT&_2+3=(L'/[WV<0?.F3NSo2Oj9Cg([C?<5#tprrmU]Zng5G_:V_--7.X#hjfBhpDhh/bZ]+CY_C'XBK6%TZe2aK!L\UF%0cWmV_Fo48LFmI8%_9+8])9De%WAP:P4Etm-\s!sB7#I,r0L5G?uhuUQ1(uT"5Vg6^mt.DBZb?qq+mUQhiIBR$%i:B(*7-N4s<^:N!dt0cQ0s5s+k0m$jtCU4,i)BB41=kiWo?S!jlZkY92ou0D;o2B[Qju>l!LcIsr@W&,T%9=u7*=HVitpV_`;n'2/baJ85fK^o\Lm<[=F;%./##ZtX0k1t5G*nI%E`?ri7fg%PrgQW.=8TNbZO07Bbl9n`NK0q=S1U%Y;lIt9ZX9tOb`e^%N(2--N-[ISbgeH9pu:cDKWggP-@,8@f"=jhqQW,C/LBgIZZp]`dKMO3QiYh(dsMIgI==2lY6)lu$H#3R`]N#9.[!KO-CR/!RC[1CM\'Y1J_6+Y3C$t)7T!";cr0l`'F2P-:3]"o[p[CE\'(:/^WTT/o<:#Zq.%`RH[RiaNd@g)P\C0=^al!DC.Tegj)\_2A5Q2>o#$MBn1WqElTkF:]'JXff4gP1@1Rqn/%ab**p1GI$cQBcCm.^G<:MT5?jpBSX-i4pr>C=C`c+;r@Gf%a0Ni$0"(k-nHI#(>;*/+ikG<4+CA?5HoRCs2gHkbUs9YN?IYr:r_1qS^AOT_#[We7Gck_IH?BItbqE?EGdslc-UZ?!b'QhsmeU^0mIp\1d1Pf=lSh.;`C,H'B8O3fC_35KF7"[MU?QUqa\BP3fDM^cJ,)RZ?h8iTN.:D`2\Z4Y+Pm#^s'd'&^ZP\m?JjigR=ECTngA4o7D+F;+qia#f9%M#6f$flnfY'qh0BW6OUkbdXlN.,'%ZUshF)d;6Z4/peY]C_rF*AR+s%n:F[h"t(NW%Zjm,.anGi74SYLL99tc"!j?pWAC#l>.12;=cB!o;^S(9!eslWhor%+O,@u2Mhl#AEp\C./^\d!up5K%C@?n2CK.hNi?'FjCEpV$$BlWg]$97gC]_!VMK700nM4@Bk02k50MGt:;(0Zj=hC@OU-KN/4*M[2a05D<0NV:u0Xc^Wkd>4@EYr\)ti8,hH$KiZPPn")qI/UMkf0Uk.g*XU*saY>akni+,Nq:?^us*f`5O7u4VE=B\->WQCt9WJY1d8Ra!\r\?]BcKI8`&^C"fPpqSLeNgOaPmRP+EBGr,VFE'o>s+U%S1u-'ORXS*J^dgLdI/#'QF/k,VTH/o9%;<3E_6^2QpUPMYbKW*l3NZb>]"iH.2*'i"n*Cg[KE=YpbN>t2?`"PWPZuHN%&?Qp;IUjBfiT,aY%aj?gOoqK$8!4[D"jt79*(+gf01QR:E:`!!',7
available engines: 7
slice total: 0, mask=
subslice total: 0
EU total: 0
EU per subslice: 0
has slice power gating: no
has subslice power gating: no
has EU power gating: no
Unavailable
Num Pipes: 2
Pipe [0]:
  Power: on
  SRC: 077f0437
  STAT: 
Plane [0]:
  CNTR: d8004400
  STRIDE: 1e00
  ADDR: 
  SURF: 007f6000
  TILEOFF: 
Cursor [0]:
  CNTR: 04004027
  POS: 0277047e
  BASE: 00fdf000
Pipe [1]:
  Power: on
  SRC: 
  STAT: 
Plane [1]:
  CNTR: 4000
  STRIDE: 
  ADDR: 
  SURF: 
  TILEOFF: 
Cursor [1]:
  CNTR: 
  POS: 
  BASE: 
CPU transcoder: A
  Power: on
  CONF: c000
  HTOTAL: 0897077f
  HBLANK: 0897077f
  HSYNC: 080307d7
  VTOTAL: 04640437
  VBLANK: 04640437
  VSYNC: 0440043b
CPU transcoder: B
  Power: on
  CONF: 
  HTOTAL: 
  HBLANK: 
  HSYNC: 

Bug#1002978: Another GPU hang

2022-01-17 Thread CJ Fearnley
I rebooted with these grub config options:
GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force intel_idle.max_cstate=1"

In addition to an almost immediate GPU hang, my r8169 driver Ethernet
card failed to load its firmware causing a full network outage.

I rebooted to see if it was a fluke, but no: evidently, removing two
boot options (i915.disable_power_well=1 i915.enable_dc=0) plus the
new microcode is incompatible with the intel_idle.max_cstate=1 with
my r8169 Ethernet card. I know that because I fixed it by removing
intel_idle.max_cstate=1 in grub.

My strategy is to explore if the microcode is enough to fix the problem.
But because intel_idle.max_cstate=1 doesn't cause a tainted kernel
warning, I wrongly thought it was an innocuous option. Hence the learning
experience described above.

So far this configuration is stable. The Ethernet is working and No GPU
hangs though I have experienced a few small hiccups that might be GPU
performance related. These were not too bad and nothing has appeared in
the error logs. But I only have 45 minutes of uptime. I'll report back
when I have more confidence.

After that no network GPU hang here is the output of
/sys/class/drm/card0/error:
GPU HANG: ecode 6:0:
Kernel: 5.10.0-10-amd64 x86_64
Driver: 20200917
Time: 1642455365 s 522631 us
Boottime: 40 s 7750 us
Uptime: 33 s 557210 us
Capture: 4294902272 jiffies; 149072 ms ago
Reset count: 0
Suspend count: 0
Platform: SANDYBRIDGE
Subplatform: 0x0
PCI ID: 0x0102
PCI Revision: 0x09
PCI Subsystem: 1565:110d
IOMMU enabled?: 0
RPM wakelock: yes
PM suspended: no
GT awake: yes
EIR: 0x
IER: 0x82bc8585
GTIER[0]: 0x00401001
PGTBL_ER: 0x
FORCEWAKE: 0x
DERRMR: 0x
  fence[0] = fde03b007f6001
  fence[1] = 11f8007010f9003
  fence[2] = 
  fence[3] = 
  fence[4] = 
  fence[5] = 
  fence[6] = 
  fence[7] = 
  fence[8] = 
  fence[9] = 
  fence[10] = 
  fence[11] = 
  fence[12] = 
  fence[13] = 
  fence[14] = 
  fence[15] = 
ERROR: 0x
DONE_REG: 0x
available engines: 7
slice total: 0, mask=
subslice total: 0
EU total: 0
EU per subslice: 0
has slice power gating: no
has subslice power gating: no
has EU power gating: no
Unavailable
Num Pipes: 2
Pipe [0]:
  Power: on
  SRC: 077f0437
  STAT: 
Plane [0]:
  CNTR: d8004400
  STRIDE: 1e00
  ADDR: 
  SURF: 007f6000
  TILEOFF: 
Cursor [0]:
  CNTR: 04004027
  POS: 02db0452
  BASE: 00fdf000
Pipe [1]:
  Power: on
  SRC: 
  STAT: 
Plane [1]:
  CNTR: 4000
  STRIDE: 
  ADDR: 
  SURF: 
  TILEOFF: 
Cursor [1]:
  CNTR: 
  POS: 
  BASE: 
CPU transcoder: A
  Power: on
  CONF: c000
  HTOTAL: 0897077f
  HBLANK: 0897077f
  HSYNC: 080307d7
  VTOTAL: 04640437
  VBLANK: 04640437
  VSYNC: 0440043b
CPU transcoder: B
  Power: on
  CONF: 
  HTOTAL: 
  HBLANK: 
  HSYNC: 
  VTOTAL: 
  VBLANK: 
  VSYNC: 
gen: 6
gt: 1
iommu: disabled
memory-regions: 5
page-sizes: 1000
platform: SANDYBRIDGE
ppgtt-size: 31
ppgtt-type: 1
dma_mask_size: 40
is_mobile: no
is_lp: no
require_force_probe: no
is_dgfx: no
has_64bit_reloc: no
gpu_reset_clobbers_display: no
has_reset_engine: no
has_fpga_dbg: no
has_global_mocs: no
has_gt_uc: no
has_l3_dpf: no
has_llc: yes
has_logical_ring_contexts: no
has_logical_ring_elsq: no
has_logical_ring_preemption: no
has_master_unit_irq: no
has_pooled_eu: no
has_rc6: yes
has_rc6p: yes
has_rps: yes
has_runtime_pm: no
has_snoop: no
has_coherent_ggtt: yes
unfenced_needs_alignment: no
hws_needs_physical: no
cursor_needs_physical: no
has_csr: no
has_ddi: no
has_dp_mst: no
has_dsb: no
has_dsc: no
has_fbc: yes
has_gmch: no
has_hdcp: no
has_hotplug: yes
has_hti: no
has_ipc: no
has_modular_fia: no
has_overlay: no
has_psr: no
has_psr_hw_tracking: no
overlay_needs_physical: no
supports_tv: no
rawclk rate: 125000 kHz
CS timestamp frequency: 1250 Hz
Has logical contexts? yes
scheduler: 0
i915.vbt_firmware=(null)
i915.modeset=-1
i915.lvds_channel_mode=0
i915.panel_use_ssc=-1
i915.vbt_sdvo_panel_type=-1
i915.enable_dc=-1
i915.enable_fbc=0
i915.enable_psr=-1
i915.psr_safest_params=no
i915.enable_psr2_sel_fetch=no
i915.disable_power_well=1
i915.enable_ips=1
i915.invert_brightness=0
i915.enable_guc=0
i915.guc_log_level=-1
i915.guc_firmware_path=(null)
i915.huc_firmware_path=(null)
i915.dmc_firmware_path=(null)
i915.mmio_debug=0
i915.edp_vswing=0
i915.reset=3
i915.inject_probe_failure=0
i915.fastboot=-1
i915.enable_dpcd_backlight=-1
i915.force_probe=
i915.fake_lmem_start=0
i915.enable_hangcheck=yes
i915.load_detect_test=no
i915.force_reset_modeset_test=no
i915.error_capture=yes
i915.disable_display=no
i915.verbose_state_checks=yes
i915.nuclear_pageflip=no
i915.enable_dp_mst=yes
i915.enable_gvt=no

Here are the full logs from dmesg:
[0.00] microcode: microcode updated early to revision 0x2f, 

Bug#1002978: Another GPU hang

2022-01-02 Thread CJ Fearnley
I rebooted with the GRUB options intel_idle.max_cstate=1
i915.disable_power_well=1 i915.enable_dc=0 which my research shows
sometimes helps with these kinds of GPU hangs.

For quite some time performance was acceptable, but less than 20 hours
later I triggered a GPU hang by switching desktops rapidly in fvwm.
I was executing my keyboard equivalents for fvwm's 'Desk 0 9' followed by
'Scroll +0   +100' followed by 'Scroll +0   -100' followed by 'Desk 0 11'
in repeated cycling when the system hung.

Before the hang, the dmesg output reports three lines with 'perf:
interrupt took too long'. These were minor hiccups/freezes that failed to
trigger a crash/CPU hang error in the logs. But I consider them related to
the problem. I was again changing desktops and moving around my desktops
using my keyboard shortcuts mentioned above and the i915 driver could
not cope. Back in Jessie this kind of work was routine for me with no
fear of triggering kernel stack dumps. The i915 driver appears to have
developed major bugs at least with my chipset.

I have attached dmesg output capturing boot messages and
the crash/hang. The other file contains the output of 'cat
/sys/class/drm/card0/error'.

I am planning to reboot soon to try another test, since daily GPU hangs
will slow my productivity unacceptibly. I need to keep debugging in the
hopes that I can find a solution.

-- 
CJ Fearnley |   LinuxForce Inc.
c...@linuxforce.net  |   Hosting and Linux Consulting
https://www.LinuxForce.net  |   https://blog.LinuxForce.net
[0.00] Linux version 5.10.0-10-amd64 (debian-ker...@lists.debian.org) 
(gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 
2.35.2) #1 SMP Debian 5.10.84-1 (2021-12-08)
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-5.10.0-10-amd64 
root=/dev/mapper/precession-root ro quiet pcie_aspm=force 
intel_idle.max_cstate=1 i915.disable_power_well=1 i915.enable_dc=0
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Enabled xstate features 0x3, context size is 576 bytes, 
using 'standard' format.
[0.00] BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009d7ff] usable
[0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x1fff] usable
[0.00] BIOS-e820: [mem 0x2000-0x201f] reserved
[0.00] BIOS-e820: [mem 0x2020-0x3fff] usable
[0.00] BIOS-e820: [mem 0x4000-0x401f] reserved
[0.00] BIOS-e820: [mem 0x4020-0xbad92fff] usable
[0.00] BIOS-e820: [mem 0xbad93000-0xbadd9fff] ACPI NVS
[0.00] BIOS-e820: [mem 0xbadda000-0xbade0fff] ACPI data
[0.00] BIOS-e820: [mem 0xbade1000-0xbade1fff] ACPI NVS
[0.00] BIOS-e820: [mem 0xbade2000-0xbae04fff] reserved
[0.00] BIOS-e820: [mem 0xbae05000-0xbae05fff] usable
[0.00] BIOS-e820: [mem 0xbae06000-0xbae17fff] reserved
[0.00] BIOS-e820: [mem 0xbae18000-0xbae24fff] ACPI NVS
[0.00] BIOS-e820: [mem 0xbae25000-0xbae48fff] reserved
[0.00] BIOS-e820: [mem 0xbae49000-0xbae8bfff] ACPI NVS
[0.00] BIOS-e820: [mem 0xbae8c000-0xbaff] usable
[0.00] BIOS-e820: [mem 0xbb80-0xbf9f] reserved
[0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00023fdf] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.7 present.
[0.00] DMI: BIOSTAR Group H61MU3/H61MU3, BIOS 4.6.4 04/07/2011
[0.00] tsc: Fast TSC calibration using PIT
[0.00] tsc: Detected 2394.344 MHz processor
[0.000831] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.000834] e820: remove [mem 0x000a-0x000f] usable
[0.000843] last_pfn = 0x23fe00 max_arch_pfn = 0x4
[0.000847] MTRR default type: uncachable
[0.000848] MTRR fixed ranges enabled:
[0.000850]   0-9 write-back
[0.000851]   A-B uncachable
[0.000852]   C-C write-protect
[0.000853]   D-E7FFF uncachable
[0.000854]   E8000-F write-protect
[0.000854] MTRR variable ranges enabled:
[0.000856]   0 base 0 mask E write-back
[0.000857]   1 base 2 mask FC000 write-back
[0.000859]   2 base 0BB80 mask FFF80 uncachable
[0.000860]   3 base