Hello,

For some reason, in the past week or so this bug has been freezing my 
machine every couple of days or so (I’m surprised that AMD wasn’t able 
to reproduce the problem yet¹). You can imagine how “pleasant” it makes 
using this computer.

Today I got an interesting error in dmesg, perhaps it provides some
clue:

[38454.299445] ------------[ cut here ]------------
[38454.299449] refcount_t: underflow; use-after-free.
[38454.299457] WARNING: CPU: 5 PID: 17577 at lib/refcount.c:28 
refcount_warn_saturate+0xae/0xf0
[38454.299465] Modules linked in: overlay ccm rfcomm xt_CHECKSUM xt_MASQUERADE 
xt_conntrack ipt_REJECT xt_tcpudp nft_compat nft_counter nft_objref 
nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 
nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject 
nft_ct bridge stp llc nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 ip_set nf_tables nfnetlink cmac algif_hash algif_skcipher af_alg 
bnep binfmt_misc nls_iso8859_1 snd_hda_codec_generic ledtrig_audio 
snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg soundwire_intel 
soundwire_generic_allocation soundwire_cadence snd_hda_codec snd_hda_core 
snd_hwdep soundwire_bus snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine 
intel_rapl_msr intel_rapl_common joydev snd_pcm edac_mce_amd snd_seq_midi 
ath10k_pci ath10k_core snd_seq_midi_event kvm_amd snd_rawmidi ath mac80211 kvm 
uvcvideo snd_seq btusb videobuf2_vmalloc rapl videobuf2_memops videobuf2_v4l2 
videobuf2_common btrtl input_leds
[38454.299510]  serio_raw btbcm videodev btintel wmi_bmof snd_seq_device 
efi_pstore bluetooth snd_timer mc cfg80211 k10temp ecdh_generic snd ecc 
ideapad_laptop ccp libarc4 sparse_keymap soundcore elan_i2c mac_hid 
sch_fq_codel msr parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs 
blake2b_generic xor raid6_pq libcrc32c dm_crypt zstd zram z3fold amdgpu 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iommu_v2 gpu_sched 
aesni_intel i2c_algo_bit drm_ttm_helper ttm crypto_simd cryptd glue_helper 
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm 
i2c_piix4 nvme xhci_pci i2c_hid xhci_pci_renesas nvme_core wmi video hid
[38454.299550] CPU: 5 PID: 17577 Comm: kworker/u32:18 Not tainted 
5.11.0-25-generic #27-Ubuntu
[38454.299552] Hardware name: LENOVO 81V7/LNVNB161216, BIOS BUCN23WW 11/05/2019
[38454.299554] Workqueue: events_unbound async_run_entry_fn
[38454.299559] RIP: 0010:refcount_warn_saturate+0xae/0xf0
[38454.299562] Code: f8 1c 96 01 01 e8 9f f1 62 00 0f 0b 5d c3 80 3d e5 1c 96 
01 00 75 91 48 c7 c7 e8 c7 60 b9 c6 05 d5 1c 96 01 01 e8 7f f1 62 00 <0f> 0b 5d 
c3 80 3d c3 1c 96 01 00 0f 85 6d ff ff ff 48 c7 c7 40 c8
[38454.299564] RSP: 0018:ffffb60383537b58 EFLAGS: 00010282
[38454.299566] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8d4578b58ac8
[38454.299567] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff8d4578b58ac0
[38454.299568] RBP: ffffb60383537b58 R08: ffffffffb9c73540 R09: ffffb60383537af0
[38454.299569] R10: 000000002d2d2d2d R11: ffffb603835379e8 R12: ffff8d44cf64d000
[38454.299570] R13: 0000000000000000 R14: ffffb6038b8cd000 R15: 0000000000000004
[38454.299571] FS:  0000000000000000(0000) GS:ffff8d4578b40000(0000) 
knlGS:0000000000000000
[38454.299572] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[38454.299574] CR2: 0000000000000000 CR3: 000000016ae10000 CR4: 00000000003506e0
[38454.299575] Call Trace:
[38454.299578]  dc_stream_release+0x78/0x80 [amdgpu]
[38454.299751]  dc_resource_state_destruct+0x58/0x80 [amdgpu]
[38454.299904]  dc_release_state+0x2f/0x60 [amdgpu]
[38454.300055]  dm_atomic_destroy_state+0x21/0x30 [amdgpu]
[38454.300211]  drm_atomic_state_default_clear+0x23d/0x2f0 [drm]
[38454.300236]  __drm_atomic_state_free+0x5e/0xa0 [drm]
[38454.300257]  drm_atomic_helper_resume+0x12b/0x150 [drm_kms_helper]
[38454.300271]  dm_resume+0x2bd/0x540 [amdgpu]
[38454.300427]  amdgpu_device_ip_resume_phase2+0x58/0xc0 [amdgpu]
[38454.300531]  amdgpu_device_resume+0x8d/0x370 [amdgpu]
[38454.300635]  ? native_queued_spin_lock_slowpath+0x2b/0x30
[38454.300638]  ? _raw_spin_lock_irq+0x26/0x2a
[38454.300642]  ? __wait_for_common+0xfb/0x150
[38454.300644]  amdgpu_pmops_resume+0x17/0x20 [amdgpu]
[38454.300748]  pci_pm_resume+0x6b/0xf0
[38454.300751]  ? pci_pm_poweroff_noirq+0x120/0x120
[38454.300752]  dpm_run_callback+0x50/0x110
[38454.300755]  device_resume+0xad/0x200
[38454.300757]  async_resume+0x1e/0x40
[38454.300759]  async_run_entry_fn+0x3c/0x150
[38454.300761]  process_one_work+0x220/0x3c0
[38454.300764]  worker_thread+0x50/0x370
[38454.300765]  kthread+0x12f/0x150
[38454.300767]  ? process_one_work+0x3c0/0x3c0
[38454.300768]  ? __kthread_bind_mask+0x70/0x70
[38454.300770]  ret_from_fork+0x22/0x30
[38454.300775] ---[ end trace 1f54ad57671def2f ]---

Note that immediately before it there’s a page allocation failure during
wake up from suspend. So there’s some refcounting bug in an error path
somewhere.

Much later there’s the familiar “retry page fault” problem, which is what
made me look in dmesg.

The amdgpu/picasso* files are the same as in linux-firmware ‘main’ branch
as of today: commit 168452ee695b ("Merge tag 'iwlwifi-fw-2021-07-19' of
git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/linux-firmware into
main”). The kernel is Ubuntu’s 5.11.0-25-generic.

The full log is attached. Hope this helps.

Regards,
Thiago.


¹ I suggest trying to reproduce with the following steps:

1. Grab a laptop with a Picasso integrated GPU.
2. Install Kubuntu 21.04 on it.
3. Install and use Firefox from Flathub:
   https://www.flathub.org/apps/details/org.mozilla.firefox
4. Log into KDE and run the above Firefox.
5. With the laptop on battery power (I have the impression that the bug is
   easier to trigger that way), suspend and restore the machine a few times.


** Attachment added: "dmesg-refcount-underflow.log"
   
https://bugs.launchpad.net/bugs/1928393/+attachment/5515846/+files/dmesg-refcount-underflow.log

-- 
You received this bug notification because you are a member of Ubuntu-X,
which is subscribed to mesa in Ubuntu.
https://bugs.launchpad.net/bugs/1928393

Title:
  linux-firmware 1.197 causes kernel to report error "amdgpu: [gfxhub0]
  retry page fault"

To manage notifications about this bug go to:
https://bugs.launchpad.net/amd/+bug/1928393/+subscriptions


_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-x-swat
Post to     : ubuntu-x-swat@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-x-swat
More help   : https://help.launchpad.net/ListHelp

Reply via email to