Re: TTM refcount problem.
Am 16.10.19 um 12:09 schrieb Bas Nieuwenhuizen: On Mon, Jul 29, 2019 at 11:32 AM Christian König wrote: Is this a known issue? No, that looks like a new one to me. Is that somehow reproducible? I tried finding a reliable reproducer (only Vulkan CTS runs uncommonly caught it), but could not find anything better. However this issue seems to be fixed with one of the following patches from drm-misc-fixes: "drm/ttm: fix handling in ttm_bo_add_mem_to_lru" "drm/ttm: fix busy reference in ttm_mem_evict_first" I haven't seen the issue in 100 CTS runs. Thanks for the information. I'm currently completely reworking the handling and trying to get rid of all the reference dropping which just results in a BUG(). Issues like that one will then hopefully completely disappear. Regards, Christian. Thanks, Bas Christian. Am 29.07.19 um 10:14 schrieb Bas Nieuwenhuizen: Hi all, I have a TTM refcount issue: [173774.309968] [ cut here ] [173774.309970] kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:202! [173774.309982] invalid opcode: [#1] PREEMPT SMP NOPTI [173774.309985] CPU: 13 PID: 128214 Comm: kworker/13:2 Not tainted 5.2.0-rc1-g3f2e519b0974 #10 [173774.309986] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Taichi, BIOS P1.50 09/05/2017 [173774.309995] Workqueue: events ttm_bo_delayed_workqueue [ttm] [173774.31] RIP: 0010:ttm_bo_ref_bug+0x5/0x10 [ttm] [173774.310002] Code: c0 c3 b8 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 f0 ff 8f a4 00 00 00 c3 0f 1f 00 0f 1f 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 48 8b 07 48 89 [173774.310003] RSP: 0018:b42e5589bde8 EFLAGS: 00010246 [173774.310005] RAX: b42e5589be40 RBX: 9395fd0cd908 RCX: 9395fd0cd8f8 [173774.310006] RDX: b42e5589be40 RSI: 939b59b64f18 RDI: 9395fd0cd87c [173774.310007] RBP: c0930f40 R08: 0014 R09: c091f100 [173774.310008] R10: 9399f69b0800 R11: 0001 R12: [173774.310009] R13: 9395fd0cd850 R14: 0001 R15: 0001 [173774.310010] FS: () GS:939b7d34() knlGS: [173774.310011] CS: 0010 DS: ES: CR0: 80050033 [173774.310012] CR2: 7f4f64008838 CR3: 000643baa000 CR4: 003406e0 [173774.310013] Call Trace: [173774.310019] ttm_bo_cleanup_refs+0x160/0x1e0 [ttm] [173774.310025] ttm_bo_delayed_delete+0xa8/0x1e0 [ttm] [173774.310029] ttm_bo_delayed_workqueue+0x17/0x40 [ttm] [173774.310033] process_one_work+0x1fd/0x430 [173774.310036] worker_thread+0x2d/0x3d0 [173774.310038] ? process_one_work+0x430/0x430 [173774.310040] kthread+0x112/0x130 [173774.310042] ? kthread_create_on_node+0x60/0x60 [173774.310045] ret_from_fork+0x22/0x40 [173774.310048] Modules linked in: fuse nct6775 hwmon_vid nls_iso8859_1 nls_cp437 vfat fat edac_mce_amd kvm_amd kvm irqbypass amdgpu arc4 iwlmvm mac80211 snd_usb_audio uvcvideo snd_usbmidi_lib videobuf2_vmalloc crct10dif_pclmul videobuf2_memops snd_hda_codec_realtek videobuf2_v4l2 btusb gpu_sched snd_rawmidi videobuf2_common snd_hda_codec_generic btrtl videodev crc32_pclmul btbcm snd_seq_device ledtrig_audio ttm btintel ghash_clmulni_intel wmi_bmof mxm_wmi snd_hda_codec_hdmi media bluetooth drm_kms_helper iwlwifi snd_hda_intel drm aesni_intel snd_hda_codec joydev input_leds aes_x86_64 snd_hda_core mousedev evdev crypto_simd cryptd ecdh_generic led_class agpgart snd_hwdep mac_hid cdc_acm glue_helper ecc snd_pcm igb syscopyarea pcspkr cfg80211 sysfillrect snd_timer sysimgblt snd fb_sys_fops ccp ptp soundcore pps_core rng_core k10temp i2c_algo_bit sp5100_tco dca i2c_piix4 rfkill wmi pcc_cpufreq button acpi_cpufreq sch_fq_codel ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 sd_mod [173774.310085] hid_generic usbhid hid crc32c_intel ahci xhci_pci libahci xhci_hcd libata usbcore scsi_mod usb_common [173774.310094] ---[ end trace 1f8d21980c0b3fd5 ]--- [173774.310097] RIP: 0010:ttm_bo_ref_bug+0x5/0x10 [ttm] [173774.310099] Code: c0 c3 b8 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 f0 ff 8f a4 00 00 00 c3 0f 1f 00 0f 1f 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 48 8b 07 48 89 [173774.310100] RSP: 0018:b42e5589bde8 EFLAGS: 00010246 [173774.310101] RAX: b42e5589be40 RBX: 9395fd0cd908 RCX: 9395fd0cd8f8 [173774.310102] RDX: b42e5589be40 RSI: 939b59b64f18 RDI: 9395fd0cd87c [173774.310103] RBP: c0930f40 R08: 0014 R09: c091f100 [173774.310104] R10: 9399f69b0800 R11: 0001 R12: [173774.310104] R13: 9395fd0cd850 R14: 0001 R15: 0001 [173774.310106] FS: () GS:939b7d34() knlGS: [173774.310107] CS: 0010 DS: ES: CR0: 80050033 [173774.310107] CR2: 7f4f64008838 CR3: 000643baa000 CR4: 003406e0 [173774.310110]
Re: TTM refcount problem.
On Mon, Jul 29, 2019 at 11:32 AM Christian König wrote: > > > Is this a known issue? > No, that looks like a new one to me. > > Is that somehow reproducible? I tried finding a reliable reproducer (only Vulkan CTS runs uncommonly caught it), but could not find anything better. However this issue seems to be fixed with one of the following patches from drm-misc-fixes: "drm/ttm: fix handling in ttm_bo_add_mem_to_lru" "drm/ttm: fix busy reference in ttm_mem_evict_first" I haven't seen the issue in 100 CTS runs. Thanks, Bas > > Christian. > > Am 29.07.19 um 10:14 schrieb Bas Nieuwenhuizen: > > Hi all, > > > > I have a TTM refcount issue: > > > > [173774.309968] [ cut here ] > > [173774.309970] kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:202! > > [173774.309982] invalid opcode: [#1] PREEMPT SMP NOPTI > > [173774.309985] CPU: 13 PID: 128214 Comm: kworker/13:2 Not tainted > > 5.2.0-rc1-g3f2e519b0974 #10 > > [173774.309986] Hardware name: To Be Filled By O.E.M. To Be Filled By > > O.E.M./X399 Taichi, BIOS P1.50 09/05/2017 > > [173774.309995] Workqueue: events ttm_bo_delayed_workqueue [ttm] > > [173774.31] RIP: 0010:ttm_bo_ref_bug+0x5/0x10 [ttm] > > [173774.310002] Code: c0 c3 b8 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00 > > 00 00 00 66 90 0f 1f 44 00 00 f0 ff 8f a4 00 00 00 c3 0f 1f 00 0f 1f > > 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 48 8b 07 > > 48 89 > > [173774.310003] RSP: 0018:b42e5589bde8 EFLAGS: 00010246 > > [173774.310005] RAX: b42e5589be40 RBX: 9395fd0cd908 RCX: > > 9395fd0cd8f8 > > [173774.310006] RDX: b42e5589be40 RSI: 939b59b64f18 RDI: > > 9395fd0cd87c > > [173774.310007] RBP: c0930f40 R08: 0014 R09: > > c091f100 > > [173774.310008] R10: 9399f69b0800 R11: 0001 R12: > > > > [173774.310009] R13: 9395fd0cd850 R14: 0001 R15: > > 0001 > > [173774.310010] FS: () GS:939b7d34() > > knlGS: > > [173774.310011] CS: 0010 DS: ES: CR0: 80050033 > > [173774.310012] CR2: 7f4f64008838 CR3: 000643baa000 CR4: > > 003406e0 > > [173774.310013] Call Trace: > > [173774.310019] ttm_bo_cleanup_refs+0x160/0x1e0 [ttm] > > [173774.310025] ttm_bo_delayed_delete+0xa8/0x1e0 [ttm] > > [173774.310029] ttm_bo_delayed_workqueue+0x17/0x40 [ttm] > > [173774.310033] process_one_work+0x1fd/0x430 > > [173774.310036] worker_thread+0x2d/0x3d0 > > [173774.310038] ? process_one_work+0x430/0x430 > > [173774.310040] kthread+0x112/0x130 > > [173774.310042] ? kthread_create_on_node+0x60/0x60 > > [173774.310045] ret_from_fork+0x22/0x40 > > [173774.310048] Modules linked in: fuse nct6775 hwmon_vid > > nls_iso8859_1 nls_cp437 vfat fat edac_mce_amd kvm_amd kvm irqbypass > > amdgpu arc4 iwlmvm mac80211 snd_usb_audio uvcvideo snd_usbmidi_lib > > videobuf2_vmalloc crct10dif_pclmul videobuf2_memops > > snd_hda_codec_realtek videobuf2_v4l2 btusb gpu_sched snd_rawmidi > > videobuf2_common snd_hda_codec_generic btrtl videodev crc32_pclmul > > btbcm snd_seq_device ledtrig_audio ttm btintel ghash_clmulni_intel > > wmi_bmof mxm_wmi snd_hda_codec_hdmi media bluetooth drm_kms_helper > > iwlwifi snd_hda_intel drm aesni_intel snd_hda_codec joydev input_leds > > aes_x86_64 snd_hda_core mousedev evdev crypto_simd cryptd ecdh_generic > > led_class agpgart snd_hwdep mac_hid cdc_acm glue_helper ecc snd_pcm > > igb syscopyarea pcspkr cfg80211 sysfillrect snd_timer sysimgblt snd > > fb_sys_fops ccp ptp soundcore pps_core rng_core k10temp i2c_algo_bit > > sp5100_tco dca i2c_piix4 rfkill wmi pcc_cpufreq button acpi_cpufreq > > sch_fq_codel ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 > > sd_mod > > [173774.310085] hid_generic usbhid hid crc32c_intel ahci xhci_pci > > libahci xhci_hcd libata usbcore scsi_mod usb_common > > [173774.310094] ---[ end trace 1f8d21980c0b3fd5 ]--- > > [173774.310097] RIP: 0010:ttm_bo_ref_bug+0x5/0x10 [ttm] > > [173774.310099] Code: c0 c3 b8 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00 > > 00 00 00 66 90 0f 1f 44 00 00 f0 ff 8f a4 00 00 00 c3 0f 1f 00 0f 1f > > 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 48 8b 07 > > 48 89 > > [173774.310100] RSP: 0018:b42e5589bde8 EFLAGS: 00010246 > > [173774.310101] RAX: b42e5589be40 RBX: 9395fd0cd908 RCX: > > 9395fd0cd8f8 > > [173774.310102] RDX: b42e5589be40 RSI: 939b59b64f18 RDI: > > 9395fd0cd87c > > [173774.310103] RBP: c0930f40 R08: 0014 R09: > > c091f100 > > [173774.310104] R10: 9399f69b0800 R11: 0001 R12: > > > > [173774.310104] R13: 9395fd0cd850 R14: 0001 R15: > > 0001 > > [173774.310106] FS: () GS:939b7d34() > > knlGS: > > [173774.310107] CS: 0010 DS: ES: CR0: 80050033 > > [173774.310107] CR2: 7f4f64008838 CR3: 000643baa000
Re: TTM refcount problem.
Is this a known issue? No, that looks like a new one to me. Is that somehow reproducible? Christian. Am 29.07.19 um 10:14 schrieb Bas Nieuwenhuizen: Hi all, I have a TTM refcount issue: [173774.309968] [ cut here ] [173774.309970] kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:202! [173774.309982] invalid opcode: [#1] PREEMPT SMP NOPTI [173774.309985] CPU: 13 PID: 128214 Comm: kworker/13:2 Not tainted 5.2.0-rc1-g3f2e519b0974 #10 [173774.309986] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Taichi, BIOS P1.50 09/05/2017 [173774.309995] Workqueue: events ttm_bo_delayed_workqueue [ttm] [173774.31] RIP: 0010:ttm_bo_ref_bug+0x5/0x10 [ttm] [173774.310002] Code: c0 c3 b8 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 f0 ff 8f a4 00 00 00 c3 0f 1f 00 0f 1f 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 48 8b 07 48 89 [173774.310003] RSP: 0018:b42e5589bde8 EFLAGS: 00010246 [173774.310005] RAX: b42e5589be40 RBX: 9395fd0cd908 RCX: 9395fd0cd8f8 [173774.310006] RDX: b42e5589be40 RSI: 939b59b64f18 RDI: 9395fd0cd87c [173774.310007] RBP: c0930f40 R08: 0014 R09: c091f100 [173774.310008] R10: 9399f69b0800 R11: 0001 R12: [173774.310009] R13: 9395fd0cd850 R14: 0001 R15: 0001 [173774.310010] FS: () GS:939b7d34() knlGS: [173774.310011] CS: 0010 DS: ES: CR0: 80050033 [173774.310012] CR2: 7f4f64008838 CR3: 000643baa000 CR4: 003406e0 [173774.310013] Call Trace: [173774.310019] ttm_bo_cleanup_refs+0x160/0x1e0 [ttm] [173774.310025] ttm_bo_delayed_delete+0xa8/0x1e0 [ttm] [173774.310029] ttm_bo_delayed_workqueue+0x17/0x40 [ttm] [173774.310033] process_one_work+0x1fd/0x430 [173774.310036] worker_thread+0x2d/0x3d0 [173774.310038] ? process_one_work+0x430/0x430 [173774.310040] kthread+0x112/0x130 [173774.310042] ? kthread_create_on_node+0x60/0x60 [173774.310045] ret_from_fork+0x22/0x40 [173774.310048] Modules linked in: fuse nct6775 hwmon_vid nls_iso8859_1 nls_cp437 vfat fat edac_mce_amd kvm_amd kvm irqbypass amdgpu arc4 iwlmvm mac80211 snd_usb_audio uvcvideo snd_usbmidi_lib videobuf2_vmalloc crct10dif_pclmul videobuf2_memops snd_hda_codec_realtek videobuf2_v4l2 btusb gpu_sched snd_rawmidi videobuf2_common snd_hda_codec_generic btrtl videodev crc32_pclmul btbcm snd_seq_device ledtrig_audio ttm btintel ghash_clmulni_intel wmi_bmof mxm_wmi snd_hda_codec_hdmi media bluetooth drm_kms_helper iwlwifi snd_hda_intel drm aesni_intel snd_hda_codec joydev input_leds aes_x86_64 snd_hda_core mousedev evdev crypto_simd cryptd ecdh_generic led_class agpgart snd_hwdep mac_hid cdc_acm glue_helper ecc snd_pcm igb syscopyarea pcspkr cfg80211 sysfillrect snd_timer sysimgblt snd fb_sys_fops ccp ptp soundcore pps_core rng_core k10temp i2c_algo_bit sp5100_tco dca i2c_piix4 rfkill wmi pcc_cpufreq button acpi_cpufreq sch_fq_codel ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 sd_mod [173774.310085] hid_generic usbhid hid crc32c_intel ahci xhci_pci libahci xhci_hcd libata usbcore scsi_mod usb_common [173774.310094] ---[ end trace 1f8d21980c0b3fd5 ]--- [173774.310097] RIP: 0010:ttm_bo_ref_bug+0x5/0x10 [ttm] [173774.310099] Code: c0 c3 b8 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 f0 ff 8f a4 00 00 00 c3 0f 1f 00 0f 1f 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 48 8b 07 48 89 [173774.310100] RSP: 0018:b42e5589bde8 EFLAGS: 00010246 [173774.310101] RAX: b42e5589be40 RBX: 9395fd0cd908 RCX: 9395fd0cd8f8 [173774.310102] RDX: b42e5589be40 RSI: 939b59b64f18 RDI: 9395fd0cd87c [173774.310103] RBP: c0930f40 R08: 0014 R09: c091f100 [173774.310104] R10: 9399f69b0800 R11: 0001 R12: [173774.310104] R13: 9395fd0cd850 R14: 0001 R15: 0001 [173774.310106] FS: () GS:939b7d34() knlGS: [173774.310107] CS: 0010 DS: ES: CR0: 80050033 [173774.310107] CR2: 7f4f64008838 CR3: 000643baa000 CR4: 003406e0 [173774.310110] note: kworker/13:2[128214] exited with preempt_count 1 With amd-staging-drm-next: commit 20d6b9c3b7f40ec427af912d140f2be0de098d2d (origin/amd-staging-drm-next) Author: Gustavo A. R. Silva Date: Mon Jul 22 12:47:16 2019 -0500 drm/amdkfd/kfd_mqd_manager_v10: Avoid fall-through warning with a Vega10. Is this a known issue? Thanks, Bas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx