Re: TTM refcount problem.

2019-10-16 Thread Christian König

Am 16.10.19 um 12:09 schrieb Bas Nieuwenhuizen:

On Mon, Jul 29, 2019 at 11:32 AM Christian König
 wrote:

Is this a known issue?

No, that looks like a new one to me.

Is that somehow reproducible?

I tried finding a reliable reproducer (only Vulkan CTS runs uncommonly
caught it), but could not find anything better.

However this issue seems to be fixed with one of the following patches
from drm-misc-fixes:

"drm/ttm: fix handling in ttm_bo_add_mem_to_lru"
"drm/ttm: fix busy reference in ttm_mem_evict_first"

I haven't seen the issue in 100 CTS runs.


Thanks for the information.

I'm currently completely reworking the handling and trying to get rid of 
all the reference dropping which just results in a BUG().


Issues like that one will then hopefully completely disappear.

Regards,
Christian.



Thanks,
Bas


Christian.

Am 29.07.19 um 10:14 schrieb Bas Nieuwenhuizen:

Hi all,

I have a TTM refcount issue:

[173774.309968] [ cut here ]
[173774.309970] kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:202!
[173774.309982] invalid opcode:  [#1] PREEMPT SMP NOPTI
[173774.309985] CPU: 13 PID: 128214 Comm: kworker/13:2 Not tainted
5.2.0-rc1-g3f2e519b0974 #10
[173774.309986] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./X399 Taichi, BIOS P1.50 09/05/2017
[173774.309995] Workqueue: events ttm_bo_delayed_workqueue [ttm]
[173774.31] RIP: 0010:ttm_bo_ref_bug+0x5/0x10 [ttm]
[173774.310002] Code: c0 c3 b8 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00
00 00 00 66 90 0f 1f 44 00 00 f0 ff 8f a4 00 00 00 c3 0f 1f 00 0f 1f
44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 48 8b 07
48 89
[173774.310003] RSP: 0018:b42e5589bde8 EFLAGS: 00010246
[173774.310005] RAX: b42e5589be40 RBX: 9395fd0cd908 RCX:
9395fd0cd8f8
[173774.310006] RDX: b42e5589be40 RSI: 939b59b64f18 RDI:
9395fd0cd87c
[173774.310007] RBP: c0930f40 R08: 0014 R09:
c091f100
[173774.310008] R10: 9399f69b0800 R11: 0001 R12:

[173774.310009] R13: 9395fd0cd850 R14: 0001 R15:
0001
[173774.310010] FS:  () GS:939b7d34()
knlGS:
[173774.310011] CS:  0010 DS:  ES:  CR0: 80050033
[173774.310012] CR2: 7f4f64008838 CR3: 000643baa000 CR4:
003406e0
[173774.310013] Call Trace:
[173774.310019]  ttm_bo_cleanup_refs+0x160/0x1e0 [ttm]
[173774.310025]  ttm_bo_delayed_delete+0xa8/0x1e0 [ttm]
[173774.310029]  ttm_bo_delayed_workqueue+0x17/0x40 [ttm]
[173774.310033]  process_one_work+0x1fd/0x430
[173774.310036]  worker_thread+0x2d/0x3d0
[173774.310038]  ? process_one_work+0x430/0x430
[173774.310040]  kthread+0x112/0x130
[173774.310042]  ? kthread_create_on_node+0x60/0x60
[173774.310045]  ret_from_fork+0x22/0x40
[173774.310048] Modules linked in: fuse nct6775 hwmon_vid
nls_iso8859_1 nls_cp437 vfat fat edac_mce_amd kvm_amd kvm irqbypass
amdgpu arc4 iwlmvm mac80211 snd_usb_audio uvcvideo snd_usbmidi_lib
videobuf2_vmalloc crct10dif_pclmul videobuf2_memops
snd_hda_codec_realtek videobuf2_v4l2 btusb gpu_sched snd_rawmidi
videobuf2_common snd_hda_codec_generic btrtl videodev crc32_pclmul
btbcm snd_seq_device ledtrig_audio ttm btintel ghash_clmulni_intel
wmi_bmof mxm_wmi snd_hda_codec_hdmi media bluetooth drm_kms_helper
iwlwifi snd_hda_intel drm aesni_intel snd_hda_codec joydev input_leds
aes_x86_64 snd_hda_core mousedev evdev crypto_simd cryptd ecdh_generic
led_class agpgart snd_hwdep mac_hid cdc_acm glue_helper ecc snd_pcm
igb syscopyarea pcspkr cfg80211 sysfillrect snd_timer sysimgblt snd
fb_sys_fops ccp ptp soundcore pps_core rng_core k10temp i2c_algo_bit
sp5100_tco dca i2c_piix4 rfkill wmi pcc_cpufreq button acpi_cpufreq
sch_fq_codel ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2
sd_mod
[173774.310085]  hid_generic usbhid hid crc32c_intel ahci xhci_pci
libahci xhci_hcd libata usbcore scsi_mod usb_common
[173774.310094] ---[ end trace 1f8d21980c0b3fd5 ]---
[173774.310097] RIP: 0010:ttm_bo_ref_bug+0x5/0x10 [ttm]
[173774.310099] Code: c0 c3 b8 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00
00 00 00 66 90 0f 1f 44 00 00 f0 ff 8f a4 00 00 00 c3 0f 1f 00 0f 1f
44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 48 8b 07
48 89
[173774.310100] RSP: 0018:b42e5589bde8 EFLAGS: 00010246
[173774.310101] RAX: b42e5589be40 RBX: 9395fd0cd908 RCX:
9395fd0cd8f8
[173774.310102] RDX: b42e5589be40 RSI: 939b59b64f18 RDI:
9395fd0cd87c
[173774.310103] RBP: c0930f40 R08: 0014 R09:
c091f100
[173774.310104] R10: 9399f69b0800 R11: 0001 R12:

[173774.310104] R13: 9395fd0cd850 R14: 0001 R15:
0001
[173774.310106] FS:  () GS:939b7d34()
knlGS:
[173774.310107] CS:  0010 DS:  ES:  CR0: 80050033
[173774.310107] CR2: 7f4f64008838 CR3: 000643baa000 CR4:
003406e0
[173774.310110] 

Re: TTM refcount problem.

2019-10-16 Thread Bas Nieuwenhuizen
On Mon, Jul 29, 2019 at 11:32 AM Christian König
 wrote:
>
> > Is this a known issue?
> No, that looks like a new one to me.
>
> Is that somehow reproducible?

I tried finding a reliable reproducer (only Vulkan CTS runs uncommonly
caught it), but could not find anything better.

However this issue seems to be fixed with one of the following patches
from drm-misc-fixes:

"drm/ttm: fix handling in ttm_bo_add_mem_to_lru"
"drm/ttm: fix busy reference in ttm_mem_evict_first"

I haven't seen the issue in 100 CTS runs.

Thanks,
Bas

>
> Christian.
>
> Am 29.07.19 um 10:14 schrieb Bas Nieuwenhuizen:
> > Hi all,
> >
> > I have a TTM refcount issue:
> >
> > [173774.309968] [ cut here ]
> > [173774.309970] kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:202!
> > [173774.309982] invalid opcode:  [#1] PREEMPT SMP NOPTI
> > [173774.309985] CPU: 13 PID: 128214 Comm: kworker/13:2 Not tainted
> > 5.2.0-rc1-g3f2e519b0974 #10
> > [173774.309986] Hardware name: To Be Filled By O.E.M. To Be Filled By
> > O.E.M./X399 Taichi, BIOS P1.50 09/05/2017
> > [173774.309995] Workqueue: events ttm_bo_delayed_workqueue [ttm]
> > [173774.31] RIP: 0010:ttm_bo_ref_bug+0x5/0x10 [ttm]
> > [173774.310002] Code: c0 c3 b8 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00
> > 00 00 00 66 90 0f 1f 44 00 00 f0 ff 8f a4 00 00 00 c3 0f 1f 00 0f 1f
> > 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 48 8b 07
> > 48 89
> > [173774.310003] RSP: 0018:b42e5589bde8 EFLAGS: 00010246
> > [173774.310005] RAX: b42e5589be40 RBX: 9395fd0cd908 RCX:
> > 9395fd0cd8f8
> > [173774.310006] RDX: b42e5589be40 RSI: 939b59b64f18 RDI:
> > 9395fd0cd87c
> > [173774.310007] RBP: c0930f40 R08: 0014 R09:
> > c091f100
> > [173774.310008] R10: 9399f69b0800 R11: 0001 R12:
> > 
> > [173774.310009] R13: 9395fd0cd850 R14: 0001 R15:
> > 0001
> > [173774.310010] FS:  () GS:939b7d34()
> > knlGS:
> > [173774.310011] CS:  0010 DS:  ES:  CR0: 80050033
> > [173774.310012] CR2: 7f4f64008838 CR3: 000643baa000 CR4:
> > 003406e0
> > [173774.310013] Call Trace:
> > [173774.310019]  ttm_bo_cleanup_refs+0x160/0x1e0 [ttm]
> > [173774.310025]  ttm_bo_delayed_delete+0xa8/0x1e0 [ttm]
> > [173774.310029]  ttm_bo_delayed_workqueue+0x17/0x40 [ttm]
> > [173774.310033]  process_one_work+0x1fd/0x430
> > [173774.310036]  worker_thread+0x2d/0x3d0
> > [173774.310038]  ? process_one_work+0x430/0x430
> > [173774.310040]  kthread+0x112/0x130
> > [173774.310042]  ? kthread_create_on_node+0x60/0x60
> > [173774.310045]  ret_from_fork+0x22/0x40
> > [173774.310048] Modules linked in: fuse nct6775 hwmon_vid
> > nls_iso8859_1 nls_cp437 vfat fat edac_mce_amd kvm_amd kvm irqbypass
> > amdgpu arc4 iwlmvm mac80211 snd_usb_audio uvcvideo snd_usbmidi_lib
> > videobuf2_vmalloc crct10dif_pclmul videobuf2_memops
> > snd_hda_codec_realtek videobuf2_v4l2 btusb gpu_sched snd_rawmidi
> > videobuf2_common snd_hda_codec_generic btrtl videodev crc32_pclmul
> > btbcm snd_seq_device ledtrig_audio ttm btintel ghash_clmulni_intel
> > wmi_bmof mxm_wmi snd_hda_codec_hdmi media bluetooth drm_kms_helper
> > iwlwifi snd_hda_intel drm aesni_intel snd_hda_codec joydev input_leds
> > aes_x86_64 snd_hda_core mousedev evdev crypto_simd cryptd ecdh_generic
> > led_class agpgart snd_hwdep mac_hid cdc_acm glue_helper ecc snd_pcm
> > igb syscopyarea pcspkr cfg80211 sysfillrect snd_timer sysimgblt snd
> > fb_sys_fops ccp ptp soundcore pps_core rng_core k10temp i2c_algo_bit
> > sp5100_tco dca i2c_piix4 rfkill wmi pcc_cpufreq button acpi_cpufreq
> > sch_fq_codel ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2
> > sd_mod
> > [173774.310085]  hid_generic usbhid hid crc32c_intel ahci xhci_pci
> > libahci xhci_hcd libata usbcore scsi_mod usb_common
> > [173774.310094] ---[ end trace 1f8d21980c0b3fd5 ]---
> > [173774.310097] RIP: 0010:ttm_bo_ref_bug+0x5/0x10 [ttm]
> > [173774.310099] Code: c0 c3 b8 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00
> > 00 00 00 66 90 0f 1f 44 00 00 f0 ff 8f a4 00 00 00 c3 0f 1f 00 0f 1f
> > 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 48 8b 07
> > 48 89
> > [173774.310100] RSP: 0018:b42e5589bde8 EFLAGS: 00010246
> > [173774.310101] RAX: b42e5589be40 RBX: 9395fd0cd908 RCX:
> > 9395fd0cd8f8
> > [173774.310102] RDX: b42e5589be40 RSI: 939b59b64f18 RDI:
> > 9395fd0cd87c
> > [173774.310103] RBP: c0930f40 R08: 0014 R09:
> > c091f100
> > [173774.310104] R10: 9399f69b0800 R11: 0001 R12:
> > 
> > [173774.310104] R13: 9395fd0cd850 R14: 0001 R15:
> > 0001
> > [173774.310106] FS:  () GS:939b7d34()
> > knlGS:
> > [173774.310107] CS:  0010 DS:  ES:  CR0: 80050033
> > [173774.310107] CR2: 7f4f64008838 CR3: 000643baa000 

Re: TTM refcount problem.

2019-07-29 Thread Christian König

Is this a known issue?

No, that looks like a new one to me.

Is that somehow reproducible?

Christian.

Am 29.07.19 um 10:14 schrieb Bas Nieuwenhuizen:

Hi all,

I have a TTM refcount issue:

[173774.309968] [ cut here ]
[173774.309970] kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:202!
[173774.309982] invalid opcode:  [#1] PREEMPT SMP NOPTI
[173774.309985] CPU: 13 PID: 128214 Comm: kworker/13:2 Not tainted
5.2.0-rc1-g3f2e519b0974 #10
[173774.309986] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./X399 Taichi, BIOS P1.50 09/05/2017
[173774.309995] Workqueue: events ttm_bo_delayed_workqueue [ttm]
[173774.31] RIP: 0010:ttm_bo_ref_bug+0x5/0x10 [ttm]
[173774.310002] Code: c0 c3 b8 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00
00 00 00 66 90 0f 1f 44 00 00 f0 ff 8f a4 00 00 00 c3 0f 1f 00 0f 1f
44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 48 8b 07
48 89
[173774.310003] RSP: 0018:b42e5589bde8 EFLAGS: 00010246
[173774.310005] RAX: b42e5589be40 RBX: 9395fd0cd908 RCX:
9395fd0cd8f8
[173774.310006] RDX: b42e5589be40 RSI: 939b59b64f18 RDI:
9395fd0cd87c
[173774.310007] RBP: c0930f40 R08: 0014 R09:
c091f100
[173774.310008] R10: 9399f69b0800 R11: 0001 R12:

[173774.310009] R13: 9395fd0cd850 R14: 0001 R15:
0001
[173774.310010] FS:  () GS:939b7d34()
knlGS:
[173774.310011] CS:  0010 DS:  ES:  CR0: 80050033
[173774.310012] CR2: 7f4f64008838 CR3: 000643baa000 CR4:
003406e0
[173774.310013] Call Trace:
[173774.310019]  ttm_bo_cleanup_refs+0x160/0x1e0 [ttm]
[173774.310025]  ttm_bo_delayed_delete+0xa8/0x1e0 [ttm]
[173774.310029]  ttm_bo_delayed_workqueue+0x17/0x40 [ttm]
[173774.310033]  process_one_work+0x1fd/0x430
[173774.310036]  worker_thread+0x2d/0x3d0
[173774.310038]  ? process_one_work+0x430/0x430
[173774.310040]  kthread+0x112/0x130
[173774.310042]  ? kthread_create_on_node+0x60/0x60
[173774.310045]  ret_from_fork+0x22/0x40
[173774.310048] Modules linked in: fuse nct6775 hwmon_vid
nls_iso8859_1 nls_cp437 vfat fat edac_mce_amd kvm_amd kvm irqbypass
amdgpu arc4 iwlmvm mac80211 snd_usb_audio uvcvideo snd_usbmidi_lib
videobuf2_vmalloc crct10dif_pclmul videobuf2_memops
snd_hda_codec_realtek videobuf2_v4l2 btusb gpu_sched snd_rawmidi
videobuf2_common snd_hda_codec_generic btrtl videodev crc32_pclmul
btbcm snd_seq_device ledtrig_audio ttm btintel ghash_clmulni_intel
wmi_bmof mxm_wmi snd_hda_codec_hdmi media bluetooth drm_kms_helper
iwlwifi snd_hda_intel drm aesni_intel snd_hda_codec joydev input_leds
aes_x86_64 snd_hda_core mousedev evdev crypto_simd cryptd ecdh_generic
led_class agpgart snd_hwdep mac_hid cdc_acm glue_helper ecc snd_pcm
igb syscopyarea pcspkr cfg80211 sysfillrect snd_timer sysimgblt snd
fb_sys_fops ccp ptp soundcore pps_core rng_core k10temp i2c_algo_bit
sp5100_tco dca i2c_piix4 rfkill wmi pcc_cpufreq button acpi_cpufreq
sch_fq_codel ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2
sd_mod
[173774.310085]  hid_generic usbhid hid crc32c_intel ahci xhci_pci
libahci xhci_hcd libata usbcore scsi_mod usb_common
[173774.310094] ---[ end trace 1f8d21980c0b3fd5 ]---
[173774.310097] RIP: 0010:ttm_bo_ref_bug+0x5/0x10 [ttm]
[173774.310099] Code: c0 c3 b8 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00
00 00 00 66 90 0f 1f 44 00 00 f0 ff 8f a4 00 00 00 c3 0f 1f 00 0f 1f
44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 48 8b 07
48 89
[173774.310100] RSP: 0018:b42e5589bde8 EFLAGS: 00010246
[173774.310101] RAX: b42e5589be40 RBX: 9395fd0cd908 RCX:
9395fd0cd8f8
[173774.310102] RDX: b42e5589be40 RSI: 939b59b64f18 RDI:
9395fd0cd87c
[173774.310103] RBP: c0930f40 R08: 0014 R09:
c091f100
[173774.310104] R10: 9399f69b0800 R11: 0001 R12:

[173774.310104] R13: 9395fd0cd850 R14: 0001 R15:
0001
[173774.310106] FS:  () GS:939b7d34()
knlGS:
[173774.310107] CS:  0010 DS:  ES:  CR0: 80050033
[173774.310107] CR2: 7f4f64008838 CR3: 000643baa000 CR4:
003406e0
[173774.310110] note: kworker/13:2[128214] exited with preempt_count 1


With amd-staging-drm-next:

commit 20d6b9c3b7f40ec427af912d140f2be0de098d2d (origin/amd-staging-drm-next)
Author: Gustavo A. R. Silva 
Date:   Mon Jul 22 12:47:16 2019 -0500

 drm/amdkfd/kfd_mqd_manager_v10: Avoid fall-through warning

with a Vega10.

Is this a known issue?

Thanks,
Bas
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx