[Intel-gfx] [topic/core-for-CI] Revert "iommu/dma: Fix race condition during iova_domain initialization"
This reverts commit ac9a5d522bb80be50ea84965699e1c8257d745ce. This change introduces a regression on Alder Lake that completely blocks testing. To enable CI and avoid possible circular locking warning, revert the patch. kernel log: == WARNING: possible circular locking dependency detected 6.0.0-rc5-CI_DRM_12132-g6c93e979e542+ #1 Not tainted -- cpuhp/0/15 is trying to acquire lock: 8881013df278 (&(&priv->bus_notifier)->rwsem){}-{3:3}, at: blocking_notifier_call_chain+0x20/0x50 but task is already holding lock: 826490c0 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x48/0x1f0 which lock already depends on the new loc the existing dependency chain (in reverse order) is: -> #3 (cpuhp_state-up){+.+.}-{0:0}: lock_acquire+0xd3/0x310 cpuhp_thread_fun+0xa6/0x1f0 smpboot_thread_fn+0x1b5/0x260 kthread+0xed/0x120 ret_from_fork+0x1f/0x30 -> #2 (cpu_hotplug_lock){}-{0:0}: lock_acquire+0xd3/0x310 __cpuhp_state_add_instance+0x43/0x1c0 iova_domain_init_rcaches+0x199/0x1c0 iommu_setup_dma_ops+0x130/0x440 bus_iommu_probe+0x26a/0x2d0 bus_set_iommu+0x82/0xd0 intel_iommu_init+0xe33/0x1039 pci_iommu_init+0x9/0x31 do_one_initcall+0x53/0x2f0 kernel_init_freeable+0x18f/0x1e1 kernel_init+0x11/0x120 ret_from_fork+0x1f/0x30 -> #1 (&domain->iova_cookie->mutex){+.+.}-{3:3}: lock_acquire+0xd3/0x310 __mutex_lock+0x97/0xf10 iommu_setup_dma_ops+0xd7/0x440 iommu_probe_device+0xa4/0x180 iommu_bus_notifier+0x2d/0x40 notifier_call_chain+0x31/0x90 blocking_notifier_call_chain+0x3a/0x50 device_add+0x3c1/0x900 pci_device_add+0x255/0x580 pci_scan_single_device+0xa6/0xd0 pci_scan_slot+0x7a/0x1b0 pci_scan_child_bus_extend+0x35/0x2a0 vmd_probe+0x5cd/0x970 pci_device_probe+0x95/0x110 really_probe+0xd6/0x350 __driver_probe_device+0x73/0x170 driver_probe_device+0x1a/0x90 __driver_attach+0xbc/0x190 bus_for_each_dev+0x72/0xc0 bus_add_driver+0x1bb/0x210 driver_register+0x66/0xc0 do_one_initcall+0x53/0x2f0 kernel_init_freeable+0x18f/0x1e1 kernel_init+0x11/0x120 ret_from_fork+0x1f/0x30 -> #0 (&(&priv->bus_notifier)->rwsem){}-{3:3}: validate_chain+0xb3f/0x2000 __lock_acquire+0x5a4/0xb70 lock_acquire+0xd3/0x310 down_read+0x39/0x140 blocking_notifier_call_chain+0x20/0x50 device_add+0x3c1/0x900 platform_device_add+0x108/0x240 coretemp_cpu_online+0xe1/0x15e [coretemp] cpuhp_invoke_callback+0x181/0x8a0 cpuhp_thread_fun+0x188/0x1f0 smpboot_thread_fn+0x1b5/0x260 kthread+0xed/0x120 ret_from_fork+0x1f/0x30 other info that might help us debug thi Chain exists of &(&priv->bus_notifier)->rwsem --> cpu_hotplug_lock --> cpuhp_state- Possible unsafe locking scenari CPU0CPU1 lock(cpuhp_state-up); lock(cpu_hotplug_lock); lock(cpuhp_state-up); lock(&(&priv->bus_notifier)->rwsem); *** DEADLOCK * 2 locks held by cpuhp/0/15: #0: 82648f10 (cpu_hotplug_lock){}-{0:0}, at: cpuhp_thread_fun+0x48/0x1f0 #1: 826490c0 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x48/0x1f0 stack backtrace: CPU: 0 PID: 15 Comm: cpuhp/0 Not tainted 6.0.0-rc5-CI_DRM_12132-g6c93e979e542+ #1 Hardware name: Intel Corporation Alder Lake Client Platform/AlderLake-P DDR4 RVP, BIOS ADLPFWI1.R00.3135.A00.2203251419 03/25/2022 Call Trace: dump_stack_lvl+0x56/0x7f check_noncircular+0x132/0x150 validate_chain+0xb3f/0x2000 __lock_acquire+0x5a4/0xb70 lock_acquire+0xd3/0x310 ? blocking_notifier_call_chain+0x20/0x50 down_read+0x39/0x140 ? blocking_notifier_call_chain+0x20/0x50 blocking_notifier_call_chain+0x20/0x50 device_add+0x3c1/0x900 ? dev_set_name+0x4e/0x70 platform_device_add+0x108/0x240 coretemp_cpu_online+0xe1/0x15e [coretemp] ? create_core_data+0x550/0x550 [coretemp] cpuhp_invoke_callback+0x181/0x8a0 cpuhp_thread_fun+0x188/0x1f0 ? smpboot_thread_fn+0x1e/0x260 smpboot_thread_fn+0x1b5/0x260 ? sort_range+0x20/0x20 kthread+0xed/0x120 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6641 Signed-off-by: Karolina Drobnik Cc: Lucas De Marchi --- drivers/iommu/dma-iommu.c | 17 - 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 17dd683b2fce..9616b473e4c7 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iom
Re: [Intel-gfx] [topic/core-for-CI] Revert "iommu/dma: Fix race condition during iova_domain initialization"
On Wed, Sep 14, 2022 at 02:40:45PM +0200, Karolina Drobnik wrote: This reverts commit ac9a5d522bb80be50ea84965699e1c8257d745ce. This change introduces a regression on Alder Lake that completely blocks testing. To enable CI and avoid possible circular locking warning, revert the patch. We are already on rc5. Are iommu authors involved aware of this issue? We could do this in our "for CI only" branch, but it's equally important that this is fixed for 6.0 Cc'ing them. thanks Lucas De Marchi kernel log: == WARNING: possible circular locking dependency detected 6.0.0-rc5-CI_DRM_12132-g6c93e979e542+ #1 Not tainted -- cpuhp/0/15 is trying to acquire lock: 8881013df278 (&(&priv->bus_notifier)->rwsem){}-{3:3}, at: blocking_notifier_call_chain+0x20/0x50 but task is already holding lock: 826490c0 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x48/0x1f0 which lock already depends on the new loc the existing dependency chain (in reverse order) is: -> #3 (cpuhp_state-up){+.+.}-{0:0}: lock_acquire+0xd3/0x310 cpuhp_thread_fun+0xa6/0x1f0 smpboot_thread_fn+0x1b5/0x260 kthread+0xed/0x120 ret_from_fork+0x1f/0x30 -> #2 (cpu_hotplug_lock){}-{0:0}: lock_acquire+0xd3/0x310 __cpuhp_state_add_instance+0x43/0x1c0 iova_domain_init_rcaches+0x199/0x1c0 iommu_setup_dma_ops+0x130/0x440 bus_iommu_probe+0x26a/0x2d0 bus_set_iommu+0x82/0xd0 intel_iommu_init+0xe33/0x1039 pci_iommu_init+0x9/0x31 do_one_initcall+0x53/0x2f0 kernel_init_freeable+0x18f/0x1e1 kernel_init+0x11/0x120 ret_from_fork+0x1f/0x30 -> #1 (&domain->iova_cookie->mutex){+.+.}-{3:3}: lock_acquire+0xd3/0x310 __mutex_lock+0x97/0xf10 iommu_setup_dma_ops+0xd7/0x440 iommu_probe_device+0xa4/0x180 iommu_bus_notifier+0x2d/0x40 notifier_call_chain+0x31/0x90 blocking_notifier_call_chain+0x3a/0x50 device_add+0x3c1/0x900 pci_device_add+0x255/0x580 pci_scan_single_device+0xa6/0xd0 pci_scan_slot+0x7a/0x1b0 pci_scan_child_bus_extend+0x35/0x2a0 vmd_probe+0x5cd/0x970 pci_device_probe+0x95/0x110 really_probe+0xd6/0x350 __driver_probe_device+0x73/0x170 driver_probe_device+0x1a/0x90 __driver_attach+0xbc/0x190 bus_for_each_dev+0x72/0xc0 bus_add_driver+0x1bb/0x210 driver_register+0x66/0xc0 do_one_initcall+0x53/0x2f0 kernel_init_freeable+0x18f/0x1e1 kernel_init+0x11/0x120 ret_from_fork+0x1f/0x30 -> #0 (&(&priv->bus_notifier)->rwsem){}-{3:3}: validate_chain+0xb3f/0x2000 __lock_acquire+0x5a4/0xb70 lock_acquire+0xd3/0x310 down_read+0x39/0x140 blocking_notifier_call_chain+0x20/0x50 device_add+0x3c1/0x900 platform_device_add+0x108/0x240 coretemp_cpu_online+0xe1/0x15e [coretemp] cpuhp_invoke_callback+0x181/0x8a0 cpuhp_thread_fun+0x188/0x1f0 smpboot_thread_fn+0x1b5/0x260 kthread+0xed/0x120 ret_from_fork+0x1f/0x30 other info that might help us debug thi Chain exists of &(&priv->bus_notifier)->rwsem --> cpu_hotplug_lock --> cpuhp_state- Possible unsafe locking scenari CPU0CPU1 lock(cpuhp_state-up); lock(cpu_hotplug_lock); lock(cpuhp_state-up); lock(&(&priv->bus_notifier)->rwsem); *** DEADLOCK * 2 locks held by cpuhp/0/15: #0: 82648f10 (cpu_hotplug_lock){}-{0:0}, at: cpuhp_thread_fun+0x48/0x1f0 #1: 826490c0 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x48/0x1f0 stack backtrace: CPU: 0 PID: 15 Comm: cpuhp/0 Not tainted 6.0.0-rc5-CI_DRM_12132-g6c93e979e542+ #1 Hardware name: Intel Corporation Alder Lake Client Platform/AlderLake-P DDR4 RVP, BIOS ADLPFWI1.R00.3135.A00.2203251419 03/25/2022 Call Trace: dump_stack_lvl+0x56/0x7f check_noncircular+0x132/0x150 validate_chain+0xb3f/0x2000 __lock_acquire+0x5a4/0xb70 lock_acquire+0xd3/0x310 ? blocking_notifier_call_chain+0x20/0x50 down_read+0x39/0x140 ? blocking_notifier_call_chain+0x20/0x50 blocking_notifier_call_chain+0x20/0x50 device_add+0x3c1/0x900 ? dev_set_name+0x4e/0x70 platform_device_add+0x108/0x240 coretemp_cpu_online+0xe1/0x15e [coretemp] ? create_core_data+0x550/0x550 [coretemp] cpuhp_invoke_callback+0x181/0x8a0 cpuhp_thread_fun+0x188/0x1f0 ? smpboot_thread_fn+0x1e/0x260 smpboot_thread_fn+0x1b5/0x260 ? sort_range+0x20/0x20 kthread+0xed/0x120 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6641 Signed-off-by: Karolina Drobnik Cc: Lucas De Marchi --- drivers/iommu/dma-iommu.c | 17 - 1 file changed, 4 inserti
Re: [Intel-gfx] [topic/core-for-CI] Revert "iommu/dma: Fix race condition during iova_domain initialization"
On 14.09.2022 17:01, Lucas De Marchi wrote: On Wed, Sep 14, 2022 at 02:40:45PM +0200, Karolina Drobnik wrote: This reverts commit ac9a5d522bb80be50ea84965699e1c8257d745ce. This change introduces a regression on Alder Lake that completely blocks testing. To enable CI and avoid possible circular locking warning, revert the patch. We are already on rc5. Are iommu authors involved aware of this issue? We could do this in our "for CI only" branch, but it's equally important that this is fixed for 6.0 I planned to reach out to them after merging this revert on "CI only" branch (hence the topic tag) with more justification. And yes, I'm fully aware we're quite late in the cycle, so that's also why I went with this patch first. Many thanks, Karolina Cc'ing them. thanks Lucas De Marchi kernel log: == WARNING: possible circular locking dependency detected 6.0.0-rc5-CI_DRM_12132-g6c93e979e542+ #1 Not tainted -- cpuhp/0/15 is trying to acquire lock: 8881013df278 (&(&priv->bus_notifier)->rwsem){}-{3:3}, at: blocking_notifier_call_chain+0x20/0x50 but task is already holding lock: 826490c0 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x48/0x1f0 which lock already depends on the new loc the existing dependency chain (in reverse order) is: -> #3 (cpuhp_state-up){+.+.}-{0:0}: lock_acquire+0xd3/0x310 cpuhp_thread_fun+0xa6/0x1f0 smpboot_thread_fn+0x1b5/0x260 kthread+0xed/0x120 ret_from_fork+0x1f/0x30 -> #2 (cpu_hotplug_lock){}-{0:0}: lock_acquire+0xd3/0x310 __cpuhp_state_add_instance+0x43/0x1c0 iova_domain_init_rcaches+0x199/0x1c0 iommu_setup_dma_ops+0x130/0x440 bus_iommu_probe+0x26a/0x2d0 bus_set_iommu+0x82/0xd0 intel_iommu_init+0xe33/0x1039 pci_iommu_init+0x9/0x31 do_one_initcall+0x53/0x2f0 kernel_init_freeable+0x18f/0x1e1 kernel_init+0x11/0x120 ret_from_fork+0x1f/0x30 -> #1 (&domain->iova_cookie->mutex){+.+.}-{3:3}: lock_acquire+0xd3/0x310 __mutex_lock+0x97/0xf10 iommu_setup_dma_ops+0xd7/0x440 iommu_probe_device+0xa4/0x180 iommu_bus_notifier+0x2d/0x40 notifier_call_chain+0x31/0x90 blocking_notifier_call_chain+0x3a/0x50 device_add+0x3c1/0x900 pci_device_add+0x255/0x580 pci_scan_single_device+0xa6/0xd0 pci_scan_slot+0x7a/0x1b0 pci_scan_child_bus_extend+0x35/0x2a0 vmd_probe+0x5cd/0x970 pci_device_probe+0x95/0x110 really_probe+0xd6/0x350 __driver_probe_device+0x73/0x170 driver_probe_device+0x1a/0x90 __driver_attach+0xbc/0x190 bus_for_each_dev+0x72/0xc0 bus_add_driver+0x1bb/0x210 driver_register+0x66/0xc0 do_one_initcall+0x53/0x2f0 kernel_init_freeable+0x18f/0x1e1 kernel_init+0x11/0x120 ret_from_fork+0x1f/0x30 -> #0 (&(&priv->bus_notifier)->rwsem){}-{3:3}: validate_chain+0xb3f/0x2000 __lock_acquire+0x5a4/0xb70 lock_acquire+0xd3/0x310 down_read+0x39/0x140 blocking_notifier_call_chain+0x20/0x50 device_add+0x3c1/0x900 platform_device_add+0x108/0x240 coretemp_cpu_online+0xe1/0x15e [coretemp] cpuhp_invoke_callback+0x181/0x8a0 cpuhp_thread_fun+0x188/0x1f0 smpboot_thread_fn+0x1b5/0x260 kthread+0xed/0x120 ret_from_fork+0x1f/0x30 other info that might help us debug thi Chain exists of &(&priv->bus_notifier)->rwsem --> cpu_hotplug_lock --> cpuhp_state- Possible unsafe locking scenari CPU0 CPU1 lock(cpuhp_state-up); lock(cpu_hotplug_lock); lock(cpuhp_state-up); lock(&(&priv->bus_notifier)->rwsem); *** DEADLOCK * 2 locks held by cpuhp/0/15: #0: 82648f10 (cpu_hotplug_lock){}-{0:0}, at: cpuhp_thread_fun+0x48/0x1f0 #1: 826490c0 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x48/0x1f0 stack backtrace: CPU: 0 PID: 15 Comm: cpuhp/0 Not tainted 6.0.0-rc5-CI_DRM_12132-g6c93e979e542+ #1 Hardware name: Intel Corporation Alder Lake Client Platform/AlderLake-P DDR4 RVP, BIOS ADLPFWI1.R00.3135.A00.2203251419 03/25/2022 Call Trace: dump_stack_lvl+0x56/0x7f check_noncircular+0x132/0x150 validate_chain+0xb3f/0x2000 __lock_acquire+0x5a4/0xb70 lock_acquire+0xd3/0x310 ? blocking_notifier_call_chain+0x20/0x50 down_read+0x39/0x140 ? blocking_notifier_call_chain+0x20/0x50 blocking_notifier_call_chain+0x20/0x50 device_add+0x3c1/0x900 ? dev_set_name+0x4e/0x70 platform_device_add+0x108/0x240 coretemp_cpu_online+0xe1/0x15e [coretemp] ? create_core_data+0x550/0x550 [coretemp] cpuhp_invoke_callback+0x181/0x8a0 cpuhp_thread_fun+0x188/0x1f0 ? smpboot_thread_fn+0x1e/0x260 smpboot_thread_fn+0x1
Re: [Intel-gfx] [topic/core-for-CI] Revert "iommu/dma: Fix race condition during iova_domain initialization"
On 14.09.2022 17:54, Robin Murphy wrote: On 2022-09-14 16:01, Lucas De Marchi wrote: On Wed, Sep 14, 2022 at 02:40:45PM +0200, Karolina Drobnik wrote: This reverts commit ac9a5d522bb80be50ea84965699e1c8257d745ce. This change introduces a regression on Alder Lake that completely blocks testing. To enable CI and avoid possible circular locking warning, revert the patch. We are already on rc5. Are iommu authors involved aware of this issue? We could do this in our "for CI only" branch, but it's equally important that this is fixed for 6.0 Cc'ing them. The lockdep report doesn't make much sense to me - the deadlock cycle it's reporting doesn't even involve the mutex added by that commit, and otherwise the lock ordering between the IOMMU bus notifier(s) and cpu_hotplug_lock has existed for ages. Has lockdep somehow got multiple different and unrelated bus notifiers mixed up, maybe? FWIW nobody else has reported anything, and that mutex addresses a real-world concurrency issue, so I'm not convinced a revert is appropriate without at least a much clearer justification. I'll share more background on this regression. We've noticed that no tests were run for Alder Lake platforms. This may happens when, for example, there is a kernel taint or lockdep warning. Links: https://intel-gfx-ci.01.org/tree/drm-tip/bat-adlm-1.html https://intel-gfx-ci.01.org/tree/drm-tip/bat-adlp-6.html The CI logs (which can be found for example here[1], boot0 file) revealed a lockdep warning. One of the recent changes in the area was commit ac9a5d522bb8 ("iommu/dma: Fix race condition during iova_domain initialization"), and I sent a revert patch to test it on CI[2]. This proved to be effective, as the tests started running on Alder Lake platform: https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_108474v1/index.html?hosts=adlp To be clear, that revert is just a way of unblocking CI testing, the problem requires a specific fix. Lucas, would it be possible to merge this revert to the topic branch to unblock Alder Lake until this issue is fixed? I'm afraid that some regressions could slip through the cracks if we don't do it soon enough. Thanks, Karolina [1] - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12145/bat-adlm-1/igt@run...@aborted.html [2] - https://patchwork.freedesktop.org/series/108474/ Robin. thanks Lucas De Marchi kernel log: == WARNING: possible circular locking dependency detected 6.0.0-rc5-CI_DRM_12132-g6c93e979e542+ #1 Not tainted -- cpuhp/0/15 is trying to acquire lock: 8881013df278 (&(&priv->bus_notifier)->rwsem){}-{3:3}, at: blocking_notifier_call_chain+0x20/0x50 but task is already holding lock: 826490c0 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x48/0x1f0 which lock already depends on the new loc the existing dependency chain (in reverse order) is: -> #3 (cpuhp_state-up){+.+.}-{0:0}: lock_acquire+0xd3/0x310 cpuhp_thread_fun+0xa6/0x1f0 smpboot_thread_fn+0x1b5/0x260 kthread+0xed/0x120 ret_from_fork+0x1f/0x30 -> #2 (cpu_hotplug_lock){}-{0:0}: lock_acquire+0xd3/0x310 __cpuhp_state_add_instance+0x43/0x1c0 iova_domain_init_rcaches+0x199/0x1c0 iommu_setup_dma_ops+0x130/0x440 bus_iommu_probe+0x26a/0x2d0 bus_set_iommu+0x82/0xd0 intel_iommu_init+0xe33/0x1039 pci_iommu_init+0x9/0x31 do_one_initcall+0x53/0x2f0 kernel_init_freeable+0x18f/0x1e1 kernel_init+0x11/0x120 ret_from_fork+0x1f/0x30 -> #1 (&domain->iova_cookie->mutex){+.+.}-{3:3}: lock_acquire+0xd3/0x310 __mutex_lock+0x97/0xf10 iommu_setup_dma_ops+0xd7/0x440 iommu_probe_device+0xa4/0x180 iommu_bus_notifier+0x2d/0x40 notifier_call_chain+0x31/0x90 blocking_notifier_call_chain+0x3a/0x50 device_add+0x3c1/0x900 pci_device_add+0x255/0x580 pci_scan_single_device+0xa6/0xd0 pci_scan_slot+0x7a/0x1b0 pci_scan_child_bus_extend+0x35/0x2a0 vmd_probe+0x5cd/0x970 pci_device_probe+0x95/0x110 really_probe+0xd6/0x350 __driver_probe_device+0x73/0x170 driver_probe_device+0x1a/0x90 __driver_attach+0xbc/0x190 bus_for_each_dev+0x72/0xc0 bus_add_driver+0x1bb/0x210 driver_register+0x66/0xc0 do_one_initcall+0x53/0x2f0 kernel_init_freeable+0x18f/0x1e1 kernel_init+0x11/0x120 ret_from_fork+0x1f/0x30 -> #0 (&(&priv->bus_notifier)->rwsem){}-{3:3}: validate_chain+0xb3f/0x2000 __lock_acquire+0x5a4/0xb70 lock_acquire+0xd3/0x310 down_read+0x39/0x140 blocking_notifier_call_chain+0x20/0x50 device_add+0x3c1/0x900 platform_device_add+0x108/0x240 coretemp_cpu_online+0xe1/0x15e [coretemp] cpuhp_invoke_callback+0x181/0x8a0 cpuhp_thread_fun+0x188/0x1f0 smpboot_thread_fn+0x1b5/0x260 kthread+0xed/0x120 ret_from_fork+0x1f/0x30 other info that might help us debug thi Chain exists of &(&priv->bus_notifier)->rwsem --> cpu_hotplug_lock --> cpuhp_state- Possible unsafe locking scenari CPU0 CPU1 lock(cpuhp_state-up); lock(cpu_hotplug_lock); lock(cpuhp_state-up); lock(&(&priv->bus_noti
Re: [Intel-gfx] [topic/core-for-CI] Revert "iommu/dma: Fix race condition during iova_domain initialization"
On Fri, Sep 16, 2022 at 02:24:00PM +0200, Karolina Drobnik wrote: On 14.09.2022 17:54, Robin Murphy wrote: On 2022-09-14 16:01, Lucas De Marchi wrote: On Wed, Sep 14, 2022 at 02:40:45PM +0200, Karolina Drobnik wrote: This reverts commit ac9a5d522bb80be50ea84965699e1c8257d745ce. This change introduces a regression on Alder Lake that completely blocks testing. To enable CI and avoid possible circular locking warning, revert the patch. We are already on rc5. Are iommu authors involved aware of this issue? We could do this in our "for CI only" branch, but it's equally important that this is fixed for 6.0 Cc'ing them. The lockdep report doesn't make much sense to me - the deadlock cycle it's reporting doesn't even involve the mutex added by that commit, and otherwise the lock ordering between the IOMMU bus notifier(s) and cpu_hotplug_lock has existed for ages. Has lockdep somehow got multiple different and unrelated bus notifiers mixed up, maybe? FWIW nobody else has reported anything, and that mutex addresses a real-world concurrency issue, so I'm not convinced a revert is appropriate without at least a much clearer justification. I'll share more background on this regression. We've noticed that no tests were run for Alder Lake platforms. This may happens when, for example, there is a kernel taint or lockdep warning. Links: https://intel-gfx-ci.01.org/tree/drm-tip/bat-adlm-1.html https://intel-gfx-ci.01.org/tree/drm-tip/bat-adlp-6.html The CI logs (which can be found for example here[1], boot0 file) revealed a lockdep warning. One of the recent changes in the area was commit ac9a5d522bb8 ("iommu/dma: Fix race condition during iova_domain initialization"), and I sent a revert patch to test it on CI[2]. This proved to be effective, as the tests started running on Alder Lake platform: https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_108474v1/index.html?hosts=adlp To be clear, that revert is just a way of unblocking CI testing, the problem requires a specific fix. Lucas, would it be possible to merge this revert to the topic branch to unblock Alder Lake until this issue is fixed? I'm afraid that some regressions could slip through the cracks if we don't do it soon enough. Yeah. Let's have CI running with the revertt so we can see if on next runs it will really show it was a regression or if it's something else. I think it will help us understand why it's failing. Lucas De Marchi Thanks, Karolina [1] - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12145/bat-adlm-1/igt@run...@aborted.html [2] - https://patchwork.freedesktop.org/series/108474/ Robin. thanks Lucas De Marchi kernel log: == WARNING: possible circular locking dependency detected 6.0.0-rc5-CI_DRM_12132-g6c93e979e542+ #1 Not tainted -- cpuhp/0/15 is trying to acquire lock: 8881013df278 (&(&priv->bus_notifier)->rwsem){}-{3:3}, at: blocking_notifier_call_chain+0x20/0x50 but task is already holding lock: 826490c0 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x48/0x1f0 which lock already depends on the new loc the existing dependency chain (in reverse order) is: -> #3 (cpuhp_state-up){+.+.}-{0:0}: lock_acquire+0xd3/0x310 cpuhp_thread_fun+0xa6/0x1f0 smpboot_thread_fn+0x1b5/0x260 kthread+0xed/0x120 ret_from_fork+0x1f/0x30 -> #2 (cpu_hotplug_lock){}-{0:0}: lock_acquire+0xd3/0x310 __cpuhp_state_add_instance+0x43/0x1c0 iova_domain_init_rcaches+0x199/0x1c0 iommu_setup_dma_ops+0x130/0x440 bus_iommu_probe+0x26a/0x2d0 bus_set_iommu+0x82/0xd0 intel_iommu_init+0xe33/0x1039 pci_iommu_init+0x9/0x31 do_one_initcall+0x53/0x2f0 kernel_init_freeable+0x18f/0x1e1 kernel_init+0x11/0x120 ret_from_fork+0x1f/0x30 -> #1 (&domain->iova_cookie->mutex){+.+.}-{3:3}: lock_acquire+0xd3/0x310 __mutex_lock+0x97/0xf10 iommu_setup_dma_ops+0xd7/0x440 iommu_probe_device+0xa4/0x180 iommu_bus_notifier+0x2d/0x40 notifier_call_chain+0x31/0x90 blocking_notifier_call_chain+0x3a/0x50 device_add+0x3c1/0x900 pci_device_add+0x255/0x580 pci_scan_single_device+0xa6/0xd0 pci_scan_slot+0x7a/0x1b0 pci_scan_child_bus_extend+0x35/0x2a0 vmd_probe+0x5cd/0x970 pci_device_probe+0x95/0x110 really_probe+0xd6/0x350 __driver_probe_device+0x73/0x170 driver_probe_device+0x1a/0x90 __driver_attach+0xbc/0x190 bus_for_each_dev+0x72/0xc0 bus_add_driver+0x1bb/0x210 driver_register+0x66/0xc0 do_one_initcall+0x53/0x2f0 kernel_init_freeable+0x18f/0x1e1 kernel_init+0x11/0x120 ret_from_fork+0x1f/0x30 -> #0 (&(&priv->bus_notifier)->rwsem){}-{3:3}: validate_chain+0xb3f/0x2000 __lock_acquire+0x5a4/0xb70 lock_acquire+0xd3/0x310 down_read+0x39/0x140 blocking_notifier_call_chain+0x20/0x50 device_add+0x3c1/0x900 platform_device_add+0x108/0x240 coretemp_cpu_online+0xe1/0x15e [coretemp] cpuhp_invoke_callback+0x181/0x8a0 cpuhp_thread_fun+0x188/0x1f0 smpboot_thread_fn+0x1b5/0x260 kthread+0xed/0x120 ret_from_fork+0x1f/0x30 other
Re: [Intel-gfx] [topic/core-for-CI] Revert "iommu/dma: Fix race condition during iova_domain initialization"
On 2022-09-14 16:01, Lucas De Marchi wrote: On Wed, Sep 14, 2022 at 02:40:45PM +0200, Karolina Drobnik wrote: This reverts commit ac9a5d522bb80be50ea84965699e1c8257d745ce. This change introduces a regression on Alder Lake that completely blocks testing. To enable CI and avoid possible circular locking warning, revert the patch. We are already on rc5. Are iommu authors involved aware of this issue? We could do this in our "for CI only" branch, but it's equally important that this is fixed for 6.0 Cc'ing them. The lockdep report doesn't make much sense to me - the deadlock cycle it's reporting doesn't even involve the mutex added by that commit, and otherwise the lock ordering between the IOMMU bus notifier(s) and cpu_hotplug_lock has existed for ages. Has lockdep somehow got multiple different and unrelated bus notifiers mixed up, maybe? FWIW nobody else has reported anything, and that mutex addresses a real-world concurrency issue, so I'm not convinced a revert is appropriate without at least a much clearer justification. Robin. thanks Lucas De Marchi kernel log: == WARNING: possible circular locking dependency detected 6.0.0-rc5-CI_DRM_12132-g6c93e979e542+ #1 Not tainted -- cpuhp/0/15 is trying to acquire lock: 8881013df278 (&(&priv->bus_notifier)->rwsem){}-{3:3}, at: blocking_notifier_call_chain+0x20/0x50 but task is already holding lock: 826490c0 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x48/0x1f0 which lock already depends on the new loc the existing dependency chain (in reverse order) is: -> #3 (cpuhp_state-up){+.+.}-{0:0}: lock_acquire+0xd3/0x310 cpuhp_thread_fun+0xa6/0x1f0 smpboot_thread_fn+0x1b5/0x260 kthread+0xed/0x120 ret_from_fork+0x1f/0x30 -> #2 (cpu_hotplug_lock){}-{0:0}: lock_acquire+0xd3/0x310 __cpuhp_state_add_instance+0x43/0x1c0 iova_domain_init_rcaches+0x199/0x1c0 iommu_setup_dma_ops+0x130/0x440 bus_iommu_probe+0x26a/0x2d0 bus_set_iommu+0x82/0xd0 intel_iommu_init+0xe33/0x1039 pci_iommu_init+0x9/0x31 do_one_initcall+0x53/0x2f0 kernel_init_freeable+0x18f/0x1e1 kernel_init+0x11/0x120 ret_from_fork+0x1f/0x30 -> #1 (&domain->iova_cookie->mutex){+.+.}-{3:3}: lock_acquire+0xd3/0x310 __mutex_lock+0x97/0xf10 iommu_setup_dma_ops+0xd7/0x440 iommu_probe_device+0xa4/0x180 iommu_bus_notifier+0x2d/0x40 notifier_call_chain+0x31/0x90 blocking_notifier_call_chain+0x3a/0x50 device_add+0x3c1/0x900 pci_device_add+0x255/0x580 pci_scan_single_device+0xa6/0xd0 pci_scan_slot+0x7a/0x1b0 pci_scan_child_bus_extend+0x35/0x2a0 vmd_probe+0x5cd/0x970 pci_device_probe+0x95/0x110 really_probe+0xd6/0x350 __driver_probe_device+0x73/0x170 driver_probe_device+0x1a/0x90 __driver_attach+0xbc/0x190 bus_for_each_dev+0x72/0xc0 bus_add_driver+0x1bb/0x210 driver_register+0x66/0xc0 do_one_initcall+0x53/0x2f0 kernel_init_freeable+0x18f/0x1e1 kernel_init+0x11/0x120 ret_from_fork+0x1f/0x30 -> #0 (&(&priv->bus_notifier)->rwsem){}-{3:3}: validate_chain+0xb3f/0x2000 __lock_acquire+0x5a4/0xb70 lock_acquire+0xd3/0x310 down_read+0x39/0x140 blocking_notifier_call_chain+0x20/0x50 device_add+0x3c1/0x900 platform_device_add+0x108/0x240 coretemp_cpu_online+0xe1/0x15e [coretemp] cpuhp_invoke_callback+0x181/0x8a0 cpuhp_thread_fun+0x188/0x1f0 smpboot_thread_fn+0x1b5/0x260 kthread+0xed/0x120 ret_from_fork+0x1f/0x30 other info that might help us debug thi Chain exists of &(&priv->bus_notifier)->rwsem --> cpu_hotplug_lock --> cpuhp_state- Possible unsafe locking scenari CPU0 CPU1 lock(cpuhp_state-up); lock(cpu_hotplug_lock); lock(cpuhp_state-up); lock(&(&priv->bus_notifier)->rwsem); *** DEADLOCK * 2 locks held by cpuhp/0/15: #0: 82648f10 (cpu_hotplug_lock){}-{0:0}, at: cpuhp_thread_fun+0x48/0x1f0 #1: 826490c0 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x48/0x1f0 stack backtrace: CPU: 0 PID: 15 Comm: cpuhp/0 Not tainted 6.0.0-rc5-CI_DRM_12132-g6c93e979e542+ #1 Hardware name: Intel Corporation Alder Lake Client Platform/AlderLake-P DDR4 RVP, BIOS ADLPFWI1.R00.3135.A00.2203251419 03/25/2022 Call Trace: dump_stack_lvl+0x56/0x7f check_noncircular+0x132/0x150 validate_chain+0xb3f/0x2000 __lock_acquire+0x5a4/0xb70 lock_acquire+0xd3/0x310 ? blocking_notifier_call_chain+0x20/0x50 down_read+0x39/0x140 ? blocking_notifier_call_chain+0x20/0x50 blocking_notifier_call_chain+0x20
Re: [Intel-gfx] [topic/core-for-CI] Revert "iommu/dma: Fix race condition during iova_domain initialization"
On 16.09.2022 22:32, Lucas De Marchi wrote: On Fri, Sep 16, 2022 at 02:24:00PM +0200, Karolina Drobnik wrote: On 14.09.2022 17:54, Robin Murphy wrote: On 2022-09-14 16:01, Lucas De Marchi wrote: On Wed, Sep 14, 2022 at 02:40:45PM +0200, Karolina Drobnik wrote: This reverts commit ac9a5d522bb80be50ea84965699e1c8257d745ce. This change introduces a regression on Alder Lake that completely blocks testing. To enable CI and avoid possible circular locking warning, revert the patch. We are already on rc5. Are iommu authors involved aware of this issue? We could do this in our "for CI only" branch, but it's equally important that this is fixed for 6.0 Cc'ing them. The lockdep report doesn't make much sense to me - the deadlock cycle it's reporting doesn't even involve the mutex added by that commit, and otherwise the lock ordering between the IOMMU bus notifier(s) and cpu_hotplug_lock has existed for ages. Has lockdep somehow got multiple different and unrelated bus notifiers mixed up, maybe? FWIW nobody else has reported anything, and that mutex addresses a real-world concurrency issue, so I'm not convinced a revert is appropriate without at least a much clearer justification. I'll share more background on this regression. We've noticed that no tests were run for Alder Lake platforms. This may happens when, for example, there is a kernel taint or lockdep warning. Links: https://intel-gfx-ci.01.org/tree/drm-tip/bat-adlm-1.html https://intel-gfx-ci.01.org/tree/drm-tip/bat-adlp-6.html The CI logs (which can be found for example here[1], boot0 file) revealed a lockdep warning. One of the recent changes in the area was commit ac9a5d522bb8 ("iommu/dma: Fix race condition during iova_domain initialization"), and I sent a revert patch to test it on CI[2]. This proved to be effective, as the tests started running on Alder Lake platform: https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_108474v1/index.html?hosts=adlp To be clear, that revert is just a way of unblocking CI testing, the problem requires a specific fix. Lucas, would it be possible to merge this revert to the topic branch to unblock Alder Lake until this issue is fixed? I'm afraid that some regressions could slip through the cracks if we don't do it soon enough. Yeah. Let's have CI running with the revertt so we can see if on next runs it will really show it was a regression or if it's something else. I think it will help us understand why it's failing. Thanks for the merge. It looks like all adls are doing better now (revert went in CI_DRM_12147): https://intel-gfx-ci.01.org/tree/drm-tip/bat-adlp-6.html https://intel-gfx-ci.01.org/tree/drm-tip/bat-adln-1.html https://intel-gfx-ci.01.org/tree/drm-tip/bat-adlm-1.html (CI_DRM_12149 seems to show a different problem) All the best, Karolina Lucas De Marchi Thanks, Karolina [1] - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12145/bat-adlm-1/igt@run...@aborted.html [2] - https://patchwork.freedesktop.org/series/108474/ Robin. thanks Lucas De Marchi kernel log: == WARNING: possible circular locking dependency detected 6.0.0-rc5-CI_DRM_12132-g6c93e979e542+ #1 Not tainted -- cpuhp/0/15 is trying to acquire lock: 8881013df278 (&(&priv->bus_notifier)->rwsem){}-{3:3}, at: blocking_notifier_call_chain+0x20/0x50 but task is already holding lock: 826490c0 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x48/0x1f0 which lock already depends on the new loc the existing dependency chain (in reverse order) is: -> #3 (cpuhp_state-up){+.+.}-{0:0}: lock_acquire+0xd3/0x310 cpuhp_thread_fun+0xa6/0x1f0 smpboot_thread_fn+0x1b5/0x260 kthread+0xed/0x120 ret_from_fork+0x1f/0x30 -> #2 (cpu_hotplug_lock){}-{0:0}: lock_acquire+0xd3/0x310 __cpuhp_state_add_instance+0x43/0x1c0 iova_domain_init_rcaches+0x199/0x1c0 iommu_setup_dma_ops+0x130/0x440 bus_iommu_probe+0x26a/0x2d0 bus_set_iommu+0x82/0xd0 intel_iommu_init+0xe33/0x1039 pci_iommu_init+0x9/0x31 do_one_initcall+0x53/0x2f0 kernel_init_freeable+0x18f/0x1e1 kernel_init+0x11/0x120 ret_from_fork+0x1f/0x30 -> #1 (&domain->iova_cookie->mutex){+.+.}-{3:3}: lock_acquire+0xd3/0x310 __mutex_lock+0x97/0xf10 iommu_setup_dma_ops+0xd7/0x440 iommu_probe_device+0xa4/0x180 iommu_bus_notifier+0x2d/0x40 notifier_call_chain+0x31/0x90 blocking_notifier_call_chain+0x3a/0x50 device_add+0x3c1/0x900 pci_device_add+0x255/0x580 pci_scan_single_device+0xa6/0xd0 pci_scan_slot+0x7a/0x1b0 pci_scan_child_bus_extend+0x35/0x2a0 vmd_probe+0x5cd/0x970 pci_device_probe+0x95/0x110 really_probe+0xd6/0x350 __driver_probe_device+0x73/0x170 driver_probe_device+0x1a/0x90 __driver_attach+0xbc/0x190 bus_for_each_dev+0x72/0xc0 bus_add_driver+0x1bb/0x210 driver_register+0x66/0xc0 do_one_initcall+0x53/0x2f0 kernel_init_freeable+0x18f/0x1e1 kernel_init+0x11/0x120 ret_from_fork+0x1f/0x30 -> #0 (&(&priv->bus_notifier)->rwsem){++
Re: [Intel-gfx] [topic/core-for-CI] Revert "iommu/dma: Fix race condition during iova_domain initialization"
Hi Robin, On Wednesday, 14 September 2022 17:54:36 CEST Robin Murphy wrote: > On 2022-09-14 16:01, Lucas De Marchi wrote: > > On Wed, Sep 14, 2022 at 02:40:45PM +0200, Karolina Drobnik wrote: > >> This reverts commit ac9a5d522bb80be50ea84965699e1c8257d745ce. > >> > >> This change introduces a regression on Alder Lake that completely > >> blocks testing. To enable CI and avoid possible circular locking > >> warning, revert the patch. > > > > We are already on rc5. Are iommu authors involved aware of this issue? > > We could do this in our "for CI only" branch, but it's equally important > > that this is fixed for 6.0 > > > > Cc'ing them. > > The lockdep report doesn't make much sense to me - the deadlock cycle > it's reporting doesn't even involve the mutex added by that commit, and > otherwise the lock ordering between the IOMMU bus notifier(s) and > cpu_hotplug_lock has existed for ages. Indeed, that lockdep report is not quite related, but there are other lockdep reports in our CI results that better justify the revert. https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_595/bat-dg2-8/igt@core_hotunp...@unplug-rescan.html ... <7> [181.565177] [IGT] core_hotunplug: unplugging the device <7> [181.565372] i915 :03:00.0: [drm:intel_power_well_enable [i915]] enabling DC_off <7> [181.565708] i915 :03:00.0: [drm:gen9_set_dc_state.part.15 [i915]] Setting DC state from 01 to 00 <7> [181.566060] i915 :03:00.0: [drm:intel_power_well_enable [i915]] enabling PW_2 <7> [181.566216] i915 :03:00.0: [drm:intel_power_well_enable [i915]] enabling PW_A <7> [181.566447] i915 :03:00.0: [drm:intel_power_well_enable [i915]] enabling PW_B <7> [181.566607] i915 :03:00.0: [drm:intel_power_well_enable [i915]] enabling PW_C <7> [181.566765] i915 :03:00.0: [drm:intel_power_well_enable [i915]] enabling PW_D <7> [181.570683] intel_gt_set_wedged called from intel_gt_set_wedged_on_fini+0x9/0x30 [i915] <7> [181.659480] i915 :03:00.0: [drm:drm_client_release] drm_fb_helper <6> [181.773310] pci :03:00.0: Removing from iommu group 1 <7> [181.774416] [IGT] core_hotunplug: rediscovering the device <6> [181.775800] pci :03:00.0: [8086:56a0] type 00 class 0x03 <6> [181.775833] pci :03:00.0: reg 0x10: [mem 0x9000-0x90ff 64bit] <6> [181.775852] pci :03:00.0: reg 0x18: [mem 0x40-0x43 64bit pref] <6> [181.775886] pci :03:00.0: reg 0x30: [mem 0xffe0-0x pref] <6> [181.776058] pci :03:00.0: ASPM: overriding L1 acceptable latency from 0x0 to 0x7 <6> [181.776073] pci :03:00.0: Video device with shadowed ROM at [mem 0x000c-0x000d] <6> [181.776209] pci :03:00.0: PME# supported from D0 D3hot <6> [181.776869] pci :03:00.0: vgaarb: setting as boot VGA device <6> [181.776878] pci :03:00.0: vgaarb: bridge control possible <6> [181.776881] pci :03:00.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none <6> [181.777052] pci :03:00.0: Adding to iommu group 1 <4> [181.777164] <4> [181.777169] == <4> [181.777176] WARNING: possible circular locking dependency detected <4> [181.777182] 6.0.0-rc5-CI_DRM_12145-g2dc9ea03abff+ #1 Not tainted <4> [181.777189] -- <4> [181.777196] core_hotunplug/1240 is trying to acquire lock: <4> [181.777202] 8881029b33b0 (&domain->iova_cookie->mutex){+.+.}-{3:3}, at: iommu_setup_dma_ops+0xd7/0x440 <4> [181.777218] but task is already holding lock: <4> [181.777225] 8881010c9e78 (&(&priv->bus_notifier)->rwsem){}-{3:3}, at: blocking_notifier_call_chain+0x20/0x50 <4> [181.777242] which lock already depends on the new lock. <4> [181.777250] the existing dependency chain (in reverse order) is: <4> [181.777258] -> #3 (&(&priv->bus_notifier)->rwsem){}-{3:3}: <4> [181.777268]lock_acquire+0xd3/0x310 <4> [181.777274]down_read+0x39/0x140 <4> [181.777281]blocking_notifier_call_chain+0x20/0x50 <4> [181.777289]device_add+0x3c1/0x900 <4> [181.777295]platform_device_add+0x108/0x240 <4> [181.777302]coretemp_cpu_online+0xe1/0x15e [coretemp] <4> [181.777310]cpuhp_invoke_callback+0x181/0x8a0 <4> [181.777318]cpuhp_thread_fun+0x188/0x1f0 <4> [181.777325]smpboot_thread_fn+0x1b5/0x260 <4> [181.777332]kthread+0xed/0x120 <4> [181.777337]ret_from_fork+0x1f/0x30 <4> [181.777343] -> #2 (cpuhp_state-up){+.+.}-{0:0}: <4> [181.777352]lock_acquire+0xd3/0x310 <4> [181.777358]cpuhp_thread_fun+0xa6/0x1f0 <4> [181.777364]smpboot_thread_fn+0x1b5/0x260 <4> [181.777370]kthread+0xed/0x120 <4> [181.777375]ret_from_fork+0x1f/0x30 <4> [181.777381] -> #1 (cpu_hotplug_lock){}-{0:0}: <4> [181.777390]lock_acquire+0xd3/0x310 <4> [181.777395]__cpuhp_state_add_instance+0x43/0x1c0 <4> [181.777402]iova_domain_init_rcaches+0x199/0x1c0 <4> [181.777409]