Re: [Intel-gfx] [RFC PATCH 1/4] drm/i915: Drop user contexts on driver remove
Hi Chris, On Thu, 2020-05-28 at 14:41 +0100, Chris Wilson wrote: > Quoting Tvrtko Ursulin (2020-05-28 14:34:42) > > On 28/05/2020 13:10, Janusz Krzysztofik wrote: > > > Hi Tvrtko, > > > > > > On Thu, 2020-05-28 at 11:14 +0100, Tvrtko Ursulin wrote: > > > > On 18/05/2020 19:17, Janusz Krzysztofik wrote: > > > > > Contexts associated with open device file descriptors together with > > > > > their assigned address spaces are now closed on device file close. On > > > > > > > > i915_gem_driver_remove looks like module unload to me, not device file > > > > close. So.. > > > > > > Not only module unload ... > > > > > > > > address space closure its associated DMA mappings are revoked. If the > > > > > device is removed while being open, subsequent attempts to revoke > > > > > those mappings while closing the device file descriptor may may be > > > > > judged by intel-iommu code as a bug and result in kernel panic. > > > > > > > > > > Since user contexts become useless after the device is no longer > > > > > available, drop them on device removal. > > > > > > > > > > <4> [36.900985] [ cut here ] > > > > > <2> [36.901005] kernel BUG at drivers/iommu/intel-iommu.c:3717! > > > > > <4> [36.901105] invalid opcode: [#1] PREEMPT SMP NOPTI > > > > > <4> [36.901117] CPU: 0 PID: 39 Comm: kworker/u8:1 Tainted: G U W > > > > > 5.7.0-rc5-CI-CI_DRM_8485+ #1 > > > > > <4> [36.901133] Hardware name: Intel Corporation Elkhart Lake > > > > > Embedded Platform/ElkhartLake LPDDR4x T3 CRB, BIOS > > > > > EHLSFWI1.R00.1484.A00.1911290833 11/29/2019 > > > > > <4> [36.901250] Workqueue: i915 __i915_vm_release [i915] > > > > > <4> [36.901264] RIP: 0010:intel_unmap+0x1f5/0x230 > > > > > <4> [36.901274] Code: 01 e8 9f bc a9 ff 85 c0 74 09 80 3d df 60 09 01 > > > > > 00 74 19 65 ff 0d 13 12 97 7e 0f 85 fc fe ff ff e8 82 b0 95 ff e9 f2 > > > > > fe ff ff <0f> 0b e8 d4 bd a9 ff 85 c0 75 de 48 c7 c2 10 84 2c 82 be > > > > > 54 00 00 > > > > > <4> [36.901302] RSP: 0018:c91ebdc0 EFLAGS: 00010246 > > > > > <4> [36.901313] RAX: RBX: 8882561dd000 RCX: > > > > > > > > > > <4> [36.901324] RDX: 1000 RSI: ffd9c000 RDI: > > > > > 888274c94000 > > > > > <4> [36.901336] RBP: 888274c940b0 R08: R09: > > > > > 0001 > > > > > <4> [36.901348] R10: 0a25d812 R11: 112af2d4 R12: > > > > > 888252c70200 > > > > > <4> [36.901360] R13: ffd9c000 R14: 1000 R15: > > > > > 8882561dd010 > > > > > <4> [36.901372] FS: () GS:88827800() > > > > > knlGS: > > > > > <4> [36.901386] CS: 0010 DS: ES: CR0: 80050033 > > > > > <4> [36.901396] CR2: 7f06def54950 CR3: 000255844000 CR4: > > > > > 00340ef0 > > > > > <4> [36.901408] Call Trace: > > > > > <4> [36.901418] ? process_one_work+0x1de/0x600 > > > > > <4> [36.901494] cleanup_page_dma+0x37/0x70 [i915] > > > > > <4> [36.901573] free_pd+0x9/0x20 [i915] > > > > > <4> [36.901644] gen8_ppgtt_cleanup+0x59/0xc0 [i915] > > > > > <4> [36.901721] __i915_vm_release+0x14/0x30 [i915] > > > > > <4> [36.901733] process_one_work+0x268/0x600 > > > > > <4> [36.901744] ? __schedule+0x307/0x8d0 > > > > > <4> [36.901756] worker_thread+0x37/0x380 > > > > > <4> [36.901766] ? process_one_work+0x600/0x600 > > > > > <4> [36.901775] kthread+0x140/0x160 > > > > > <4> [36.901783] ? kthread_park+0x80/0x80 > > > > > <4> [36.901792] ret_from_fork+0x24/0x50 > > > > > <4> [36.901804] Modules linked in: mei_hdcp i915 x86_pkg_temp_thermal > > > > > coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel > > > > > ax88179_178a usbnet mii mei_me mei prime_numbers intel_lpss_pci > > > > > <4> [36.901857] ---[ end trace 52d1b4d81f8d1ea7 ]--- > > > > > > > > > > Signed-off-by: Janusz Krzysztofik > > > > > --- > > > > >drivers/gpu/drm/i915/gem/i915_gem_context.c | 38 > > > > > + > > > > >drivers/gpu/drm/i915/gem/i915_gem_context.h | 1 + > > > > >drivers/gpu/drm/i915/i915_gem.c | 2 ++ > > > > >3 files changed, 41 insertions(+) > > > > > > > > > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c > > > > > b/drivers/gpu/drm/i915/gem/i915_gem_context.c > > > > > index 900ea8b7fc8f..0096a69fbfd3 100644 > > > > > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c > > > > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c > > > > > @@ -927,6 +927,44 @@ void i915_gem_driver_release__contexts(struct > > > > > drm_i915_private *i915) > > > > > rcu_barrier(); /* and flush the left over RCU frees */ > > > > >} > > > > > > > > > > +void i915_gem_driver_remove__contexts(struct drm_i915_private *i915) > > > > > +{ > > > > > + struct i915_gem_context *ctx, *cn; > > > > > + > > > > > + list_for_each_entry_safe(ctx, cn, &i915->gem.contexts.list, link) > > > > > { > > > > > + struct drm_i915_file_p
Re: [Intel-gfx] [RFC PATCH 1/4] drm/i915: Drop user contexts on driver remove
Quoting Tvrtko Ursulin (2020-05-28 14:34:42) > > On 28/05/2020 13:10, Janusz Krzysztofik wrote: > > Hi Tvrtko, > > > > On Thu, 2020-05-28 at 11:14 +0100, Tvrtko Ursulin wrote: > >> On 18/05/2020 19:17, Janusz Krzysztofik wrote: > >>> Contexts associated with open device file descriptors together with > >>> their assigned address spaces are now closed on device file close. On > >> > >> i915_gem_driver_remove looks like module unload to me, not device file > >> close. So.. > > > > Not only module unload ... > > > >> > >>> address space closure its associated DMA mappings are revoked. If the > >>> device is removed while being open, subsequent attempts to revoke > >>> those mappings while closing the device file descriptor may may be > >>> judged by intel-iommu code as a bug and result in kernel panic. > >>> > >>> Since user contexts become useless after the device is no longer > >>> available, drop them on device removal. > >>> > >>> <4> [36.900985] [ cut here ] > >>> <2> [36.901005] kernel BUG at drivers/iommu/intel-iommu.c:3717! > >>> <4> [36.901105] invalid opcode: [#1] PREEMPT SMP NOPTI > >>> <4> [36.901117] CPU: 0 PID: 39 Comm: kworker/u8:1 Tainted: G U W > >>> 5.7.0-rc5-CI-CI_DRM_8485+ #1 > >>> <4> [36.901133] Hardware name: Intel Corporation Elkhart Lake Embedded > >>> Platform/ElkhartLake LPDDR4x T3 CRB, BIOS > >>> EHLSFWI1.R00.1484.A00.1911290833 11/29/2019 > >>> <4> [36.901250] Workqueue: i915 __i915_vm_release [i915] > >>> <4> [36.901264] RIP: 0010:intel_unmap+0x1f5/0x230 > >>> <4> [36.901274] Code: 01 e8 9f bc a9 ff 85 c0 74 09 80 3d df 60 09 01 00 > >>> 74 19 65 ff 0d 13 12 97 7e 0f 85 fc fe ff ff e8 82 b0 95 ff e9 f2 fe ff > >>> ff <0f> 0b e8 d4 bd a9 ff 85 c0 75 de 48 c7 c2 10 84 2c 82 be 54 00 00 > >>> <4> [36.901302] RSP: 0018:c91ebdc0 EFLAGS: 00010246 > >>> <4> [36.901313] RAX: RBX: 8882561dd000 RCX: > >>> > >>> <4> [36.901324] RDX: 1000 RSI: ffd9c000 RDI: > >>> 888274c94000 > >>> <4> [36.901336] RBP: 888274c940b0 R08: R09: > >>> 0001 > >>> <4> [36.901348] R10: 0a25d812 R11: 112af2d4 R12: > >>> 888252c70200 > >>> <4> [36.901360] R13: ffd9c000 R14: 1000 R15: > >>> 8882561dd010 > >>> <4> [36.901372] FS: () GS:88827800() > >>> knlGS: > >>> <4> [36.901386] CS: 0010 DS: ES: CR0: 80050033 > >>> <4> [36.901396] CR2: 7f06def54950 CR3: 000255844000 CR4: > >>> 00340ef0 > >>> <4> [36.901408] Call Trace: > >>> <4> [36.901418] ? process_one_work+0x1de/0x600 > >>> <4> [36.901494] cleanup_page_dma+0x37/0x70 [i915] > >>> <4> [36.901573] free_pd+0x9/0x20 [i915] > >>> <4> [36.901644] gen8_ppgtt_cleanup+0x59/0xc0 [i915] > >>> <4> [36.901721] __i915_vm_release+0x14/0x30 [i915] > >>> <4> [36.901733] process_one_work+0x268/0x600 > >>> <4> [36.901744] ? __schedule+0x307/0x8d0 > >>> <4> [36.901756] worker_thread+0x37/0x380 > >>> <4> [36.901766] ? process_one_work+0x600/0x600 > >>> <4> [36.901775] kthread+0x140/0x160 > >>> <4> [36.901783] ? kthread_park+0x80/0x80 > >>> <4> [36.901792] ret_from_fork+0x24/0x50 > >>> <4> [36.901804] Modules linked in: mei_hdcp i915 x86_pkg_temp_thermal > >>> coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ax88179_178a > >>> usbnet mii mei_me mei prime_numbers intel_lpss_pci > >>> <4> [36.901857] ---[ end trace 52d1b4d81f8d1ea7 ]--- > >>> > >>> Signed-off-by: Janusz Krzysztofik > >>> --- > >>>drivers/gpu/drm/i915/gem/i915_gem_context.c | 38 + > >>>drivers/gpu/drm/i915/gem/i915_gem_context.h | 1 + > >>>drivers/gpu/drm/i915/i915_gem.c | 2 ++ > >>>3 files changed, 41 insertions(+) > >>> > >>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c > >>> b/drivers/gpu/drm/i915/gem/i915_gem_context.c > >>> index 900ea8b7fc8f..0096a69fbfd3 100644 > >>> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c > >>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c > >>> @@ -927,6 +927,44 @@ void i915_gem_driver_release__contexts(struct > >>> drm_i915_private *i915) > >>> rcu_barrier(); /* and flush the left over RCU frees */ > >>>} > >>> > >>> +void i915_gem_driver_remove__contexts(struct drm_i915_private *i915) > >>> +{ > >>> + struct i915_gem_context *ctx, *cn; > >>> + > >>> + list_for_each_entry_safe(ctx, cn, &i915->gem.contexts.list, link) { > >>> + struct drm_i915_file_private *file_priv = ctx->file_priv; > >>> + struct i915_gem_context *entry; > >>> + unsigned long int id; > >>> + > >>> + if (i915_gem_context_is_closed(ctx) || IS_ERR(file_priv)) > >>> + continue; > >>> + > >>> + xa_for_each(&file_priv->context_xa, id, entry) { > >> > >> ... how is driver unload possible with open drm file descriptors, or > >> active contexts? > > > > ... but also
Re: [Intel-gfx] [RFC PATCH 1/4] drm/i915: Drop user contexts on driver remove
On 28/05/2020 13:10, Janusz Krzysztofik wrote: Hi Tvrtko, On Thu, 2020-05-28 at 11:14 +0100, Tvrtko Ursulin wrote: On 18/05/2020 19:17, Janusz Krzysztofik wrote: Contexts associated with open device file descriptors together with their assigned address spaces are now closed on device file close. On i915_gem_driver_remove looks like module unload to me, not device file close. So.. Not only module unload ... address space closure its associated DMA mappings are revoked. If the device is removed while being open, subsequent attempts to revoke those mappings while closing the device file descriptor may may be judged by intel-iommu code as a bug and result in kernel panic. Since user contexts become useless after the device is no longer available, drop them on device removal. <4> [36.900985] [ cut here ] <2> [36.901005] kernel BUG at drivers/iommu/intel-iommu.c:3717! <4> [36.901105] invalid opcode: [#1] PREEMPT SMP NOPTI <4> [36.901117] CPU: 0 PID: 39 Comm: kworker/u8:1 Tainted: G U W 5.7.0-rc5-CI-CI_DRM_8485+ #1 <4> [36.901133] Hardware name: Intel Corporation Elkhart Lake Embedded Platform/ElkhartLake LPDDR4x T3 CRB, BIOS EHLSFWI1.R00.1484.A00.1911290833 11/29/2019 <4> [36.901250] Workqueue: i915 __i915_vm_release [i915] <4> [36.901264] RIP: 0010:intel_unmap+0x1f5/0x230 <4> [36.901274] Code: 01 e8 9f bc a9 ff 85 c0 74 09 80 3d df 60 09 01 00 74 19 65 ff 0d 13 12 97 7e 0f 85 fc fe ff ff e8 82 b0 95 ff e9 f2 fe ff ff <0f> 0b e8 d4 bd a9 ff 85 c0 75 de 48 c7 c2 10 84 2c 82 be 54 00 00 <4> [36.901302] RSP: 0018:c91ebdc0 EFLAGS: 00010246 <4> [36.901313] RAX: RBX: 8882561dd000 RCX: <4> [36.901324] RDX: 1000 RSI: ffd9c000 RDI: 888274c94000 <4> [36.901336] RBP: 888274c940b0 R08: R09: 0001 <4> [36.901348] R10: 0a25d812 R11: 112af2d4 R12: 888252c70200 <4> [36.901360] R13: ffd9c000 R14: 1000 R15: 8882561dd010 <4> [36.901372] FS: () GS:88827800() knlGS: <4> [36.901386] CS: 0010 DS: ES: CR0: 80050033 <4> [36.901396] CR2: 7f06def54950 CR3: 000255844000 CR4: 00340ef0 <4> [36.901408] Call Trace: <4> [36.901418] ? process_one_work+0x1de/0x600 <4> [36.901494] cleanup_page_dma+0x37/0x70 [i915] <4> [36.901573] free_pd+0x9/0x20 [i915] <4> [36.901644] gen8_ppgtt_cleanup+0x59/0xc0 [i915] <4> [36.901721] __i915_vm_release+0x14/0x30 [i915] <4> [36.901733] process_one_work+0x268/0x600 <4> [36.901744] ? __schedule+0x307/0x8d0 <4> [36.901756] worker_thread+0x37/0x380 <4> [36.901766] ? process_one_work+0x600/0x600 <4> [36.901775] kthread+0x140/0x160 <4> [36.901783] ? kthread_park+0x80/0x80 <4> [36.901792] ret_from_fork+0x24/0x50 <4> [36.901804] Modules linked in: mei_hdcp i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ax88179_178a usbnet mii mei_me mei prime_numbers intel_lpss_pci <4> [36.901857] ---[ end trace 52d1b4d81f8d1ea7 ]--- Signed-off-by: Janusz Krzysztofik --- drivers/gpu/drm/i915/gem/i915_gem_context.c | 38 + drivers/gpu/drm/i915/gem/i915_gem_context.h | 1 + drivers/gpu/drm/i915/i915_gem.c | 2 ++ 3 files changed, 41 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c index 900ea8b7fc8f..0096a69fbfd3 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c @@ -927,6 +927,44 @@ void i915_gem_driver_release__contexts(struct drm_i915_private *i915) rcu_barrier(); /* and flush the left over RCU frees */ } +void i915_gem_driver_remove__contexts(struct drm_i915_private *i915) +{ + struct i915_gem_context *ctx, *cn; + + list_for_each_entry_safe(ctx, cn, &i915->gem.contexts.list, link) { + struct drm_i915_file_private *file_priv = ctx->file_priv; + struct i915_gem_context *entry; + unsigned long int id; + + if (i915_gem_context_is_closed(ctx) || IS_ERR(file_priv)) + continue; + + xa_for_each(&file_priv->context_xa, id, entry) { ... how is driver unload possible with open drm file descriptors, or active contexts? ... but also PCI driver unbind or PCI device remove, with the module still loaded. That may perfectly happen even if a device file descriptor is still kept open. I see. What do we do, or plan to do, with those left open drm fds after the driver is unbound from the device? Is there a path connected to keep saying -ENODEV (Or is a different errno standard for this case, like ENXIO?) from that point onward for everything done with that fd. So userspace couldn't do anything more with it, attempt to create a new context etc. Is the DRM core handling this? Regards, Tvrtko If something i
Re: [Intel-gfx] [RFC PATCH 1/4] drm/i915: Drop user contexts on driver remove
Hi Tvrtko, On Thu, 2020-05-28 at 11:14 +0100, Tvrtko Ursulin wrote: > On 18/05/2020 19:17, Janusz Krzysztofik wrote: > > Contexts associated with open device file descriptors together with > > their assigned address spaces are now closed on device file close. On > > i915_gem_driver_remove looks like module unload to me, not device file > close. So.. Not only module unload ... > > > address space closure its associated DMA mappings are revoked. If the > > device is removed while being open, subsequent attempts to revoke > > those mappings while closing the device file descriptor may may be > > judged by intel-iommu code as a bug and result in kernel panic. > > > > Since user contexts become useless after the device is no longer > > available, drop them on device removal. > > > > <4> [36.900985] [ cut here ] > > <2> [36.901005] kernel BUG at drivers/iommu/intel-iommu.c:3717! > > <4> [36.901105] invalid opcode: [#1] PREEMPT SMP NOPTI > > <4> [36.901117] CPU: 0 PID: 39 Comm: kworker/u8:1 Tainted: G U W > > 5.7.0-rc5-CI-CI_DRM_8485+ #1 > > <4> [36.901133] Hardware name: Intel Corporation Elkhart Lake Embedded > > Platform/ElkhartLake LPDDR4x T3 CRB, BIOS EHLSFWI1.R00.1484.A00.1911290833 > > 11/29/2019 > > <4> [36.901250] Workqueue: i915 __i915_vm_release [i915] > > <4> [36.901264] RIP: 0010:intel_unmap+0x1f5/0x230 > > <4> [36.901274] Code: 01 e8 9f bc a9 ff 85 c0 74 09 80 3d df 60 09 01 00 74 > > 19 65 ff 0d 13 12 97 7e 0f 85 fc fe ff ff e8 82 b0 95 ff e9 f2 fe ff ff > > <0f> 0b e8 d4 bd a9 ff 85 c0 75 de 48 c7 c2 10 84 2c 82 be 54 00 00 > > <4> [36.901302] RSP: 0018:c91ebdc0 EFLAGS: 00010246 > > <4> [36.901313] RAX: RBX: 8882561dd000 RCX: > > > > <4> [36.901324] RDX: 1000 RSI: ffd9c000 RDI: > > 888274c94000 > > <4> [36.901336] RBP: 888274c940b0 R08: R09: > > 0001 > > <4> [36.901348] R10: 0a25d812 R11: 112af2d4 R12: > > 888252c70200 > > <4> [36.901360] R13: ffd9c000 R14: 1000 R15: > > 8882561dd010 > > <4> [36.901372] FS: () GS:88827800() > > knlGS: > > <4> [36.901386] CS: 0010 DS: ES: CR0: 80050033 > > <4> [36.901396] CR2: 7f06def54950 CR3: 000255844000 CR4: > > 00340ef0 > > <4> [36.901408] Call Trace: > > <4> [36.901418] ? process_one_work+0x1de/0x600 > > <4> [36.901494] cleanup_page_dma+0x37/0x70 [i915] > > <4> [36.901573] free_pd+0x9/0x20 [i915] > > <4> [36.901644] gen8_ppgtt_cleanup+0x59/0xc0 [i915] > > <4> [36.901721] __i915_vm_release+0x14/0x30 [i915] > > <4> [36.901733] process_one_work+0x268/0x600 > > <4> [36.901744] ? __schedule+0x307/0x8d0 > > <4> [36.901756] worker_thread+0x37/0x380 > > <4> [36.901766] ? process_one_work+0x600/0x600 > > <4> [36.901775] kthread+0x140/0x160 > > <4> [36.901783] ? kthread_park+0x80/0x80 > > <4> [36.901792] ret_from_fork+0x24/0x50 > > <4> [36.901804] Modules linked in: mei_hdcp i915 x86_pkg_temp_thermal > > coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ax88179_178a > > usbnet mii mei_me mei prime_numbers intel_lpss_pci > > <4> [36.901857] ---[ end trace 52d1b4d81f8d1ea7 ]--- > > > > Signed-off-by: Janusz Krzysztofik > > --- > > drivers/gpu/drm/i915/gem/i915_gem_context.c | 38 + > > drivers/gpu/drm/i915/gem/i915_gem_context.h | 1 + > > drivers/gpu/drm/i915/i915_gem.c | 2 ++ > > 3 files changed, 41 insertions(+) > > > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c > > b/drivers/gpu/drm/i915/gem/i915_gem_context.c > > index 900ea8b7fc8f..0096a69fbfd3 100644 > > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c > > @@ -927,6 +927,44 @@ void i915_gem_driver_release__contexts(struct > > drm_i915_private *i915) > > rcu_barrier(); /* and flush the left over RCU frees */ > > } > > > > +void i915_gem_driver_remove__contexts(struct drm_i915_private *i915) > > +{ > > + struct i915_gem_context *ctx, *cn; > > + > > + list_for_each_entry_safe(ctx, cn, &i915->gem.contexts.list, link) { > > + struct drm_i915_file_private *file_priv = ctx->file_priv; > > + struct i915_gem_context *entry; > > + unsigned long int id; > > + > > + if (i915_gem_context_is_closed(ctx) || IS_ERR(file_priv)) > > + continue; > > + > > + xa_for_each(&file_priv->context_xa, id, entry) { > > ... how is driver unload possible with open drm file descriptors, or > active contexts? ... but also PCI driver unbind or PCI device remove, with the module still loaded. That may perfectly happen even if a device file descriptor is still kept open. > If something is going wrong sounds like something else. I think we might consider that "something" as intel-iommu code, but see also the last paragraph of my response be
Re: [Intel-gfx] [RFC PATCH 1/4] drm/i915: Drop user contexts on driver remove
On 18/05/2020 19:17, Janusz Krzysztofik wrote: Contexts associated with open device file descriptors together with their assigned address spaces are now closed on device file close. On i915_gem_driver_remove looks like module unload to me, not device file close. So.. address space closure its associated DMA mappings are revoked. If the device is removed while being open, subsequent attempts to revoke those mappings while closing the device file descriptor may may be judged by intel-iommu code as a bug and result in kernel panic. Since user contexts become useless after the device is no longer available, drop them on device removal. <4> [36.900985] [ cut here ] <2> [36.901005] kernel BUG at drivers/iommu/intel-iommu.c:3717! <4> [36.901105] invalid opcode: [#1] PREEMPT SMP NOPTI <4> [36.901117] CPU: 0 PID: 39 Comm: kworker/u8:1 Tainted: G U W 5.7.0-rc5-CI-CI_DRM_8485+ #1 <4> [36.901133] Hardware name: Intel Corporation Elkhart Lake Embedded Platform/ElkhartLake LPDDR4x T3 CRB, BIOS EHLSFWI1.R00.1484.A00.1911290833 11/29/2019 <4> [36.901250] Workqueue: i915 __i915_vm_release [i915] <4> [36.901264] RIP: 0010:intel_unmap+0x1f5/0x230 <4> [36.901274] Code: 01 e8 9f bc a9 ff 85 c0 74 09 80 3d df 60 09 01 00 74 19 65 ff 0d 13 12 97 7e 0f 85 fc fe ff ff e8 82 b0 95 ff e9 f2 fe ff ff <0f> 0b e8 d4 bd a9 ff 85 c0 75 de 48 c7 c2 10 84 2c 82 be 54 00 00 <4> [36.901302] RSP: 0018:c91ebdc0 EFLAGS: 00010246 <4> [36.901313] RAX: RBX: 8882561dd000 RCX: <4> [36.901324] RDX: 1000 RSI: ffd9c000 RDI: 888274c94000 <4> [36.901336] RBP: 888274c940b0 R08: R09: 0001 <4> [36.901348] R10: 0a25d812 R11: 112af2d4 R12: 888252c70200 <4> [36.901360] R13: ffd9c000 R14: 1000 R15: 8882561dd010 <4> [36.901372] FS: () GS:88827800() knlGS: <4> [36.901386] CS: 0010 DS: ES: CR0: 80050033 <4> [36.901396] CR2: 7f06def54950 CR3: 000255844000 CR4: 00340ef0 <4> [36.901408] Call Trace: <4> [36.901418] ? process_one_work+0x1de/0x600 <4> [36.901494] cleanup_page_dma+0x37/0x70 [i915] <4> [36.901573] free_pd+0x9/0x20 [i915] <4> [36.901644] gen8_ppgtt_cleanup+0x59/0xc0 [i915] <4> [36.901721] __i915_vm_release+0x14/0x30 [i915] <4> [36.901733] process_one_work+0x268/0x600 <4> [36.901744] ? __schedule+0x307/0x8d0 <4> [36.901756] worker_thread+0x37/0x380 <4> [36.901766] ? process_one_work+0x600/0x600 <4> [36.901775] kthread+0x140/0x160 <4> [36.901783] ? kthread_park+0x80/0x80 <4> [36.901792] ret_from_fork+0x24/0x50 <4> [36.901804] Modules linked in: mei_hdcp i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ax88179_178a usbnet mii mei_me mei prime_numbers intel_lpss_pci <4> [36.901857] ---[ end trace 52d1b4d81f8d1ea7 ]--- Signed-off-by: Janusz Krzysztofik --- drivers/gpu/drm/i915/gem/i915_gem_context.c | 38 + drivers/gpu/drm/i915/gem/i915_gem_context.h | 1 + drivers/gpu/drm/i915/i915_gem.c | 2 ++ 3 files changed, 41 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c index 900ea8b7fc8f..0096a69fbfd3 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c @@ -927,6 +927,44 @@ void i915_gem_driver_release__contexts(struct drm_i915_private *i915) rcu_barrier(); /* and flush the left over RCU frees */ } +void i915_gem_driver_remove__contexts(struct drm_i915_private *i915) +{ + struct i915_gem_context *ctx, *cn; + + list_for_each_entry_safe(ctx, cn, &i915->gem.contexts.list, link) { + struct drm_i915_file_private *file_priv = ctx->file_priv; + struct i915_gem_context *entry; + unsigned long int id; + + if (i915_gem_context_is_closed(ctx) || IS_ERR(file_priv)) + continue; + + xa_for_each(&file_priv->context_xa, id, entry) { ... how is driver unload possible with open drm file descriptors, or active contexts? If something is going wrong sounds like something else. drm postclose -> i915_gem_context_close -> closes all contexts and puts all vm. What can remain dangling? An active context? But there is idling via i915_gem_driver_remove -> i915_gem_suspend_late. Regards, Tvrtko + struct i915_address_space *vm; + unsigned long int idx; + + if (entry != ctx) + continue; + + xa_erase(&file_priv->context_xa, id); + + if (id) + break; + + xa_for_each(&file_priv->vm_xa, idx, vm) { + xa_erase(&file_priv->vm_xa, idx); +
Re: [Intel-gfx] [RFC PATCH 1/4] drm/i915: Drop user contexts on driver remove
Quoting Janusz Krzysztofik (2020-05-18 20:17:17) > Contexts associated with open device file descriptors together with > their assigned address spaces are now closed on device file close. On > address space closure its associated DMA mappings are revoked. If the > device is removed while being open, subsequent attempts to revoke > those mappings while closing the device file descriptor may may be > judged by intel-iommu code as a bug and result in kernel panic. > > Since user contexts become useless after the device is no longer > available, drop them on device removal. > > <4> [36.900985] [ cut here ] > <2> [36.901005] kernel BUG at drivers/iommu/intel-iommu.c:3717! > <4> [36.901105] invalid opcode: [#1] PREEMPT SMP NOPTI > <4> [36.901117] CPU: 0 PID: 39 Comm: kworker/u8:1 Tainted: G U W > 5.7.0-rc5-CI-CI_DRM_8485+ #1 > <4> [36.901133] Hardware name: Intel Corporation Elkhart Lake Embedded > Platform/ElkhartLake LPDDR4x T3 CRB, BIOS EHLSFWI1.R00.1484.A00.1911290833 > 11/29/2019 > <4> [36.901250] Workqueue: i915 __i915_vm_release [i915] > <4> [36.901264] RIP: 0010:intel_unmap+0x1f5/0x230 > <4> [36.901274] Code: 01 e8 9f bc a9 ff 85 c0 74 09 80 3d df 60 09 01 00 74 > 19 65 ff 0d 13 12 97 7e 0f 85 fc fe ff ff e8 82 b0 95 ff e9 f2 fe ff ff <0f> > 0b e8 d4 bd a9 ff 85 c0 75 de 48 c7 c2 10 84 2c 82 be 54 00 00 > <4> [36.901302] RSP: 0018:c91ebdc0 EFLAGS: 00010246 > <4> [36.901313] RAX: RBX: 8882561dd000 RCX: > > <4> [36.901324] RDX: 1000 RSI: ffd9c000 RDI: > 888274c94000 > <4> [36.901336] RBP: 888274c940b0 R08: R09: > 0001 > <4> [36.901348] R10: 0a25d812 R11: 112af2d4 R12: > 888252c70200 > <4> [36.901360] R13: ffd9c000 R14: 1000 R15: > 8882561dd010 > <4> [36.901372] FS: () GS:88827800() > knlGS: > <4> [36.901386] CS: 0010 DS: ES: CR0: 80050033 > <4> [36.901396] CR2: 7f06def54950 CR3: 000255844000 CR4: > 00340ef0 > <4> [36.901408] Call Trace: > <4> [36.901418] ? process_one_work+0x1de/0x600 > <4> [36.901494] cleanup_page_dma+0x37/0x70 [i915] > <4> [36.901573] free_pd+0x9/0x20 [i915] > <4> [36.901644] gen8_ppgtt_cleanup+0x59/0xc0 [i915] > <4> [36.901721] __i915_vm_release+0x14/0x30 [i915] > <4> [36.901733] process_one_work+0x268/0x600 > <4> [36.901744] ? __schedule+0x307/0x8d0 > <4> [36.901756] worker_thread+0x37/0x380 > <4> [36.901766] ? process_one_work+0x600/0x600 > <4> [36.901775] kthread+0x140/0x160 > <4> [36.901783] ? kthread_park+0x80/0x80 > <4> [36.901792] ret_from_fork+0x24/0x50 > <4> [36.901804] Modules linked in: mei_hdcp i915 x86_pkg_temp_thermal > coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ax88179_178a > usbnet mii mei_me mei prime_numbers intel_lpss_pci > <4> [36.901857] ---[ end trace 52d1b4d81f8d1ea7 ]--- > > Signed-off-by: Janusz Krzysztofik > --- > drivers/gpu/drm/i915/gem/i915_gem_context.c | 38 + > drivers/gpu/drm/i915/gem/i915_gem_context.h | 1 + > drivers/gpu/drm/i915/i915_gem.c | 2 ++ > 3 files changed, 41 insertions(+) > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c > b/drivers/gpu/drm/i915/gem/i915_gem_context.c > index 900ea8b7fc8f..0096a69fbfd3 100644 > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c > @@ -927,6 +927,44 @@ void i915_gem_driver_release__contexts(struct > drm_i915_private *i915) > rcu_barrier(); /* and flush the left over RCU frees */ > } > > +void i915_gem_driver_remove__contexts(struct drm_i915_private *i915) > +{ > + struct i915_gem_context *ctx, *cn; > + > + list_for_each_entry_safe(ctx, cn, &i915->gem.contexts.list, link) { You're not removing ctx from gem.contexts.list inside this loop. > + struct drm_i915_file_private *file_priv = ctx->file_priv; > + struct i915_gem_context *entry; > + unsigned long int id; > + > + if (i915_gem_context_is_closed(ctx) || IS_ERR(file_priv)) > + continue; > + > + xa_for_each(&file_priv->context_xa, id, entry) { We're iterating over contexts? I thought we were already doing that by going over i915->gem.contexts.list. > + struct i915_address_space *vm; > + unsigned long int idx; > + > + if (entry != ctx) > + continue; > + > + xa_erase(&file_priv->context_xa, id); > + > + if (id) > + break; Ok... So we're exiting early for !default contexts? > + > + xa_for_each(&file_priv->vm_xa, idx, vm) { > + xa_erase(&file_priv->vm_xa, idx); > + i915_vm_put(vm); >
[Intel-gfx] [RFC PATCH 1/4] drm/i915: Drop user contexts on driver remove
Contexts associated with open device file descriptors together with their assigned address spaces are now closed on device file close. On address space closure its associated DMA mappings are revoked. If the device is removed while being open, subsequent attempts to revoke those mappings while closing the device file descriptor may may be judged by intel-iommu code as a bug and result in kernel panic. Since user contexts become useless after the device is no longer available, drop them on device removal. <4> [36.900985] [ cut here ] <2> [36.901005] kernel BUG at drivers/iommu/intel-iommu.c:3717! <4> [36.901105] invalid opcode: [#1] PREEMPT SMP NOPTI <4> [36.901117] CPU: 0 PID: 39 Comm: kworker/u8:1 Tainted: G U W 5.7.0-rc5-CI-CI_DRM_8485+ #1 <4> [36.901133] Hardware name: Intel Corporation Elkhart Lake Embedded Platform/ElkhartLake LPDDR4x T3 CRB, BIOS EHLSFWI1.R00.1484.A00.1911290833 11/29/2019 <4> [36.901250] Workqueue: i915 __i915_vm_release [i915] <4> [36.901264] RIP: 0010:intel_unmap+0x1f5/0x230 <4> [36.901274] Code: 01 e8 9f bc a9 ff 85 c0 74 09 80 3d df 60 09 01 00 74 19 65 ff 0d 13 12 97 7e 0f 85 fc fe ff ff e8 82 b0 95 ff e9 f2 fe ff ff <0f> 0b e8 d4 bd a9 ff 85 c0 75 de 48 c7 c2 10 84 2c 82 be 54 00 00 <4> [36.901302] RSP: 0018:c91ebdc0 EFLAGS: 00010246 <4> [36.901313] RAX: RBX: 8882561dd000 RCX: <4> [36.901324] RDX: 1000 RSI: ffd9c000 RDI: 888274c94000 <4> [36.901336] RBP: 888274c940b0 R08: R09: 0001 <4> [36.901348] R10: 0a25d812 R11: 112af2d4 R12: 888252c70200 <4> [36.901360] R13: ffd9c000 R14: 1000 R15: 8882561dd010 <4> [36.901372] FS: () GS:88827800() knlGS: <4> [36.901386] CS: 0010 DS: ES: CR0: 80050033 <4> [36.901396] CR2: 7f06def54950 CR3: 000255844000 CR4: 00340ef0 <4> [36.901408] Call Trace: <4> [36.901418] ? process_one_work+0x1de/0x600 <4> [36.901494] cleanup_page_dma+0x37/0x70 [i915] <4> [36.901573] free_pd+0x9/0x20 [i915] <4> [36.901644] gen8_ppgtt_cleanup+0x59/0xc0 [i915] <4> [36.901721] __i915_vm_release+0x14/0x30 [i915] <4> [36.901733] process_one_work+0x268/0x600 <4> [36.901744] ? __schedule+0x307/0x8d0 <4> [36.901756] worker_thread+0x37/0x380 <4> [36.901766] ? process_one_work+0x600/0x600 <4> [36.901775] kthread+0x140/0x160 <4> [36.901783] ? kthread_park+0x80/0x80 <4> [36.901792] ret_from_fork+0x24/0x50 <4> [36.901804] Modules linked in: mei_hdcp i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ax88179_178a usbnet mii mei_me mei prime_numbers intel_lpss_pci <4> [36.901857] ---[ end trace 52d1b4d81f8d1ea7 ]--- Signed-off-by: Janusz Krzysztofik --- drivers/gpu/drm/i915/gem/i915_gem_context.c | 38 + drivers/gpu/drm/i915/gem/i915_gem_context.h | 1 + drivers/gpu/drm/i915/i915_gem.c | 2 ++ 3 files changed, 41 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c index 900ea8b7fc8f..0096a69fbfd3 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c @@ -927,6 +927,44 @@ void i915_gem_driver_release__contexts(struct drm_i915_private *i915) rcu_barrier(); /* and flush the left over RCU frees */ } +void i915_gem_driver_remove__contexts(struct drm_i915_private *i915) +{ + struct i915_gem_context *ctx, *cn; + + list_for_each_entry_safe(ctx, cn, &i915->gem.contexts.list, link) { + struct drm_i915_file_private *file_priv = ctx->file_priv; + struct i915_gem_context *entry; + unsigned long int id; + + if (i915_gem_context_is_closed(ctx) || IS_ERR(file_priv)) + continue; + + xa_for_each(&file_priv->context_xa, id, entry) { + struct i915_address_space *vm; + unsigned long int idx; + + if (entry != ctx) + continue; + + xa_erase(&file_priv->context_xa, id); + + if (id) + break; + + xa_for_each(&file_priv->vm_xa, idx, vm) { + xa_erase(&file_priv->vm_xa, idx); + i915_vm_put(vm); + } + + break; + } + + context_close(ctx); + } + + i915_gem_driver_release__contexts(i915); +} + static int gem_context_register(struct i915_gem_context *ctx, struct drm_i915_file_private *fpriv, u32 *id) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.h b/drivers/gpu/drm/i915/gem/i915_gem_context.h index 3702b2fb27ab..62808