Device-to-device migration is causing xe_exec_system_allocator --r *race*no* to intermittently fail with engine resets and a kernel hang on a page lock. This should work but is clearly buggy somewhere. Disable device-to-device migration in the interim until the issue can be root-caused.
The only downside of disabling device-to-device migration is that memory will bounce through system memory during migration. However, this path should be rare, as it only occurs when madvise attributes are changed or atomics are used. Cc: Thomas Hellström <[email protected]> Fixes: ec265e1f1cfc ("drm/pagemap: Support source migration over interconnect") Signed-off-by: Matthew Brost <[email protected]> --- drivers/gpu/drm/drm_pagemap.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c index aa43a8475100..03ee39a761a4 100644 --- a/drivers/gpu/drm/drm_pagemap.c +++ b/drivers/gpu/drm/drm_pagemap.c @@ -480,8 +480,18 @@ int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation, .start = start, .end = end, .pgmap_owner = pagemap->owner, - .flags = MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT | - MIGRATE_VMA_SELECT_DEVICE_PRIVATE, + /* + * FIXME: MIGRATE_VMA_SELECT_DEVICE_PRIVATE intermittently + * causes 'xe_exec_system_allocator --r *race*no*' to trigger aa + * engine reset and a hard hang due to getting stuck on a folio + * lock. This should work and needs to be root-caused. The only + * downside of not selecting MIGRATE_VMA_SELECT_DEVICE_PRIVATE + * is that device-to-device migrations won’t work; instead, + * memory will bounce through system memory. This path should be + * rare and only occur when the madvise attributes of memory are + * changed or atomics are being used. + */ + .flags = MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT, }; unsigned long i, npages = npages_in_range(start, end); unsigned long own_pages = 0, migrated_pages = 0; -- 2.34.1
