CRIU is a user space tool which is very popular for container live migration in 
datacentres. It can checkpoint a running application, save its complete state, 
memory contents and all system resources to images on disk which can be 
migrated to another m achine and restored later. More information on CRIU can 
be found at https://criu.org/Main_Page

CRIU currently does not support Checkpoint / Restore with applications that 
have devices files open so it cannot perform checkpoint and restore on GPU 
devices which are very complex and have their own VRAM managed privately. CRIU, 
however can support e xternal devices by using a plugin architecture. This 
patch series adds initial support for ROCm applications while we add more 
remaining features. We welcome some feedback, especially in regards to the 
APIs, before involving a larger audience.

Our plugin code can be found at 
https://github.com/RadeonOpenCompute/criu/tree/criu-dev/plugins/amdgpu

We have tested the following scenarios:
-Checkpoint / Restore of a Pytorch (BERT) workload -kfdtests with queues and 
events
-Gfx9 and Gfx10 based multi GPU test systems -On baremetal and inside a docker 
container -Restoring on a different system

V2: Addressed review comments

David Yat Sin (9):
  drm/amdkfd: CRIU Implement KFD pause ioctl
  drm/amdkfd: CRIU add queues support
  drm/amdkfd: CRIU restore queue ids
  drm/amdkfd: CRIU restore sdma id for queues
  drm/amdkfd: CRIU restore queue doorbell id
  drm/amdkfd: CRIU dump and restore queue mqds
  drm/amdkfd: CRIU dump/restore queue control stack
  drm/amdkfd: CRIU dump and restore events
  drm/amdkfd: CRIU implement gpu_id remapping

Rajneesh Bhardwaj (9):
  x86/configs: CRIU update release defconfig
  x86/configs: CRIU update debug rock defconfig
  drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs
  drm/amdkfd: CRIU Implement KFD process_info ioctl
  drm/amdkfd: CRIU Implement KFD dumper ioctl
  drm/amdkfd: CRIU Implement KFD restore ioctl
  drm/amdkfd: CRIU Implement KFD resume ioctl
  Revert "drm/amdgpu: Remove verify_access shortcut for KFD BOs"
  drm/amdkfd: CRIU export kfd bos as prime dmabuf objects

 arch/x86/configs/rock-dbg_defconfig           |   53 +-
 arch/x86/configs/rock-rel_defconfig           |   13 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    5 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   51 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   27 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |    2 +
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 1170 ++++++++++++++---
 drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c       |    2 +-
 .../drm/amd/amdkfd/kfd_device_queue_manager.c |  187 ++-
 .../drm/amd/amdkfd/kfd_device_queue_manager.h |   14 +-
 drivers/gpu/drm/amd/amdkfd/kfd_events.c       |  323 ++++-
 drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h  |   11 +
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c  |   76 ++
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c  |   78 ++
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   |   86 ++
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_vi.c   |   77 ++
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  138 +-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c      |   69 +-
 .../amd/amdkfd/kfd_process_queue_manager.c    |  475 ++++++-
 include/uapi/linux/kfd_ioctl.h                |  221 +++-
 20 files changed, 2815 insertions(+), 263 deletions(-)

-- 
2.17.1

Reply via email to