Hi everyone!

My team and I are working on the possibility to checkpoint a process and
restarting it on another node. We are using CRIU framework for the
checkpoint/restart part, but we are facing some issues related to migration.

First of all: we found out that some attempts to C/R an OMPI process have
been already made in the past. Is anything related to that still
supported/available/working?

Then, we need to know which network communications are used at any time, in
order to "pause" them during migrations (at least the ones involving the
migrating node). Our code analysis makes us think that:
-OpenMPI runtime (HNP<->orteds) uses orte/OOB
-Running applications exchange data via ompi/BTL

Is that correct? If not, can someone give us a hint?

Questions on how to update topology info may be yet to come.

Thank you guys!

Gianmario

Reply via email to