Hi everyone! My team and I are working on the possibility to checkpoint a process and restarting it on another node. We are using CRIU framework for the checkpoint/restart part, but we are facing some issues related to migration.
First of all: we found out that some attempts to C/R an OMPI process have been already made in the past. Is anything related to that still supported/available/working? Then, we need to know which network communications are used at any time, in order to "pause" them during migrations (at least the ones involving the migrating node). Our code analysis makes us think that: -OpenMPI runtime (HNP<->orteds) uses orte/OOB -Running applications exchange data via ompi/BTL Is that correct? If not, can someone give us a hint? Questions on how to update topology info may be yet to come. Thank you guys! Gianmario
