Hello,

I've noticed that ompi-restart doesn't support the --rankfile option.
It only supports --hostfile/--machinefile. Is there any reason
--rankfile isn't supported?

Suppose you have a cluster without a shared file system. When one node
fails, you transfer its checkpoint to a spare node and invoke
ompi-restart. In 1.5, ompi-restart automagically handles this
situation (if you supply a hostfile) and is able to restart the
process, but I'm afraid it might not always be able to find the
checkpoints this way. If you could specify to ompi-restart where the
ranks are (and thus where the checkpoints are), then maybe restart
would always work as long (as long as you've specified the location of
the checkpoints correctly), or maybe ompi-restart would be faster.



Regards,

Reply via email to