Jeff Hammond <jeff.scie...@gmail.com> writes: > When this was written, I was convinced that Dursi was wrong about > everything because one of the key arguments against MPI was > fault-intolerance, which I was sure was going to be solved soon. However, > LLNL has done everything in their power to torpedo MPI fault-tolerance in > MPI-4 for the past 3+ years and I am no longer optimistic about MPI's > ability to grow outside of traditional HPC because of the forum's inability > to take fault-tolerance seriously. It's also unclear that we can get by > without it in a post-exascale world.
Have you seen any MPI FT proposals that would actually enable a collective library or application to meaningfully "recover"? Seems to me that in-memory checkpointing and process-based FT is more practical. For example, you could have a Spark or Spark-like system that manages distributed in-memory data (perhaps standardizing on Arrow), launches MPI jobs to access that data in-place, and coordinates the distributed replication so that a fresh MPI job could restart on a (possibly) different group of nodes after the MPI job crashes. In such a system, you would need to implement transactional updates to in-memory checkpoint data, but no special recovery from MPI (just that crashes be eventually collective, versus deadlock).