Our c/r person has moved on to a different career path, so we may not have 
anyone who can answer this question.

What we can say is that checkpointing at any significant scale will always be a 
losing proposition. It just takes too long and hammers the file system. People 
have been working on extending the capability with things like "burst buffers" 
(basically putting an SSD in front of the file system to absorb the checkpoint 
surge), but that hasn't become very common yet.

Frankly, what people have found to be the "best" solution is for your app to 
periodically write out its intermediate results, and then take a flag that 
indicates "read prior results" when it starts. This minimizes the amount of 
data being written to the disk. If done correctly, you would only lose whatever 
work was done since the last intermediate result was written - which is about 
equivalent to losing whatever works was done since the last checkpoint.

HTH
Ralph

On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

> Hello,
> I am doing checkpointing tests (with BLCR) with an MPI application compiled 
> with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange.
> 
> First, some details about the tests :
> - The only filesystem available on the nodes are 1) one tmpfs, 2) one lustre 
> shared filesystem (tested to be able to provide ~15GB/s for writing and 
> support ~40k IOPs).
> - The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or 2 
> nodes). Each MPI rank was using approximately 200MB of memory.
> - I was doing checkpoints with ompi-checkpoint and restarting with 
> ompi-restart.
> - I was starting with mpirun -am ft-enable-cr
> - The nodes are monitored by ganglia, which allows me to see the number of 
> IOPs and the read/write speed on the filesystem.
> 
> I tried a few different mca settings, but I consistently observed that :
> - The checkpoints lasted ~4-5 minutes each time
> - During checkpoint, each node (8 ranks) was doing ~500 IOPs, and writing at 
> ~15MB/s.
> 
> I am worried by the number of IOPs and the very slow writing speed. This was 
> a very small test. We have jobs running with 128 or 256 MPI ranks, each using 
> 1-2 GB of ram per rank. With such jobs, the overall number of IOPs would 
> reach tens of thousands and would completely overload our lustre filesystem. 
> Moreover, with 15MB/s per node, the checkpointing process would take hours.
> 
> How can I improve on that ? Is there an MCA setting that I am missing ?
> 
> Thanks,
> 
> -- 
> ---------------------------------
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to