Le 2013-01-28 12:46, Ralph Castain a écrit :
On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

Hello Ralph,
I agree that ideally, someone would implement checkpointing in the application 
itself, but that is not always possible (commercial applications, use of 
complicated libraries, algorithms with no clear progression points at which you 
can interrupt the algorithm and start it back from there).
Hmmm...well, most apps can be adjusted to support it - we have some very 
complex apps that were updated that way. Commercial apps are another story, but 
we frankly don't find much call for checkpointing those as they typically just 
don't run long enough - especially if you are only running 256 ranks, so your 
cluster is small. Failure rates just don't justify it in such cases, in our 
experience.

Is there some particular reason why you feel you need checkpointing?
This specific case is that the jobs run for days. The risk of a hardware or power failure for that kind of duration goes too high (we allow for no more than 48 hours of run time). While it is true we can dig through the code of the library to make it checkpoint, BLCR checkpointing just seemed easier.

There certainly must be a better way to write the information down on disc 
though. Doing 500 IOPs seems to be completely broken. Why isn't there buffering 
involved ?
I don't know - that's all done in BLCR, I believe. Either way, it isn't 
something we can address due to the loss of our supporter for c/r.
I suppose I should contact BLCR instead then.

Thank you,

Maxime

Sorry we can't be of more help :-(
Ralph

Thanks,

Maxime


Le 2013-01-28 10:58, Ralph Castain a écrit :
Our c/r person has moved on to a different career path, so we may not have 
anyone who can answer this question.

What we can say is that checkpointing at any significant scale will always be a losing 
proposition. It just takes too long and hammers the file system. People have been working 
on extending the capability with things like "burst buffers" (basically putting 
an SSD in front of the file system to absorb the checkpoint surge), but that hasn't 
become very common yet.

Frankly, what people have found to be the "best" solution is for your app to periodically 
write out its intermediate results, and then take a flag that indicates "read prior 
results" when it starts. This minimizes the amount of data being written to the disk. If done 
correctly, you would only lose whatever work was done since the last intermediate result was 
written - which is about equivalent to losing whatever works was done since the last checkpoint.

HTH
Ralph

On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

Hello,
I am doing checkpointing tests (with BLCR) with an MPI application compiled 
with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange.

First, some details about the tests :
- The only filesystem available on the nodes are 1) one tmpfs, 2) one lustre 
shared filesystem (tested to be able to provide ~15GB/s for writing and support 
~40k IOPs).
- The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or 2 
nodes). Each MPI rank was using approximately 200MB of memory.
- I was doing checkpoints with ompi-checkpoint and restarting with ompi-restart.
- I was starting with mpirun -am ft-enable-cr
- The nodes are monitored by ganglia, which allows me to see the number of IOPs 
and the read/write speed on the filesystem.

I tried a few different mca settings, but I consistently observed that :
- The checkpoints lasted ~4-5 minutes each time
- During checkpoint, each node (8 ranks) was doing ~500 IOPs, and writing at 
~15MB/s.

I am worried by the number of IOPs and the very slow writing speed. This was a 
very small test. We have jobs running with 128 or 256 MPI ranks, each using 1-2 
GB of ram per rank. With such jobs, the overall number of IOPs would reach tens 
of thousands and would completely overload our lustre filesystem. Moreover, 
with 15MB/s per node, the checkpointing process would take hours.

How can I improve on that ? Is there an MCA setting that I am missing ?

Thanks,

--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique



--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Reply via email to