On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault <maxime.boissonnea...@calculquebec.ca> wrote:
> Le 2013-01-28 12:46, Ralph Castain a écrit : >> On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault >> <maxime.boissonnea...@calculquebec.ca> wrote: >> >>> Hello Ralph, >>> I agree that ideally, someone would implement checkpointing in the >>> application itself, but that is not always possible (commercial >>> applications, use of complicated libraries, algorithms with no clear >>> progression points at which you can interrupt the algorithm and start it >>> back from there). >> Hmmm...well, most apps can be adjusted to support it - we have some very >> complex apps that were updated that way. Commercial apps are another story, >> but we frankly don't find much call for checkpointing those as they >> typically just don't run long enough - especially if you are only running >> 256 ranks, so your cluster is small. Failure rates just don't justify it in >> such cases, in our experience. >> >> Is there some particular reason why you feel you need checkpointing? > This specific case is that the jobs run for days. The risk of a hardware or > power failure for that kind of duration goes too high (we allow for no more > than 48 hours of run time). I'm surprised by that - we run with UPS support on the clusters, but for a small one like you describe, we find the probability that a job will be interrupted even during a multi-week app is vanishingly small. FWIW: I do work with the financial industry where we regularly run apps that execute non-stop for about a month with no reported failures. Are you actually seeing failures, or are you anticipating them? > While it is true we can dig through the code of the library to make it > checkpoint, BLCR checkpointing just seemed easier. I see - just be aware that checkpoint support in OMPI will disappear in v1.7 and there is no clear timetable for restoring it. >> >>> There certainly must be a better way to write the information down on disc >>> though. Doing 500 IOPs seems to be completely broken. Why isn't there >>> buffering involved ? >> I don't know - that's all done in BLCR, I believe. Either way, it isn't >> something we can address due to the loss of our supporter for c/r. > I suppose I should contact BLCR instead then. For the disk op problem, I think that's the way to go - though like I said, I could be wrong and the disk writes could be something we do inside OMPI. I'm not familiar enough with the c/r code to state it with certainty. > > Thank you, > > Maxime >> >> Sorry we can't be of more help :-( >> Ralph >> >>> Thanks, >>> >>> Maxime >>> >>> >>> Le 2013-01-28 10:58, Ralph Castain a écrit : >>>> Our c/r person has moved on to a different career path, so we may not have >>>> anyone who can answer this question. >>>> >>>> What we can say is that checkpointing at any significant scale will always >>>> be a losing proposition. It just takes too long and hammers the file >>>> system. People have been working on extending the capability with things >>>> like "burst buffers" (basically putting an SSD in front of the file system >>>> to absorb the checkpoint surge), but that hasn't become very common yet. >>>> >>>> Frankly, what people have found to be the "best" solution is for your app >>>> to periodically write out its intermediate results, and then take a flag >>>> that indicates "read prior results" when it starts. This minimizes the >>>> amount of data being written to the disk. If done correctly, you would >>>> only lose whatever work was done since the last intermediate result was >>>> written - which is about equivalent to losing whatever works was done >>>> since the last checkpoint. >>>> >>>> HTH >>>> Ralph >>>> >>>> On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault >>>> <maxime.boissonnea...@calculquebec.ca> wrote: >>>> >>>>> Hello, >>>>> I am doing checkpointing tests (with BLCR) with an MPI application >>>>> compiled with OpenMPI 1.6.3, and I am seeing behaviors that are quite >>>>> strange. >>>>> >>>>> First, some details about the tests : >>>>> - The only filesystem available on the nodes are 1) one tmpfs, 2) one >>>>> lustre shared filesystem (tested to be able to provide ~15GB/s for >>>>> writing and support ~40k IOPs). >>>>> - The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or >>>>> 2 nodes). Each MPI rank was using approximately 200MB of memory. >>>>> - I was doing checkpoints with ompi-checkpoint and restarting with >>>>> ompi-restart. >>>>> - I was starting with mpirun -am ft-enable-cr >>>>> - The nodes are monitored by ganglia, which allows me to see the number >>>>> of IOPs and the read/write speed on the filesystem. >>>>> >>>>> I tried a few different mca settings, but I consistently observed that : >>>>> - The checkpoints lasted ~4-5 minutes each time >>>>> - During checkpoint, each node (8 ranks) was doing ~500 IOPs, and writing >>>>> at ~15MB/s. >>>>> >>>>> I am worried by the number of IOPs and the very slow writing speed. This >>>>> was a very small test. We have jobs running with 128 or 256 MPI ranks, >>>>> each using 1-2 GB of ram per rank. With such jobs, the overall number of >>>>> IOPs would reach tens of thousands and would completely overload our >>>>> lustre filesystem. Moreover, with 15MB/s per node, the checkpointing >>>>> process would take hours. >>>>> >>>>> How can I improve on that ? Is there an MCA setting that I am missing ? >>>>> >>>>> Thanks, >>>>> >>>>> -- >>>>> --------------------------------- >>>>> Maxime Boissonneault >>>>> Analyste de calcul - Calcul Québec, Université Laval >>>>> Ph. D. en physique >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> -- >>> --------------------------------- >>> Maxime Boissonneault >>> Analyste de calcul - Calcul Québec, Université Laval >>> Ph. D. en physique >>> > > > -- > --------------------------------- > Maxime Boissonneault > Analyste de calcul - Calcul Québec, Université Laval > Ph. D. en physique >