On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault <maxime.boissonnea...@calculquebec.ca> wrote:
> Le 2013-01-28 13:15, Ralph Castain a écrit : >> On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault >> <maxime.boissonnea...@calculquebec.ca> wrote: >> >>> Le 2013-01-28 12:46, Ralph Castain a écrit : >>>> On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault >>>> <maxime.boissonnea...@calculquebec.ca> wrote: >>>> >>>>> Hello Ralph, >>>>> I agree that ideally, someone would implement checkpointing in the >>>>> application itself, but that is not always possible (commercial >>>>> applications, use of complicated libraries, algorithms with no clear >>>>> progression points at which you can interrupt the algorithm and start it >>>>> back from there). >>>> Hmmm...well, most apps can be adjusted to support it - we have some very >>>> complex apps that were updated that way. Commercial apps are another >>>> story, but we frankly don't find much call for checkpointing those as they >>>> typically just don't run long enough - especially if you are only running >>>> 256 ranks, so your cluster is small. Failure rates just don't justify it >>>> in such cases, in our experience. >>>> >>>> Is there some particular reason why you feel you need checkpointing? >>> This specific case is that the jobs run for days. The risk of a hardware or >>> power failure for that kind of duration goes too high (we allow for no more >>> than 48 hours of run time). >> I'm surprised by that - we run with UPS support on the clusters, but for a >> small one like you describe, we find the probability that a job will be >> interrupted even during a multi-week app is vanishingly small. >> >> FWIW: I do work with the financial industry where we regularly run apps that >> execute non-stop for about a month with no reported failures. Are you >> actually seeing failures, or are you anticipating them? > While our filesystem and management nodes are on UPS, our compute nodes are > not. With one average generic (power/cooling mostly) failure every one or two > months, running for weeks is just asking for trouble. Wow, that is high > If you add to that typical dimm/cpu/networking failures (I estimated about 1 > node goes down per day because of some sort hardware failure, for a cluster > of 960 nodes). That is incredibly high > With these numbers, a job running on 32 nodes for 7 days has a ~35% chance of > failing before it is done. I've never seen anything like that behavior in practice - a 32 node cluster typically runs for quite a few months without a failure. It is a typical size for the financial sector, so we have a LOT of experience with such clusters. I suspect you won't see anything like that behavior... > > Having 24GB of ram per node, even if a 32 nodes job takes close to 100% of > the ram, that's merely 640 GB of data. Writing that on a lustre filesystem > capable of reaching ~15GB/s should take no more than a few minutes if written > correctly. Right now, I am getting a few minutes for a hundredth of this > amount of data! Agreed - but I don't think you'll get that bandwidth for checkpointing. I suspect you'll find that checkpointing really has troubles when scaling, which is why you don't see it used in production (at least, I haven't). Mostly used for research by a handful of organizations, which is why we haven't been too concerned about its loss. > >>> While it is true we can dig through the code of the library to make it >>> checkpoint, BLCR checkpointing just seemed easier. >> I see - just be aware that checkpoint support in OMPI will disappear in v1.7 >> and there is no clear timetable for restoring it. > That is very good to know. Thanks for the information. It is too bad though. >> >>>>> There certainly must be a better way to write the information down on >>>>> disc though. Doing 500 IOPs seems to be completely broken. Why isn't >>>>> there buffering involved ? >>>> I don't know - that's all done in BLCR, I believe. Either way, it isn't >>>> something we can address due to the loss of our supporter for c/r. >>> I suppose I should contact BLCR instead then. >> For the disk op problem, I think that's the way to go - though like I said, >> I could be wrong and the disk writes could be something we do inside OMPI. >> I'm not familiar enough with the c/r code to state it with certainty. >> >>> Thank you, >>> >>> Maxime >>>> Sorry we can't be of more help :-( >>>> Ralph >>>> >>>>> Thanks, >>>>> >>>>> Maxime >>>>> >>>>> >>>>> Le 2013-01-28 10:58, Ralph Castain a écrit : >>>>>> Our c/r person has moved on to a different career path, so we may not >>>>>> have anyone who can answer this question. >>>>>> >>>>>> What we can say is that checkpointing at any significant scale will >>>>>> always be a losing proposition. It just takes too long and hammers the >>>>>> file system. People have been working on extending the capability with >>>>>> things like "burst buffers" (basically putting an SSD in front of the >>>>>> file system to absorb the checkpoint surge), but that hasn't become very >>>>>> common yet. >>>>>> >>>>>> Frankly, what people have found to be the "best" solution is for your >>>>>> app to periodically write out its intermediate results, and then take a >>>>>> flag that indicates "read prior results" when it starts. This minimizes >>>>>> the amount of data being written to the disk. If done correctly, you >>>>>> would only lose whatever work was done since the last intermediate >>>>>> result was written - which is about equivalent to losing whatever works >>>>>> was done since the last checkpoint. >>>>>> >>>>>> HTH >>>>>> Ralph >>>>>> >>>>>> On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault >>>>>> <maxime.boissonnea...@calculquebec.ca> wrote: >>>>>> >>>>>>> Hello, >>>>>>> I am doing checkpointing tests (with BLCR) with an MPI application >>>>>>> compiled with OpenMPI 1.6.3, and I am seeing behaviors that are quite >>>>>>> strange. >>>>>>> >>>>>>> First, some details about the tests : >>>>>>> - The only filesystem available on the nodes are 1) one tmpfs, 2) one >>>>>>> lustre shared filesystem (tested to be able to provide ~15GB/s for >>>>>>> writing and support ~40k IOPs). >>>>>>> - The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 >>>>>>> or 2 nodes). Each MPI rank was using approximately 200MB of memory. >>>>>>> - I was doing checkpoints with ompi-checkpoint and restarting with >>>>>>> ompi-restart. >>>>>>> - I was starting with mpirun -am ft-enable-cr >>>>>>> - The nodes are monitored by ganglia, which allows me to see the number >>>>>>> of IOPs and the read/write speed on the filesystem. >>>>>>> >>>>>>> I tried a few different mca settings, but I consistently observed that : >>>>>>> - The checkpoints lasted ~4-5 minutes each time >>>>>>> - During checkpoint, each node (8 ranks) was doing ~500 IOPs, and >>>>>>> writing at ~15MB/s. >>>>>>> >>>>>>> I am worried by the number of IOPs and the very slow writing speed. >>>>>>> This was a very small test. We have jobs running with 128 or 256 MPI >>>>>>> ranks, each using 1-2 GB of ram per rank. With such jobs, the overall >>>>>>> number of IOPs would reach tens of thousands and would completely >>>>>>> overload our lustre filesystem. Moreover, with 15MB/s per node, the >>>>>>> checkpointing process would take hours. >>>>>>> >>>>>>> How can I improve on that ? Is there an MCA setting that I am missing ? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> -- >>>>>>> --------------------------------- >>>>>>> Maxime Boissonneault >>>>>>> Analyste de calcul - Calcul Québec, Université Laval >>>>>>> Ph. D. en physique >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> -- >>>>> --------------------------------- >>>>> Maxime Boissonneault >>>>> Analyste de calcul - Calcul Québec, Université Laval >>>>> Ph. D. en physique >>>>> >>> >>> -- >>> --------------------------------- >>> Maxime Boissonneault >>> Analyste de calcul - Calcul Québec, Université Laval >>> Ph. D. en physique >>> > > > -- > --------------------------------- > Maxime Boissonneault > Analyste de calcul - Calcul Québec, Université Laval > Ph. D. en physique >