At the scale you address you should have no trouble with the C/R if the file system is correctly configured. We get more bandwidth per node out of an NFS over 1Gb/s at 32 nodes. Have you run some parallel benchmarks on your cluster ?
George. PS: You can some MPI I/O benchmarks at http://www.mcs.anl.gov/~thakur/pio-benchmarks.html On Mon, Jan 28, 2013 at 2:04 PM, Ralph Castain <r...@open-mpi.org> wrote: > > On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault > <maxime.boissonnea...@calculquebec.ca> wrote: > >> Le 2013-01-28 13:15, Ralph Castain a écrit : >>> On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault >>> <maxime.boissonnea...@calculquebec.ca> wrote: >>> >>>> Le 2013-01-28 12:46, Ralph Castain a écrit : >>>>> On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault >>>>> <maxime.boissonnea...@calculquebec.ca> wrote: >>>>> >>>>>> Hello Ralph, >>>>>> I agree that ideally, someone would implement checkpointing in the >>>>>> application itself, but that is not always possible (commercial >>>>>> applications, use of complicated libraries, algorithms with no clear >>>>>> progression points at which you can interrupt the algorithm and start it >>>>>> back from there). >>>>> Hmmm...well, most apps can be adjusted to support it - we have some very >>>>> complex apps that were updated that way. Commercial apps are another >>>>> story, but we frankly don't find much call for checkpointing those as >>>>> they typically just don't run long enough - especially if you are only >>>>> running 256 ranks, so your cluster is small. Failure rates just don't >>>>> justify it in such cases, in our experience. >>>>> >>>>> Is there some particular reason why you feel you need checkpointing? >>>> This specific case is that the jobs run for days. The risk of a hardware >>>> or power failure for that kind of duration goes too high (we allow for no >>>> more than 48 hours of run time). >>> I'm surprised by that - we run with UPS support on the clusters, but for a >>> small one like you describe, we find the probability that a job will be >>> interrupted even during a multi-week app is vanishingly small. >>> >>> FWIW: I do work with the financial industry where we regularly run apps >>> that execute non-stop for about a month with no reported failures. Are you >>> actually seeing failures, or are you anticipating them? >> While our filesystem and management nodes are on UPS, our compute nodes are >> not. With one average generic (power/cooling mostly) failure every one or >> two months, running for weeks is just asking for trouble. > > Wow, that is high > >> If you add to that typical dimm/cpu/networking failures (I estimated about 1 >> node goes down per day because of some sort hardware failure, for a cluster >> of 960 nodes). > > That is incredibly high > >> With these numbers, a job running on 32 nodes for 7 days has a ~35% chance >> of failing before it is done. > > I've never seen anything like that behavior in practice - a 32 node cluster > typically runs for quite a few months without a failure. It is a typical size > for the financial sector, so we have a LOT of experience with such clusters. > > I suspect you won't see anything like that behavior... > >> >> Having 24GB of ram per node, even if a 32 nodes job takes close to 100% of >> the ram, that's merely 640 GB of data. Writing that on a lustre filesystem >> capable of reaching ~15GB/s should take no more than a few minutes if >> written correctly. Right now, I am getting a few minutes for a hundredth of >> this amount of data! > > > Agreed - but I don't think you'll get that bandwidth for checkpointing. I > suspect you'll find that checkpointing really has troubles when scaling, > which is why you don't see it used in production (at least, I haven't). > Mostly used for research by a handful of organizations, which is why we > haven't been too concerned about its loss. > > >> >>>> While it is true we can dig through the code of the library to make it >>>> checkpoint, BLCR checkpointing just seemed easier. >>> I see - just be aware that checkpoint support in OMPI will disappear in >>> v1.7 and there is no clear timetable for restoring it. >> That is very good to know. Thanks for the information. It is too bad though. >>> >>>>>> There certainly must be a better way to write the information down on >>>>>> disc though. Doing 500 IOPs seems to be completely broken. Why isn't >>>>>> there buffering involved ? >>>>> I don't know - that's all done in BLCR, I believe. Either way, it isn't >>>>> something we can address due to the loss of our supporter for c/r. >>>> I suppose I should contact BLCR instead then. >>> For the disk op problem, I think that's the way to go - though like I said, >>> I could be wrong and the disk writes could be something we do inside OMPI. >>> I'm not familiar enough with the c/r code to state it with certainty. >>> >>>> Thank you, >>>> >>>> Maxime >>>>> Sorry we can't be of more help :-( >>>>> Ralph >>>>> >>>>>> Thanks, >>>>>> >>>>>> Maxime >>>>>> >>>>>> >>>>>> Le 2013-01-28 10:58, Ralph Castain a écrit : >>>>>>> Our c/r person has moved on to a different career path, so we may not >>>>>>> have anyone who can answer this question. >>>>>>> >>>>>>> What we can say is that checkpointing at any significant scale will >>>>>>> always be a losing proposition. It just takes too long and hammers the >>>>>>> file system. People have been working on extending the capability with >>>>>>> things like "burst buffers" (basically putting an SSD in front of the >>>>>>> file system to absorb the checkpoint surge), but that hasn't become >>>>>>> very common yet. >>>>>>> >>>>>>> Frankly, what people have found to be the "best" solution is for your >>>>>>> app to periodically write out its intermediate results, and then take a >>>>>>> flag that indicates "read prior results" when it starts. This minimizes >>>>>>> the amount of data being written to the disk. If done correctly, you >>>>>>> would only lose whatever work was done since the last intermediate >>>>>>> result was written - which is about equivalent to losing whatever works >>>>>>> was done since the last checkpoint. >>>>>>> >>>>>>> HTH >>>>>>> Ralph >>>>>>> >>>>>>> On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault >>>>>>> <maxime.boissonnea...@calculquebec.ca> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> I am doing checkpointing tests (with BLCR) with an MPI application >>>>>>>> compiled with OpenMPI 1.6.3, and I am seeing behaviors that are quite >>>>>>>> strange. >>>>>>>> >>>>>>>> First, some details about the tests : >>>>>>>> - The only filesystem available on the nodes are 1) one tmpfs, 2) one >>>>>>>> lustre shared filesystem (tested to be able to provide ~15GB/s for >>>>>>>> writing and support ~40k IOPs). >>>>>>>> - The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 >>>>>>>> or 2 nodes). Each MPI rank was using approximately 200MB of memory. >>>>>>>> - I was doing checkpoints with ompi-checkpoint and restarting with >>>>>>>> ompi-restart. >>>>>>>> - I was starting with mpirun -am ft-enable-cr >>>>>>>> - The nodes are monitored by ganglia, which allows me to see the >>>>>>>> number of IOPs and the read/write speed on the filesystem. >>>>>>>> >>>>>>>> I tried a few different mca settings, but I consistently observed that >>>>>>>> : >>>>>>>> - The checkpoints lasted ~4-5 minutes each time >>>>>>>> - During checkpoint, each node (8 ranks) was doing ~500 IOPs, and >>>>>>>> writing at ~15MB/s. >>>>>>>> >>>>>>>> I am worried by the number of IOPs and the very slow writing speed. >>>>>>>> This was a very small test. We have jobs running with 128 or 256 MPI >>>>>>>> ranks, each using 1-2 GB of ram per rank. With such jobs, the overall >>>>>>>> number of IOPs would reach tens of thousands and would completely >>>>>>>> overload our lustre filesystem. Moreover, with 15MB/s per node, the >>>>>>>> checkpointing process would take hours. >>>>>>>> >>>>>>>> How can I improve on that ? Is there an MCA setting that I am missing ? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> -- >>>>>>>> --------------------------------- >>>>>>>> Maxime Boissonneault >>>>>>>> Analyste de calcul - Calcul Québec, Université Laval >>>>>>>> Ph. D. en physique >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> -- >>>>>> --------------------------------- >>>>>> Maxime Boissonneault >>>>>> Analyste de calcul - Calcul Québec, Université Laval >>>>>> Ph. D. en physique >>>>>> >>>> >>>> -- >>>> --------------------------------- >>>> Maxime Boissonneault >>>> Analyste de calcul - Calcul Québec, Université Laval >>>> Ph. D. en physique >>>> >> >> >> -- >> --------------------------------- >> Maxime Boissonneault >> Analyste de calcul - Calcul Québec, Université Laval >> Ph. D. en physique >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users