Based on the paper you linked the answer is quite obvious. The proposed CRFS mechanism supports all of the checkpoint-enabled MPI implementation, thus you just have to go with the one providing and caring about the services you need.
George. On Mon, Jan 28, 2013 at 3:46 PM, Maxime Boissonneault <maxime.boissonnea...@calculquebec.ca> wrote: > Hi George, > The problem here is not the bandwidth, but the number of IOPs. I wrote to > the BLCR list, and they confirmed that : > "While ideally the checkpoint would be written in sizable chunks, the > current code in BLCR will issue a single write operation for each contiguous > range of user memory, and many quite small writes for various meta-data and > non-memory state (registers, signal-handlers,etc). As show in Table 1 of > the paper cited above, the writes in the 10s of KB range will dominate > performance." > > (Reference being : X. Ouyang, R. Rajachandrasekhar, X. Besseron, H. Wang, J. > Huang and D. K. Panda, CRFS: A Lightweight User-Level Filesystem for Generic > Checkpoint/Restart, Int'l Conference on Parallel Processing (ICPP '11), > Sept. 2011. (PDF)) > > We did run parallel IO benchmarks. Our filesystem can reach a speed of > ~15GB/s, but only with large IO operations (at least bigger than 1MB, > optimally in the 100MB-1GB range). For small (<1MB) operations, the > filesystem is considerably slower. I believe this is precisely what is > killing the performance here. > > Not sure there is anything to be done about it. > > Best regards, > > > Maxime > > Le 2013-01-28 15:40, George Bosilca a écrit : > > At the scale you address you should have no trouble with the C/R if > the file system is correctly configured. We get more bandwidth per > node out of an NFS over 1Gb/s at 32 nodes. Have you run some parallel > benchmarks on your cluster ? > > George. > > PS: You can some MPI I/O benchmarks at > http://www.mcs.anl.gov/~thakur/pio-benchmarks.html > > > > On Mon, Jan 28, 2013 at 2:04 PM, Ralph Castain <r...@open-mpi.org> wrote: > > On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault > <maxime.boissonnea...@calculquebec.ca> wrote: > > Le 2013-01-28 13:15, Ralph Castain a écrit : > > On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault > <maxime.boissonnea...@calculquebec.ca> wrote: > > Le 2013-01-28 12:46, Ralph Castain a écrit : > > On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault > <maxime.boissonnea...@calculquebec.ca> wrote: > > Hello Ralph, > I agree that ideally, someone would implement checkpointing in the > application itself, but that is not always possible (commercial > applications, use of complicated libraries, algorithms with no clear > progression points at which you can interrupt the algorithm and start it > back from there). > > Hmmm...well, most apps can be adjusted to support it - we have some very > complex apps that were updated that way. Commercial apps are another story, > but we frankly don't find much call for checkpointing those as they > typically just don't run long enough - especially if you are only running > 256 ranks, so your cluster is small. Failure rates just don't justify it in > such cases, in our experience. > > Is there some particular reason why you feel you need checkpointing? > > This specific case is that the jobs run for days. The risk of a hardware or > power failure for that kind of duration goes too high (we allow for no more > than 48 hours of run time). > > I'm surprised by that - we run with UPS support on the clusters, but for a > small one like you describe, we find the probability that a job will be > interrupted even during a multi-week app is vanishingly small. > > FWIW: I do work with the financial industry where we regularly run apps that > execute non-stop for about a month with no reported failures. Are you > actually seeing failures, or are you anticipating them? > > While our filesystem and management nodes are on UPS, our compute nodes are > not. With one average generic (power/cooling mostly) failure every one or > two months, running for weeks is just asking for trouble. > > Wow, that is high > > If you add to that typical dimm/cpu/networking failures (I estimated about 1 > node goes down per day because of some sort hardware failure, for a cluster > of 960 nodes). > > That is incredibly high > > With these numbers, a job running on 32 nodes for 7 days has a ~35% chance > of failing before it is done. > > I've never seen anything like that behavior in practice - a 32 node cluster > typically runs for quite a few months without a failure. It is a typical > size for the financial sector, so we have a LOT of experience with such > clusters. > > I suspect you won't see anything like that behavior... > > Having 24GB of ram per node, even if a 32 nodes job takes close to 100% of > the ram, that's merely 640 GB of data. Writing that on a lustre filesystem > capable of reaching ~15GB/s should take no more than a few minutes if > written correctly. Right now, I am getting a few minutes for a hundredth of > this amount of data! > > Agreed - but I don't think you'll get that bandwidth for checkpointing. I > suspect you'll find that checkpointing really has troubles when scaling, > which is why you don't see it used in production (at least, I haven't). > Mostly used for research by a handful of organizations, which is why we > haven't been too concerned about its loss. > > > While it is true we can dig through the code of the library to make it > checkpoint, BLCR checkpointing just seemed easier. > > I see - just be aware that checkpoint support in OMPI will disappear in v1.7 > and there is no clear timetable for restoring it. > > That is very good to know. Thanks for the information. It is too bad though. > > There certainly must be a better way to write the information down on disc > though. Doing 500 IOPs seems to be completely broken. Why isn't there > buffering involved ? > > I don't know - that's all done in BLCR, I believe. Either way, it isn't > something we can address due to the loss of our supporter for c/r. > > I suppose I should contact BLCR instead then. > > For the disk op problem, I think that's the way to go - though like I said, > I could be wrong and the disk writes could be something we do inside OMPI. > I'm not familiar enough with the c/r code to state it with certainty. > > Thank you, > > Maxime > > Sorry we can't be of more help :-( > Ralph > > Thanks, > > Maxime > > > Le 2013-01-28 10:58, Ralph Castain a écrit : > > Our c/r person has moved on to a different career path, so we may not have > anyone who can answer this question. > > What we can say is that checkpointing at any significant scale will always > be a losing proposition. It just takes too long and hammers the file system. > People have been working on extending the capability with things like "burst > buffers" (basically putting an SSD in front of the file system to absorb the > checkpoint surge), but that hasn't become very common yet. > > Frankly, what people have found to be the "best" solution is for your app to > periodically write out its intermediate results, and then take a flag that > indicates "read prior results" when it starts. This minimizes the amount of > data being written to the disk. If done correctly, you would only lose > whatever work was done since the last intermediate result was written - > which is about equivalent to losing whatever works was done since the last > checkpoint. > > HTH > Ralph > > On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault > <maxime.boissonnea...@calculquebec.ca> wrote: > > Hello, > I am doing checkpointing tests (with BLCR) with an MPI application compiled > with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange. > > First, some details about the tests : > - The only filesystem available on the nodes are 1) one tmpfs, 2) one lustre > shared filesystem (tested to be able to provide ~15GB/s for writing and > support ~40k IOPs). > - The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or 2 > nodes). Each MPI rank was using approximately 200MB of memory. > - I was doing checkpoints with ompi-checkpoint and restarting with > ompi-restart. > - I was starting with mpirun -am ft-enable-cr > - The nodes are monitored by ganglia, which allows me to see the number of > IOPs and the read/write speed on the filesystem. > > I tried a few different mca settings, but I consistently observed that : > - The checkpoints lasted ~4-5 minutes each time > - During checkpoint, each node (8 ranks) was doing ~500 IOPs, and writing at > ~15MB/s. > > I am worried by the number of IOPs and the very slow writing speed. This was > a very small test. We have jobs running with 128 or 256 MPI ranks, each > using 1-2 GB of ram per rank. With such jobs, the overall number of IOPs > would reach tens of thousands and would completely overload our lustre > filesystem. Moreover, with 15MB/s per node, the checkpointing process would > take hours. > > How can I improve on that ? Is there an MCA setting that I am missing ? > > Thanks, > > -- > --------------------------------- > Maxime Boissonneault > Analyste de calcul - Calcul Québec, Université Laval > Ph. D. en physique > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- > --------------------------------- > Maxime Boissonneault > Analyste de calcul - Calcul Québec, Université Laval > Ph. D. en physique > > -- > --------------------------------- > Maxime Boissonneault > Analyste de calcul - Calcul Québec, Université Laval > Ph. D. en physique > > > -- > --------------------------------- > Maxime Boissonneault > Analyste de calcul - Calcul Québec, Université Laval > Ph. D. en physique > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > --------------------------------- > Maxime Boissonneault > Analyste de calcul - Calcul Québec, Université Laval > Ph. D. en physique