On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

> Le 2013-01-28 12:46, Ralph Castain a écrit :
>> On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault 
>> <maxime.boissonnea...@calculquebec.ca> wrote:
>> 
>>> Hello Ralph,
>>> I agree that ideally, someone would implement checkpointing in the 
>>> application itself, but that is not always possible (commercial 
>>> applications, use of complicated libraries, algorithms with no clear 
>>> progression points at which you can interrupt the algorithm and start it 
>>> back from there).
>> Hmmm...well, most apps can be adjusted to support it - we have some very 
>> complex apps that were updated that way. Commercial apps are another story, 
>> but we frankly don't find much call for checkpointing those as they 
>> typically just don't run long enough - especially if you are only running 
>> 256 ranks, so your cluster is small. Failure rates just don't justify it in 
>> such cases, in our experience.
>> 
>> Is there some particular reason why you feel you need checkpointing?
> This specific case is that the jobs run for days. The risk of a hardware or 
> power failure for that kind of duration goes too high (we allow for no more 
> than 48 hours of run time).

I'm surprised by that - we run with UPS support on the clusters, but for a 
small one like you describe, we find the probability that a job will be 
interrupted even during a multi-week app is vanishingly small.

FWIW: I do work with the financial industry where we regularly run apps that 
execute non-stop for about a month with no reported failures. Are you actually 
seeing failures, or are you anticipating them?

> While it is true we can dig through the code of the library to make it 
> checkpoint, BLCR checkpointing just seemed easier.

I see - just be aware that checkpoint support in OMPI will disappear in v1.7 
and there is no clear timetable for restoring it.

>> 
>>> There certainly must be a better way to write the information down on disc 
>>> though. Doing 500 IOPs seems to be completely broken. Why isn't there 
>>> buffering involved ?
>> I don't know - that's all done in BLCR, I believe. Either way, it isn't 
>> something we can address due to the loss of our supporter for c/r.
> I suppose I should contact BLCR instead then.

For the disk op problem, I think that's the way to go - though like I said, I 
could be wrong and the disk writes could be something we do inside OMPI. I'm 
not familiar enough with the c/r code to state it with certainty.

> 
> Thank you,
> 
> Maxime
>> 
>> Sorry we can't be of more help :-(
>> Ralph
>> 
>>> Thanks,
>>> 
>>> Maxime
>>> 
>>> 
>>> Le 2013-01-28 10:58, Ralph Castain a écrit :
>>>> Our c/r person has moved on to a different career path, so we may not have 
>>>> anyone who can answer this question.
>>>> 
>>>> What we can say is that checkpointing at any significant scale will always 
>>>> be a losing proposition. It just takes too long and hammers the file 
>>>> system. People have been working on extending the capability with things 
>>>> like "burst buffers" (basically putting an SSD in front of the file system 
>>>> to absorb the checkpoint surge), but that hasn't become very common yet.
>>>> 
>>>> Frankly, what people have found to be the "best" solution is for your app 
>>>> to periodically write out its intermediate results, and then take a flag 
>>>> that indicates "read prior results" when it starts. This minimizes the 
>>>> amount of data being written to the disk. If done correctly, you would 
>>>> only lose whatever work was done since the last intermediate result was 
>>>> written - which is about equivalent to losing whatever works was done 
>>>> since the last checkpoint.
>>>> 
>>>> HTH
>>>> Ralph
>>>> 
>>>> On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault 
>>>> <maxime.boissonnea...@calculquebec.ca> wrote:
>>>> 
>>>>> Hello,
>>>>> I am doing checkpointing tests (with BLCR) with an MPI application 
>>>>> compiled with OpenMPI 1.6.3, and I am seeing behaviors that are quite 
>>>>> strange.
>>>>> 
>>>>> First, some details about the tests :
>>>>> - The only filesystem available on the nodes are 1) one tmpfs, 2) one 
>>>>> lustre shared filesystem (tested to be able to provide ~15GB/s for 
>>>>> writing and support ~40k IOPs).
>>>>> - The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or 
>>>>> 2 nodes). Each MPI rank was using approximately 200MB of memory.
>>>>> - I was doing checkpoints with ompi-checkpoint and restarting with 
>>>>> ompi-restart.
>>>>> - I was starting with mpirun -am ft-enable-cr
>>>>> - The nodes are monitored by ganglia, which allows me to see the number 
>>>>> of IOPs and the read/write speed on the filesystem.
>>>>> 
>>>>> I tried a few different mca settings, but I consistently observed that :
>>>>> - The checkpoints lasted ~4-5 minutes each time
>>>>> - During checkpoint, each node (8 ranks) was doing ~500 IOPs, and writing 
>>>>> at ~15MB/s.
>>>>> 
>>>>> I am worried by the number of IOPs and the very slow writing speed. This 
>>>>> was a very small test. We have jobs running with 128 or 256 MPI ranks, 
>>>>> each using 1-2 GB of ram per rank. With such jobs, the overall number of 
>>>>> IOPs would reach tens of thousands and would completely overload our 
>>>>> lustre filesystem. Moreover, with 15MB/s per node, the checkpointing 
>>>>> process would take hours.
>>>>> 
>>>>> How can I improve on that ? Is there an MCA setting that I am missing ?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> -- 
>>>>> ---------------------------------
>>>>> Maxime Boissonneault
>>>>> Analyste de calcul - Calcul Québec, Université Laval
>>>>> Ph. D. en physique
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> -- 
>>> ---------------------------------
>>> Maxime Boissonneault
>>> Analyste de calcul - Calcul Québec, Université Laval
>>> Ph. D. en physique
>>> 
> 
> 
> -- 
> ---------------------------------
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
> 


Reply via email to