On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

> Le 2013-01-28 13:15, Ralph Castain a écrit :
>> On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault 
>> <maxime.boissonnea...@calculquebec.ca> wrote:
>> 
>>> Le 2013-01-28 12:46, Ralph Castain a écrit :
>>>> On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault 
>>>> <maxime.boissonnea...@calculquebec.ca> wrote:
>>>> 
>>>>> Hello Ralph,
>>>>> I agree that ideally, someone would implement checkpointing in the 
>>>>> application itself, but that is not always possible (commercial 
>>>>> applications, use of complicated libraries, algorithms with no clear 
>>>>> progression points at which you can interrupt the algorithm and start it 
>>>>> back from there).
>>>> Hmmm...well, most apps can be adjusted to support it - we have some very 
>>>> complex apps that were updated that way. Commercial apps are another 
>>>> story, but we frankly don't find much call for checkpointing those as they 
>>>> typically just don't run long enough - especially if you are only running 
>>>> 256 ranks, so your cluster is small. Failure rates just don't justify it 
>>>> in such cases, in our experience.
>>>> 
>>>> Is there some particular reason why you feel you need checkpointing?
>>> This specific case is that the jobs run for days. The risk of a hardware or 
>>> power failure for that kind of duration goes too high (we allow for no more 
>>> than 48 hours of run time).
>> I'm surprised by that - we run with UPS support on the clusters, but for a 
>> small one like you describe, we find the probability that a job will be 
>> interrupted even during a multi-week app is vanishingly small.
>> 
>> FWIW: I do work with the financial industry where we regularly run apps that 
>> execute non-stop for about a month with no reported failures. Are you 
>> actually seeing failures, or are you anticipating them?
> While our filesystem and management nodes are on UPS, our compute nodes are 
> not. With one average generic (power/cooling mostly) failure every one or two 
> months, running for weeks is just asking for trouble.

Wow, that is high

> If you add to that typical dimm/cpu/networking failures (I estimated about 1 
> node goes down per day because of some sort hardware failure, for a cluster 
> of 960 nodes).

That is incredibly high

> With these numbers, a job running on 32 nodes for 7 days has a ~35% chance of 
> failing before it is done.

I've never seen anything like that behavior in practice - a 32 node cluster 
typically runs for quite a few months without a failure. It is a typical size 
for the financial sector, so we have a LOT of experience with such clusters.

I suspect you won't see anything like that behavior...

> 
> Having 24GB of ram per node, even if a 32 nodes job takes close to 100% of 
> the ram, that's merely 640 GB of data. Writing that on a lustre filesystem 
> capable of reaching ~15GB/s should take no more than a few minutes if written 
> correctly. Right now, I am getting a few minutes for a hundredth of this 
> amount of data!


Agreed - but I don't think you'll get that bandwidth for checkpointing. I 
suspect you'll find that checkpointing really has troubles when scaling, which 
is why you don't see it used in production (at least, I haven't). Mostly used 
for research by a handful of organizations, which is why we haven't been too 
concerned about its loss.


> 
>>> While it is true we can dig through the code of the library to make it 
>>> checkpoint, BLCR checkpointing just seemed easier.
>> I see - just be aware that checkpoint support in OMPI will disappear in v1.7 
>> and there is no clear timetable for restoring it.
> That is very good to know. Thanks for the information. It is too bad though.
>> 
>>>>> There certainly must be a better way to write the information down on 
>>>>> disc though. Doing 500 IOPs seems to be completely broken. Why isn't 
>>>>> there buffering involved ?
>>>> I don't know - that's all done in BLCR, I believe. Either way, it isn't 
>>>> something we can address due to the loss of our supporter for c/r.
>>> I suppose I should contact BLCR instead then.
>> For the disk op problem, I think that's the way to go - though like I said, 
>> I could be wrong and the disk writes could be something we do inside OMPI. 
>> I'm not familiar enough with the c/r code to state it with certainty.
>> 
>>> Thank you,
>>> 
>>> Maxime
>>>> Sorry we can't be of more help :-(
>>>> Ralph
>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Maxime
>>>>> 
>>>>> 
>>>>> Le 2013-01-28 10:58, Ralph Castain a écrit :
>>>>>> Our c/r person has moved on to a different career path, so we may not 
>>>>>> have anyone who can answer this question.
>>>>>> 
>>>>>> What we can say is that checkpointing at any significant scale will 
>>>>>> always be a losing proposition. It just takes too long and hammers the 
>>>>>> file system. People have been working on extending the capability with 
>>>>>> things like "burst buffers" (basically putting an SSD in front of the 
>>>>>> file system to absorb the checkpoint surge), but that hasn't become very 
>>>>>> common yet.
>>>>>> 
>>>>>> Frankly, what people have found to be the "best" solution is for your 
>>>>>> app to periodically write out its intermediate results, and then take a 
>>>>>> flag that indicates "read prior results" when it starts. This minimizes 
>>>>>> the amount of data being written to the disk. If done correctly, you 
>>>>>> would only lose whatever work was done since the last intermediate 
>>>>>> result was written - which is about equivalent to losing whatever works 
>>>>>> was done since the last checkpoint.
>>>>>> 
>>>>>> HTH
>>>>>> Ralph
>>>>>> 
>>>>>> On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault 
>>>>>> <maxime.boissonnea...@calculquebec.ca> wrote:
>>>>>> 
>>>>>>> Hello,
>>>>>>> I am doing checkpointing tests (with BLCR) with an MPI application 
>>>>>>> compiled with OpenMPI 1.6.3, and I am seeing behaviors that are quite 
>>>>>>> strange.
>>>>>>> 
>>>>>>> First, some details about the tests :
>>>>>>> - The only filesystem available on the nodes are 1) one tmpfs, 2) one 
>>>>>>> lustre shared filesystem (tested to be able to provide ~15GB/s for 
>>>>>>> writing and support ~40k IOPs).
>>>>>>> - The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 
>>>>>>> or 2 nodes). Each MPI rank was using approximately 200MB of memory.
>>>>>>> - I was doing checkpoints with ompi-checkpoint and restarting with 
>>>>>>> ompi-restart.
>>>>>>> - I was starting with mpirun -am ft-enable-cr
>>>>>>> - The nodes are monitored by ganglia, which allows me to see the number 
>>>>>>> of IOPs and the read/write speed on the filesystem.
>>>>>>> 
>>>>>>> I tried a few different mca settings, but I consistently observed that :
>>>>>>> - The checkpoints lasted ~4-5 minutes each time
>>>>>>> - During checkpoint, each node (8 ranks) was doing ~500 IOPs, and 
>>>>>>> writing at ~15MB/s.
>>>>>>> 
>>>>>>> I am worried by the number of IOPs and the very slow writing speed. 
>>>>>>> This was a very small test. We have jobs running with 128 or 256 MPI 
>>>>>>> ranks, each using 1-2 GB of ram per rank. With such jobs, the overall 
>>>>>>> number of IOPs would reach tens of thousands and would completely 
>>>>>>> overload our lustre filesystem. Moreover, with 15MB/s per node, the 
>>>>>>> checkpointing process would take hours.
>>>>>>> 
>>>>>>> How can I improve on that ? Is there an MCA setting that I am missing ?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> -- 
>>>>>>> ---------------------------------
>>>>>>> Maxime Boissonneault
>>>>>>> Analyste de calcul - Calcul Québec, Université Laval
>>>>>>> Ph. D. en physique
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> -- 
>>>>> ---------------------------------
>>>>> Maxime Boissonneault
>>>>> Analyste de calcul - Calcul Québec, Université Laval
>>>>> Ph. D. en physique
>>>>> 
>>> 
>>> -- 
>>> ---------------------------------
>>> Maxime Boissonneault
>>> Analyste de calcul - Calcul Québec, Université Laval
>>> Ph. D. en physique
>>> 
> 
> 
> -- 
> ---------------------------------
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
> 


Reply via email to