[caiman-discuss] Adding the ability to restart DC from a checkpoint

Jean McCormack Mon, 28 Jan 2008 14:58:14 -0700

Dave Miner wrote:
> Jean McCormack wrote:
>> In the DC meeting yesterday we discussed the future user experience 
>> for the Distro Constructor. The first thing I'm
>> looking at is the ability to restart the DC build at different 
>> checkpoints or steps in the process.
>>
>> There were 3 ways of specifying the restart that were considered
>> 1) The user would edit the manifest file to specify they wanted to 
>> start the build at a certain point
>> 2) a command line option
>> 3) Making the command have an interactive option
>>
>> After consulting with Frank Ludolph #2 (command line option) was 
>> decided upon.
>> His suggestion was this:
>> dist_const -resume [step]
>>
>> dist_const -resume would resume the build from the failed step in the 
>> previous build
>> dist_const -resume step would resume the build from the step specified.
>>
>
> Generally this seems like the right sort of idea.  A couple of things 
> to think about in designing it:
>
> - This doesn't seem much different from different targets in a 
> makefile.  Perhaps think about how to leverage make for this.
Yeah. The nice thing is it would work universally. I'll need to put a 
lot more thought into this though.
>
> - The restarting I'd implemented in the live media kit used ZFS 
> snapshots for recording state.  Think about how that might be 
> leveraged here.
This actually looks very much like what we want to do. A couple of 
questions:


1) This obviously only works if the user specifies a zfs dataset for 
their proto area. We aren't making ZFS a requirement for DC are we?
2) Do zfs snapshots take up a lot of space?

>
>
>> Some technical thoughts behind this new option:
>>
>> - In order to keep the build from having issues because the user 
>> changes the manifest between the two
>>   runs, we would not have them specify a new manifest file.
>> - The build does need to have the manifest information somehow, so my 
>> thought was that during a build
>>    we would copy the current manifest file to .step<step number>. As 
>> the step completes successfully this
>>    file would be deleted. It would then serve as a marker for the 
>> -resume case as to where to restart and
>>    would contain all the information for the restarted build.
>> - dist_const -resume step would check that the step specified is <= 
>> the failed step. Restarting at step+n is not
>>    allowed
>> - We could do some checking to make sure that the user hasn't 
>> modified .step<number> which has the potential
>>   to cause havoc in the build. Depending upon where you were in the 
>> build process, some modifications would be OK, others not.
>>   I'm not sure the extra complication is worth it. How do others feel 
>> about this?
>
> My use of ZFS snapshots in live media let me do fairly arbitrary 
> things by hand when I wanted to experiment with modifications to parts 
> of the image before bothering to commit them to code (I'd just rename 
> snapshots and so on to get to the state I wanted).  I think that 
> whatever we do should allow for that sort of developer behavior.
So are you suggesting that we let the user specify a manifest that might 
have been modified? Make it their responsibility to make sure
they haven't changed anything critical?  By allowing that we'd also give 
them flexibility to experiment more. I think that would be nice too.

Jean
>
>> - The messaging coming from the DC would be worded such that the user 
>> would know what step failed in the process.
>>    That's the next step in this work.
>> - the .step<number> files would be cleaned up at the start of every 
>> build and the end of every successful build.
>> - dist_const -resume doesn't make sense after a complete successful 
>> build but dist_const -resume step does. If the user
>>   has a build that completes successfully but doesn't work, they 
>> could rerun the build from any step they think is appropriate.
>>
>
> Right.
>
>> Any comments?
>>
>
> Good start.  Thanks for moving on this.
>
> Dave
>

[caiman-discuss] Adding the ability to restart DC from a checkpoint

Reply via email to