[caiman-discuss] Stop/resume feature in new install engine

Dave Miner Tue, 30 Mar 2010 16:57:01 -0400

On 03/30/10 03:52 PM, Karen Tung wrote:
> Hi Dave,
>
> Please see my responses inline.
>
> On 03/30/10 07:53, Dave Miner wrote:
>> On 03/26/10 07:21 PM, Karen Tung wrote:
>>> I uploaded the install engine feature highlight slides:
>>> http://hub.opensolaris.org/bin/download/Project+caiman/CUE_docs/install-engine-feature-highlights.pdf
>>>
>>>
>>>
>>> During the discussion of these slides, we discussed how stop/pause and
>>> resume would work.
>>> Keith brought up a suggestion on resuming from a snapshot of the data
>>> cache in /tmp,
>>> and I initially though it would be OK to add the feature.
>>> Upon further consideration, I don't think it is a good idea to support
>>> it. I would
>>> like to get your opinion on the issue.
>>>
>>> In summary, I proposed for stop and resume to work as follows in the
>>> engine:
>>>
>>> - After successful execution of each checkpoint, a snapshot of the data
>>> cache will be taken.
>>> - If the install target ZFS dataset is available, the data cache snap
>>> shot will be stored in the ZFS
>>> dataset.
>>> - If the install target ZFS dataset is not available, the data cache
>>> snapshot will be stored in /tmp.
>>> - For resumes without terminating the app, resumes are allowed from
>>> any previously successfully executed checkpoint.
>>> - For application that terminates and resumes upon restarting, resumes
>>> are only allowed from
>>> checkpoints that have data cache snapshot saved in the install target
>>> ZFS dataset.
>>>
>>> See slides 10-17 for more details.
>>>
>>> During the discussion, Keith suggested allowing resume to happen even
>>> if the data cache snapshot is not stored in the ZFS dataset, since the
>>> data cache snapshot
>>> stored in /tmp can be used. I thought it would be OK to support that
>>> also during the meeting.
>>> However, after further consideration, I thought of a couple of reasons
>>> for opposing to
>>> the support.
>>>
>>> 1) If we allow a copy of snapshot to be provided to the engine for
>>> resume, we need to provide an
>>> interface for the user/app to determine which copy of snapshot file
>>> belong to which process.
>>> Arguably, one can guess based on timestamp, and knowledge of the
>>> engine..etc..
>>> However, all those are implementation specific, and depending on how
>>> things evolve,
>>> they are not "official interfaces" that user/applications can count on.
>>>
>>
>> I'd argue, though, that you have this same problem with respect to the
>> ZFS snapshots, we've just deliberately ignored it. Or am I missing
>> something about how you're expecting to tag them to record the
>> "ownership"?
>
> IMO, the snapshot files from the data cache and the ZFS snapshots are
> implementation details
> that should not be exposed to the user. The "official interface" for
> application to resume,
> as currently designed, is that they will supply the ZFS dataset that is
> used as the
> installation target. Each application does "own" that, and it is a well
> define value.
> The engine is just taking advantage and storing it's own book keeping
> information
> there.


I agree that snapshots and such are implementation details; even in DC 
right now we do not expose those as an interface, but resumption is 
specified to the application using a checkpoint identifier.  So I'm 
still not seeing much of a difference.  It really seems to be about the 
naming of what you're storing.

>>
>>> 2) The current design will remove all copies of snapshot in /tmp when
>>> the application/engine
>>> terminates. If we want to allow resume from a copy of snapshot in /tmp,
>>> we won't be
>>> cleaning up those snapshots, and over time, we clutter up /tmp with a
>>> lot of snapshot files.
>>>
>>
>> Well, this is likely to be true in the applications as well - DC
>> leaves snapshots around deliberately, but the installers probably
>> won't on successful completion. In any event, I think you'll need to
>> make the cleanup behavior controllable in some way, at the very least
>> for debug purposes. So, if they're going to be around...
>
> Yes, they will be around, but since each application will write to it's
> own install target,
> all the snapshots will belong to that application, and there will only
> be a fixed number
> of snapshots in 1 install target regardless how many times you run the
> application.
>
> On the other hand, all process that uses the engine will also use /tmp,
> and to make sure
> one process does not overwrite the files from another, we will probably
> be naming the
> files with the pid or something. So, every time the program is run, new
> copies of files
> are created. If we don't clean it up when each process exits, /tmp might
> get very cluttered.

A nit: /var/run, not /tmp, for privileged processes, which is going to 
be the primary case here.

>>
>>> Based on this, I don't think it is a good idea to "officially" support
>>> resuming from a data
>>> cache snapshot in /tmp. I can probably leave a backdoor in the code to
>>> enable it
>>> somehow if needed.
>>>
>>> I would like to hear your thoughts on this.
>>>
>>
>> I think I'd consider a little more closely why this would or would not
>> be useful to support, as I'm not sure the issues you've raised are all
>> that unique. For example, I'd think target discovery could be
>> relatively expensive on more complex storage topologies such that it
>> may be convenient to have restart capability post-TD.
>
> I do agree with you on this point, that's why I was considering
> providing a backdoor in the code,
> such as setting some debugging env variable or some such thing which
> will preserve the snapshot
> files in /tmp. However, by default, everything in /tmp will be cleaned up.
>

I would recommend you make it a formal part of the engine interface and 
leave it to the applications how it might be used/exposed (by 
environment variable or whatever).

Dave

[caiman-discuss] Stop/resume feature in new install engine

Reply via email to