On 03/30/10 03:52 PM, Karen Tung wrote: > Hi Dave, > > Please see my responses inline. > > On 03/30/10 07:53, Dave Miner wrote: >> On 03/26/10 07:21 PM, Karen Tung wrote: >>> I uploaded the install engine feature highlight slides: >>> http://hub.opensolaris.org/bin/download/Project+caiman/CUE_docs/install-engine-feature-highlights.pdf >>> >>> >>> >>> During the discussion of these slides, we discussed how stop/pause and >>> resume would work. >>> Keith brought up a suggestion on resuming from a snapshot of the data >>> cache in /tmp, >>> and I initially though it would be OK to add the feature. >>> Upon further consideration, I don't think it is a good idea to support >>> it. I would >>> like to get your opinion on the issue. >>> >>> In summary, I proposed for stop and resume to work as follows in the >>> engine: >>> >>> - After successful execution of each checkpoint, a snapshot of the data >>> cache will be taken. >>> - If the install target ZFS dataset is available, the data cache snap >>> shot will be stored in the ZFS >>> dataset. >>> - If the install target ZFS dataset is not available, the data cache >>> snapshot will be stored in /tmp. >>> - For resumes without terminating the app, resumes are allowed from >>> any previously successfully executed checkpoint. >>> - For application that terminates and resumes upon restarting, resumes >>> are only allowed from >>> checkpoints that have data cache snapshot saved in the install target >>> ZFS dataset. >>> >>> See slides 10-17 for more details. >>> >>> During the discussion, Keith suggested allowing resume to happen even >>> if the data cache snapshot is not stored in the ZFS dataset, since the >>> data cache snapshot >>> stored in /tmp can be used. I thought it would be OK to support that >>> also during the meeting. >>> However, after further consideration, I thought of a couple of reasons >>> for opposing to >>> the support. >>> >>> 1) If we allow a copy of snapshot to be provided to the engine for >>> resume, we need to provide an >>> interface for the user/app to determine which copy of snapshot file >>> belong to which process. >>> Arguably, one can guess based on timestamp, and knowledge of the >>> engine..etc.. >>> However, all those are implementation specific, and depending on how >>> things evolve, >>> they are not "official interfaces" that user/applications can count on. >>> >> >> I'd argue, though, that you have this same problem with respect to the >> ZFS snapshots, we've just deliberately ignored it. Or am I missing >> something about how you're expecting to tag them to record the >> "ownership"? > > IMO, the snapshot files from the data cache and the ZFS snapshots are > implementation details > that should not be exposed to the user. The "official interface" for > application to resume, > as currently designed, is that they will supply the ZFS dataset that is > used as the > installation target. Each application does "own" that, and it is a well > define value. > The engine is just taking advantage and storing it's own book keeping > information > there.
I agree that snapshots and such are implementation details; even in DC right now we do not expose those as an interface, but resumption is specified to the application using a checkpoint identifier. So I'm still not seeing much of a difference. It really seems to be about the naming of what you're storing. >> >>> 2) The current design will remove all copies of snapshot in /tmp when >>> the application/engine >>> terminates. If we want to allow resume from a copy of snapshot in /tmp, >>> we won't be >>> cleaning up those snapshots, and over time, we clutter up /tmp with a >>> lot of snapshot files. >>> >> >> Well, this is likely to be true in the applications as well - DC >> leaves snapshots around deliberately, but the installers probably >> won't on successful completion. In any event, I think you'll need to >> make the cleanup behavior controllable in some way, at the very least >> for debug purposes. So, if they're going to be around... > > Yes, they will be around, but since each application will write to it's > own install target, > all the snapshots will belong to that application, and there will only > be a fixed number > of snapshots in 1 install target regardless how many times you run the > application. > > On the other hand, all process that uses the engine will also use /tmp, > and to make sure > one process does not overwrite the files from another, we will probably > be naming the > files with the pid or something. So, every time the program is run, new > copies of files > are created. If we don't clean it up when each process exits, /tmp might > get very cluttered. A nit: /var/run, not /tmp, for privileged processes, which is going to be the primary case here. >> >>> Based on this, I don't think it is a good idea to "officially" support >>> resuming from a data >>> cache snapshot in /tmp. I can probably leave a backdoor in the code to >>> enable it >>> somehow if needed. >>> >>> I would like to hear your thoughts on this. >>> >> >> I think I'd consider a little more closely why this would or would not >> be useful to support, as I'm not sure the issues you've raised are all >> that unique. For example, I'd think target discovery could be >> relatively expensive on more complex storage topologies such that it >> may be convenient to have restart capability post-TD. > > I do agree with you on this point, that's why I was considering > providing a backdoor in the code, > such as setting some debugging env variable or some such thing which > will preserve the snapshot > files in /tmp. However, by default, everything in /tmp will be cleaned up. > I would recommend you make it a formal part of the engine interface and leave it to the applications how it might be used/exposed (by environment variable or whatever). Dave
