[caiman-discuss] Caiman architecture document open for review

Sarah Jelinek Thu, 21 Jan 2010 07:08:12 -0700

Hi Karen,



> 
> Please see my response to your comments inline.
> 
> On 01/12/10 04:44 PM, Sarah Jelinek wrote:
>> Hi Karen,
>>
>> Thank you for the review. My comments inline...
>>
>>
>>> Hi Sarah,
>>>
>>> Here are my comments on the document.
>>>
>>> General:
>>> ---------------
>>>
>>> - Question: throughout the document, you choose "class" to refer to 
>>> the different
>>> pieces of the New Solaris Installation suite.  Why is "class" 
>>> chosen?  I generally
>>> see "class" being used in the object oriented programming context.  
>>> Is it implied that
>>> our implementation of the suite must be done in a language that 
>>> supports OO?
>>> Would it be more general to refer to the different pieces as "module" 
>>> or "component"?
>>
>> I use 'class' specifically to imply the OO'ness of the architecture. 
>> And  yes, the language(s) we choose should support OO. Or if we do 
>> decide some of this is better done in C(not likely, but in case) we 
>> build the C datastructures and funtions in an OO manner.
> Since it is implementing the architecture using a language that supports 
> OO is one of the requirement,
> I think it should be explicitly spelled out in the document somewhere.

I don't think this implies we need a language that supports OO 
specifically. We can create OO like code in C.

>>
>>
>>> - In various parts of the document, there's mention about certain 
>>> python features
>>> being used, such as using the python logging module for the engine.
>>> Would it make sense to spell out Python as the language that will be 
>>> used?
>>>
>>
>> Actually, we are not necessarily going to be using python and the 
>> implementation details don't belong in the architecture document 
>> anyway. I think the only reference to Python is the logging. I will 
>> remove this reference.
> OK
> 
>>
>>> Specific comments:
>>> -------------------------------
>>> Section 1:
>>> - The list of of "architecturally significant points" did not include 
>>> any requirement
>>> for being able to create customization images (DC), and the
>>> ability to easily upgrade (snap).  Is that intentional?
>>
>> No, simply a shortened list. I think these are included in other 
>> requirements listed in the document. If not, I can certainly add them 
>> here.
>>
> I don't think they are in anywhere else in the doc.  Having everything 
> in one place is better, I think.
> 

Ok, I will add this.

>>>
>>> Section 2:
>>> - This section mentioned that the engine offers functionality to 
>>> pause or stop the installation
>>> at each checkpoint, and to resume at those checkpoints.  Currently, 
>>> DC provides similar
>>> functionality if the "installation target" is on a ZFS dataset.  In 
>>> order to provide these functionalities,
>>> do we also require the installation target to be a ZFS dataset?
>>>
>>
>> No, we won't required ZFS datasets. I think the difference between DC 
>> and the other Caiman installer apps is that it isn't our intent to 
>> kill the installer app when a checkpoint is completed. We don't have 
>> to. So, in affect the application is in a pause state, and we have to 
>> 'resume', either at the next checkpoint or one previous.  The data we 
>> will need to retain will be retained in memory.
> Thanks for the explanation.


Yes, it can be retained in memory. However, for modifications to the 
filesystems, if you want to rollback that state then ZFS would be 
required. We need to go through each of the checkpoints for an install 
and see what would need to be done to be able to 'rollback' that 
checkpoint.

>>
>>> Section 2.1, 2nd paragraph:
>>> - Question: Can checkpoints interact with the user?  If so, does the 
>>> engine need to coordinate that?
>>
>> I am not sure what you mean by interact with users. Certainly, 
>> checkpoints will be used in interactive installations, and that 
>> installation app will have to reap the data from the data collection 
>> once the checkpoint is completed. In this sense, yes, they can 
>> interact and yes the engine is responsible for managing this 
>> interaction. In the case of the liveCD GUI for example, the gui 
>> registers the appropriate checkpoints, target discovery, target 
>> instantiation, transfer and instance configuration. The application 
>> requests the engine to run the first checkpoint, target discovery, and 
>> this data is put in the data collection cache. The GUI application 
>> then has to read the data from the cache and allow users to make 
>> choices. After the user has completed their choices and chooses to 
>> start the installation, the application requests the engine to execute 
>> the remaining checkpoints that were registered.
> OK, based on your explanation, it sounds like the code in checkpoints 
> can not directly interact with
> the user, for example, there shouldn't be code in the checkpoints to ask 
> the user to type something.

Ah.. no, there shouldn't be anything like this in the checkpoints.


>>
>>> If not, please add something in this paragraph to clarify.
>>> - This description mentions that the execution engine automatically 
>>> computes and reports
>>> progress as checkpoints are executed.  I wonder how realistic it is to
>>> report useful overall progress?  I suppose each checkpoint can 
>>> specify a "weight" for themselves
>>> indicating whether it is a long checkpoint or short one.  Then, the 
>>> engine can average
>>> it out somehow, and assign a weight.  But all these will be guesses, 
>>> not sure how accurate
>>> can be overall progress be reported.
>>
>> The weighting and time of execution has to be supplied by the 
>> checkpoint. Certainly, the engine will do some computing of the 
>> progress totals and percentages, but it is based on input from the 
>> checkpoint. We do something similar today with the GUI. Except that 
>> the overall percentage(weighting) is actually hardcoded in 
>> liborchestrator. It isn't totally accurate either.
> Yes, the weighting is hard coded in liborchestrator, because we just 
> make a guess as to what the weighting
> is.  Since the engine is more general purpose, it will not have any 
> knowledge of checkpoints that get
> registered to it.  That's why I asked.  I suppose the engine will just 
> trust that the checkpoints are providing
> accurate weightings.

Yes, it will have to trust the data it gets. The checkpoints have the 
best knowledge of the work they are doing so it seems better to have 
them manage the progress estimates and reporting.

> 
>>
>> We haven't design this yet, but I don't think that having the engine 
>> manage this calculation with input from the checkpoints makes the 
>> progress reporting less accurate.
> If the checkpoints are not giving accurate weightings, then, the 
> progress reporting is not accurate.
> I think that's really not the engine's problem.  We will just assume 
> applications who register checkpoints
> knows the checkpoints they provide.

Right.

>>
>>> Section 2.1, the list of methods ExecutionEngine need to provide:
>>> - I think it is missing list_checkpoints() or get_checkpoints() so 
>>> applications would know what
>>> value to use when they want to pause/stop/resume.
>>>
>>
>> They are doing the registration of the checkpoints up front, at 
>> application startup time, so I would think they would know the 
>> checkpoints that are available. That said.. I am wondering if there is 
>> a need for this I am not seeing. Maybe you can describe a use case 
>> that might be interesting for this?
> Sorry I was not being clear.  What I was thinking when I wrote that was 
> for the engine to provide the
> list of valid resumable checkpoints, or to somehow validate a provided 
> checkpoint, basically, the ability
> for the applications to use the resume feature.


I need to think about this some. The application has the initial list of 
checkpoints since it did registration so I am not sure a 
list_checkpoints is needed. However, we do need a way to let the engine 
know what is resumable. Presumably the application will not try to 
resume a checkpoint that isn't resumable within the context of that 
application, but I could see maybe for something like DC or any command 
line interface where this might be needed.

>>
>>> Section 2.1, the list of methods a checkpoint must implement is 
>>> missing the following:
>>> - stop(): this is needed to support cancel() in the engine
>>> - percent_completed(): this is useful for progress reporting for long 
>>> running checkpoints.
>>
>> There is a get_progress_estimate() for the checkpoint which should 
>> give us the percentage or whatever unit of measure we are using, 
>> toward completion.
>  From the description of this function, I didn't get the idea that this 
> function is something that's
> repeatedly called to get progress estimate.  I thought it is used once 
> to report how long a certain
> checkpoint will take so the engine can compute and report progress.

Ah.. yes, you are right.We have a report_progress() method on the engine 
that allows checkpoints to report progress.

>>
>> stop() should be added. I will do this.
>>
>>
> OK
>>
>>> Section 2.2, second bullet in the beginning:
>>> - Why just translate installation data into valid AI manifest?  It 
>>> would be more general to translate
>>> it into any manifest.  Other installer might be able to accept 
>>> manifests as input in the future.
>>
>> Well.. I guess I should remove the AI reference since you are about 
>> the 100th person to mention this :-). In general, the plan is to have 
>> the 'manifests' that our products use be the same, wherever possible. 
>> So, in theory a manifest from this could translate to another use.
>>
>>
>>
> OK
>>> Section 2.2, the methods provided by the DataCollection class:
>>> - clear_cache() and create_cache(), there's no description.  I 
>>> couldn't figure out how they
>>> will be used based on the name.
>>
>> I will add descriptions.
>>
>>
> OK
>>> Section 3.1, 2nd paragraph:
>>> - Do you mean "semantic validation" of user-provided manifests?
>>
>> No, I mean syntactic validation here. At this point in time, we have 
>> no plans to do semantic validation and if we did, the ManifestParser 
>> isn't likely the place this would happen.
> OK.  If that's the case, what's the difference between first sentence of 
> the first paragraph and the first sentence
> of the 2nd paragraph?  One is "user provided installation XML 
> manifest".  The second is "user provided manifests".
> So, in the 2nd case, you meant those manifests might not be in XML format?
>>

Ah.. no, I don't meant that. It is a mistake. I will correct.

>>> Section 3.2, the list of objects Target Discovery need to discover:
>>> -" Installed instances of other operating systems....", what about 
>>> existing instances
>>> of OpenSolaris?
>>
>> Those are covered by the discovery of Boot Environments.
>>
> OK
>>>
>>> Section 3.2.1:
>>> - This whole section talks about the different classes that target 
>>> discovery needs to
>>> support and the interfaces/methods these classes are required to 
>>> implement.
>>> Target discovery is supposed to be a "read" operation, but many of 
>>> these classes
>>> have methods for "write" operations such as set_xxxxx(), snapshot(), 
>>> clone(), or other
>>> functions for implementing a functionality, such as the 
>>> send()/receive() methods in
>>> the Dataset class.  I think it is very confusing to have the 
>>> description for all these
>>> in the section for Target discovery.  Why not have a completely new 
>>> section that just
>>> talks about all the classes and all the methods it requires, and 
>>> perhaps talk about
>>> what functionality these methods can provide.
>>
>> I can see your point. I think maybe the better answer is to have one 
>> section that addresses target discovery and target instantiation. The 
>> classes are the same for both, the methods of course may not be. I 
>> will work on getting this clearer.
> OK
>>
>>
>>>
>>> Section 3.2.1.1, description of the required methods for Partition 
>>> class and Slice Class
>>> - I think you also need get_assoc_disk() and set_assoc_disk() in 
>>> those 2 classes so it
>>> can associate itself back to the disk that contains these 
>>> partitions/slices.
>>
>> I agree I need a method that allows us to build the association back 
>> to the parent. Maybe get_assoc_parent()? I have something similar in 
>> the LogicalTarget class, get_assoc_container() but I should likely 
>> just use get_assoc_parent(). What do you think?
> Yes, get_assoc_parent() sounds good.  It will allow us to traverse "up" 
> the tree.
>>
>>
>>> Section 4.1, New Solaris Install UML diagram
>>> - Question: does ProgressReciver, DataCollection and Logger all need 
>>> to talk to
>>> the checkpoints, because it is the check points that will generate 
>>> the "data"
>>> for those classes?  Or is it true that all the "data" from the 
>>> checkpoint classes will
>>> be reported to the engine, and the engine will dispatch them to the 3 
>>> classes as appropriate?
>>
>>
>> The DataCollection object is open for communication from anyone. To do 
>> this a method, get_data_collection() is called on the ExecutionEngine 
>> to get the DataCollection object handle.
> After applications call get_data_collection() on the ExecutionEngine to 
> get the DataCollection object handle,
> does the application use the DataCollection object methods directly or 
> via the ExecutionEngine?
>>

Applications(consumers) call methods directly on the DataCollection object.

>>
>>> - InstallApp needs an arrow to the DataCollection class also, because 
>>> previous
>>> section in the doc talks about DataCollection is instantiated by the 
>>> app, and the app
>>> can access/modify the data outside of the engine.
>>
>> It is instantiated by the InstallApp, through the ExecutionEngine. It 
>> uses the method set_data_collection() to do this. I see what you mean 
>> by the text indicating it is instantiated by the InstallerApp, and it 
>> is,just as the ProgressReceiver and Logger are instantiated by the 
>> InstallerApp, via the ExecutionEngine. I think the diagram reflects 
>> this correctly.
> After they are instantiated via the ExecutionEngine, InstallApps will 
> call the methods in these classes directly or via the ExecutionEngine.


It can call them directly, in the case of DataCollection anyway. So, I 
think I need to update the diagram to reflect this.

>>> - Question: will the app be able to use the logger too?  To log 
>>> messages/errors?  If so, need
>>> an arrow.
>>
>> It can use the logger. It actually gets the logger and writes to the 
>> logger via the ExecutionEngine though, so it isn't direct. Although, I 
>> can see that in diagram 4.2, I do show the checkpoints directly 
>> logging messages to the logger. That seems broken perhaps. The app 
>> should be able to log messages as well, but only through the 
>> ExecutionEngine interfaces to the logger. I will update this to 
>> reflect this fact.
> OK
>>
>>> - Should the arrow from InstallApp to ExecutionEngine be 
>>> bi-directional?  Since InstallApp will
>>> be getting progress/errors from engine?
>>
>> The ProgressReceiver gets the progress reports from the 
>> ExecutionEngine.  The ProgressReceiver doesn't know about the 
>> ExecutionEngine, it simply gets updates from it so the association is 
>> uni-directional. The ProgressReceiver is generally the InstallerApp, 
>> but is managed via a separate class. The ProgressReceiver will also 
>> get the errors from the ExecutionEngine via the update() method. So, 
>> to answer your question, I don't think this should be bi-drectional to 
>> the InstallerApp.
> ok
>>
>>
>>> - Under "FileSystem", only Dataset is listed, what about UFS file 
>>> systems?
>>
>> UFS and other filesystems will be supported for discovery, but not 
>> creation. Dataset is called out specifically as a subclass since it 
>> has unique methods and we do create them as part of our installations.
>>
> Well, DC does need to create a UFS file system for the ramdisk.

Ah.. that's true. Good point.
>>
>>> Section 4 and Section 5 general comment:
>>> - The engine is supposed to provide the ability to 
>>> pause/stop/resume.  It would be useful to
>>> have a UML diagram or use case to illustrate how that can be done.
>>
>> I am not sure how I would show that in a UML diagram, but I will think 
>> on this and see what I can do.
> I don't know how easily it is to show this using a UML diagram.  It 
> should be easy enough to show
> it as a use case.  Perhaps another DC use case?

Sure, I can try this.
>>
>>
>>> Section 5.1, item 4, instance configuration:
>>> - You mentioned that this checkpoint will reverse media-specific 
>>> customizations and apply
>>> system configuration options selected by the user.  I think that 
>>> system configuration options,
>>> and transferring of the log file are common things that all 
>>> installers need to do,
>>> and most of the code can probably be shared
>>> among installers.  So, it might be useful for it to be it's own 
>>> checkpoint class.  Reversing of
>>> media-specific customization should be in a separate checkpoint class 
>>> because it is very specific
>>> to how the media is constructed.
>>
>> Configuration options and how they are applied is pretty application 
>> specific. For example, the way configuration is specified in the GUI 
>> is different than for AI. Applying them may be different as well. The 
>> transfer of logs I can see is common but having this as its own 
>> separate application class seems unnecessary. However, I can see where 
>> some of InstanceConfiguration might be broken down in to multiple 
>> parts. I think we need to keep this in mind when working through the 
>> design of the CommonApplication classes and the ApplicationClasses 
>> that are specific to installers.
> OK
>>>
>>> Section 5.3, first paragraph, 2nd sectence:
>>> - ..... a manifest is used to drive the installation process.... I 
>>> think "construction process" sounds better
>>> than "installation process"
>>
>> Sounds better to me as well. I will change.
> OK
>>>
>>> Section 5.3, step 2:
>>> - .... create the ZFS datasets required for the construction 
>>> process.  Currently, both UFS and ZFS are
>>> supported by DC.  Does this imply that we will no longer support UFS?
>>
>> I am not sure that's what this was intended to imply, I think it was 
>> just an example of a possible use case. But, that raises a good 
>> question.. what if we no longer supported UFS? What impact would that 
>> have?
> 
> In current DC, there's very little or none at all special code to handle 
> UFS.  So, it really does not matter
> whether remove that support or not.  One thing I have heard people do 
> this that they use /tmp as the build area,
> which reportedly improved the image construction speed a lot, even 
> though they will loose
> the ability to stop/resume.
> 
> Given that OpenSolaris is required to run DC, and you would always have 
> at least 1 ZFS pool (rpool)
> it is not un-reasonable to only support ZFS.

Ok. I will take more of a look at this.

Thanks again for your comments and review!

sarah

[caiman-discuss] Caiman architecture document open for review

Reply via email to