First off, I'd like to thank you very much for your interest and involvment in Tashi. I've tried to respond to the specific issues listed:
Priority 1: Authentication/Encryption I agree this is a high priority item. A student working over the summer (Michael Wang) modified Tashi to use RPyC, which provides a user authentication mechanism as well as a secure channel for requests. We haven't done extensive testing, but it appears to provide most of what we want. It requires some manual configuration at this point, but I'd like to know if for some reason this is not a satisfactory approach in general for you before I dig deeper. Priority 2: Network configuration I agree that this will likely be an ongoing issue. In our current infrastructure, we have a DMZ (with 10 public IPs), a general network, and several private VLANs. We have assumed control of the DMZ and the general network, but are having users run their own DNS and DHCP servers in the private VLANs. I agree completely with the strategy you suggest -- implement what we need now with an eye toward future extensions. Priority 3: Site-specific plugins This is similar to the last point in that we need to implement what we need now while trying to keep it extensible, but we won't really know all the requirements until more sites are using Tashi. Priority 4: VM scheduling model The basic scheduler (primitive.py) doesn't do much in this space. We have, however, implemented a bridge that allows the use of Maui, a resource scheduler, to control VM creation. This should allow the use of more advanced scheduling techniques for things like priorities and quotas. A basic system of billing would be possible by using this as well, but it would seem advantageous to have Tashi support a more direct and systematic form of billing. Priority 5: Physical boot We have looked at this a fair bit and there seems to be two basic conclusions we have drawn. One is that if we properly isolate physical machines (VLANs and routing and other techniques), we can limit a rogue DHCP server from affecting the entire cluster and have it only affect a private VLAN (presumably owned and managed by one user or group). We are working with others at HP on a project called PRS that will is responsible for the physical booting. It will automatically reprogram switches and other networking infrastructure to limit the access of an end-host and setup servers to perform the PXE booting. The other conclusion is that, in general, current hardware lacks the ability to limit modifications to the BIOS and other system hardware by a priveledged user in the operating system. We have thought of dealing with these problems by, as mentioned above, limiting the impact using network isolation and disincentivizing the later problem/bahavior by using a billing system that will bill a user until a machine is returned (ie. it PXE boots a base image we provide). And as you mention, this feature is just beginning to materialize. Priority 6: Multi-VM job control This may be solvable by using Maui as the scheduler, but I agree that this is a scheduler-only change and shouldn't be tremendously difficult with respect to Tashi (synchronized operations are always a little challenging in a cluster). To respond to you rquestion about joining and proposing and developing solutions, I'd like to warmly welcome you to do so. I have sent this email to the tashi-dev mailing list and BCC'd all of the original recipients (to avoid exposing email addresses). I'd be happy to continue any discussion on the mailing list. You can join the mailing list by emailing [email protected]. Additionally, if you have code, patches, ideas, or documentation to contribute, sending it to the list is the right way to get it applied to SVN. The basic way forward is for us to continue this discussion by exchaning ideas and code. Assuming you want to get even more involved, we could look into making one or more of you committers after some further interactions. In terms of testing, I haven't written much documentation. The procude works roughly as follows: 1. Install on a small testbed (2-3) nodes and test all basic features as well as any new functionality. 2. If the change affects the cluster manager, stop the scheduler, backup the CM's data, update the software and restart the CM and scheduler on the production cluster. 3. Incrementally update the software on the nodes, simply killing the node manager process and restarting it (everything should automatically reload). Again, this is on our production cluster. Obviously, in the cases where the data format checkpointed by the node manager changed, that must be updated between the exit and the restart. Again, thank you very much for your time and energy. I appreciate the detailed analysis of the current system and look forward to working with you in the future. - Michael -----Original Message----- From: Sheen, Robert Sent: Thursday, September 17, 2009 6:02 AM To: Ryan, Michael P Subject: RE: Support of Tashi Dear Ryan, This is Robert Sheen at Taiwan HP. I would like to ask your support to help III to solve the questions of Tashi, III is planning to join Open Cirrus and have already installed Tashi on their site. Your help will be very helpful to speedup the collaboration, thanks in advance. III Dr. Hsieh as in the cc list. After his studying the Tashi slides, there are bellows known issues, Dr Hsieh would like to know what is the current status of these known issues, and if III want to join to propose and develop solutions for these known issues, how to proceed? What procedure need to take? Thanks! 2nd question is III is drafting a test plan for the Tashi environment. Mr., Chen would like to ask help on any exist test procedure document to reference. Thanks! • Priority 1: Authentication/Encryption – Virtual cluster owner authentication has not been resolved in the current Tashi implementation – Plan: select a user account management scheme soon and implement (probably via SSL) • Priority 2: Network configuration – Site-specific network configuration will probably be an on-going thorny issue. How many global IP addresses are available? Which private subnets are available? Do the physical cluster owners have control over local DHCP/DNS servers? Etc. – Plan: implement something that works for the first few Tashi sites, architect the site-specific plugin to enable modification, adapt as new needs surface • Priority 3: Site-specific plugins – Are agents capable of doing all of the site-specific logic needed to create and manage VMs? – Plan: Solicit feedback from partners to determine for which steps in VM creation/activation customization is critical • Priority 4: VM scheduling model – Tashi does not currently have a well-integrated scheduler that supports VM priorities, quotas, billing, etc. – Plan: Implement features on “as needed” basis • Priority 5: Physical boot – A number of security concerns have surfaced here if the owner of the physically-booted machine is not completely trusted (or if a trusted, but naïve, owner’s machine becomes compromised). What if a DHCP server is started that competes with the cluster’s server? If we rely on PXE boot to regain control, can we prevent a physical owner from reprogramming the BIOS to prevent PXE boot? What are the best monitoring/control options? Etc. – Plan: do not offer physical boot in Tashi until security model is better understood • Priority 6: Multi-VM job control – The current scheduling agent activates VMs one at a time. A transactional mechanism needs to be added that only starts a VM group if there is room to accommodate the entire group and enables easy tear-down if any portion of the group fails – Plan: Extend scheduler with such a feature, should be straight-forward Best Rgrds, Robert Sheen 沈 仲 杰 HP TSG Pre-Sales Solution Manager
