Chip, To clarify as well, I too -1 this merge until the technical issues I have identified on the list and Review Board are addressed and the patch is re-reviewed.
Thanks, -John On May 30, 2013, at 11:41 AM, Chip Childers <chip.child...@sungard.com> wrote: > Just so that we are 100% clear... I'm -1 (binding veto) on this merge right > now due to the technical issues being discussed below. No single issue > should be the focus, and instead we should start focusing on > collaboratively fixing this branch up so that it's coming into master > with high quality and full consensus on the changes. > > I would propose that the issues being discussed below be broken out into > several DISCUSS threads, where concerns can be debated and addressed > within a reasonably defined scope for each discussion. Edison, as the > proposer of this merge, I would imagine that you are in the best > position to kick off the appropriate threads to get you the help and > input needed. > > I am *NOT* saying that it's a veto due to timing. Timing is a > procedural issue which, frankly, is something that I'd rather work to > achieve consensus on. I only raised that point because this merge > request contradicted our previous consensus on the appropriate timing of > architectural changes merging into master (as documented at [1]). > > -chip > > [1] > https://cwiki.apache.org/confluence/display/CLOUDSTACK/Releases#Releases-FeatureReleaseCycle > > On Thu, May 30, 2013 at 10:43:04AM -0400, John Burwell wrote: >> Edison, >> >> Please see my comment in-line. >> >> Thanks, >> -John >> >> On May 29, 2013, at 5:15 PM, Edison Su <edison...@citrix.com> wrote: >> >>> >>> >>>> -----Original Message----- >>>> From: Chip Childers [mailto:chip.child...@sungard.com] >>>> Sent: Wednesday, May 29, 2013 1:41 PM >>>> To: dev@cloudstack.apache.org >>>> Subject: Re: [MERGE]object_store branch into master >>>> >>>> On Tue, May 28, 2013 at 11:42:49PM +0000, Edison Su wrote: >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: John Burwell [mailto:jburw...@basho.com] >>>>>> Sent: Tuesday, May 28, 2013 8:43 AM >>>>>> To: dev@cloudstack.apache.org >>>>>> Subject: Re: [MERGE]object_store branch into master >>>>>> >>>>>> All, >>>>>> >>>>>> I have gone a through a large chunk of this patch, and published my >>>> review >>>>>> thus far (https://reviews.apache.org/r/11277/). TL;DR is that this >>>>>> patch >>>> has a >>>>>> number of significant issues which can be summarized as follows: >>>>> >>>>> >>>>> Thanks for your time to review this patch. >>>>> >>>>>> >>>>>> 1. While it appeas that the requirement of NFS for secondary storage >>>>>> has largely been removed, it has basically been if blocked out >>>>>> instead of pushed down as an detail choice in the physical layer. >>>>>> Rather than exploiting polymorpish to vary behavior through a set of >>>>>> higher level abstracttions, the orchestration performs instanceof >>>>>> NFSTO checks. The concern is that future evolution of secondary storage >>>> layer will still have dependencies on NFS. >>>>> >>>>> As long as mgt server doesn't explicitly depend on nfs secondary storage >>>> during the orchestration, and the nfs storage can be used based on >>>> different >>>> configurations, then the resource side "NFSTO checks" is not much concern >>>> to me at the current stage. We can add a " polymorpish high level >>>> abstraction" at the resource side, as you suggested before, in the future. >>>> The current code change is already so huge now, add a new level a >>>> abstraction will make the change even much bigger. But adding this kind of >>>> "high level abstraction" is not hard any more, as all the storage related >>>> commands, are only dealing with DataTO/DataStoreTO. Right now, there is >>>> no read/write data operations on this *TO, one possible way to add a "high >>>> level abstraction" would be add one read/write data operations at each >>>> hypervisor resource based on the DataTO and DataStoreTO. >>>>> My point is that refactor mgt server code is much much harder than the >>>> resource code. If we can get the mgt server right, then we can move on to >>>> refactor resource code, in the future. >> >> Due to their impact on dependent subsystems and lower layers, these types of >> design questions need to be addressed before a merge to master. >> >>>>> >>>>>> >>>>>> 2. In some sceanrios, NFS is still a requirement for secondary >>>>>> storage. In particular, Xen users will still have to maintain an >>>>>> "NFS cache". Given the degradation of capability, I think it would >>>>>> be helpful to put a few more eyeballs on the Xen implementation to >>>>>> determine if we could avoid using an intermediate file system. >>>>> >>>>> Nfs will not disappear soon, for some hypervisors, for some type of >>>> storages. But as long as the mgt server doesn't have this requirement, then >>>> people can add new type of hypervisors(e.g. hyperv), new type of >>>> storages(ceph, solidfire etc) which don't need NFS. The optimization for a >>>> particular hypervisor can be done by other people, who may be familiar with >>>> some type of storage, or a certain hypervisor. >> >> It feels like we have jumped to a solution without completely understanding >> the scope of the problem and the associated assumptions. We have a >> community of hypervisor experts who we should consult to ensure we have the >> best solution. As such, I recommend mailing the list with the specific >> hypervisors and functions that you have been unable to interface to storage >> that does not present a filesystem. I do not recall seeing such a >> discussion on the list previously. >> >>>>> >>>>>> 3. I have the following concerns regarding potential race conditions >>>>>> and resource exhaustion in the NFS cache implementation. >>>>>> >>>>>> - The current random allocator algorithm does not account for >>>>>> the amount space that will be required for the operation (i.e. >>>>>> checking to ensure that the cache it picks has enough space to >>>>>> transfer to hold the object being >>>>>> downloaded) nor does it reserve the space.Given the long (in compute >>>>>> time) running nature of these of processes, the available space in a >>>>>> cache could be exhausted by a number of concurrently transfering >>>>>> templates and/or snapshots. By reserving space before the transfer, >>>>>> the allocator would be able to account for both pending operations >>>>>> and the current contents of the cache. >>>>> >>>>> 1. It's the current behavior of 4.1 and master secondary storage code. We >>>> didn't introduce new issue here. >>>>> 2. Add a new cache storage allocator is much easier than adding it in >>>>> master >>>> branch. >>>>> 3. I treat it as a bug, and plan to fix it after the merge. >>>> >>>> We should not be merging in major changes with *known* bugs like this. >>> >>> first and foremost, it's a well known issue in master branch, don't matter >>> merge the branch or not. >>> If we want to fix the issue, either fix it on object_store branch or on >>> master branch, but both will need time. >>> As the 4.2 merge window is approaching, I don't think we can fix it in >>> these two or three days. >>> If we missed the merge window, then who will fix it on master? >> >> As I understand the goals of this enhancement, we will support additional >> secondary storage types and removing the assumption that secondary storage >> will always be NFS or have a filesystem. As such, when a non-NFS type of >> secondary storage is employed, NFS is no longer the repository of record for >> this data. We can always exceed available space in the repository of >> record, and the failure scenarios are relatively well understood (4.1.0) -- >> operations will fail quickly and obviously. However, as a transitory >> staging storage mechanism (4.2.0), the expectation of the user is the NFS >> storage will not be as reliable or large. If the only solution we can >> provide for this problem is to recommend an NFS "cache" that is equal to the >> size of the object store itself then we have little to no progress >> addressing our user's needs. Fundamentally, the role of the NFS is >> different in 4.2.0 than 4.1.0. Therefore, I disagree with the assertion >> that issue is present in 4.1.0. >> >> An additional risk in the object_store implementation is that we lead a user >> to believe their data has been stored in reliable storage (e.g. S3, Riak CS, >> etc) when it may not. I saw no provision in the object_store to retry >> transfers if the object_store transfer fails or becomes unavailable. In >> 4.0.0/4.1.0, if we can't connect to S3 or Swift, a background process >> continuously retries the upload until successful. >> >> Finally, I see this issue as a design issue than a bug. I don't think we >> should be merging branches that we know have a defect let alone a design >> issue. A proper implementation will require refining/expanding the >> StorageAllocator interface, and introducing a mechanism to track >> reservations (including proper TTL support). IMHO, this these type of >> issues must resolved before a feature can be merged into master. >> >>> >>>> >>>>> >>>>>> - There appears no mechanism to age out contents of the cache. >>>>>> Therefore, as implemented, it will grow unbounded. The only >>>>>> workaround for this problem would be to have an NFS cache whose size >>>>>> equals that of the object store. >>>>> >>>>> >>>>> Admin can multiple nfs cache storages into one zone, and it's the current >>>> behavior on master branch. Again, we can add aging policy later, as cache >>>> storage has its own pom project, and it's own implementation. >>>> >>>> That doesn't make much sense to me. A cache is expected to be, by it's >>>> nature, bounded by time / size / something... We need to implement some >>>> sort of bounds here. >>>> >>>> It also shouldn't be a failure point if you have a cache miss or cache >>>> failure. >>>> What behaviour occurs if there is only 1 and it's full? >>> >>> As I said before, it's the current behavior of master branch. On master >>> branch, there is no code the check the capacity of nfs secondary storage. >>> I do have plan to fix it on object_store branch, but the code change is so >>> huge, take a lot of time to refactor other stuff, so my plan is to fix it >>> later. >> >> Given the different use of NFS in the object_store branch vs. current, I >> don't see the comparison in this case. In the current implementation, when >> we exhaust space, we are truly out of resource. However, in the >> object_store branch, we have no provision to remove stale data and we may >> report no space available when there is plenty of space available in the >> underlying object store. In this scenario, the NFS "cache" becomes an >> artificial limiter on the capacity of the system. I do not understand how >> we have this problem in current since the object store is only a backup of >> secondary store -- not secondary storage itself. >> >>> >>>> >>>>> >>>>> >>>>>> - The mechanism lacks robust error handling or retry logic. In >>>>>> particular, the behavior if/when a cache exhausts available space appears >>>> non-deterministic. >>>>> >>>>> Again, master branch will have the same issue, fix the issue will be >>>>> easier >>>> than fixing it on master branch. >>>> >>>> Why not fix it now, since the cache mechanism is being touched? >>> >>> 1. Do need time to fix it. At least one week. >>> 2. What's difference between, fix it on object_store branch, instead of >>> after merge? As this issue is not introduced by object_store branch. >> >> It is my estimate robust error handling will require design changes (e.g. >> introduction of a resource reservation mechanism, introduction of addition >> exception classes, enhancement of interfaces to provide more context >> regarding client intentions, etc) yielding significant code impact. These >> changes need to undertaken in a holistic manner with minimum risk to master. >> Fundamentally, we should not be merging code to master with known >> significant issues. When it goes to master, we should be saying, "To the >> best of my knowledge and developer testing, there are no blocker or critical >> issues." In my opinion, omission of robust error handling does not meet >> that standard. >> >>> >>>> >>>>> >>>>>> - Generally, I see little consideration for >>>>>> alternative/exception flows. For example, what happens if I attempt >>>>>> to use a template/iso/snapshot in transit to the object store >>>>>> to/from the cache? Since these files can be very large (multiple GBs), >>>>>> we >>>> have assume some transfer latency in all interactions. >>>>> >>>>> Management server mainly maintains the state of each >>>> object(volume/snapshot/template), we add state machine for all the objects, >>>> and for all the objects on each storages, which is unavailable on the >>>> master >>>> branch. Take copy template from object store to cache as an example, the >>>> state of template entry on object storage and cache storage will be changed >>>> to "Copying", if copying failed or time out, then the state of template >>>> entry >>>> on object storage will be changed to "Ready", while the state of template >>>> entry on cache storage will be changed to "Failed". >>>>> As long as mgt server maintains the state correctly, user/admin can issue >>>> the same operation through UI/API in case one operation is failed. That' s >>>> one of reason, there is no retry logic in the storage subsystem code. >>>> Another >>>> reason is that adding retry logic on top of async api will a little bit >>>> difficult. >>>>> >>>>>> >>>>>> 4. The Image Cache abstraction is too tightly coupled to the core >>>>>> orchestration mechanism. I would recommend implementing it as a >>>>> >>>>> The only place is coupled with image cache abstraction is at >>>>> datamotionstrategy, in particular, AncientDataMotionStrategy - >>>>> needCacheStorage and AncientDataMotionStrategy will call >>>> StorageCacheManager-> createCacheObject if cache storage is needed. >>>>> >>>>>> DataStore proxy that is applied as necessary. For example, the >>>>>> current Xen hypervisor implementation must interact with a file >>>>>> system. It seems most appropriate that the Xen hypervisor >>>>>> implementation would apply the proxy to its secondary storage >>>>>> DataStore instances. Therefore, the storage engine >>>>> >>>>> The issue that struggles me for a while is that, how and where to >>>>> configure >>>> cache storage. Take S3 as an example, the implementation of S3 provider >>>> should not need to care about cache storage at all, S3 implementation >>>> should >>>> only care about how to deal with data stored in S3. While S3 is a general >>>> purpose storage, the same implementation can be used by different >>>> hypervisors, and in different situations. For example, If both Ceph and S3 >>>> used by KVM, then definitely, nfs cache is not needed, while for other >>>> cluster wide primary storages supported by KVM, nfs cache is needed. >>>>> How and where to present this kind of configuration, and also both primary >>>> storage provider and object storage provider implementation don't need to >>>> aware this kind of configuration? My current solution is that, write the >>>> decision in the code, DataMotionStrategy is the place to decide to use >>>> cache >>>> storage or not. In the default implementation, all the copy operations from >>>> s3 to primary storage, cache storage is needed, while if people adding a >>>> new >>>> storage, and want to remove cache storage for certain reason, then add a >>>> new DataMotionStrategy. >>>>> >>>>> I am not sure how the datastore proxy will solve the issue I mentioned >>>> above, I am glad to know your solution. >>>> >>>> This looks like a good opportunity for some collaboration on the design! >>> >>> I am glad to know other solutions, or other people write code on the branch. >>> Code is always welcome. >> >> Fundamentally, there should be no dependency from Storage to Hypervisor. >> The scenario presented here makes we wonder whether or not data motion >> belongs in the Storage or Hypervisor layers. Storage doesn't care if the >> I/O is for data motion. It's only concern is the movement of data from one >> location to another. I am beginning to feel that DataMotion belongs in the >> Hypervisor layer and the Storage layer should expose a more basic set of >> primitives that are composed by the service to orchestrate the required I/O >> operations. Can you shed more light on the reasoning for putting this >> service in Storage rather Hypervisor? >> >> To my mind, the storage services should present options that allow clients >> declare the requirements. For example, if the Xen hypervisor implement must >> perform an operation through a file system interface, it should express that >> need to the storage subsystem. In turn, the DataStore returned will fulfill >> that contract. From an implementation perspective, I suggest the cache >> mechanism be implemented as proxy of the underlying DataStore >> implementation. There are likely other solutions -- I was merely trying to >> suggest approaches that would reduce the coupling for the purposes of >> discussion. >> >>> >>>> >>>>> >>>>>> should only provide the facility not attempt to determine when it is >>>> needed. >>>>>> Additionally, a proxy model would greatly reduce the size and >>>>>> complexity of the various classes (e.g. AncientDataMotionStrategy). >>>>> >>>>> The current AncientDataMotionStrategy is only about 400 LOC now, I >>>> removed some of the duplicated code. But still want to know the detail of >>>> your solution. >>>>> >>>>>> >>>>>> 5. While I have only reviewed roughly 50-60% of the patch thus far, >>>>>> I see little to no additional unit tests for the new code added. >>>>>> Integration tests have >>>>> >>>>> If the code is changing rapidly, then writing/maintaining unit test cases >>>>> will >>>> take huge effort, we focused on the integration test and marvin test in >>>> this >>>> stage. >>>> >>>> We shouldn't be merging code that is "still changing rapidly". What do you >>>> mean by this comment? >>> >>> Oh, should be "was changing rapidly" in the last month, in May, we moved on >>> to integration test and marvin test and a lot of bug fixes. >>> >>>> >>>>> >>>>>> been expanded, but many of the test simply execute the service >>>>>> methods and do not verify the form of the operation results. There >>>>>> are also a tremendous number of ignored exceptions that should likely >>>> fail these tests. >>>>>> Therefore, we could have tests passing due to an ignored exception >>>>>> that should be failing. >>>>> >>>>> 1. Thanks, you take a look at the integration test, which is very >>>>> convenient >>>> for us during the implementation. It saves us a lot time during the test. >>>>> 2. Yes, the integration test needed to be improved as your suggested. >>>> >>>> Let's do that please! >> >> Without asserts, the tests are simply exercising the code to ensure that no >> exceptions are being thrown. They are not verifying that the test actually >> succeeded. Until all methods properly verify the results of the call, it is >> difficult to objectively gage the quality of the branch. Additionally, the >> integration tests verify very few alternative and exception flows. I would >> like to see more of those scenarios exercised. >> >>>> >>>>> >>>>>> >>>>>> While I recognize the tremendous effort that has been expended to >>>>>> implement this capability and value of the feature, I see >>>>>> significant design and operational issues that must be addressed >>>>>> before it can be accepted into >>>>> >>>>> I would say the object _store branch is just better than the master >>>>> branch, >>>> almost all the issues(related to cache storages) you pointed out, are >>>> existing >>>> on the master branch. While fixing these issues on master branch is harder >>>> than on object_store branch. Frankly speaking, I don't want to >>>> read/understand/guess how it works on the master branch, it's a lot of hack >>>> all around the place. >>>>> Of course, these issues are still issues, we need to fix them, but when >>>>> to fix >>>> them is arguable. I prefer fixing them after the merge. Here is the reason: >>>>> If we agreed on the issues you pointed out are existing on the master >>>>> also, >>>> then do we want to fix them on the 4.2 or not? If the answer is yes, then >>>> who will fix them, and how long it will take to fix them on master? If I >>>> propose an offer, that I'll fix these issues after merging object_store >>>> into >>>> master, will it be acceptable? If the answer is no, then there is no >>>> technical >>>> reason to not merge object_store into master, as these issues are not >>>> introduced by object_store branch. >>>>> >>>>> >>>>>> master. In my opinion, the object_store represents a good first >>>>>> step, but it needs a few more review/refinement iterations before it >>>>>> will be ready for a master merge. >> >> I agree with Chip. In addition to the issues I have mentioned, this is a >> massive change late in the cycle for little overall gain. We have not yet >> achieved the key goal our users want -- reliable, horizontally scalable >> secondary storage on commodity hardware (i.e. cheaper per GB with the same >> reliability). As such, I will take proven/released, "ugly" code over >> unproven/unreleased, "pretty" code every time. >> >>>>>> >>>>>> Thanks, >>>>>> -John >> >>