Re: [MERGE]object_store branch into master

John Burwell Thu, 30 May 2013 13:59:24 -0700

Chip,

To clarify as well, I too -1 this merge until the technical issues I
have identified on the list and Review Board are addressed and the
patch is re-reviewed.


Thanks,
-John




On May 30, 2013, at 11:41 AM, Chip Childers <chip.child...@sungard.com> wrote:

> Just so that we are 100% clear...  I'm -1 (binding veto) on this merge right
> now due to the technical issues being discussed below.  No single issue
> should be the focus, and instead we should start focusing on
> collaboratively fixing this branch up so that it's coming into master
> with high quality and full consensus on the changes.
>
> I would propose that the issues being discussed below be broken out into
> several DISCUSS threads, where concerns can be debated and addressed
> within a reasonably defined scope for each discussion.  Edison, as the
> proposer of this merge, I would imagine that you are in the best
> position to kick off the appropriate threads to get you the help and
> input needed.
>
> I am *NOT* saying that it's a veto due to timing.  Timing is a
> procedural issue which, frankly, is something that I'd rather work to
> achieve consensus on.  I only raised that point because this merge
> request contradicted our previous consensus on the appropriate timing of
> architectural changes merging into master (as documented at [1]).
>
> -chip
>
> [1] 
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Releases#Releases-FeatureReleaseCycle
>
> On Thu, May 30, 2013 at 10:43:04AM -0400, John Burwell wrote:
>> Edison,
>>
>> Please see my comment in-line.
>>
>> Thanks,
>> -John
>>
>> On May 29, 2013, at 5:15 PM, Edison Su <edison...@citrix.com> wrote:
>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Chip Childers [mailto:chip.child...@sungard.com]
>>>> Sent: Wednesday, May 29, 2013 1:41 PM
>>>> To: dev@cloudstack.apache.org
>>>> Subject: Re: [MERGE]object_store branch into master
>>>>
>>>> On Tue, May 28, 2013 at 11:42:49PM +0000, Edison Su wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: John Burwell [mailto:jburw...@basho.com]
>>>>>> Sent: Tuesday, May 28, 2013 8:43 AM
>>>>>> To: dev@cloudstack.apache.org
>>>>>> Subject: Re: [MERGE]object_store branch into master
>>>>>>
>>>>>> All,
>>>>>>
>>>>>> I have gone a through a large chunk of this patch, and published my
>>>> review
>>>>>> thus far (https://reviews.apache.org/r/11277/).   TL;DR is that this 
>>>>>> patch
>>>> has a
>>>>>> number of significant issues which can be summarized as follows:
>>>>>
>>>>>
>>>>> Thanks for your time to review this patch.
>>>>>
>>>>>>
>>>>>> 1. While it appeas that the requirement of NFS for secondary storage
>>>>>> has largely been removed, it has basically been if blocked out
>>>>>> instead of pushed down as an detail choice in the physical layer.
>>>>>> Rather than exploiting polymorpish to vary behavior through a set of
>>>>>> higher level abstracttions, the orchestration performs instanceof
>>>>>> NFSTO checks. The concern is that future evolution of secondary storage
>>>> layer will still have dependencies on NFS.
>>>>>
>>>>> As long as mgt server doesn't explicitly depend on nfs secondary storage
>>>> during the orchestration, and the nfs storage can be used based on 
>>>> different
>>>> configurations, then the resource side "NFSTO checks" is not much concern
>>>> to me at the current stage. We can add a " polymorpish high level
>>>> abstraction"  at the resource side, as you suggested before, in the future.
>>>> The current code change is already so huge now, add a new level a
>>>> abstraction will make the change even much bigger. But adding this kind of
>>>> "high level abstraction" is not hard any more, as all the storage related
>>>> commands, are only dealing with DataTO/DataStoreTO. Right now, there is
>>>> no read/write data operations on this *TO, one possible way to add a "high
>>>> level abstraction" would be add one read/write data operations at each
>>>> hypervisor resource based on the DataTO and DataStoreTO.
>>>>> My point is that refactor mgt server code is much much harder than the
>>>> resource code. If we can get the mgt server right, then we can move on to
>>>> refactor resource code, in the future.
>>
>> Due to their impact on dependent subsystems and lower layers, these types of 
>> design questions need to be addressed before a merge to master.
>>
>>>>>
>>>>>>
>>>>>> 2. In some sceanrios, NFS is still a requirement for secondary
>>>>>> storage.  In particular, Xen users will still have to maintain an
>>>>>> "NFS cache".  Given the degradation of capability, I think it would
>>>>>> be helpful to put a few more eyeballs on the Xen implementation to
>>>>>> determine if we could avoid using an intermediate file system.
>>>>>
>>>>> Nfs will not disappear soon, for some hypervisors, for some type of
>>>> storages. But as long as the mgt server doesn't have this requirement, then
>>>> people can add new type of hypervisors(e.g. hyperv), new type of
>>>> storages(ceph, solidfire etc) which don't need NFS. The optimization for a
>>>> particular hypervisor can be done by other people, who may be familiar with
>>>> some type of storage, or a certain hypervisor.
>>
>> It feels like we have jumped to a solution without completely understanding 
>> the scope of the problem and the associated assumptions.  We have a 
>> community of hypervisor experts who we should consult to ensure we have the 
>> best solution.  As such, I recommend mailing the list with the specific 
>> hypervisors and functions that you have been unable to interface to storage 
>> that does not present a filesystem.  I do not recall seeing such a 
>> discussion on the list previously.
>>
>>>>>
>>>>>> 3. I have the following concerns regarding potential race conditions
>>>>>> and resource exhaustion in the NFS cache implementation.
>>>>>>
>>>>>>  - The current random allocator algorithm does not account for
>>>>>> the amount space that will be required for the operation (i.e.
>>>>>> checking to ensure that the cache it picks has enough space to
>>>>>> transfer to hold the object being
>>>>>> downloaded) nor does it reserve the space.Given the long (in compute
>>>>>> time) running nature of these of processes, the available space in a
>>>>>> cache could be exhausted by a number of concurrently transfering
>>>>>> templates and/or snapshots.  By reserving space before the transfer,
>>>>>> the allocator would be able to account for both pending operations
>>>>>> and the current contents of the cache.
>>>>>
>>>>> 1. It's the current behavior of 4.1 and master secondary storage code. We
>>>> didn't introduce new issue here.
>>>>> 2. Add a new cache storage allocator is much easier than adding it in 
>>>>> master
>>>> branch.
>>>>> 3. I treat it as a bug, and plan to fix it after the merge.
>>>>
>>>> We should not be merging in major changes with *known* bugs like this.
>>>
>>> first and foremost, it's a well known issue in master branch, don't matter 
>>> merge the branch or not.
>>> If we want to fix the issue, either fix it on object_store branch or on 
>>> master branch, but both will need time.
>>> As the 4.2 merge window is approaching, I don't think we can fix it in 
>>> these two or three days.
>>> If we missed the merge window, then who will fix it on master?
>>
>> As I understand the goals of this enhancement, we will support additional 
>> secondary storage types and removing the assumption that secondary storage 
>> will always be NFS or have a filesystem.  As such, when a non-NFS type of 
>> secondary storage is employed, NFS is no longer the repository of record for 
>> this data.  We can always exceed available space in the repository of 
>> record, and the failure scenarios are relatively well understood (4.1.0) -- 
>> operations will fail quickly and obviously.  However, as a transitory 
>> staging storage mechanism (4.2.0), the expectation of the user is the NFS 
>> storage will not be as reliable or large.  If the only solution we can 
>> provide for this problem is to recommend an NFS "cache" that is equal to the 
>> size of the object store itself then we have little to no progress 
>> addressing our user's needs.  Fundamentally, the role of the NFS is 
>> different in 4.2.0 than 4.1.0.  Therefore, I disagree with the assertion 
>> that issue is present in 4.1.0.
>>
>> An additional risk in the object_store implementation is that we lead a user 
>> to believe their data has been stored in reliable storage (e.g. S3, Riak CS, 
>> etc) when it may not.  I saw no provision in the object_store to retry 
>> transfers if the object_store transfer fails or becomes unavailable.  In 
>> 4.0.0/4.1.0, if we can't connect to S3 or Swift, a background process 
>> continuously retries the upload until successful.
>>
>> Finally, I see this issue as a design issue than a bug.  I don't think we 
>> should be merging branches that we know have a defect let alone a design 
>> issue.  A proper implementation will require refining/expanding the 
>> StorageAllocator interface, and introducing a mechanism to track 
>> reservations (including proper TTL support).  IMHO, this these type of 
>> issues must resolved before a feature can be merged into master.
>>
>>>
>>>>
>>>>>
>>>>>>  - There appears no mechanism to age out contents of the cache.
>>>>>> Therefore, as implemented, it will grow unbounded.  The only
>>>>>> workaround for this problem would be to have an NFS cache whose size
>>>>>> equals that of the object store.
>>>>>
>>>>>
>>>>> Admin can multiple nfs cache storages into one zone, and it's the current
>>>> behavior on master branch. Again, we can add aging policy later, as cache
>>>> storage has its own pom project, and it's own implementation.
>>>>
>>>> That doesn't make much sense to me.  A cache is expected to be, by it's
>>>> nature, bounded by time / size / something...  We need to implement some
>>>> sort of bounds here.
>>>>
>>>> It also shouldn't be a failure point if you have a cache miss or cache 
>>>> failure.
>>>> What behaviour occurs if there is only 1 and it's full?
>>>
>>> As I said before, it's the current behavior of master branch. On master 
>>> branch, there is no code the check the capacity of nfs secondary storage.
>>> I do have plan to fix it on object_store branch, but the code change is so 
>>> huge, take a lot of time to refactor other stuff, so my plan is to fix it 
>>> later.
>>
>> Given the different use of NFS in the object_store branch vs. current, I 
>> don't see the comparison in this case.  In the current implementation, when 
>> we exhaust space, we are truly out of resource.  However, in the 
>> object_store branch, we have no provision to remove stale data and we may 
>> report no space available when there is plenty of space available in the 
>> underlying object store.  In this scenario, the NFS "cache" becomes an 
>> artificial limiter on the capacity of the system.  I do not understand how 
>> we have this problem in current since the object store is only a backup of 
>> secondary store -- not secondary storage itself.
>>
>>>
>>>>
>>>>>
>>>>>
>>>>>>  - The mechanism lacks robust error handling or retry logic.  In
>>>>>> particular, the behavior if/when a cache exhausts available space appears
>>>> non-deterministic.
>>>>>
>>>>> Again, master branch will have the same issue, fix the issue will be 
>>>>> easier
>>>> than fixing it on master branch.
>>>>
>>>> Why not fix it now, since the cache mechanism is being touched?
>>>
>>> 1. Do need time to fix it. At least one week.
>>> 2. What's difference between, fix it on object_store branch, instead of 
>>> after merge? As this issue is not introduced by object_store branch.
>>
>> It is my estimate robust error handling will require design changes (e.g. 
>> introduction of a resource reservation mechanism, introduction of addition 
>> exception classes, enhancement of interfaces to provide more context 
>> regarding client intentions, etc) yielding significant code impact.  These 
>> changes need to undertaken in a holistic manner with minimum risk to master. 
>>   Fundamentally, we should not be merging code to master with known 
>> significant issues.  When it goes to master, we should be saying, "To the 
>> best of my knowledge and developer testing, there are no blocker or critical 
>> issues."  In my opinion, omission of robust error handling does not meet 
>> that standard.
>>
>>>
>>>>
>>>>>
>>>>>>  - Generally, I see little consideration for
>>>>>> alternative/exception flows.  For example, what happens if I attempt
>>>>>> to use a template/iso/snapshot in transit to the object store
>>>>>> to/from the cache?  Since these files can be very large (multiple GBs), 
>>>>>> we
>>>> have assume some transfer latency in all interactions.
>>>>>
>>>>> Management server mainly maintains the state of each
>>>> object(volume/snapshot/template), we add state machine for all the objects,
>>>> and for all the objects on each storages, which is unavailable on the 
>>>> master
>>>> branch. Take copy template from object store to cache as an example, the
>>>> state of template entry on object storage and cache storage will be changed
>>>> to "Copying", if copying failed or time out, then the state of template 
>>>> entry
>>>> on object storage will be changed to "Ready", while the state of template
>>>> entry on cache storage will be changed to "Failed".
>>>>> As long as mgt server maintains the state correctly, user/admin can issue
>>>> the same operation through UI/API in case one operation is failed. That' s
>>>> one of reason, there is no retry logic in the storage subsystem code. 
>>>> Another
>>>> reason is that adding retry logic on top of async api will a little bit 
>>>> difficult.
>>>>>
>>>>>>
>>>>>> 4. The Image Cache abstraction is too tightly coupled to the core
>>>>>> orchestration mechanism.  I would recommend implementing it as a
>>>>>
>>>>> The only place is coupled with image cache abstraction is at
>>>>> datamotionstrategy, in particular, AncientDataMotionStrategy -
>>>>> needCacheStorage and AncientDataMotionStrategy will call
>>>> StorageCacheManager-> createCacheObject if cache storage is needed.
>>>>>
>>>>>> DataStore proxy that is applied as necessary.  For example, the
>>>>>> current Xen hypervisor implementation must interact with a file
>>>>>> system.  It seems most appropriate that the Xen hypervisor
>>>>>> implementation would apply the proxy to its secondary storage
>>>>>> DataStore instances.  Therefore, the storage engine
>>>>>
>>>>> The issue that struggles me for a while is that, how and where to 
>>>>> configure
>>>> cache storage. Take S3 as an example, the implementation of S3 provider
>>>> should not need to care about cache storage at all, S3 implementation 
>>>> should
>>>> only care about how to deal with data stored in S3. While S3 is a general
>>>> purpose storage, the same implementation can be used by different
>>>> hypervisors, and in different situations. For example,  If both Ceph and S3
>>>> used by KVM, then definitely, nfs cache is not needed, while for other
>>>> cluster wide primary storages supported by KVM, nfs cache is needed.
>>>>> How and where to present this kind of configuration, and also both primary
>>>> storage provider and object storage provider implementation don't need to
>>>> aware this kind of configuration? My current solution is that, write the
>>>> decision in the code, DataMotionStrategy is the place to decide to use 
>>>> cache
>>>> storage or not. In the default implementation, all the copy operations from
>>>> s3 to primary storage, cache storage is needed, while if people adding a 
>>>> new
>>>> storage, and want to remove cache storage for certain reason, then add a
>>>> new DataMotionStrategy.
>>>>>
>>>>> I am not sure how the datastore proxy will solve the issue I mentioned
>>>> above, I am glad to know your solution.
>>>>
>>>> This looks like a good opportunity for some collaboration on the design!
>>>
>>> I am glad to know other solutions, or other people write code on the branch.
>>> Code is always welcome.
>>
>> Fundamentally, there should be no dependency from Storage to Hypervisor.  
>> The scenario presented here makes we wonder whether or not data motion 
>> belongs in the Storage or Hypervisor layers.  Storage doesn't care if the 
>> I/O is for data motion.  It's only concern is the movement of data from one 
>> location to another.  I am beginning to feel that DataMotion belongs in the 
>> Hypervisor layer and the Storage layer should expose a more basic set of 
>> primitives that are composed by the service to orchestrate the required I/O 
>> operations.  Can you shed more light on the reasoning for putting this 
>> service in Storage rather Hypervisor?
>>
>> To my mind, the storage services should present options that allow clients 
>> declare the requirements.  For example, if the Xen hypervisor implement must 
>> perform an operation through a file system interface, it should express that 
>> need to the storage subsystem.  In turn, the DataStore returned will fulfill 
>> that contract.  From an implementation perspective, I suggest the cache 
>> mechanism be implemented as proxy of the underlying DataStore 
>> implementation.  There are likely other solutions -- I was merely trying to 
>> suggest approaches that would reduce the coupling for the purposes of 
>> discussion.
>>
>>>
>>>>
>>>>>
>>>>>> should only provide the facility not attempt to determine when it is
>>>> needed.
>>>>>> Additionally, a proxy model would greatly reduce the size and
>>>>>> complexity of the various classes (e.g. AncientDataMotionStrategy).
>>>>>
>>>>> The current AncientDataMotionStrategy is only about 400 LOC now, I
>>>> removed some of the duplicated code. But still want to know the detail of
>>>> your solution.
>>>>>
>>>>>>
>>>>>> 5. While I have only reviewed roughly 50-60% of the patch thus far,
>>>>>> I see little to no additional unit tests for the new code added.
>>>>>> Integration tests have
>>>>>
>>>>> If the code is changing rapidly, then writing/maintaining unit test cases 
>>>>> will
>>>> take huge effort, we focused on the integration test and marvin test in 
>>>> this
>>>> stage.
>>>>
>>>> We shouldn't be merging code that is "still changing rapidly".  What do you
>>>> mean by this comment?
>>>
>>> Oh, should be "was changing rapidly" in the last month, in May, we moved on 
>>> to integration test and marvin test and a lot of bug fixes.
>>>
>>>>
>>>>>
>>>>>> been expanded, but many of the test simply execute the service
>>>>>> methods and do not verify the form of the operation results. There
>>>>>> are also a tremendous number of ignored exceptions that should likely
>>>> fail these tests.
>>>>>> Therefore, we could have tests passing due to an ignored exception
>>>>>> that should be failing.
>>>>>
>>>>> 1. Thanks, you take a look at the integration test, which is very 
>>>>> convenient
>>>> for us during the implementation. It saves us a lot time during the test.
>>>>> 2. Yes, the integration test needed to be improved as your suggested.
>>>>
>>>> Let's do that please!
>>
>> Without asserts, the tests are simply exercising the code to ensure that no 
>> exceptions are being thrown.  They are not verifying that the test actually 
>> succeeded.  Until all methods properly verify the results of the call, it is 
>> difficult to objectively gage the quality of the branch.  Additionally, the 
>> integration tests verify very few alternative and exception flows.  I would 
>> like to see more of those scenarios exercised.
>>
>>>>
>>>>>
>>>>>>
>>>>>> While I recognize the tremendous effort that has been expended to
>>>>>> implement this capability and value of the feature, I see
>>>>>> significant design and operational issues that must be addressed
>>>>>> before it can be accepted into
>>>>>
>>>>> I would say the object _store branch is just better than the master 
>>>>> branch,
>>>> almost all the issues(related to cache storages) you pointed out, are 
>>>> existing
>>>> on the master branch. While fixing these issues on master branch is harder
>>>> than on object_store branch. Frankly speaking, I don't want to
>>>> read/understand/guess how it works on the master branch, it's a lot of hack
>>>> all around the place.
>>>>> Of course, these issues are still issues, we need to fix them, but when 
>>>>> to fix
>>>> them is arguable. I prefer fixing them after the merge. Here is the reason:
>>>>> If we agreed on the issues you pointed out are existing on the master 
>>>>> also,
>>>> then do we want to fix them on the 4.2 or not? If the answer is yes, then
>>>> who will fix them, and how long it will take to fix them on master? If I
>>>> propose an offer, that I'll fix these issues after merging object_store 
>>>> into
>>>> master, will it be acceptable? If the answer is no,  then there is no 
>>>> technical
>>>> reason to not merge object_store into master, as these issues are not
>>>> introduced by object_store branch.
>>>>>
>>>>>
>>>>>> master.  In my opinion, the object_store represents a good first
>>>>>> step, but it needs a few more review/refinement iterations before it
>>>>>> will be ready for a master merge.
>>
>> I agree with Chip.  In addition to the issues I have mentioned, this is a 
>> massive change late in the cycle for little overall gain.  We have not yet 
>> achieved the key goal our users want -- reliable, horizontally scalable 
>> secondary storage on commodity hardware (i.e. cheaper per GB with the same 
>> reliability). As such, I will take proven/released, "ugly" code over 
>> unproven/unreleased, "pretty" code every time.
>>
>>>>>>
>>>>>> Thanks,
>>>>>> -John
>>
>>

Re: [MERGE]object_store branch into master

Reply via email to