Re: [MERGE]object_store branch into master

Chip Childers Thu, 30 May 2013 08:41:58 -0700

Just so that we are 100% clear...  I'm -1 (binding veto) on this merge right
now due to the technical issues being discussed below.  No single issue
should be the focus, and instead we should start focusing on
collaboratively fixing this branch up so that it's coming into master
with high quality and full consensus on the changes.


I would propose that the issues being discussed below be broken out into
several DISCUSS threads, where concerns can be debated and addressed
within a reasonably defined scope for each discussion.  Edison, as the
proposer of this merge, I would imagine that you are in the best
position to kick off the appropriate threads to get you the help and
input needed.

I am *NOT* saying that it's a veto due to timing.  Timing is a
procedural issue which, frankly, is something that I'd rather work to
achieve consensus on.  I only raised that point because this merge
request contradicted our previous consensus on the appropriate timing of
architectural changes merging into master (as documented at [1]).

-chip

[1] 
https://cwiki.apache.org/confluence/display/CLOUDSTACK/Releases#Releases-FeatureReleaseCycle

On Thu, May 30, 2013 at 10:43:04AM -0400, John Burwell wrote:
> Edison,
> 
> Please see my comment in-line.
> 
> Thanks,
> -John
> 
> On May 29, 2013, at 5:15 PM, Edison Su <[email protected]> wrote:
> 
> > 
> > 
> >> -----Original Message-----
> >> From: Chip Childers [mailto:[email protected]]
> >> Sent: Wednesday, May 29, 2013 1:41 PM
> >> To: [email protected]
> >> Subject: Re: [MERGE]object_store branch into master
> >> 
> >> On Tue, May 28, 2013 at 11:42:49PM +0000, Edison Su wrote:
> >>> 
> >>> 
> >>>> -----Original Message-----
> >>>> From: John Burwell [mailto:[email protected]]
> >>>> Sent: Tuesday, May 28, 2013 8:43 AM
> >>>> To: [email protected]
> >>>> Subject: Re: [MERGE]object_store branch into master
> >>>> 
> >>>> All,
> >>>> 
> >>>> I have gone a through a large chunk of this patch, and published my
> >> review
> >>>> thus far (https://reviews.apache.org/r/11277/).   TL;DR is that this 
> >>>> patch
> >> has a
> >>>> number of significant issues which can be summarized as follows:
> >>> 
> >>> 
> >>> Thanks for your time to review this patch.
> >>> 
> >>>> 
> >>>> 1. While it appeas that the requirement of NFS for secondary storage
> >>>> has largely been removed, it has basically been if blocked out
> >>>> instead of pushed down as an detail choice in the physical layer.
> >>>> Rather than exploiting polymorpish to vary behavior through a set of
> >>>> higher level abstracttions, the orchestration performs instanceof
> >>>> NFSTO checks. The concern is that future evolution of secondary storage
> >> layer will still have dependencies on NFS.
> >>> 
> >>> As long as mgt server doesn't explicitly depend on nfs secondary storage
> >> during the orchestration, and the nfs storage can be used based on 
> >> different
> >> configurations, then the resource side "NFSTO checks" is not much concern
> >> to me at the current stage. We can add a " polymorpish high level
> >> abstraction"  at the resource side, as you suggested before, in the future.
> >> The current code change is already so huge now, add a new level a
> >> abstraction will make the change even much bigger. But adding this kind of
> >> "high level abstraction" is not hard any more, as all the storage related
> >> commands, are only dealing with DataTO/DataStoreTO. Right now, there is
> >> no read/write data operations on this *TO, one possible way to add a "high
> >> level abstraction" would be add one read/write data operations at each
> >> hypervisor resource based on the DataTO and DataStoreTO.
> >>> My point is that refactor mgt server code is much much harder than the
> >> resource code. If we can get the mgt server right, then we can move on to
> >> refactor resource code, in the future.
> 
> Due to their impact on dependent subsystems and lower layers, these types of 
> design questions need to be addressed before a merge to master. 
> 
> >>> 
> >>>> 
> >>>> 2. In some sceanrios, NFS is still a requirement for secondary
> >>>> storage.  In particular, Xen users will still have to maintain an
> >>>> "NFS cache".  Given the degradation of capability, I think it would
> >>>> be helpful to put a few more eyeballs on the Xen implementation to
> >>>> determine if we could avoid using an intermediate file system.
> >>> 
> >>> Nfs will not disappear soon, for some hypervisors, for some type of
> >> storages. But as long as the mgt server doesn't have this requirement, then
> >> people can add new type of hypervisors(e.g. hyperv), new type of
> >> storages(ceph, solidfire etc) which don't need NFS. The optimization for a
> >> particular hypervisor can be done by other people, who may be familiar with
> >> some type of storage, or a certain hypervisor.
> 
> It feels like we have jumped to a solution without completely understanding 
> the scope of the problem and the associated assumptions.  We have a community 
> of hypervisor experts who we should consult to ensure we have the best 
> solution.  As such, I recommend mailing the list with the specific 
> hypervisors and functions that you have been unable to interface to storage 
> that does not present a filesystem.  I do not recall seeing such a discussion 
> on the list previously.
> 
> >>> 
> >>>> 3. I have the following concerns regarding potential race conditions
> >>>> and resource exhaustion in the NFS cache implementation.
> >>>> 
> >>>>    - The current random allocator algorithm does not account for
> >>>> the amount space that will be required for the operation (i.e.
> >>>> checking to ensure that the cache it picks has enough space to
> >>>> transfer to hold the object being
> >>>> downloaded) nor does it reserve the space.Given the long (in compute
> >>>> time) running nature of these of processes, the available space in a
> >>>> cache could be exhausted by a number of concurrently transfering
> >>>> templates and/or snapshots.  By reserving space before the transfer,
> >>>> the allocator would be able to account for both pending operations
> >>>> and the current contents of the cache.
> >>> 
> >>> 1. It's the current behavior of 4.1 and master secondary storage code. We
> >> didn't introduce new issue here.
> >>> 2. Add a new cache storage allocator is much easier than adding it in 
> >>> master
> >> branch.
> >>> 3. I treat it as a bug, and plan to fix it after the merge.
> >> 
> >> We should not be merging in major changes with *known* bugs like this.
> > 
> > first and foremost, it's a well known issue in master branch, don't matter 
> > merge the branch or not.
> > If we want to fix the issue, either fix it on object_store branch or on 
> > master branch, but both will need time. 
> > As the 4.2 merge window is approaching, I don't think we can fix it in 
> > these two or three days.
> > If we missed the merge window, then who will fix it on master?
> 
> As I understand the goals of this enhancement, we will support additional 
> secondary storage types and removing the assumption that secondary storage 
> will always be NFS or have a filesystem.  As such, when a non-NFS type of 
> secondary storage is employed, NFS is no longer the repository of record for 
> this data.  We can always exceed available space in the repository of record, 
> and the failure scenarios are relatively well understood (4.1.0) -- 
> operations will fail quickly and obviously.  However, as a transitory staging 
> storage mechanism (4.2.0), the expectation of the user is the NFS storage 
> will not be as reliable or large.  If the only solution we can provide for 
> this problem is to recommend an NFS "cache" that is equal to the size of the 
> object store itself then we have little to no progress addressing our user's 
> needs.  Fundamentally, the role of the NFS is different in 4.2.0 than 4.1.0.  
> Therefore, I disagree with the assertion that issue is present in 4.1.0.
> 
> An additional risk in the object_store implementation is that we lead a user 
> to believe their data has been stored in reliable storage (e.g. S3, Riak CS, 
> etc) when it may not.  I saw no provision in the object_store to retry 
> transfers if the object_store transfer fails or becomes unavailable.  In 
> 4.0.0/4.1.0, if we can't connect to S3 or Swift, a background process 
> continuously retries the upload until successful. 
> 
> Finally, I see this issue as a design issue than a bug.  I don't think we 
> should be merging branches that we know have a defect let alone a design 
> issue.  A proper implementation will require refining/expanding the 
> StorageAllocator interface, and introducing a mechanism to track reservations 
> (including proper TTL support).  IMHO, this these type of issues must 
> resolved before a feature can be merged into master.
> 
> > 
> >> 
> >>> 
> >>>>    - There appears no mechanism to age out contents of the cache.
> >>>> Therefore, as implemented, it will grow unbounded.  The only
> >>>> workaround for this problem would be to have an NFS cache whose size
> >>>> equals that of the object store.
> >>> 
> >>> 
> >>> Admin can multiple nfs cache storages into one zone, and it's the current
> >> behavior on master branch. Again, we can add aging policy later, as cache
> >> storage has its own pom project, and it's own implementation.
> >> 
> >> That doesn't make much sense to me.  A cache is expected to be, by it's
> >> nature, bounded by time / size / something...  We need to implement some
> >> sort of bounds here.
> >> 
> >> It also shouldn't be a failure point if you have a cache miss or cache 
> >> failure.
> >> What behaviour occurs if there is only 1 and it's full?
> > 
> > As I said before, it's the current behavior of master branch. On master 
> > branch, there is no code the check the capacity of nfs secondary storage.
> > I do have plan to fix it on object_store branch, but the code change is so 
> > huge, take a lot of time to refactor other stuff, so my plan is to fix it 
> > later.
> 
> Given the different use of NFS in the object_store branch vs. current, I 
> don't see the comparison in this case.  In the current implementation, when 
> we exhaust space, we are truly out of resource.  However, in the object_store 
> branch, we have no provision to remove stale data and we may report no space 
> available when there is plenty of space available in the underlying object 
> store.  In this scenario, the NFS "cache" becomes an artificial limiter on 
> the capacity of the system.  I do not understand how we have this problem in 
> current since the object store is only a backup of secondary store -- not 
> secondary storage itself.
> 
> > 
> >> 
> >>> 
> >>> 
> >>>>    - The mechanism lacks robust error handling or retry logic.  In
> >>>> particular, the behavior if/when a cache exhausts available space appears
> >> non-deterministic.
> >>> 
> >>> Again, master branch will have the same issue, fix the issue will be 
> >>> easier
> >> than fixing it on master branch.
> >> 
> >> Why not fix it now, since the cache mechanism is being touched?
> > 
> > 1. Do need time to fix it. At least one week.
> > 2. What's difference between, fix it on object_store branch, instead of 
> > after merge? As this issue is not introduced by object_store branch.
> 
> It is my estimate robust error handling will require design changes (e.g. 
> introduction of a resource reservation mechanism, introduction of addition 
> exception classes, enhancement of interfaces to provide more context 
> regarding client intentions, etc) yielding significant code impact.  These 
> changes need to undertaken in a holistic manner with minimum risk to master.  
>  Fundamentally, we should not be merging code to master with known 
> significant issues.  When it goes to master, we should be saying, "To the 
> best of my knowledge and developer testing, there are no blocker or critical 
> issues."  In my opinion, omission of robust error handling does not meet that 
> standard. 
> 
> > 
> >> 
> >>> 
> >>>>    - Generally, I see little consideration for
> >>>> alternative/exception flows.  For example, what happens if I attempt
> >>>> to use a template/iso/snapshot in transit to the object store
> >>>> to/from the cache?  Since these files can be very large (multiple GBs), 
> >>>> we
> >> have assume some transfer latency in all interactions.
> >>> 
> >>> Management server mainly maintains the state of each
> >> object(volume/snapshot/template), we add state machine for all the objects,
> >> and for all the objects on each storages, which is unavailable on the 
> >> master
> >> branch. Take copy template from object store to cache as an example, the
> >> state of template entry on object storage and cache storage will be changed
> >> to "Copying", if copying failed or time out, then the state of template 
> >> entry
> >> on object storage will be changed to "Ready", while the state of template
> >> entry on cache storage will be changed to "Failed".
> >>> As long as mgt server maintains the state correctly, user/admin can issue
> >> the same operation through UI/API in case one operation is failed. That' s
> >> one of reason, there is no retry logic in the storage subsystem code. 
> >> Another
> >> reason is that adding retry logic on top of async api will a little bit 
> >> difficult.
> >>> 
> >>>> 
> >>>> 4. The Image Cache abstraction is too tightly coupled to the core
> >>>> orchestration mechanism.  I would recommend implementing it as a
> >>> 
> >>> The only place is coupled with image cache abstraction is at
> >>> datamotionstrategy, in particular, AncientDataMotionStrategy -
> >>> needCacheStorage and AncientDataMotionStrategy will call
> >> StorageCacheManager-> createCacheObject if cache storage is needed.
> >>> 
> >>>> DataStore proxy that is applied as necessary.  For example, the
> >>>> current Xen hypervisor implementation must interact with a file
> >>>> system.  It seems most appropriate that the Xen hypervisor
> >>>> implementation would apply the proxy to its secondary storage
> >>>> DataStore instances.  Therefore, the storage engine
> >>> 
> >>> The issue that struggles me for a while is that, how and where to 
> >>> configure
> >> cache storage. Take S3 as an example, the implementation of S3 provider
> >> should not need to care about cache storage at all, S3 implementation 
> >> should
> >> only care about how to deal with data stored in S3. While S3 is a general
> >> purpose storage, the same implementation can be used by different
> >> hypervisors, and in different situations. For example,  If both Ceph and S3
> >> used by KVM, then definitely, nfs cache is not needed, while for other
> >> cluster wide primary storages supported by KVM, nfs cache is needed.
> >>> How and where to present this kind of configuration, and also both primary
> >> storage provider and object storage provider implementation don't need to
> >> aware this kind of configuration? My current solution is that, write the
> >> decision in the code, DataMotionStrategy is the place to decide to use 
> >> cache
> >> storage or not. In the default implementation, all the copy operations from
> >> s3 to primary storage, cache storage is needed, while if people adding a 
> >> new
> >> storage, and want to remove cache storage for certain reason, then add a
> >> new DataMotionStrategy.
> >>> 
> >>> I am not sure how the datastore proxy will solve the issue I mentioned
> >> above, I am glad to know your solution.
> >> 
> >> This looks like a good opportunity for some collaboration on the design!
> > 
> > I am glad to know other solutions, or other people write code on the 
> > branch. 
> > Code is always welcome.
> 
> Fundamentally, there should be no dependency from Storage to Hypervisor.  The 
> scenario presented here makes we wonder whether or not data motion belongs in 
> the Storage or Hypervisor layers.  Storage doesn't care if the I/O is for 
> data motion.  It's only concern is the movement of data from one location to 
> another.  I am beginning to feel that DataMotion belongs in the Hypervisor 
> layer and the Storage layer should expose a more basic set of primitives that 
> are composed by the service to orchestrate the required I/O operations.  Can 
> you shed more light on the reasoning for putting this service in Storage 
> rather Hypervisor?
> 
> To my mind, the storage services should present options that allow clients 
> declare the requirements.  For example, if the Xen hypervisor implement must 
> perform an operation through a file system interface, it should express that 
> need to the storage subsystem.  In turn, the DataStore returned will fulfill 
> that contract.  From an implementation perspective, I suggest the cache 
> mechanism be implemented as proxy of the underlying DataStore implementation. 
>  There are likely other solutions -- I was merely trying to suggest 
> approaches that would reduce the coupling for the purposes of discussion.
> 
> > 
> >> 
> >>> 
> >>>> should only provide the facility not attempt to determine when it is
> >> needed.
> >>>> Additionally, a proxy model would greatly reduce the size and
> >>>> complexity of the various classes (e.g. AncientDataMotionStrategy).
> >>> 
> >>> The current AncientDataMotionStrategy is only about 400 LOC now, I
> >> removed some of the duplicated code. But still want to know the detail of
> >> your solution.
> >>> 
> >>>> 
> >>>> 5. While I have only reviewed roughly 50-60% of the patch thus far,
> >>>> I see little to no additional unit tests for the new code added.
> >>>> Integration tests have
> >>> 
> >>> If the code is changing rapidly, then writing/maintaining unit test cases 
> >>> will
> >> take huge effort, we focused on the integration test and marvin test in 
> >> this
> >> stage.
> >> 
> >> We shouldn't be merging code that is "still changing rapidly".  What do you
> >> mean by this comment?
> > 
> > Oh, should be "was changing rapidly" in the last month, in May, we moved on 
> > to integration test and marvin test and a lot of bug fixes.
> > 
> >> 
> >>> 
> >>>> been expanded, but many of the test simply execute the service
> >>>> methods and do not verify the form of the operation results. There
> >>>> are also a tremendous number of ignored exceptions that should likely
> >> fail these tests.
> >>>> Therefore, we could have tests passing due to an ignored exception
> >>>> that should be failing.
> >>> 
> >>> 1. Thanks, you take a look at the integration test, which is very 
> >>> convenient
> >> for us during the implementation. It saves us a lot time during the test.
> >>> 2. Yes, the integration test needed to be improved as your suggested.
> >> 
> >> Let's do that please!
> 
> Without asserts, the tests are simply exercising the code to ensure that no 
> exceptions are being thrown.  They are not verifying that the test actually 
> succeeded.  Until all methods properly verify the results of the call, it is 
> difficult to objectively gage the quality of the branch.  Additionally, the 
> integration tests verify very few alternative and exception flows.  I would 
> like to see more of those scenarios exercised.
> 
> >> 
> >>> 
> >>>> 
> >>>> While I recognize the tremendous effort that has been expended to
> >>>> implement this capability and value of the feature, I see
> >>>> significant design and operational issues that must be addressed
> >>>> before it can be accepted into
> >>> 
> >>> I would say the object _store branch is just better than the master 
> >>> branch,
> >> almost all the issues(related to cache storages) you pointed out, are 
> >> existing
> >> on the master branch. While fixing these issues on master branch is harder
> >> than on object_store branch. Frankly speaking, I don't want to
> >> read/understand/guess how it works on the master branch, it's a lot of hack
> >> all around the place.
> >>> Of course, these issues are still issues, we need to fix them, but when 
> >>> to fix
> >> them is arguable. I prefer fixing them after the merge. Here is the reason:
> >>> If we agreed on the issues you pointed out are existing on the master 
> >>> also,
> >> then do we want to fix them on the 4.2 or not? If the answer is yes, then
> >> who will fix them, and how long it will take to fix them on master? If I
> >> propose an offer, that I'll fix these issues after merging object_store 
> >> into
> >> master, will it be acceptable? If the answer is no,  then there is no 
> >> technical
> >> reason to not merge object_store into master, as these issues are not
> >> introduced by object_store branch.
> >>> 
> >>> 
> >>>> master.  In my opinion, the object_store represents a good first
> >>>> step, but it needs a few more review/refinement iterations before it
> >>>> will be ready for a master merge.
> 
> I agree with Chip.  In addition to the issues I have mentioned, this is a 
> massive change late in the cycle for little overall gain.  We have not yet 
> achieved the key goal our users want -- reliable, horizontally scalable 
> secondary storage on commodity hardware (i.e. cheaper per GB with the same 
> reliability). As such, I will take proven/released, "ugly" code over 
> unproven/unreleased, "pretty" code every time.  
> 
> >>>> 
> >>>> Thanks,
> >>>> -John
> 
>

Re: [MERGE]object_store branch into master

Reply via email to