RE: [MERGE]object_store branch into master

Edison Su Thu, 30 May 2013 14:43:25 -0700


> -----Original Message-----
> From: John Burwell [mailto:[email protected]]
> Sent: Thursday, May 30, 2013 7:43 AM
> To: [email protected]
> Subject: Re: [MERGE]object_store branch into master
> 
> Edison,
> 
> Please see my comment in-line.
> 
> Thanks,
> -John
> 
> On May 29, 2013, at 5:15 PM, Edison Su <[email protected]> wrote:
> 
> >
> >
> >> -----Original Message-----
> >> From: Chip Childers [mailto:[email protected]]
> >> Sent: Wednesday, May 29, 2013 1:41 PM
> >> To: [email protected]
> >> Subject: Re: [MERGE]object_store branch into master
> >>
> >> On Tue, May 28, 2013 at 11:42:49PM +0000, Edison Su wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: John Burwell [mailto:[email protected]]
> >>>> Sent: Tuesday, May 28, 2013 8:43 AM
> >>>> To: [email protected]
> >>>> Subject: Re: [MERGE]object_store branch into master
> >>>>
> >>>> All,
> >>>>
> >>>> I have gone a through a large chunk of this patch, and published my
> >> review
> >>>> thus far (https://reviews.apache.org/r/11277/).   TL;DR is that this 
> >>>> patch
> >> has a
> >>>> number of significant issues which can be summarized as follows:
> >>>
> >>>
> >>> Thanks for your time to review this patch.
> >>>
> >>>>
> >>>> 1. While it appeas that the requirement of NFS for secondary
> >>>> storage has largely been removed, it has basically been if blocked
> >>>> out instead of pushed down as an detail choice in the physical layer.
> >>>> Rather than exploiting polymorpish to vary behavior through a set
> >>>> of higher level abstracttions, the orchestration performs
> >>>> instanceof NFSTO checks. The concern is that future evolution of
> >>>> secondary storage
> >> layer will still have dependencies on NFS.
> >>>
> >>> As long as mgt server doesn't explicitly depend on nfs secondary
> >>> storage
> >> during the orchestration, and the nfs storage can be used based on
> >> different configurations, then the resource side "NFSTO checks" is
> >> not much concern to me at the current stage. We can add a "
> >> polymorpish high level abstraction"  at the resource side, as you
> suggested before, in the future.
> >> The current code change is already so huge now, add a new level a
> >> abstraction will make the change even much bigger. But adding this
> >> kind of "high level abstraction" is not hard any more, as all the
> >> storage related commands, are only dealing with DataTO/DataStoreTO.
> >> Right now, there is no read/write data operations on this *TO, one
> >> possible way to add a "high level abstraction" would be add one
> >> read/write data operations at each hypervisor resource based on the
> DataTO and DataStoreTO.
> >>> My point is that refactor mgt server code is much much harder than
> >>> the
> >> resource code. If we can get the mgt server right, then we can move
> >> on to refactor resource code, in the future.
> 
> Due to their impact on dependent subsystems and lower layers, these types
> of design questions need to be addressed before a merge to master.
> 
> >>>
> >>>>
> >>>> 2. In some sceanrios, NFS is still a requirement for secondary
> >>>> storage.  In particular, Xen users will still have to maintain an
> >>>> "NFS cache".  Given the degradation of capability, I think it would
> >>>> be helpful to put a few more eyeballs on the Xen implementation to
> >>>> determine if we could avoid using an intermediate file system.
> >>>
> >>> Nfs will not disappear soon, for some hypervisors, for some type of
> >> storages. But as long as the mgt server doesn't have this
> >> requirement, then people can add new type of hypervisors(e.g.
> >> hyperv), new type of storages(ceph, solidfire etc) which don't need
> >> NFS. The optimization for a particular hypervisor can be done by
> >> other people, who may be familiar with some type of storage, or a certain
> hypervisor.
> 
> It feels like we have jumped to a solution without completely understanding
> the scope of the problem and the associated assumptions.  We have a
> community of hypervisor experts who we should consult to ensure we have
> the best solution.  As such, I recommend mailing the list with the specific
> hypervisors and functions that you have been unable to interface to storage
> that does not present a filesystem.  I do not recall seeing such a discussion 
> on
> the list previously.


If people using zone-wide primary storage, like, ceph/solidfire, then suddenly, 
there is no need for nfs cache storage, as zone-wide storage can be treated as 
both primary/secondary storage, S3 as the backup  storage. It's a simple but 
powerful solution.
Why we can't just add code to support this exciting new solutions? It's hard to 
do it on master branch, that's why Min and I worked hard to refactor the code, 
and remove nfs secondary storage dependency from management server as much as 
possible. All we know, nfs secondary storage is not scalable, not matter how 
fancy aging policy you have, how advanced capacity planner you have.

And that's one of reason I don't care that much about the issue with nfs cache 
storage, couldn't we put our energy on cloud style storage solution, instead of 
on the un-scalable storage?

> 
> >>>
> >>>> 3. I have the following concerns regarding potential race
> >>>> conditions and resource exhaustion in the NFS cache implementation.
> >>>>
> >>>>    - The current random allocator algorithm does not account for
> >>>> the amount space that will be required for the operation (i.e.
> >>>> checking to ensure that the cache it picks has enough space to
> >>>> transfer to hold the object being
> >>>> downloaded) nor does it reserve the space.Given the long (in
> >>>> compute
> >>>> time) running nature of these of processes, the available space in
> >>>> a cache could be exhausted by a number of concurrently transfering
> >>>> templates and/or snapshots.  By reserving space before the
> >>>> transfer, the allocator would be able to account for both pending
> >>>> operations and the current contents of the cache.
> >>>
> >>> 1. It's the current behavior of 4.1 and master secondary storage
> >>> code. We
> >> didn't introduce new issue here.
> >>> 2. Add a new cache storage allocator is much easier than adding it
> >>> in master
> >> branch.
> >>> 3. I treat it as a bug, and plan to fix it after the merge.
> >>
> >> We should not be merging in major changes with *known* bugs like this.
> >
> > first and foremost, it's a well known issue in master branch, don't matter
> merge the branch or not.
> > If we want to fix the issue, either fix it on object_store branch or on 
> > master
> branch, but both will need time.
> > As the 4.2 merge window is approaching, I don't think we can fix it in these
> two or three days.
> > If we missed the merge window, then who will fix it on master?
> 
> As I understand the goals of this enhancement, we will support additional
> secondary storage types and removing the assumption that secondary
> storage will always be NFS or have a filesystem.  As such, when a non-NFS
> type of secondary storage is employed, NFS is no longer the repository of
> record for this data.  We can always exceed available space in the repository
> of record, and the failure scenarios are relatively well understood (4.1.0) --
> operations will fail quickly and obviously.  However, as a transitory staging
> storage mechanism (4.2.0), the expectation of the user is the NFS storage will
> not be as reliable or large.  If the only solution we can provide for this
> problem is to recommend an NFS "cache" that is equal to the size of the
> object store itself then we have little to no progress addressing our user's

No, it's not true.  Admin can add multiple NFS cache storages if they want, 
there is no such requirement that NFS storage will be the same size of object 
store, I can't be that stupid.
It's the same thing that we are doing on the master branch: admin knows that 
one NFS secondary storage is not enough, so they can add multiple NFS secondary 
storage. And on the master branch,
There is no capacity planner for NFS secondary storage, if the code just 
randomly chooses one of NFS secondary storages, even if one of them are full. 
Yes, NFS secondary storage on master can be full, there is no way to aging out.

On the current object_store branch, it has the same behavior, admin can add 
multiple NFS cache storages, no capacity planner. While, in case nfs cache 
storage is full, admin can just simply remove the db entry related to cached 
object, and cleanup NFS cache storage, then suddenly, everything just works. 

>From implementation point of view, I don't think there is any difference. 


> needs.  Fundamentally, the role of the NFS is different in 4.2.0 than 4.1.0.
> Therefore, I disagree with the assertion that issue is present in 4.1.0.

The role of NFS can be changed, but they share the same problem, no capacity 
planner, no aging out policy. 

> 
> An additional risk in the object_store implementation is that we lead a user
> to believe their data has been stored in reliable storage (e.g. S3, Riak CS, 
> etc)
> when it may not.  I saw no provision in the object_store to retry transfers if

I don't know from which code you get this kind of conclusion. Could you help to 
point out in the code?
AFAIK, the object can only be either stored in S3 or not stored in S3, I don't 
know how  the object can be in a wrong state.

> the object_store transfer fails or becomes unavailable.  In 4.0.0/4.1.0, if we
> can't connect to S3 or Swift, a background process continuously retries the
> upload until successful.

Here is the interesting situation coming out: how the mgt server or admin knows 
that background process push the objects successfully into s3? There is no 
guarantee the background process will success, there is no status track for 
this background process, right?
 
What I am doing on the object_store branch is that, if push object into S3 
failed, then the whole backup process failed, admin or user needs to send out 
another API request to push object into S3. This will guarantee that operation 
will either success or failed, instead of in a unknown state that we are doing 
on master branch. 

> 
> Finally, I see this issue as a design issue than a bug.  I don't think we 
> should

Again, I don't think it's a design issue, as I said above, it's a bug, both 
master branch and object_store have the same bug. It can be fixed, and easy to 
be fixed on object_store comparing with fixing it on master branch. And it's 
not an important issue, comparing to support cloud style storage solution.

> be merging branches that we know have a defect let alone a design issue.  A
> proper implementation will require refining/expanding the StorageAllocator
> interface, and introducing a mechanism to track reservations (including
> proper TTL support).  IMHO, this these type of issues must resolved before a
> feature can be merged into master.

Welcome to add new improvement, new RFC. But I don't think it's the reason the 
block the merge.
Let me ask the question directly, do you agree with most of our change or not, 
regardless of the cache storage issue, which to me, it's not an important one 
at all.
If you are agreeing on our interfaces and most of the implementation, do you 
think the cache storage issue is really a design issue?
There is one interface called, StorageCacheAllocator, it has a simple 
implementation called StorageCacheRandomAllocator, it only has 57 line of code, 
of cause, it doesn't have fancy aging out policy, no capacity planner, but if 
anybody write a complicated StorageCacheAllocator, the problem is just solved. 
How it can be a design issue? 

> 
> >
> >>
> >>>
> >>>>    - There appears no mechanism to age out contents of the cache.
> >>>> Therefore, as implemented, it will grow unbounded.  The only
> >>>> workaround for this problem would be to have an NFS cache whose
> >>>> size equals that of the object store.
> >>>
> >>>
> >>> Admin can multiple nfs cache storages into one zone, and it's the
> >>> current
> >> behavior on master branch. Again, we can add aging policy later, as
> >> cache storage has its own pom project, and it's own implementation.
> >>
> >> That doesn't make much sense to me.  A cache is expected to be, by
> >> it's nature, bounded by time / size / something...  We need to
> >> implement some sort of bounds here.
> >>
> >> It also shouldn't be a failure point if you have a cache miss or cache 
> >> failure.
> >> What behaviour occurs if there is only 1 and it's full?
> >
> > As I said before, it's the current behavior of master branch. On master
> branch, there is no code the check the capacity of nfs secondary storage.
> > I do have plan to fix it on object_store branch, but the code change is so
> huge, take a lot of time to refactor other stuff, so my plan is to fix it 
> later.
> 
> Given the different use of NFS in the object_store branch vs. current, I don't
> see the comparison in this case.  In the current implementation, when we
> exhaust space, we are truly out of resource.  However, in the object_store
> branch, we have no provision to remove stale data and we may report no
> space available when there is plenty of space available in the underlying
> object store.  In this scenario, the NFS "cache" becomes an artificial 
> limiter on
> the capacity of the system.  I do not understand how we have this problem in
> current since the object store is only a backup of secondary store -- not
> secondary storage itself.

As I said before, no matter what's the role of NFS storage, it shares the same 
issue, both NFS storage can be out of capacity, no capacity planner, no aging 
policy. 

> 
> >
> >>
> >>>
> >>>
> >>>>    - The mechanism lacks robust error handling or retry logic.  In
> >>>> particular, the behavior if/when a cache exhausts available space
> >>>> appears
> >> non-deterministic.
> >>>
> >>> Again, master branch will have the same issue, fix the issue will be
> >>> easier
> >> than fixing it on master branch.
> >>
> >> Why not fix it now, since the cache mechanism is being touched?
> >
> > 1. Do need time to fix it. At least one week.
> > 2. What's difference between, fix it on object_store branch, instead of
> after merge? As this issue is not introduced by object_store branch.
> 
> It is my estimate robust error handling will require design changes (e.g.
> introduction of a resource reservation mechanism, introduction of addition
> exception classes, enhancement of interfaces to provide more context
> regarding client intentions, etc) yielding significant code impact.  These
> changes need to undertaken in a holistic manner with minimum risk to
> master.   Fundamentally, we should not be merging code to master with
> known significant issues.  When it goes to master, we should be saying, "To
> the best of my knowledge and developer testing, there are no blocker or
> critical issues."  In my opinion, omission of robust error handling does not
> meet that standard.

To be realistic, on the mgt server, there is only one class which is depended 
on cache storage, there is only one interface needs to be implemented to solve 
the issue, why we need redesign?


> 
> >
> >>
> >>>
> >>>>    - Generally, I see little consideration for
> >>>> alternative/exception flows.  For example, what happens if I
> >>>> attempt to use a template/iso/snapshot in transit to the object
> >>>> store to/from the cache?  Since these files can be very large
> >>>> (multiple GBs), we
> >> have assume some transfer latency in all interactions.
> >>>
> >>> Management server mainly maintains the state of each
> >> object(volume/snapshot/template), we add state machine for all the
> >> objects, and for all the objects on each storages, which is
> >> unavailable on the master branch. Take copy template from object
> >> store to cache as an example, the state of template entry on object
> >> storage and cache storage will be changed to "Copying", if copying
> >> failed or time out, then the state of template entry on object
> >> storage will be changed to "Ready", while the state of template entry on
> cache storage will be changed to "Failed".
> >>> As long as mgt server maintains the state correctly, user/admin can
> >>> issue
> >> the same operation through UI/API in case one operation is failed.
> >> That' s one of reason, there is no retry logic in the storage
> >> subsystem code. Another reason is that adding retry logic on top of async
> api will a little bit difficult.
> >>>
> >>>>
> >>>> 4. The Image Cache abstraction is too tightly coupled to the core
> >>>> orchestration mechanism.  I would recommend implementing it as a
> >>>
> >>> The only place is coupled with image cache abstraction is at
> >>> datamotionstrategy, in particular, AncientDataMotionStrategy -
> >>> needCacheStorage and AncientDataMotionStrategy will call
> >> StorageCacheManager-> createCacheObject if cache storage is needed.
> >>>
> >>>> DataStore proxy that is applied as necessary.  For example, the
> >>>> current Xen hypervisor implementation must interact with a file
> >>>> system.  It seems most appropriate that the Xen hypervisor
> >>>> implementation would apply the proxy to its secondary storage
> >>>> DataStore instances.  Therefore, the storage engine
> >>>
> >>> The issue that struggles me for a while is that, how and where to
> >>> configure
> >> cache storage. Take S3 as an example, the implementation of S3
> >> provider should not need to care about cache storage at all, S3
> >> implementation should only care about how to deal with data stored in
> >> S3. While S3 is a general purpose storage, the same implementation
> >> can be used by different hypervisors, and in different situations.
> >> For example,  If both Ceph and S3 used by KVM, then definitely, nfs
> >> cache is not needed, while for other cluster wide primary storages
> supported by KVM, nfs cache is needed.
> >>> How and where to present this kind of configuration, and also both
> >>> primary
> >> storage provider and object storage provider implementation don't
> >> need to aware this kind of configuration? My current solution is
> >> that, write the decision in the code, DataMotionStrategy is the place
> >> to decide to use cache storage or not. In the default implementation,
> >> all the copy operations from
> >> s3 to primary storage, cache storage is needed, while if people
> >> adding a new storage, and want to remove cache storage for certain
> >> reason, then add a new DataMotionStrategy.
> >>>
> >>> I am not sure how the datastore proxy will solve the issue I
> >>> mentioned
> >> above, I am glad to know your solution.
> >>
> >> This looks like a good opportunity for some collaboration on the design!
> >
> > I am glad to know other solutions, or other people write code on the
> branch.
> > Code is always welcome.
> 
> Fundamentally, there should be no dependency from Storage to Hypervisor.
> The scenario presented here makes we wonder whether or not data motion
> belongs in the Storage or Hypervisor layers.  Storage doesn't care if the I/O 
> is
> for data motion.  It's only concern is the movement of data from one location
> to another.  I am beginning to feel that DataMotion belongs in the Hypervisor
> layer and the Storage layer should expose a more basic set of primitives that
> are composed by the service to orchestrate the required I/O operations.  Can
> you shed more light on the reasoning for putting this service in Storage
> rather Hypervisor?
> 
> To my mind, the storage services should present options that allow clients
> declare the requirements.  For example, if the Xen hypervisor implement
> must perform an operation through a file system interface, it should express
> that need to the storage subsystem.  In turn, the DataStore returned will
> fulfill that contract.  From an implementation perspective, I suggest the 
> cache
> mechanism be implemented as proxy of the underlying DataStore
> implementation.  There are likely other solutions -- I was merely trying to
> suggest approaches that would reduce the coupling for the purposes of
> discussion.
> 
> >
> >>
> >>>
> >>>> should only provide the facility not attempt to determine when it
> >>>> is
> >> needed.
> >>>> Additionally, a proxy model would greatly reduce the size and
> >>>> complexity of the various classes (e.g. AncientDataMotionStrategy).
> >>>
> >>> The current AncientDataMotionStrategy is only about 400 LOC now, I
> >> removed some of the duplicated code. But still want to know the
> >> detail of your solution.
> >>>
> >>>>
> >>>> 5. While I have only reviewed roughly 50-60% of the patch thus far,
> >>>> I see little to no additional unit tests for the new code added.
> >>>> Integration tests have
> >>>
> >>> If the code is changing rapidly, then writing/maintaining unit test
> >>> cases will
> >> take huge effort, we focused on the integration test and marvin test
> >> in this stage.
> >>
> >> We shouldn't be merging code that is "still changing rapidly".  What
> >> do you mean by this comment?
> >
> > Oh, should be "was changing rapidly" in the last month, in May, we moved
> on to integration test and marvin test and a lot of bug fixes.
> >
> >>
> >>>
> >>>> been expanded, but many of the test simply execute the service
> >>>> methods and do not verify the form of the operation results. There
> >>>> are also a tremendous number of ignored exceptions that should
> >>>> likely
> >> fail these tests.
> >>>> Therefore, we could have tests passing due to an ignored exception
> >>>> that should be failing.
> >>>
> >>> 1. Thanks, you take a look at the integration test, which is very
> >>> convenient
> >> for us during the implementation. It saves us a lot time during the test.
> >>> 2. Yes, the integration test needed to be improved as your suggested.
> >>
> >> Let's do that please!
> 
> Without asserts, the tests are simply exercising the code to ensure that no
> exceptions are being thrown.  They are not verifying that the test actually
> succeeded.  Until all methods properly verify the results of the call, it is
> difficult to objectively gage the quality of the branch.  Additionally, the
> integration tests verify very few alternative and exception flows.  I would 
> like
> to see more of those scenarios exercised.
> 
> >>
> >>>
> >>>>
> >>>> While I recognize the tremendous effort that has been expended to
> >>>> implement this capability and value of the feature, I see
> >>>> significant design and operational issues that must be addressed
> >>>> before it can be accepted into
> >>>
> >>> I would say the object _store branch is just better than the master
> >>> branch,
> >> almost all the issues(related to cache storages) you pointed out, are
> >> existing on the master branch. While fixing these issues on master
> >> branch is harder than on object_store branch. Frankly speaking, I
> >> don't want to read/understand/guess how it works on the master
> >> branch, it's a lot of hack all around the place.
> >>> Of course, these issues are still issues, we need to fix them, but
> >>> when to fix
> >> them is arguable. I prefer fixing them after the merge. Here is the reason:
> >>> If we agreed on the issues you pointed out are existing on the
> >>> master also,
> >> then do we want to fix them on the 4.2 or not? If the answer is yes,
> >> then who will fix them, and how long it will take to fix them on
> >> master? If I propose an offer, that I'll fix these issues after
> >> merging object_store into master, will it be acceptable? If the
> >> answer is no,  then there is no technical reason to not merge
> >> object_store into master, as these issues are not introduced by
> object_store branch.
> >>>
> >>>
> >>>> master.  In my opinion, the object_store represents a good first
> >>>> step, but it needs a few more review/refinement iterations before
> >>>> it will be ready for a master merge.
> 
> I agree with Chip.  In addition to the issues I have mentioned, this is a
> massive change late in the cycle for little overall gain.  We have not yet
> achieved the key goal our users want -- reliable, horizontally scalable
> secondary storage on commodity hardware (i.e. cheaper per GB with the
> same reliability). As such, I will take proven/released, "ugly" code over
> unproven/unreleased, "pretty" code every time.
> 
> >>>>
> >>>> Thanks,
> >>>> -John

RE: [MERGE]object_store branch into master

Reply via email to