On Thu, Mar 29, 2012 at 9:16 AM, Aaron Birkland <[email protected]> wrote: > >> So here's a provocative question to start: Assuming for a moment that >> the core Fedora object model (versioning warts and all) stays the same >> for 4.0, would something like this interface actually be compatible >> with the major objectives we've talked about with respect to High >> Level Storage? > > Here's my perspective: > > HighlevelStorage was designed as a data-oriented interface that > explicitly made the fedora object a fundamental and atomic unit of work > with respect to storage and associated "data-oriented" services that > might be plugged in. This was a key simplification with clear > boundaries that would enable storage implementations the flexibility to > adopt a variety of locking, optimization, and/or communication > strategies within each unit of work - as it is guaranteed that each unit > of work is "complete" and fully defined with respect to a single fedora > object. Transactions could later be laid on top of that, but would not > change the fact that each individual operation within a transaction > would be a complete-object-version unit of work.
Understood. The single Fedora object add/update/delete being the fundamental atomic unit of work seems like a reasonable place to start, and is an assumption I also made with fcrepo-store. Going after a smaller level of granularity (e.g. the datastream) would be a different challenge which, if folks want to go down that road as a thought exercise, I'm happy, but in my mind there's already a pretty easy answer for anyone who has use cases that really provoke different-datastream-same-object-update concurrency: Use atomistic content modeling > setContent() could possibly be problematic in that light, I'm not sure. > For example, one potential use case of HighLevelStorage is that the > storage impl might decide a managed datastream's physical storage > location based upon some property of the object (content model, for > example). Do the semantics of setContent() allow a FedoraStore impl to > "make note that some content is available, hold onto a reference to the > InputStreams, but only act upon it in response to update(), possibly > making storage decisions based upon the content of the FedoraObject"? Well, the intent was certainly that implementations could make storage decisions based on other aspects of the object because calls to setContent are meant to occur *after* the necessary calls to add/updateObject, but I now see the fundamental problem with splitting them at this level: object-level update atomicity just wouldn't be possible without transactions. However, I still don't see exactly how the alternative could be done. That is, making all content available somehow within the passed-in FedoraObject instance without resorting to passing around actual managed content streams all the time. Maybe if the DatastreamVersion class had a getManagedContent() method that returned an inputstream if the control group was "M". And in the case where the FedoraObject was being updated, that could be null, indicating that the intent is to keep the value the same? In general, I like the idea of the FedoraObject instances being "dumb" value objects that don't do any sort of computation or validation, but just provide getters/setters. But things seem to get harder to reason about when the objects need to encapsulate arbitrarily large content streams. Do you have any ideas for how things might look at this level? Which brings me to a related, but less important question: If managed content can be passed into storage by value in this way, why should the serialized FOXML (or whatever) actually hold any kind of reference to it? The pid+dsId+dsVersionId thing isn't actually useful once the object has been stored. > While I don't consider lock-free concurrent updates to be fundamental to > HighLevelStorage per se, the interface was designed to explicitly > declare a handle to prior state in order to provide flexibility and > avoid the need for explicit locking and shared-state. Forcing the use > of internal or external locks and/or transactions limits the opportunity > to leverage certain kinds of horizontal scalability. Indeed, the > initial motivation for HighLevelStorage for me was to horizontally-scale > fedora itself by eliminating shared state and locking between instances, > utilizing only the native capabilities of the storage impl (in this case > HBase). With the FedoraStore interface as it stands right now, locking > (or single-object transactions) *must* be used in order to create fairly > lengthy critical section, making such horizontal scaling more > complicated and less effective. The current FedoraStore split between setContent and update(FedoraObject) does make it impossible to ensure atomic updates without forcing you to resort to higher-level locks or transactions. Assuming for a second that we came up with a design that merged the two capabilities into a single atomic operation (so there was just update(FedoraObject), where the FedoraObject somehow provides access to the content streams), then I think we'd be in a pretty good, though not ideal, position: Ensuring per-object-update atomicity would at least be a concern that could be fully handled by the store implementation. Btw, after the mention of Rich Hickey on today's call (thanks, Adam), I found this excellent talk of his from '09 that is absolutely relevant to what we're talking about here: http://www.infoq.com/presentations/Are-We-There-Yet-Rich-Hickey > Used in the same place as ILowlevelStorage, providing a reference to the > "to-be-replaced" version upon update is a fairly natural thing to do. > DOManager would need to retrieve the old version of an object anyway in > order to correctly populate the updated version, so there really is no > additional overhead in supplying a reference to it to the storage impl. > In fact, having a reference to both versions of the object may even make > certain implementations of HighLevelStorage plugins more efficient. > Consider a plugin that calculates the diff of triples to send off for > indexing. It would be handy to have the metadata of the old version > right there in order to be able to dereference the proper datastream for > comparison, especially if that datastream is not versionable. If there's little cost to doing it, great. But the question of what to do about managed content still looms. I'm personally able to reason better about this stuff after looking at/writing actual code. So here's a snippet for consideration, assuming a more HLStorage-like design that takes the oldObj, newObj pair: FedoraStoreSession session = fedoraStore.getSession(); try { FedoraObject oldObj = session.getObject(pid); FedoraObject newObj = oldObj.copy(); newObj.label("new label value"); fedoraSession.update(oldObj, newObj); } finally { session.close(); } Now, unless someone has gone a bit wild with datastreams, FedoraObject.copy(), a "deep" copy, is going to be fairly cheap on its own. But what do we actually do with managed datastream content? - Chris ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Fedora-commons-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers
