Re: [fcrepo-dev] More food for 4.0 thought: fcrepo-store

Chris Wilper Thu, 29 Mar 2012 12:11:06 -0700

On Thu, Mar 29, 2012 at 9:16 AM, Aaron Birkland <[email protected]> wrote:
>
>> So here's a provocative question to start: Assuming for a moment that
>> the core Fedora object model (versioning warts and all) stays the same
>> for 4.0, would something like this interface actually be compatible
>> with the major objectives we've talked about with respect to High
>> Level Storage?
>
> Here's my perspective:
>
> HighlevelStorage was designed as a data-oriented interface that
> explicitly made the fedora object a fundamental and atomic unit of work
> with respect to storage and associated "data-oriented" services that
> might be plugged in.  This was a key simplification with clear
> boundaries that would enable storage implementations the flexibility to
> adopt a variety of locking, optimization, and/or communication
> strategies within each unit of work - as it is guaranteed that each unit
> of work is "complete" and fully defined with respect to a single fedora
> object.  Transactions could later be laid on top of that,  but would not
> change the fact that each individual operation within a transaction
> would be a complete-object-version unit of work.


Understood. The single Fedora object add/update/delete being the
fundamental atomic unit of work seems like a reasonable place to
start, and is an assumption I also made with fcrepo-store. Going after
a smaller level of granularity (e.g. the datastream) would be a
different challenge which, if folks want to go down that road as a
thought exercise, I'm happy, but in my mind there's already a pretty
easy answer for anyone who has use cases that really provoke
different-datastream-same-object-update concurrency: Use atomistic
content modeling

> setContent() could possibly be problematic in that light, I'm not sure.
> For example, one potential use case of HighLevelStorage is that the
> storage impl might decide a managed datastream's physical storage
> location based upon some property of the object (content model, for
> example).  Do the semantics of setContent() allow a FedoraStore impl to
> "make note that some content is available, hold onto a reference to the
> InputStreams, but only act upon it in response to update(), possibly
> making storage decisions based upon the content of the FedoraObject"?

Well, the intent was certainly that implementations could make storage
decisions based on other aspects of the object because calls to
setContent are meant to occur *after* the necessary calls to
add/updateObject, but I now see the fundamental problem with splitting
them at this level: object-level update atomicity just wouldn't be
possible without transactions.

However, I still don't see exactly how the alternative could be done.
That is, making all content available somehow within the passed-in
FedoraObject instance without resorting to passing around actual
managed content streams all the time.

Maybe if the DatastreamVersion class had a getManagedContent() method
that returned an inputstream if the control group was "M". And in the
case where the FedoraObject was being updated, that could be null,
indicating that the intent is to keep the value the same?

In general, I like the idea of the FedoraObject instances being "dumb"
value objects that don't do any sort of computation or validation, but
just provide getters/setters. But things seem to get harder to reason
about when the objects need to encapsulate arbitrarily large content
streams. Do you have any ideas for how things might look at this
level?

Which brings me to a related, but less important question: If managed
content can be passed into storage by value in this way, why should
the serialized FOXML (or whatever) actually hold any kind of reference
to it?  The pid+dsId+dsVersionId thing isn't actually useful once the
object has been stored.

> While I don't consider lock-free concurrent updates to be fundamental to
> HighLevelStorage per se, the interface was designed to explicitly
> declare a handle to prior state in order to provide flexibility and
> avoid the need for explicit locking and shared-state.   Forcing the use
> of internal or external locks and/or transactions limits the opportunity
> to leverage certain kinds of horizontal scalability.  Indeed, the
> initial motivation for HighLevelStorage for me was to horizontally-scale
> fedora itself by eliminating shared state and locking between instances,
> utilizing only the native capabilities of the storage impl (in this case
> HBase).   With the FedoraStore interface as it stands right now, locking
> (or single-object transactions) *must* be used in order to create fairly
> lengthy critical section, making such horizontal scaling more
> complicated and less effective.

The current FedoraStore split between setContent and
update(FedoraObject) does make it impossible to ensure atomic updates
without forcing you to resort to higher-level locks or transactions.

Assuming for a second that we came up with a design that merged the
two capabilities into a single atomic operation (so there was just
update(FedoraObject), where the FedoraObject somehow provides access
to the content streams), then I think we'd be in a pretty good, though
not ideal, position: Ensuring per-object-update atomicity would at
least be a concern that could be fully handled by the store
implementation.

Btw, after the mention of Rich Hickey on today's call (thanks, Adam),
I found this excellent talk of his from '09 that is absolutely
relevant to what we're talking about here:

http://www.infoq.com/presentations/Are-We-There-Yet-Rich-Hickey

> Used in the same place as ILowlevelStorage, providing a reference to the
> "to-be-replaced" version upon update is a fairly natural thing to do.
> DOManager would need to retrieve the old version of an object anyway in
> order to correctly populate the updated version, so there really is no
> additional overhead in supplying a reference to it to the storage impl.
> In fact, having a reference to both versions of the object may even make
> certain implementations of HighLevelStorage plugins more efficient.
> Consider a plugin that calculates the diff of triples to send off for
> indexing.  It would be handy to have the metadata of the old version
> right there in order to be able to dereference the proper datastream for
> comparison, especially if that datastream is not versionable.

If there's little cost to doing it, great. But the question of what to
do about managed content still looms. I'm personally able to reason
better about this stuff after looking at/writing actual code. So
here's a snippet for consideration, assuming a more HLStorage-like
design that takes the oldObj, newObj pair:

FedoraStoreSession session = fedoraStore.getSession();
try {
  FedoraObject oldObj = session.getObject(pid);
  FedoraObject newObj = oldObj.copy();
  newObj.label("new label value");
  fedoraSession.update(oldObj, newObj);
} finally {
  session.close();
}

Now, unless someone has gone a bit wild with datastreams,
FedoraObject.copy(), a "deep" copy, is going to be fairly cheap on its
own. But what do we actually do with managed datastream content?

- Chris

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Fedora-commons-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers

Re: [fcrepo-dev] More food for 4.0 thought: fcrepo-store

Reply via email to