On 9.5.16 11:43 , Chetan Mehrotra wrote:
To highlight - As mentioned earlier the user of proposed api is tying
itself to implementation details of Oak and if this changes later then that
code would also need to be changed. Or as Ian summed it up

if the API is introduced it should create an out of band agreement with
the consumers of the API to act responsibly.

So what does "to act responsibly" actually means? Are we even in a position to precisely specify this? Experience tells me that we only find out about those semantics after the fact when dealing with painful and expensive customer escalations.

And even if we could, it would tie Oak into very tight constraints on how it has to behave and how not. Constraints that would turn out prohibitively expensive for future evolution. Furthermore a huge amount of resources would be required to formalise such constraints via test coverage to guard against regressions.



The method is to be used for those important case where you do rely on
implementation detail to get optimal performance in very specific
scenarios. Its like DocumentNodeStore making use of some Mongo specific API
to perform some important critical operation to achieve better performance
by checking if the underlying DocumentStore is Mongo based.

Right, but the Mongo specific API is a (hopefully) well thought through API where as with your proposal there are a lot of open questions and concerns as per my last mail.

Mongo (and any other COTS DB) for good reasons also don't give you direct access to its internal file handles.



I have seen discussion of JCR-3534 and other related issue but still do not
see any conclusion on how to answer such queries where direct access to
blobs is required for performance aspect. This issue is not about exposing
the blob reference for remote access but more about optimal path for in VM
access

One bottom line of the discussions in that issue is that we came to a conclusion after clarifying the specifics of the use case. Something I'm still missing here. The case you brought forward is too general to serve as a guideline for a solution. Quite to the contrary, to me it looks like a solution to some problem (I'm trying to understand).



who owns the resource? Who coordinates (concurrent) access to it and how?
What are the correctness and performance implications here (races,
deadlock, corruptions, JCR semantics)?

The client code would need to be implemented in a proper way. Its more like
implementing a CommitHook. If implemented in incorrect way it would cause
issues deadlocks etc. But then we assume that any one implementing that
interface would take proper care in implementation.

But a commit hook is an internal SPI. It is not advertised to the whole world as a public API.



 it limits implementation freedom and hinders further evolution
(chunking, de-duplication, content based addressing, compression, gc, etc.)
for data stores.

As mentioned earlier. Some part of API indicates a closer dependency on how
things work (like SPI, or ConsumerType AP on OSGi terms). By using such API
client code definitely ties itself to Oak implementation detail but it
should not limit how Oak implementation detail evolve. So when it changes
client code need to adapt itself accordingly. Oak can express that
by increment the minor version of exported package to indicate change
in behavior.

Which IMO is completely contradictory. Such an API would prevent us from refactoring internal storage formats if a new format couldn't implement the API (e.g. because of chunking, compression, deduplication etc).


Can't we come up with an API that allows the blobs to stay under control
of Oak?

The code need to work either at OS level say file handle or say S3 object.
So I do not see a way where it can work without having access to those
details

Again, why? What's the precise use case here? If this really is the conclusions, then a corollary would be that those binaries must not go into Oak.


FWIW there is code out there which reverse engineers the blobId to access
the actual binary. People do it so as to get decent throughput in image
rendition logic for large scale deployment. The proposal here was to
formalize that approach by providing a proper api. If we do not provide
such an API then the only way for them would be to continue relying on
reverse engineering the blobId!

This is hardly a good argument. Formalising other people's hacks means making us liable. What *we* need to do is understand their use case and come up with a clean solution.



If not, this is probably an indication that those blobs shouldn't go into
Oak but just references to it as Francesco already proposed. Anything else
is whether fish nor fowl: you can't have the JCR goodies but at the same
time access underlying resources at will.

Thats a fine argument to make. But then users here have real problem to
solve which we should not ignore. Oak based systems are being proposed for
large asset deployment where one of the primary requirement is asset
handling/processing of 100 of TB of binary data. So we would then have to
recommend for such cases to not use JCR Binary abstraction and manage the
binaries on your own. That would then solve both the problems (that might
though break lots of tooling build on top of JCR API to manage those
binaries)!

But then we need to provide means to enable such customers to do their processing. Not punch wholes into Oak's architecture we would probably be fighting for the rest of the product's lifecycle.



Thinking more - Another approach that I can then suggest it people
implement there own BlobStore (may be by extending ours) and provide this
API there i.e. say which takes Blob id and provide the required details.
This way we "outsource" the problem. Would that be acceptable?

That's actually to acceptable solutions. Any custom code is outside of Oak's liability. However, I'd prefer an approach where we come up with a blob store implementation that supports what ever the use case is here. But without leaking internals.

Michael



Chetan Mehrotra

On Mon, May 9, 2016 at 2:28 PM, Michael Dürig <mdue...@apache.org> wrote:


Hi,

I very much share Francesco's concerns here. Unconditionally exposing
access to operation system resources underlying Oak's inner working is
troublesome for various reasons:

- who owns the resource? Who coordinates (concurrent) access to it and
how? What are the correctness and performance implications here (races,
deadlock, corruptions, JCR semantics)?

- it limits implementation freedom and hinders further evolution
(chunking, de-duplication, content based addressing, compression, gc, etc.)
for data stores.

- bypassing JCR's security model

Pretty much all of this has been discussed in the scope of
https://issues.apache.org/jira/browse/JCR-3534 and
https://issues.apache.org/jira/browse/OAK-834. So I suggest to review
those discussions before we jump to conclusion.


Also what is the use case requiring such a vast API surface? Can't we come
up with an API that allows the blobs to stay under control of Oak? If not,
this is probably an indication that those blobs shouldn't go into Oak but
just references to it as Francesco already proposed. Anything else is
whether fish nor fowl: you can't have the JCR goodies but at the same time
access underlying resources at will.

Michael




On 5.5.16 11:00 , Francesco Mari wrote:

This proposal introduces a huge leak of abstractions and has deep security
implications.

I guess that the reason for this proposal is that some users of Oak would
like to perform some operations on binaries in a more performant way by
leveraging the way those binaries are stored. If this is the case, I
suggest those users to evaluate an applicative solution implemented on top
of the JCR API.

If a user needs to store some important binary data (files, images, etc.)
in an S3 bucket or on the file system for performance reasons, this
shouldn't affect how Oak handles blobs internally. If some assets are of
special interest for the user, then the user should bypass Oak and take
care of the storage of those assets directly. Oak can be used to store
*references* to those assets, that can be used in user code to manipulate
the assets in his own business logic.

If the scenario I outlined is not what inspired this proposal, I would
like
to know more about the reasons why this proposal was brought up. Which
problems are we going to solve with this API? Is there a more concrete use
case that we can use as a driving example?

2016-05-05 10:06 GMT+02:00 Davide Giannella <dav...@apache.org>:

On 04/05/2016 17:37, Ian Boston wrote:

Hi,
If the File or URL is writable, will writing to the location cause
issues
for Oak ?
IIRC some Oak DS implementations use a digest of the content to
determine
the location in the DS, so changing the content via Oak will change the
location, but changing the content via the File or URL wont. If I didn't
remember correctly, then ignore the concern.  Fully supportive of the
approach, as a consumer of Oak. The locations will certainly probably

leak

outside the context of an Oak session so the API contract should make it
clear that the code using a direct location needs to behave responsibly.


It's a reasonable concern and I'm not in the details of the
implementation. It's worth to keep in mind though and remember if we
want to adapt to URL or File that maybe we'll have to come up with some
sort of read-only version of such.

For the File class, IIRC, we could force/use the setReadOnly(),
setWritable() methods. I remember those to be quite expensive in time
though.

Davide






Reply via email to