Re: API proposal for - Expose URL for Blob source (OAK-1963)

Chetan Mehrotra Mon, 09 May 2016 22:29:22 -0700

Some more points around the proposed callback based approach

1.Possible security or enforcing a read only access to the exposed file -
The file provided within the BlobProcessor callback can be a symlink
created with a os user account which only has read only access. The symlink
can be removed once the callback returns


2. S3 DataStore Security Concern - For S3 DataStore we would only be
exposing the S3 object identifier and the client code would still need the
aws credentials to connect to the bucket and perform required copy operation

3. Possibility of further optimization in S3DataStore processing -
Currently when reading a binary from S3DataStore the binary content are
*always* spooled to some local temporary file (in local cache) and then a
InputStream is opened on that file. So even if the code need to read
initial few bytes of stream the whole file would have to be read. This
happens because with current JCR Binary API we are not in control of
lifetime of exposed InputStream. So if say we expose the InputStream we
cannot determine untill when the backing S3 SDK resources need to be held

Also current S3DataStore always creates local copy - With a callback based
approach we can safely expose this file which would allow layers above to
avoid spooling the content again locally for processing. And with callback
boundary we can later do required cleanup


Chetan Mehrotra

On Mon, May 9, 2016 at 7:15 PM, Chetan Mehrotra <chetan.mehro...@gmail.com>
wrote:

> Had an offline discussion with Michael on this and explained the usecase
> requirement in more details. One concern that has been raised is that such
> a generic adaptTo API is too inviting for improper use and Oak does not
> have any context around when this url is exposed for what time it is used.
>
> So instead of having a generic adaptTo API at JCR level we can have a
> BlobProcessor callback (Approach #B). Below is more of a strawman proposal.
> Once we have a consensus then we can go over the details
>
> interface BlobProcessor {
>        void process(AdaptableBlob blob);
> }
>
> Where AdaptableBlob is
>
> public interface AdaptableBlob {
>     <AdapterType> AdapterType adaptTo(Class<AdapterType> type);
> }
>
> The BlobProcessor instance can be passed via BlobStore API. So client
> would look for a BlobStore service (so use the Oak level API) and pass it
> the ContentIdentity of JCR Binary aka blobId
>
> interface BlobStore{
>      void process(String blobId, BlobProcessor processor)
> }
>
> The approach ensures
>
> 1. That any blob handle exposed is only guaranteed for the duration
> of  'process' invocation
> 2. There is no guarantee on the utility of blob handle (File, S3 Object)
> beyond the callback. So one should not collect the passed File handle for
> later use
>
> Hopefully this should address some of the concerns raised in this thread.
> Looking forward to feedback :)
>
> Chetan Mehrotra
>
> On Mon, May 9, 2016 at 6:24 PM, Michael Dürig <mdue...@apache.org> wrote:
>
>>
>>
>> On 9.5.16 11:43 , Chetan Mehrotra wrote:
>>
>>> To highlight - As mentioned earlier the user of proposed api is tying
>>> itself to implementation details of Oak and if this changes later then
>>> that
>>> code would also need to be changed. Or as Ian summed it up
>>>
>>> if the API is introduced it should create an out of band agreement with
>>>>
>>> the consumers of the API to act responsibly.
>>>
>>
>> So what does "to act responsibly" actually means? Are we even in a
>> position to precisely specify this? Experience tells me that we only find
>> out about those semantics after the fact when dealing with painful and
>> expensive customer escalations.
>>
>> And even if we could, it would tie Oak into very tight constraints on how
>> it has to behave and how not. Constraints that would turn out prohibitively
>> expensive for future evolution. Furthermore a huge amount of resources
>> would be required to formalise such constraints via test coverage to guard
>> against regressions.
>>
>>
>>
>>> The method is to be used for those important case where you do rely on
>>> implementation detail to get optimal performance in very specific
>>> scenarios. Its like DocumentNodeStore making use of some Mongo specific
>>> API
>>> to perform some important critical operation to achieve better
>>> performance
>>> by checking if the underlying DocumentStore is Mongo based.
>>>
>>
>> Right, but the Mongo specific API is a (hopefully) well thought through
>> API where as with your proposal there are a lot of open questions and
>> concerns as per my last mail.
>>
>> Mongo (and any other COTS DB) for good reasons also don't give you direct
>> access to its internal file handles.
>>
>>
>>
>>> I have seen discussion of JCR-3534 and other related issue but still do
>>> not
>>> see any conclusion on how to answer such queries where direct access to
>>> blobs is required for performance aspect. This issue is not about
>>> exposing
>>> the blob reference for remote access but more about optimal path for in
>>> VM
>>> access
>>>
>>
>> One bottom line of the discussions in that issue is that we came to a
>> conclusion after clarifying the specifics of the use case. Something I'm
>> still missing here. The case you brought forward is too general to serve as
>> a guideline for a solution. Quite to the contrary, to me it looks like a
>> solution to some problem (I'm trying to understand).
>>
>>
>>
>>> who owns the resource? Who coordinates (concurrent) access to it and how?
>>>>
>>> What are the correctness and performance implications here (races,
>>> deadlock, corruptions, JCR semantics)?
>>>
>>> The client code would need to be implemented in a proper way. Its more
>>> like
>>> implementing a CommitHook. If implemented in incorrect way it would cause
>>> issues deadlocks etc. But then we assume that any one implementing that
>>> interface would take proper care in implementation.
>>>
>>
>> But a commit hook is an internal SPI. It is not advertised to the whole
>> world as a public API.
>>
>>
>>
>>>  it limits implementation freedom and hinders further evolution
>>>>
>>> (chunking, de-duplication, content based addressing, compression, gc,
>>> etc.)
>>> for data stores.
>>>
>>> As mentioned earlier. Some part of API indicates a closer dependency on
>>> how
>>> things work (like SPI, or ConsumerType AP on OSGi terms). By using such
>>> API
>>> client code definitely ties itself to Oak implementation detail but it
>>> should not limit how Oak implementation detail evolve. So when it changes
>>> client code need to adapt itself accordingly. Oak can express that
>>> by increment the minor version of exported package to indicate change
>>> in behavior.
>>>
>>
>> Which IMO is completely contradictory. Such an API would prevent us from
>> refactoring internal storage formats if a new format couldn't implement the
>> API (e.g. because of chunking, compression, deduplication etc).
>>
>>
>> Can't we come up with an API that allows the blobs to stay under control
>>>>
>>> of Oak?
>>>
>>> The code need to work either at OS level say file handle or say S3
>>> object.
>>> So I do not see a way where it can work without having access to those
>>> details
>>>
>>
>> Again, why? What's the precise use case here? If this really is the
>> conclusions, then a corollary would be that those binaries must not go into
>> Oak.
>>
>>
>> FWIW there is code out there which reverse engineers the blobId to access
>>> the actual binary. People do it so as to get decent throughput in image
>>> rendition logic for large scale deployment. The proposal here was to
>>> formalize that approach by providing a proper api. If we do not provide
>>> such an API then the only way for them would be to continue relying on
>>> reverse engineering the blobId!
>>>
>>
>> This is hardly a good argument. Formalising other people's hacks means
>> making us liable. What *we* need to do is understand their use case and
>> come up with a clean solution.
>>
>>
>>
>>> If not, this is probably an indication that those blobs shouldn't go into
>>>>
>>> Oak but just references to it as Francesco already proposed. Anything
>>> else
>>> is whether fish nor fowl: you can't have the JCR goodies but at the same
>>> time access underlying resources at will.
>>>
>>> Thats a fine argument to make. But then users here have real problem to
>>> solve which we should not ignore. Oak based systems are being proposed
>>> for
>>> large asset deployment where one of the primary requirement is asset
>>> handling/processing of 100 of TB of binary data. So we would then have to
>>> recommend for such cases to not use JCR Binary abstraction and manage the
>>> binaries on your own. That would then solve both the problems (that might
>>> though break lots of tooling build on top of JCR API to manage those
>>> binaries)!
>>>
>>
>> But then we need to provide means to enable such customers to do their
>> processing. Not punch wholes into Oak's architecture we would probably be
>> fighting for the rest of the product's lifecycle.
>>
>>
>>
>>> Thinking more - Another approach that I can then suggest it people
>>> implement there own BlobStore (may be by extending ours) and provide this
>>> API there i.e. say which takes Blob id and provide the required details.
>>> This way we "outsource" the problem. Would that be acceptable?
>>>
>>
>> That's actually to acceptable solutions. Any custom code is outside of
>> Oak's liability. However, I'd prefer an approach where we come up with a
>> blob store implementation that supports what ever the use case is here. But
>> without leaking internals.
>>
>> Michael
>>
>>
>>
>>
>>> Chetan Mehrotra
>>>
>>> On Mon, May 9, 2016 at 2:28 PM, Michael Dürig <mdue...@apache.org>
>>> wrote:
>>>
>>>
>>>> Hi,
>>>>
>>>> I very much share Francesco's concerns here. Unconditionally exposing
>>>> access to operation system resources underlying Oak's inner working is
>>>> troublesome for various reasons:
>>>>
>>>> - who owns the resource? Who coordinates (concurrent) access to it and
>>>> how? What are the correctness and performance implications here (races,
>>>> deadlock, corruptions, JCR semantics)?
>>>>
>>>> - it limits implementation freedom and hinders further evolution
>>>> (chunking, de-duplication, content based addressing, compression, gc,
>>>> etc.)
>>>> for data stores.
>>>>
>>>> - bypassing JCR's security model
>>>>
>>>> Pretty much all of this has been discussed in the scope of
>>>> https://issues.apache.org/jira/browse/JCR-3534 and
>>>> https://issues.apache.org/jira/browse/OAK-834. So I suggest to review
>>>> those discussions before we jump to conclusion.
>>>>
>>>>
>>>> Also what is the use case requiring such a vast API surface? Can't we
>>>> come
>>>> up with an API that allows the blobs to stay under control of Oak? If
>>>> not,
>>>> this is probably an indication that those blobs shouldn't go into Oak
>>>> but
>>>> just references to it as Francesco already proposed. Anything else is
>>>> whether fish nor fowl: you can't have the JCR goodies but at the same
>>>> time
>>>> access underlying resources at will.
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>>
>>>> On 5.5.16 11:00 , Francesco Mari wrote:
>>>>
>>>> This proposal introduces a huge leak of abstractions and has deep
>>>>> security
>>>>> implications.
>>>>>
>>>>> I guess that the reason for this proposal is that some users of Oak
>>>>> would
>>>>> like to perform some operations on binaries in a more performant way by
>>>>> leveraging the way those binaries are stored. If this is the case, I
>>>>> suggest those users to evaluate an applicative solution implemented on
>>>>> top
>>>>> of the JCR API.
>>>>>
>>>>> If a user needs to store some important binary data (files, images,
>>>>> etc.)
>>>>> in an S3 bucket or on the file system for performance reasons, this
>>>>> shouldn't affect how Oak handles blobs internally. If some assets are
>>>>> of
>>>>> special interest for the user, then the user should bypass Oak and take
>>>>> care of the storage of those assets directly. Oak can be used to store
>>>>> *references* to those assets, that can be used in user code to
>>>>> manipulate
>>>>> the assets in his own business logic.
>>>>>
>>>>> If the scenario I outlined is not what inspired this proposal, I would
>>>>> like
>>>>> to know more about the reasons why this proposal was brought up. Which
>>>>> problems are we going to solve with this API? Is there a more concrete
>>>>> use
>>>>> case that we can use as a driving example?
>>>>>
>>>>> 2016-05-05 10:06 GMT+02:00 Davide Giannella <dav...@apache.org>:
>>>>>
>>>>> On 04/05/2016 17:37, Ian Boston wrote:
>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>> If the File or URL is writable, will writing to the location cause
>>>>>>> issues
>>>>>>> for Oak ?
>>>>>>> IIRC some Oak DS implementations use a digest of the content to
>>>>>>> determine
>>>>>>> the location in the DS, so changing the content via Oak will change
>>>>>>> the
>>>>>>> location, but changing the content via the File or URL wont. If I
>>>>>>> didn't
>>>>>>> remember correctly, then ignore the concern.  Fully supportive of the
>>>>>>> approach, as a consumer of Oak. The locations will certainly probably
>>>>>>>
>>>>>>> leak
>>>>>>
>>>>>> outside the context of an Oak session so the API contract should make
>>>>>>> it
>>>>>>> clear that the code using a direct location needs to behave
>>>>>>> responsibly.
>>>>>>>
>>>>>>>
>>>>>>> It's a reasonable concern and I'm not in the details of the
>>>>>> implementation. It's worth to keep in mind though and remember if we
>>>>>> want to adapt to URL or File that maybe we'll have to come up with
>>>>>> some
>>>>>> sort of read-only version of such.
>>>>>>
>>>>>> For the File class, IIRC, we could force/use the setReadOnly(),
>>>>>> setWritable() methods. I remember those to be quite expensive in time
>>>>>> though.
>>>>>>
>>>>>> Davide
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>

Re: API proposal for - Expose URL for Blob source (OAK-1963)

Reply via email to