[DiSCUSS] - highly vs rarely used data

2017-06-23 Thread Tommaso Teofili
Hi all,

recently I've been at a conference [1] where I attended an interesting
keynote about data management [2] (I think it refers to this 2016 paper
[3]).

Apart from the approaches proposed to solve the data management problem
(e.g. get rid of DBMSs!) I got interested in the discussion about how we
deal with the increasing amount of data that we have to manage (also
because of some issues we have [4]).
In many systems only a very small subset of the data is used because the
amount of information users really need refers only to most recently
ingested data (e.g. social networks); while that doesn't always apply for
content repositories in general (e.g. if you build a CMS on top of it) I
think it's interesting to think about whether we can optimize our
persistence layer to work better with highly used data (e.g. more recent)
and use less space/cpu for data that is used more rarely.

For example, putting this together with the incremental indexing section of
the paper [3] I was thinking (but that's already a solution rather than
"just" a discussion) perhaps we could simply avoid indexing *some* content
until it's needed (e.g. the first time you get traversal, then index so
that next query over same data will be faster) but that's just an example.

What do others think ?
Regards,
Tommaso

[1] : http://www.iccs-meeting.org/iccs2017/
[2] : http://www.iccs-meeting.org/iccs2017/keynote-lectures/#Ailamaki
[3] : https://infoscience.epfl.ch/record/219993/files/p12-pavlovic.pdf
[4] : https://issues.apache.org/jira/browse/OAK-5192


Re: [DiSCUSS] - highly vs rarely used data

2017-06-26 Thread Michael Dürig


Hi,

I agree we should have a better look at access patterns, not only for 
indexing. I recently came across a repository with about 65% of its 
content in the version store. That content is pretty much archived and 
never accessed. Yet it fragments the index and thus impacts general 
access times.


Michael

On 23.06.17 10:22, Tommaso Teofili wrote:

Hi all,

recently I've been at a conference [1] where I attended an interesting
keynote about data management [2] (I think it refers to this 2016 paper
[3]).

Apart from the approaches proposed to solve the data management problem
(e.g. get rid of DBMSs!) I got interested in the discussion about how we
deal with the increasing amount of data that we have to manage (also
because of some issues we have [4]).
In many systems only a very small subset of the data is used because the
amount of information users really need refers only to most recently
ingested data (e.g. social networks); while that doesn't always apply for
content repositories in general (e.g. if you build a CMS on top of it) I
think it's interesting to think about whether we can optimize our
persistence layer to work better with highly used data (e.g. more recent)
and use less space/cpu for data that is used more rarely.

For example, putting this together with the incremental indexing section of
the paper [3] I was thinking (but that's already a solution rather than
"just" a discussion) perhaps we could simply avoid indexing *some* content
until it's needed (e.g. the first time you get traversal, then index so
that next query over same data will be faster) but that's just an example.

What do others think ?
Regards,
Tommaso

[1] : http://www.iccs-meeting.org/iccs2017/
[2] : http://www.iccs-meeting.org/iccs2017/keynote-lectures/#Ailamaki
[3] : https://infoscience.epfl.ch/record/219993/files/p12-pavlovic.pdf
[4] : https://issues.apache.org/jira/browse/OAK-5192



Re: [DiSCUSS] - highly vs rarely used data

2017-06-26 Thread Davide Giannella
On 26/06/2017 09:00, Michael Dürig wrote:
>
> I agree we should have a better look at access patterns, not only for
> indexing. I recently came across a repository with about 65% of its
> content in the version store. That content is pretty much archived and
> never accessed. Yet it fragments the index and thus impacts general
> access times.

I may say something stupid as usual, but here I can see for example that
such content could be "moved to a slower repository". So for example
speaking of segment, it could be stored in a compressed segment (rather
than plain tar) and the repository could either automatically configure
the indexes to skip such part or/and additionally create an ad-hoc index
which could async by definition every, let's say, 10 seconds.

We would gain on the repository size and indexing speed.

Just a couple of ideas off the top of my head.

Davide




Re: [DiSCUSS] - highly vs rarely used data

2017-06-26 Thread Matt Ryan
Hi,

With respect to Oak data stores, this is something I am hoping to support
later this year after the implementation of the CompositeDataStore (which
I'm still working on).

First, the assumption is that there would be a working CompositeDataStore
that can manage multiple data stores, and can select a data store for a
blob based on something like a JCR property (I'm still figuring this part
out).  In such a case, it would be possible to add a property to blobs that
can be archived, and then the CompositeDataStore could store them in a
different location - think AWS Glacier if there were a Glacier-compatible
data store.  Of course this would require that we also support an access
pattern in Oak where Oak knows that a blob can be retrieved but cannot
reply to a request with the requested blob immediately.  Instead Oak would
have to give a response indicating "I can get it, but it will take a while"
and suggest when it might be available.

That's just one example.  I believe once I figure out the
CompositeDataStore it will be able to support a lot of neat scenarios from
on the blob store side of things anyway.

-MR

On Mon, Jun 26, 2017 at 2:22 AM, Davide Giannella  wrote:

> On 26/06/2017 09:00, Michael Dürig wrote:
> >
> > I agree we should have a better look at access patterns, not only for
> > indexing. I recently came across a repository with about 65% of its
> > content in the version store. That content is pretty much archived and
> > never accessed. Yet it fragments the index and thus impacts general
> > access times.
>
> I may say something stupid as usual, but here I can see for example that
> such content could be "moved to a slower repository". So for example
> speaking of segment, it could be stored in a compressed segment (rather
> than plain tar) and the repository could either automatically configure
> the indexes to skip such part or/and additionally create an ad-hoc index
> which could async by definition every, let's say, 10 seconds.
>
> We would gain on the repository size and indexing speed.
>
> Just a couple of ideas off the top of my head.
>
> Davide
>
>
>


Re: [DiSCUSS] - highly vs rarely used data

2017-06-30 Thread Thomas Mueller
Hi,

I guess you talk about Amazon Glacier. Did you know about "Expedited 
retrievals" by the way? 
https://aws.amazon.com/about-aws/whats-new/2016/11/access-your-amazon-glacier-data-in-minutes-with-new-retrieval-options/
 - it looks like it's more than just "slow" + "fast".

About deciding which binaries to move to the slow storage: It would be good if 
that's automatic. Couldn't that be based on access frequency + recency? If a 
binary is not accessed for some time, it is moved to slow storage. I would add: 
if it was not accessed for some time, _plus_ it was rarely accessed before. 
Reason: for caching, it is well known that not only the recency, but also 
frequency, are important to predict if an entry will be needed in the near 
future. To do that, we could maintain a log that tells you when, and how many 
times, a binary was read. Maybe Amazon / Azure keep some info about that, but 
let's assume not (or not in such a way we want or can use). 

For example, each client appends the blob ids that it reads to a file. Multiple 
such files could be merged. To save space for such files (probably not needed, 
but who knows):

* Use a cache to avoid repeatedly writing the same id, in case it's accessed 
multiple times.
* Maybe you don't care about smallish binaries (smaller than 1 MB for example), 
or care less about them. So, for example only move files larger than 1 MB. That 
means no need to add an entry.
* A bloom filter or similar could be used (so you would retain x% too many 
entries). Or even simpler: only write the first x characters of the binary id. 
That way, we retain x% too much in fast storage, but save time, space, and 
memory for maintenance.

Regards,
Thomas


On 26.06.17, 18:10, "Matt Ryan"  wrote:

Hi,

With respect to Oak data stores, this is something I am hoping to support
later this year after the implementation of the CompositeDataStore (which
I'm still working on).

First, the assumption is that there would be a working CompositeDataStore
that can manage multiple data stores, and can select a data store for a
blob based on something like a JCR property (I'm still figuring this part
out).  In such a case, it would be possible to add a property to blobs that
can be archived, and then the CompositeDataStore could store them in a
different location - think AWS Glacier if there were a Glacier-compatible
data store.  Of course this would require that we also support an access
pattern in Oak where Oak knows that a blob can be retrieved but cannot
reply to a request with the requested blob immediately.  Instead Oak would
have to give a response indicating "I can get it, but it will take a while"
and suggest when it might be available.

That's just one example.  I believe once I figure out the
CompositeDataStore it will be able to support a lot of neat scenarios from
on the blob store side of things anyway.

-MR

On Mon, Jun 26, 2017 at 2:22 AM, Davide Giannella  wrote:

> On 26/06/2017 09:00, Michael Dürig wrote:
> >
> > I agree we should have a better look at access patterns, not only for
> > indexing. I recently came across a repository with about 65% of its
> > content in the version store. That content is pretty much archived and
> > never accessed. Yet it fragments the index and thus impacts general
> > access times.
>
> I may say something stupid as usual, but here I can see for example that
> such content could be "moved to a slower repository". So for example
> speaking of segment, it could be stored in a compressed segment (rather
> than plain tar) and the repository could either automatically configure
> the indexes to skip such part or/and additionally create an ad-hoc index
> which could async by definition every, let's say, 10 seconds.
>
> We would gain on the repository size and indexing speed.
>
> Just a couple of ideas off the top of my head.
>
> Davide
>
>
>




Re: [DiSCUSS] - highly vs rarely used data

2017-06-30 Thread Bertrand Delacretaz
On Fri, Jun 30, 2017 at 10:44 AM, Thomas Mueller
 wrote:
> ...About deciding which binaries to move to the slow storage: It would be 
> good if that's automatic...

>From my perspective as an Oak user I would like to have control on that.

It would be nice for Oak to make *suggestions* about moving things to
cold storage, but there might be application constraints that need to
be accounted for.

-Bertrand


Re: [DiSCUSS] - highly vs rarely used data

2017-06-30 Thread Thomas Mueller
> From my perspective as an Oak user I would like to have control on that.
> It would be nice for Oak to make *suggestions* about moving things to
> cold storage, but there might be application constraints that need to
> be accounted for.

That sounds reasonable. What would be the "API" for this? Let's say the API is: 
configure a path that _allows_ binaries to be migrated to cold storage. It's 
not allowed for all other paths. The default configuration could be: allow for 
/jcr:system/jcr:versionStorage, don't allow anywhere else. This could be 
implemented using automatic moving (as I have described), _plus_ a background 
job that, twice a month, traverses all nodes and reads the first few bytes of 
all nodes that are _not_ in /jcr:system/jcr:versionStorage. The traversal could 
additionally do some reporting, for example how many binaries are were, how 
many times where they read, how much money could you save if configured like 
this.

For automatic moving, behaviour could be:

- To move to cold storage: configuration would be needed: size, access 
frequency, recency (e.g. only move binaries larger than 1 MB that were not 
access for one month, and that were accessed only once in the month before 
that).

- When trying to access a binary that is in cold storage: you get an exception 
saying the binary is in cold storage. Plus, if configured, the binary would 
automatically be read from cold storage, so it's available within x minutes 
(configurable) when re-read.

- Bulk copy from cold storage to regular storage: This might be needed to 
create a full backup. We might need an API for this. 

Regards,
Thomas



Re: [DiSCUSS] - highly vs rarely used data

2017-06-30 Thread Matt Ryan
As I've been thinking about this I wouldn't do it based on last accessed
time, at least not directly.  Using the example of moving infrequently used
blobs to cold storage, I would use a property on the node, e.g.
"archiveState=toArchive".  In this case the property can be clearly tied to
that purpose.  This can be done in complete control of a user, who can
choose to designate "all blobs under this folder can be archived" simply by
setting the property on all the nodes.  Or a background process can run
that understands the automatic archival logic, if it is enabled and
configured, and this process goes through the tree e.g. once a week and
marks any nodes that should be archived simply by changing the archiveState.

Having more than two supported archiveStates allows a query to
differentiate between nodes that are designated for archival but are not
archived yet, and nodes that are actually moved to cold storage.  This can
be useful for example if a GUI that is browsing the repo wants to mark
nodes that are archived with some sort of decorator, so users know not to
try to open it unless they intend to unarchive it.

Using a property directly specified for this purpose gives us more direct
control over how it is being used I think.

On Fri, Jun 30, 2017 at 6:46 AM, Thomas Mueller 
wrote:

> > From my perspective as an Oak user I would like to have control on that.
> > It would be nice for Oak to make *suggestions* about moving things to
> > cold storage, but there might be application constraints that need to
> > be accounted for.
>
> That sounds reasonable. What would be the "API" for this? Let's say the
> API is: configure a path that _allows_ binaries to be migrated to cold
> storage. It's not allowed for all other paths. The default configuration
> could be: allow for /jcr:system/jcr:versionStorage, don't allow anywhere
> else. This could be implemented using automatic moving (as I have
> described), _plus_ a background job that, twice a month, traverses all
> nodes and reads the first few bytes of all nodes that are _not_ in
> /jcr:system/jcr:versionStorage. The traversal could additionally do some
> reporting, for example how many binaries are were, how many times where
> they read, how much money could you save if configured like this.
>
> For automatic moving, behaviour could be:
>
> - To move to cold storage: configuration would be needed: size, access
> frequency, recency (e.g. only move binaries larger than 1 MB that were not
> access for one month, and that were accessed only once in the month before
> that).
>
> - When trying to access a binary that is in cold storage: you get an
> exception saying the binary is in cold storage. Plus, if configured, the
> binary would automatically be read from cold storage, so it's available
> within x minutes (configurable) when re-read.
>
> - Bulk copy from cold storage to regular storage: This might be needed to
> create a full backup. We might need an API for this.
>
> Regards,
> Thomas
>
>


Re: [DiSCUSS] - highly vs rarely used data

2017-07-03 Thread Thomas Mueller
Hi,

> a property on the node, e.g. "archiveState=toArchive"

I wonder if we _can_ easily write to the version store? Also, some nodetypes 
don't allow such properties? It might need to be a hidden property, but then 
you can't use the JCR API. Or maintain this data in a "shadow" structure (not 
with the nodes), which would complicate move operations.

If I was a customer, I wouldn't wan't to *manually* mark / unmark binaries to 
be moved to / from long time storage. I would probably just want to rely on 
automatic management. But I'm not a customer, so my opinion is not that 
relevant (

> Using a property directly specified for this purpose gives us more direct 
> control over how it is being used I think.

Sure, but it also comes with some complexities.

Regards,
Thomas





Re: [DiSCUSS] - highly vs rarely used data

2017-07-03 Thread Tommaso Teofili
I am sure there are both use cases for automatic vs manual/controlled
collection of unused data, however if I were a user I would personally not
want to care about this. While I'd be happy to know that my repo is faster
/ smaller / cleaner / whatever it'd sound overly complex to deal with JCR
and Oak constraints and behaviours from the application layer.
IMHO if we want to have such a feature in Oak to save resources, it should
be the persistence responsibility to say "hey, this content is not being
accessed for ages, let's try to claim some resources from it" (which could
mean moving to cold storage, compress it or anything else).

My 2 cents,
Tommaso



Il giorno lun 3 lug 2017 alle ore 15:46 Thomas Mueller
 ha scritto:

> Hi,
>
> > a property on the node, e.g. "archiveState=toArchive"
>
> I wonder if we _can_ easily write to the version store? Also, some
> nodetypes don't allow such properties? It might need to be a hidden
> property, but then you can't use the JCR API. Or maintain this data in a
> "shadow" structure (not with the nodes), which would complicate move
> operations.
>
> If I was a customer, I wouldn't wan't to *manually* mark / unmark binaries
> to be moved to / from long time storage. I would probably just want to rely
> on automatic management. But I'm not a customer, so my opinion is not that
> relevant (
>
> > Using a property directly specified for this purpose gives us more
> direct control over how it is being used I think.
>
> Sure, but it also comes with some complexities.
>
> Regards,
> Thomas
>
>
>
>


Re: [DiSCUSS] - highly vs rarely used data

2017-07-04 Thread Julian Sedding
>From my experience working with customers, I can pretty much guarantee
that sooner or later:

(a) the implementation of an automatism is not *quite* what they need/want
(b) they want to be able to manually select (or more likely override)
whether a file can be archived

Thus I suggest to come up with a pluggable "strategy" interface and
provide a sensible default implementation. The default will be fine
for most customers/users, but advanced use-cases can be implemented by
substituting the implementation. Implementations could then also
respect manually set flags (=properties) if desired.

A much more important and difficult question to answer IMHO is how to
deal with the slow retrieval of archived content. And if needed, how
to expose the slow availability (i.e. unavailable now but available
later) to the end user (or application layer). To me this sounds
tricky if we want to stick to the JCR API.

Regards
Julian



On Mon, Jul 3, 2017 at 4:33 PM, Tommaso Teofili
 wrote:
> I am sure there are both use cases for automatic vs manual/controlled
> collection of unused data, however if I were a user I would personally not
> want to care about this. While I'd be happy to know that my repo is faster
> / smaller / cleaner / whatever it'd sound overly complex to deal with JCR
> and Oak constraints and behaviours from the application layer.
> IMHO if we want to have such a feature in Oak to save resources, it should
> be the persistence responsibility to say "hey, this content is not being
> accessed for ages, let's try to claim some resources from it" (which could
> mean moving to cold storage, compress it or anything else).
>
> My 2 cents,
> Tommaso
>
>
>
> Il giorno lun 3 lug 2017 alle ore 15:46 Thomas Mueller
>  ha scritto:
>
>> Hi,
>>
>> > a property on the node, e.g. "archiveState=toArchive"
>>
>> I wonder if we _can_ easily write to the version store? Also, some
>> nodetypes don't allow such properties? It might need to be a hidden
>> property, but then you can't use the JCR API. Or maintain this data in a
>> "shadow" structure (not with the nodes), which would complicate move
>> operations.
>>
>> If I was a customer, I wouldn't wan't to *manually* mark / unmark binaries
>> to be moved to / from long time storage. I would probably just want to rely
>> on automatic management. But I'm not a customer, so my opinion is not that
>> relevant (
>>
>> > Using a property directly specified for this purpose gives us more
>> direct control over how it is being used I think.
>>
>> Sure, but it also comes with some complexities.
>>
>> Regards,
>> Thomas
>>
>>
>>
>>


Re: [DiSCUSS] - highly vs rarely used data

2017-07-05 Thread Davide Giannella
On 04/07/2017 11:48, Julian Sedding wrote:
> A much more important and difficult question to answer IMHO is how to
> deal with the slow retrieval of archived content. And if needed, how
> to expose the slow availability (i.e. unavailable now but available
> later) to the end user (or application layer). To me this sounds
> tricky if we want to stick to the JCR API.

I think we should NOT touch the JCR api but rather use/expose the Oak
API for such features. And having a consuming application leverage one,
the other or both.

If we are going to touch the JCR API we should probably sit down a while
and think about JCR API 3 ;)

Davide




Re: [DiSCUSS] - highly vs rarely used data

2017-07-05 Thread Thomas Mueller
Hi,

> (a) the implementation of an automatism is not *quite* what they need/want
> (b) they want to be able to manually select (or more likely override)
whether a file can be archived

Well, behind the scenes, we anyway need a way to move entries to / from cold 
storage. But in my view, that's low-level API, and I wouldn't expose it first, 
but instead concentrate on implementing an automatic solution, that has no API 
(except for some config options). If it later turns out the low-level API is 
needed, it can still be added. I wouldn't introduce that as public API right 
from the start, just because we _think_ it _might_ be needed at some point 
later. Because having to maintain the API is expensive.

What I would introduce right from the start is a way to measure which binaries 
were read recently, and how frequently. But even for that, there is no public 
API needed first (except for maybe logging some statistics).

> Thus I suggest to come up with a pluggable "strategy" interface

That is too abstract for me. I think it is very important to have a concreate 
behaviour and API, otherwise discussing it is not possible.

> A much more important and difficult question to answer IMHO is how to deal 
> with the slow retrieval of archived content.

My concrete suggestion would be, as I wrote: if it's in cold storage, throw an 
exception saying so, and load the binary into hot storage. A few minutes later, 
re-reading will not throw an exception as it's in hot storage. So, there is no 
API change needed, except for a new exception class (subclass of 
RepositoryException). An application can catch those exceptions and deal with 
them in a special way (write that the binary is not currently available). 
Possibly the new exception could have a method "doNotMoveBinary()" in case 
moving is not needed, but by default the binary should be moved, so that old 
applications don't have to be changed at all (backward compatibility).

What is your concrete suggestion?

Regards,
Thomas 



Re: [DiSCUSS] - highly vs rarely used data

2017-07-10 Thread Bertrand Delacretaz
On Tue, Jul 4, 2017 at 12:48 PM, Julian Sedding  wrote:
> ...I suggest to come up with a pluggable "strategy" interface and
> provide a sensible default implementation...

Big +1 to that, requirements can vary widely IMO, also depending on
the characteristics of whatever cold storage is used.

> ...A much more important and difficult question to answer IMHO is how to
> deal with the slow retrieval of archived content

Throw an exception maybe? BinaryNotAvailableAtThisTime, including an
ETA for availability. The application can then decide how to handle
that.

-Bertrand


Re: [DiSCUSS] - highly vs rarely used data

2017-07-10 Thread Chetan Mehrotra
I would prefer a 2 phase implementation here

A - CompositeBlobStore
-

Have support for multiple BlobStores plugged within an Oak setup and
provide an API for layer above to select which BlobStore should be
used. This forms the lower most layer in stack. Such a feature should
support

1. Selecting which store a binary should be written to
2. How binary gets read
3. Support Blob GC

B - BinaryStorage Support


Once we have A implemented then layer above can implement some logic
to manage where binaries are stored without requiring major changes in
core. For example Oak can extend the current extension point in
BlobStatsCollector to allow plugging in custom stats collector. This
can be then used by application to build logic to move content based
on various heuristics

1. Path Based
2. Access Based

Application can then use std api to "copy/move" binary from one store
to another.

We can also provide some out of box implementation but key thing here
is that it should be built on top of Oak Core and hence plug-gable.

Given that we have been discussing enhancements in Binary area for
long time now [1] it would be better to get #A implemented now with an
eye for requirements of #B. So that we make some progress here

Chetan Mehrotra
[1] https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase


Re: [DiSCUSS] - highly vs rarely used data

2017-07-11 Thread Thomas Mueller
Hi,

On 10.07.17, 11:18, "Bertrand Delacretaz"  wrote:
> Throw an exception maybe? BinaryNotAvailableAtThisTime, including an
> ETA for availability. The application can then decide how to handle
>that.

Bertrand, this is exactly what I have suggested in two previous mails:

My concrete suggestion would be, as I wrote: if it's in cold storage, throw an 
exception saying so, and load the binary into hot storage. A few minutes later, 
re-reading will not throw an exception as it's in hot storage. So, there is no 
API change needed, except for a new exception class (subclass of 
RepositoryException). An application can catch those exceptions and deal with 
them in a special way (write that the binary is not currently available). 
Possibly the new exception could have a method "doNotMoveBinary()" in case 
moving is not needed, but by default the binary should be moved, so that old 
applications don't have to be changed at all (backward compatibility).

Regards,
Thomas
 



Re: [DiSCUSS] - highly vs rarely used data

2017-07-11 Thread Bertrand Delacretaz
Hi Thomas,

On Tue, Jul 11, 2017 at 3:14 PM, Thomas Mueller
 wrote:
> ...if it's in cold storage, throw an exception saying so, and load the binary 
> into hot storage.
> A few minutes later, re-reading will not throw an exception as it's in hot 
> storage

Ok great, sorry that I missed that earlier.

Note that the exception should not prevent the client from getting the
rest of the data (other properties) of the same Node - I suppose
that's natural if the exception is thrown when calling
Binary.getStream().

-Bertrand