Re: [DiSCUSS] - highly vs rarely used data

2017-06-30 Thread Matt Ryan
As I've been thinking about this I wouldn't do it based on last accessed
time, at least not directly.  Using the example of moving infrequently used
blobs to cold storage, I would use a property on the node, e.g.
"archiveState=toArchive".  In this case the property can be clearly tied to
that purpose.  This can be done in complete control of a user, who can
choose to designate "all blobs under this folder can be archived" simply by
setting the property on all the nodes.  Or a background process can run
that understands the automatic archival logic, if it is enabled and
configured, and this process goes through the tree e.g. once a week and
marks any nodes that should be archived simply by changing the archiveState.

Having more than two supported archiveStates allows a query to
differentiate between nodes that are designated for archival but are not
archived yet, and nodes that are actually moved to cold storage.  This can
be useful for example if a GUI that is browsing the repo wants to mark
nodes that are archived with some sort of decorator, so users know not to
try to open it unless they intend to unarchive it.

Using a property directly specified for this purpose gives us more direct
control over how it is being used I think.

On Fri, Jun 30, 2017 at 6:46 AM, Thomas Mueller 
wrote:

> > From my perspective as an Oak user I would like to have control on that.
> > It would be nice for Oak to make *suggestions* about moving things to
> > cold storage, but there might be application constraints that need to
> > be accounted for.
>
> That sounds reasonable. What would be the "API" for this? Let's say the
> API is: configure a path that _allows_ binaries to be migrated to cold
> storage. It's not allowed for all other paths. The default configuration
> could be: allow for /jcr:system/jcr:versionStorage, don't allow anywhere
> else. This could be implemented using automatic moving (as I have
> described), _plus_ a background job that, twice a month, traverses all
> nodes and reads the first few bytes of all nodes that are _not_ in
> /jcr:system/jcr:versionStorage. The traversal could additionally do some
> reporting, for example how many binaries are were, how many times where
> they read, how much money could you save if configured like this.
>
> For automatic moving, behaviour could be:
>
> - To move to cold storage: configuration would be needed: size, access
> frequency, recency (e.g. only move binaries larger than 1 MB that were not
> access for one month, and that were accessed only once in the month before
> that).
>
> - When trying to access a binary that is in cold storage: you get an
> exception saying the binary is in cold storage. Plus, if configured, the
> binary would automatically be read from cold storage, so it's available
> within x minutes (configurable) when re-read.
>
> - Bulk copy from cold storage to regular storage: This might be needed to
> create a full backup. We might need an API for this.
>
> Regards,
> Thomas
>
>


Re: [DiSCUSS] - highly vs rarely used data

2017-06-30 Thread Thomas Mueller
> From my perspective as an Oak user I would like to have control on that.
> It would be nice for Oak to make *suggestions* about moving things to
> cold storage, but there might be application constraints that need to
> be accounted for.

That sounds reasonable. What would be the "API" for this? Let's say the API is: 
configure a path that _allows_ binaries to be migrated to cold storage. It's 
not allowed for all other paths. The default configuration could be: allow for 
/jcr:system/jcr:versionStorage, don't allow anywhere else. This could be 
implemented using automatic moving (as I have described), _plus_ a background 
job that, twice a month, traverses all nodes and reads the first few bytes of 
all nodes that are _not_ in /jcr:system/jcr:versionStorage. The traversal could 
additionally do some reporting, for example how many binaries are were, how 
many times where they read, how much money could you save if configured like 
this.

For automatic moving, behaviour could be:

- To move to cold storage: configuration would be needed: size, access 
frequency, recency (e.g. only move binaries larger than 1 MB that were not 
access for one month, and that were accessed only once in the month before 
that).

- When trying to access a binary that is in cold storage: you get an exception 
saying the binary is in cold storage. Plus, if configured, the binary would 
automatically be read from cold storage, so it's available within x minutes 
(configurable) when re-read.

- Bulk copy from cold storage to regular storage: This might be needed to 
create a full backup. We might need an API for this. 

Regards,
Thomas



Re: Intend to backport OAK-5949 - XPath: string literals parsed as identifiers

2017-06-30 Thread Davide Giannella
On 29/06/2017 13:26, Thomas Mueller wrote:
> I'd like to backport OAK-5949 to the maintenance branches. It is a parsing 
> error for XPath queries, so that the string literal '@' is not parsed 
> correctly.

+1



Re: Intend to backport OAK-6391 - With FastQuerySize, getSize() returns -1 if there are exactly 21 rows

2017-06-30 Thread Davide Giannella
On 29/06/2017 13:38, Thomas Mueller wrote:
> Hi,
> 
> I'd like to backport OAK-6391 to the maintenance branches. The query result 
> getSize() method is often used, and it is important that the result is as 
> accurate as possible (even though the spec allows to return -1).
>

+1



Re: [DiSCUSS] - highly vs rarely used data

2017-06-30 Thread Bertrand Delacretaz
On Fri, Jun 30, 2017 at 10:44 AM, Thomas Mueller
 wrote:
> ...About deciding which binaries to move to the slow storage: It would be 
> good if that's automatic...

>From my perspective as an Oak user I would like to have control on that.

It would be nice for Oak to make *suggestions* about moving things to
cold storage, but there might be application constraints that need to
be accounted for.

-Bertrand


Re: [DiSCUSS] - highly vs rarely used data

2017-06-30 Thread Thomas Mueller
Hi,

I guess you talk about Amazon Glacier. Did you know about "Expedited 
retrievals" by the way? 
https://aws.amazon.com/about-aws/whats-new/2016/11/access-your-amazon-glacier-data-in-minutes-with-new-retrieval-options/
 - it looks like it's more than just "slow" + "fast".

About deciding which binaries to move to the slow storage: It would be good if 
that's automatic. Couldn't that be based on access frequency + recency? If a 
binary is not accessed for some time, it is moved to slow storage. I would add: 
if it was not accessed for some time, _plus_ it was rarely accessed before. 
Reason: for caching, it is well known that not only the recency, but also 
frequency, are important to predict if an entry will be needed in the near 
future. To do that, we could maintain a log that tells you when, and how many 
times, a binary was read. Maybe Amazon / Azure keep some info about that, but 
let's assume not (or not in such a way we want or can use). 

For example, each client appends the blob ids that it reads to a file. Multiple 
such files could be merged. To save space for such files (probably not needed, 
but who knows):

* Use a cache to avoid repeatedly writing the same id, in case it's accessed 
multiple times.
* Maybe you don't care about smallish binaries (smaller than 1 MB for example), 
or care less about them. So, for example only move files larger than 1 MB. That 
means no need to add an entry.
* A bloom filter or similar could be used (so you would retain x% too many 
entries). Or even simpler: only write the first x characters of the binary id. 
That way, we retain x% too much in fast storage, but save time, space, and 
memory for maintenance.

Regards,
Thomas


On 26.06.17, 18:10, "Matt Ryan"  wrote:

Hi,

With respect to Oak data stores, this is something I am hoping to support
later this year after the implementation of the CompositeDataStore (which
I'm still working on).

First, the assumption is that there would be a working CompositeDataStore
that can manage multiple data stores, and can select a data store for a
blob based on something like a JCR property (I'm still figuring this part
out).  In such a case, it would be possible to add a property to blobs that
can be archived, and then the CompositeDataStore could store them in a
different location - think AWS Glacier if there were a Glacier-compatible
data store.  Of course this would require that we also support an access
pattern in Oak where Oak knows that a blob can be retrieved but cannot
reply to a request with the requested blob immediately.  Instead Oak would
have to give a response indicating "I can get it, but it will take a while"
and suggest when it might be available.

That's just one example.  I believe once I figure out the
CompositeDataStore it will be able to support a lot of neat scenarios from
on the blob store side of things anyway.

-MR

On Mon, Jun 26, 2017 at 2:22 AM, Davide Giannella  wrote:

> On 26/06/2017 09:00, Michael Dürig wrote:
> >
> > I agree we should have a better look at access patterns, not only for
> > indexing. I recently came across a repository with about 65% of its
> > content in the version store. That content is pretty much archived and
> > never accessed. Yet it fragments the index and thus impacts general
> > access times.
>
> I may say something stupid as usual, but here I can see for example that
> such content could be "moved to a slower repository". So for example
> speaking of segment, it could be stored in a compressed segment (rather
> than plain tar) and the repository could either automatically configure
> the indexes to skip such part or/and additionally create an ad-hoc index
> which could async by definition every, let's say, 10 seconds.
>
> We would gain on the repository size and indexing speed.
>
> Just a couple of ideas off the top of my head.
>
> Davide
>
>
>




Re: Nodetype index

2017-06-30 Thread Thomas Mueller
Hi,

Right now, there is only one nodetype index. So, if you add a nodetype / mixin 
to that index (as you know the lists of nodetypes / mixins is a multi-valued 
property), then you need to reindex that index. Which needs to read all the 
nodes.

The alternative would be to have multiple nodetype indexes. A patch for that is 
welcome! If you have that, then instead of changing the nodetype index, you 
create a new index. This also needs to read all the nodes.

So, that would be more convenient (even though in both cases indexing a new 
nodetype takes about the same time).

> Is this a design choice to only allow one nodetype index?

Not a design choice, just how it is implemented right now.

> I have no way to make it contained in the project itself to make an index 
> based on it's mixin type.

There is a way, you need some code to extend the existing nodetype index, and 
then reindex.

Regards,
Thomas



 On 29.06.17, 16:42, "Roy Teeuwen"  wrote:

Hey all,

Some time ago I asked about creating an oak index based on the node type 
(primary type or mixin type), after which I was pointed to the nodetype index. 
I have to say though that there is a serious drawback to this index:

I have two separate projects, both having some code based on a mixinType. 
But seeing as there can only be one nodetype index, I have no way to make it 
contained in the project itself to make an index based on it's mixin type.
Is this a design choice to only allow one nodetype index? Is there a 
workaround for this?

Thanks!
Roy