Re: [DiSCUSS] - highly vs rarely used data
As I've been thinking about this I wouldn't do it based on last accessed time, at least not directly. Using the example of moving infrequently used blobs to cold storage, I would use a property on the node, e.g. "archiveState=toArchive". In this case the property can be clearly tied to that purpose. This can be done in complete control of a user, who can choose to designate "all blobs under this folder can be archived" simply by setting the property on all the nodes. Or a background process can run that understands the automatic archival logic, if it is enabled and configured, and this process goes through the tree e.g. once a week and marks any nodes that should be archived simply by changing the archiveState. Having more than two supported archiveStates allows a query to differentiate between nodes that are designated for archival but are not archived yet, and nodes that are actually moved to cold storage. This can be useful for example if a GUI that is browsing the repo wants to mark nodes that are archived with some sort of decorator, so users know not to try to open it unless they intend to unarchive it. Using a property directly specified for this purpose gives us more direct control over how it is being used I think. On Fri, Jun 30, 2017 at 6:46 AM, Thomas Mueller wrote: > > From my perspective as an Oak user I would like to have control on that. > > It would be nice for Oak to make *suggestions* about moving things to > > cold storage, but there might be application constraints that need to > > be accounted for. > > That sounds reasonable. What would be the "API" for this? Let's say the > API is: configure a path that _allows_ binaries to be migrated to cold > storage. It's not allowed for all other paths. The default configuration > could be: allow for /jcr:system/jcr:versionStorage, don't allow anywhere > else. This could be implemented using automatic moving (as I have > described), _plus_ a background job that, twice a month, traverses all > nodes and reads the first few bytes of all nodes that are _not_ in > /jcr:system/jcr:versionStorage. The traversal could additionally do some > reporting, for example how many binaries are were, how many times where > they read, how much money could you save if configured like this. > > For automatic moving, behaviour could be: > > - To move to cold storage: configuration would be needed: size, access > frequency, recency (e.g. only move binaries larger than 1 MB that were not > access for one month, and that were accessed only once in the month before > that). > > - When trying to access a binary that is in cold storage: you get an > exception saying the binary is in cold storage. Plus, if configured, the > binary would automatically be read from cold storage, so it's available > within x minutes (configurable) when re-read. > > - Bulk copy from cold storage to regular storage: This might be needed to > create a full backup. We might need an API for this. > > Regards, > Thomas > >
Re: [DiSCUSS] - highly vs rarely used data
> From my perspective as an Oak user I would like to have control on that. > It would be nice for Oak to make *suggestions* about moving things to > cold storage, but there might be application constraints that need to > be accounted for. That sounds reasonable. What would be the "API" for this? Let's say the API is: configure a path that _allows_ binaries to be migrated to cold storage. It's not allowed for all other paths. The default configuration could be: allow for /jcr:system/jcr:versionStorage, don't allow anywhere else. This could be implemented using automatic moving (as I have described), _plus_ a background job that, twice a month, traverses all nodes and reads the first few bytes of all nodes that are _not_ in /jcr:system/jcr:versionStorage. The traversal could additionally do some reporting, for example how many binaries are were, how many times where they read, how much money could you save if configured like this. For automatic moving, behaviour could be: - To move to cold storage: configuration would be needed: size, access frequency, recency (e.g. only move binaries larger than 1 MB that were not access for one month, and that were accessed only once in the month before that). - When trying to access a binary that is in cold storage: you get an exception saying the binary is in cold storage. Plus, if configured, the binary would automatically be read from cold storage, so it's available within x minutes (configurable) when re-read. - Bulk copy from cold storage to regular storage: This might be needed to create a full backup. We might need an API for this. Regards, Thomas
Re: Intend to backport OAK-5949 - XPath: string literals parsed as identifiers
On 29/06/2017 13:26, Thomas Mueller wrote: > I'd like to backport OAK-5949 to the maintenance branches. It is a parsing > error for XPath queries, so that the string literal '@' is not parsed > correctly. +1
Re: Intend to backport OAK-6391 - With FastQuerySize, getSize() returns -1 if there are exactly 21 rows
On 29/06/2017 13:38, Thomas Mueller wrote: > Hi, > > I'd like to backport OAK-6391 to the maintenance branches. The query result > getSize() method is often used, and it is important that the result is as > accurate as possible (even though the spec allows to return -1). > +1
Re: [DiSCUSS] - highly vs rarely used data
On Fri, Jun 30, 2017 at 10:44 AM, Thomas Mueller wrote: > ...About deciding which binaries to move to the slow storage: It would be > good if that's automatic... >From my perspective as an Oak user I would like to have control on that. It would be nice for Oak to make *suggestions* about moving things to cold storage, but there might be application constraints that need to be accounted for. -Bertrand
Re: [DiSCUSS] - highly vs rarely used data
Hi, I guess you talk about Amazon Glacier. Did you know about "Expedited retrievals" by the way? https://aws.amazon.com/about-aws/whats-new/2016/11/access-your-amazon-glacier-data-in-minutes-with-new-retrieval-options/ - it looks like it's more than just "slow" + "fast". About deciding which binaries to move to the slow storage: It would be good if that's automatic. Couldn't that be based on access frequency + recency? If a binary is not accessed for some time, it is moved to slow storage. I would add: if it was not accessed for some time, _plus_ it was rarely accessed before. Reason: for caching, it is well known that not only the recency, but also frequency, are important to predict if an entry will be needed in the near future. To do that, we could maintain a log that tells you when, and how many times, a binary was read. Maybe Amazon / Azure keep some info about that, but let's assume not (or not in such a way we want or can use). For example, each client appends the blob ids that it reads to a file. Multiple such files could be merged. To save space for such files (probably not needed, but who knows): * Use a cache to avoid repeatedly writing the same id, in case it's accessed multiple times. * Maybe you don't care about smallish binaries (smaller than 1 MB for example), or care less about them. So, for example only move files larger than 1 MB. That means no need to add an entry. * A bloom filter or similar could be used (so you would retain x% too many entries). Or even simpler: only write the first x characters of the binary id. That way, we retain x% too much in fast storage, but save time, space, and memory for maintenance. Regards, Thomas On 26.06.17, 18:10, "Matt Ryan" wrote: Hi, With respect to Oak data stores, this is something I am hoping to support later this year after the implementation of the CompositeDataStore (which I'm still working on). First, the assumption is that there would be a working CompositeDataStore that can manage multiple data stores, and can select a data store for a blob based on something like a JCR property (I'm still figuring this part out). In such a case, it would be possible to add a property to blobs that can be archived, and then the CompositeDataStore could store them in a different location - think AWS Glacier if there were a Glacier-compatible data store. Of course this would require that we also support an access pattern in Oak where Oak knows that a blob can be retrieved but cannot reply to a request with the requested blob immediately. Instead Oak would have to give a response indicating "I can get it, but it will take a while" and suggest when it might be available. That's just one example. I believe once I figure out the CompositeDataStore it will be able to support a lot of neat scenarios from on the blob store side of things anyway. -MR On Mon, Jun 26, 2017 at 2:22 AM, Davide Giannella wrote: > On 26/06/2017 09:00, Michael Dürig wrote: > > > > I agree we should have a better look at access patterns, not only for > > indexing. I recently came across a repository with about 65% of its > > content in the version store. That content is pretty much archived and > > never accessed. Yet it fragments the index and thus impacts general > > access times. > > I may say something stupid as usual, but here I can see for example that > such content could be "moved to a slower repository". So for example > speaking of segment, it could be stored in a compressed segment (rather > than plain tar) and the repository could either automatically configure > the indexes to skip such part or/and additionally create an ad-hoc index > which could async by definition every, let's say, 10 seconds. > > We would gain on the repository size and indexing speed. > > Just a couple of ideas off the top of my head. > > Davide > > >
Re: Nodetype index
Hi, Right now, there is only one nodetype index. So, if you add a nodetype / mixin to that index (as you know the lists of nodetypes / mixins is a multi-valued property), then you need to reindex that index. Which needs to read all the nodes. The alternative would be to have multiple nodetype indexes. A patch for that is welcome! If you have that, then instead of changing the nodetype index, you create a new index. This also needs to read all the nodes. So, that would be more convenient (even though in both cases indexing a new nodetype takes about the same time). > Is this a design choice to only allow one nodetype index? Not a design choice, just how it is implemented right now. > I have no way to make it contained in the project itself to make an index > based on it's mixin type. There is a way, you need some code to extend the existing nodetype index, and then reindex. Regards, Thomas On 29.06.17, 16:42, "Roy Teeuwen" wrote: Hey all, Some time ago I asked about creating an oak index based on the node type (primary type or mixin type), after which I was pointed to the nodetype index. I have to say though that there is a serious drawback to this index: I have two separate projects, both having some code based on a mixinType. But seeing as there can only be one nodetype index, I have no way to make it contained in the project itself to make an index based on it's mixin type. Is this a design choice to only allow one nodetype index? Is there a workaround for this? Thanks! Roy