[DiSCUSS] - highly vs rarely used data
Hi all, recently I've been at a conference [1] where I attended an interesting keynote about data management [2] (I think it refers to this 2016 paper [3]). Apart from the approaches proposed to solve the data management problem (e.g. get rid of DBMSs!) I got interested in the discussion about how we deal with the increasing amount of data that we have to manage (also because of some issues we have [4]). In many systems only a very small subset of the data is used because the amount of information users really need refers only to most recently ingested data (e.g. social networks); while that doesn't always apply for content repositories in general (e.g. if you build a CMS on top of it) I think it's interesting to think about whether we can optimize our persistence layer to work better with highly used data (e.g. more recent) and use less space/cpu for data that is used more rarely. For example, putting this together with the incremental indexing section of the paper [3] I was thinking (but that's already a solution rather than "just" a discussion) perhaps we could simply avoid indexing *some* content until it's needed (e.g. the first time you get traversal, then index so that next query over same data will be faster) but that's just an example. What do others think ? Regards, Tommaso [1] : http://www.iccs-meeting.org/iccs2017/ [2] : http://www.iccs-meeting.org/iccs2017/keynote-lectures/#Ailamaki [3] : https://infoscience.epfl.ch/record/219993/files/p12-pavlovic.pdf [4] : https://issues.apache.org/jira/browse/OAK-5192
Re: [DiSCUSS] - highly vs rarely used data
Hi, I agree we should have a better look at access patterns, not only for indexing. I recently came across a repository with about 65% of its content in the version store. That content is pretty much archived and never accessed. Yet it fragments the index and thus impacts general access times. Michael On 23.06.17 10:22, Tommaso Teofili wrote: Hi all, recently I've been at a conference [1] where I attended an interesting keynote about data management [2] (I think it refers to this 2016 paper [3]). Apart from the approaches proposed to solve the data management problem (e.g. get rid of DBMSs!) I got interested in the discussion about how we deal with the increasing amount of data that we have to manage (also because of some issues we have [4]). In many systems only a very small subset of the data is used because the amount of information users really need refers only to most recently ingested data (e.g. social networks); while that doesn't always apply for content repositories in general (e.g. if you build a CMS on top of it) I think it's interesting to think about whether we can optimize our persistence layer to work better with highly used data (e.g. more recent) and use less space/cpu for data that is used more rarely. For example, putting this together with the incremental indexing section of the paper [3] I was thinking (but that's already a solution rather than "just" a discussion) perhaps we could simply avoid indexing *some* content until it's needed (e.g. the first time you get traversal, then index so that next query over same data will be faster) but that's just an example. What do others think ? Regards, Tommaso [1] : http://www.iccs-meeting.org/iccs2017/ [2] : http://www.iccs-meeting.org/iccs2017/keynote-lectures/#Ailamaki [3] : https://infoscience.epfl.ch/record/219993/files/p12-pavlovic.pdf [4] : https://issues.apache.org/jira/browse/OAK-5192
Re: [DiSCUSS] - highly vs rarely used data
On 26/06/2017 09:00, Michael Dürig wrote: > > I agree we should have a better look at access patterns, not only for > indexing. I recently came across a repository with about 65% of its > content in the version store. That content is pretty much archived and > never accessed. Yet it fragments the index and thus impacts general > access times. I may say something stupid as usual, but here I can see for example that such content could be "moved to a slower repository". So for example speaking of segment, it could be stored in a compressed segment (rather than plain tar) and the repository could either automatically configure the indexes to skip such part or/and additionally create an ad-hoc index which could async by definition every, let's say, 10 seconds. We would gain on the repository size and indexing speed. Just a couple of ideas off the top of my head. Davide
Re: [DiSCUSS] - highly vs rarely used data
Hi, With respect to Oak data stores, this is something I am hoping to support later this year after the implementation of the CompositeDataStore (which I'm still working on). First, the assumption is that there would be a working CompositeDataStore that can manage multiple data stores, and can select a data store for a blob based on something like a JCR property (I'm still figuring this part out). In such a case, it would be possible to add a property to blobs that can be archived, and then the CompositeDataStore could store them in a different location - think AWS Glacier if there were a Glacier-compatible data store. Of course this would require that we also support an access pattern in Oak where Oak knows that a blob can be retrieved but cannot reply to a request with the requested blob immediately. Instead Oak would have to give a response indicating "I can get it, but it will take a while" and suggest when it might be available. That's just one example. I believe once I figure out the CompositeDataStore it will be able to support a lot of neat scenarios from on the blob store side of things anyway. -MR On Mon, Jun 26, 2017 at 2:22 AM, Davide Giannella wrote: > On 26/06/2017 09:00, Michael Dürig wrote: > > > > I agree we should have a better look at access patterns, not only for > > indexing. I recently came across a repository with about 65% of its > > content in the version store. That content is pretty much archived and > > never accessed. Yet it fragments the index and thus impacts general > > access times. > > I may say something stupid as usual, but here I can see for example that > such content could be "moved to a slower repository". So for example > speaking of segment, it could be stored in a compressed segment (rather > than plain tar) and the repository could either automatically configure > the indexes to skip such part or/and additionally create an ad-hoc index > which could async by definition every, let's say, 10 seconds. > > We would gain on the repository size and indexing speed. > > Just a couple of ideas off the top of my head. > > Davide > > >
Re: [DiSCUSS] - highly vs rarely used data
Hi, I guess you talk about Amazon Glacier. Did you know about "Expedited retrievals" by the way? https://aws.amazon.com/about-aws/whats-new/2016/11/access-your-amazon-glacier-data-in-minutes-with-new-retrieval-options/ - it looks like it's more than just "slow" + "fast". About deciding which binaries to move to the slow storage: It would be good if that's automatic. Couldn't that be based on access frequency + recency? If a binary is not accessed for some time, it is moved to slow storage. I would add: if it was not accessed for some time, _plus_ it was rarely accessed before. Reason: for caching, it is well known that not only the recency, but also frequency, are important to predict if an entry will be needed in the near future. To do that, we could maintain a log that tells you when, and how many times, a binary was read. Maybe Amazon / Azure keep some info about that, but let's assume not (or not in such a way we want or can use). For example, each client appends the blob ids that it reads to a file. Multiple such files could be merged. To save space for such files (probably not needed, but who knows): * Use a cache to avoid repeatedly writing the same id, in case it's accessed multiple times. * Maybe you don't care about smallish binaries (smaller than 1 MB for example), or care less about them. So, for example only move files larger than 1 MB. That means no need to add an entry. * A bloom filter or similar could be used (so you would retain x% too many entries). Or even simpler: only write the first x characters of the binary id. That way, we retain x% too much in fast storage, but save time, space, and memory for maintenance. Regards, Thomas On 26.06.17, 18:10, "Matt Ryan" wrote: Hi, With respect to Oak data stores, this is something I am hoping to support later this year after the implementation of the CompositeDataStore (which I'm still working on). First, the assumption is that there would be a working CompositeDataStore that can manage multiple data stores, and can select a data store for a blob based on something like a JCR property (I'm still figuring this part out). In such a case, it would be possible to add a property to blobs that can be archived, and then the CompositeDataStore could store them in a different location - think AWS Glacier if there were a Glacier-compatible data store. Of course this would require that we also support an access pattern in Oak where Oak knows that a blob can be retrieved but cannot reply to a request with the requested blob immediately. Instead Oak would have to give a response indicating "I can get it, but it will take a while" and suggest when it might be available. That's just one example. I believe once I figure out the CompositeDataStore it will be able to support a lot of neat scenarios from on the blob store side of things anyway. -MR On Mon, Jun 26, 2017 at 2:22 AM, Davide Giannella wrote: > On 26/06/2017 09:00, Michael Dürig wrote: > > > > I agree we should have a better look at access patterns, not only for > > indexing. I recently came across a repository with about 65% of its > > content in the version store. That content is pretty much archived and > > never accessed. Yet it fragments the index and thus impacts general > > access times. > > I may say something stupid as usual, but here I can see for example that > such content could be "moved to a slower repository". So for example > speaking of segment, it could be stored in a compressed segment (rather > than plain tar) and the repository could either automatically configure > the indexes to skip such part or/and additionally create an ad-hoc index > which could async by definition every, let's say, 10 seconds. > > We would gain on the repository size and indexing speed. > > Just a couple of ideas off the top of my head. > > Davide > > >
Re: [DiSCUSS] - highly vs rarely used data
On Fri, Jun 30, 2017 at 10:44 AM, Thomas Mueller wrote: > ...About deciding which binaries to move to the slow storage: It would be > good if that's automatic... >From my perspective as an Oak user I would like to have control on that. It would be nice for Oak to make *suggestions* about moving things to cold storage, but there might be application constraints that need to be accounted for. -Bertrand
Re: [DiSCUSS] - highly vs rarely used data
> From my perspective as an Oak user I would like to have control on that. > It would be nice for Oak to make *suggestions* about moving things to > cold storage, but there might be application constraints that need to > be accounted for. That sounds reasonable. What would be the "API" for this? Let's say the API is: configure a path that _allows_ binaries to be migrated to cold storage. It's not allowed for all other paths. The default configuration could be: allow for /jcr:system/jcr:versionStorage, don't allow anywhere else. This could be implemented using automatic moving (as I have described), _plus_ a background job that, twice a month, traverses all nodes and reads the first few bytes of all nodes that are _not_ in /jcr:system/jcr:versionStorage. The traversal could additionally do some reporting, for example how many binaries are were, how many times where they read, how much money could you save if configured like this. For automatic moving, behaviour could be: - To move to cold storage: configuration would be needed: size, access frequency, recency (e.g. only move binaries larger than 1 MB that were not access for one month, and that were accessed only once in the month before that). - When trying to access a binary that is in cold storage: you get an exception saying the binary is in cold storage. Plus, if configured, the binary would automatically be read from cold storage, so it's available within x minutes (configurable) when re-read. - Bulk copy from cold storage to regular storage: This might be needed to create a full backup. We might need an API for this. Regards, Thomas
Re: [DiSCUSS] - highly vs rarely used data
As I've been thinking about this I wouldn't do it based on last accessed time, at least not directly. Using the example of moving infrequently used blobs to cold storage, I would use a property on the node, e.g. "archiveState=toArchive". In this case the property can be clearly tied to that purpose. This can be done in complete control of a user, who can choose to designate "all blobs under this folder can be archived" simply by setting the property on all the nodes. Or a background process can run that understands the automatic archival logic, if it is enabled and configured, and this process goes through the tree e.g. once a week and marks any nodes that should be archived simply by changing the archiveState. Having more than two supported archiveStates allows a query to differentiate between nodes that are designated for archival but are not archived yet, and nodes that are actually moved to cold storage. This can be useful for example if a GUI that is browsing the repo wants to mark nodes that are archived with some sort of decorator, so users know not to try to open it unless they intend to unarchive it. Using a property directly specified for this purpose gives us more direct control over how it is being used I think. On Fri, Jun 30, 2017 at 6:46 AM, Thomas Mueller wrote: > > From my perspective as an Oak user I would like to have control on that. > > It would be nice for Oak to make *suggestions* about moving things to > > cold storage, but there might be application constraints that need to > > be accounted for. > > That sounds reasonable. What would be the "API" for this? Let's say the > API is: configure a path that _allows_ binaries to be migrated to cold > storage. It's not allowed for all other paths. The default configuration > could be: allow for /jcr:system/jcr:versionStorage, don't allow anywhere > else. This could be implemented using automatic moving (as I have > described), _plus_ a background job that, twice a month, traverses all > nodes and reads the first few bytes of all nodes that are _not_ in > /jcr:system/jcr:versionStorage. The traversal could additionally do some > reporting, for example how many binaries are were, how many times where > they read, how much money could you save if configured like this. > > For automatic moving, behaviour could be: > > - To move to cold storage: configuration would be needed: size, access > frequency, recency (e.g. only move binaries larger than 1 MB that were not > access for one month, and that were accessed only once in the month before > that). > > - When trying to access a binary that is in cold storage: you get an > exception saying the binary is in cold storage. Plus, if configured, the > binary would automatically be read from cold storage, so it's available > within x minutes (configurable) when re-read. > > - Bulk copy from cold storage to regular storage: This might be needed to > create a full backup. We might need an API for this. > > Regards, > Thomas > >
Re: [DiSCUSS] - highly vs rarely used data
Hi, > a property on the node, e.g. "archiveState=toArchive" I wonder if we _can_ easily write to the version store? Also, some nodetypes don't allow such properties? It might need to be a hidden property, but then you can't use the JCR API. Or maintain this data in a "shadow" structure (not with the nodes), which would complicate move operations. If I was a customer, I wouldn't wan't to *manually* mark / unmark binaries to be moved to / from long time storage. I would probably just want to rely on automatic management. But I'm not a customer, so my opinion is not that relevant ( > Using a property directly specified for this purpose gives us more direct > control over how it is being used I think. Sure, but it also comes with some complexities. Regards, Thomas
Re: [DiSCUSS] - highly vs rarely used data
I am sure there are both use cases for automatic vs manual/controlled collection of unused data, however if I were a user I would personally not want to care about this. While I'd be happy to know that my repo is faster / smaller / cleaner / whatever it'd sound overly complex to deal with JCR and Oak constraints and behaviours from the application layer. IMHO if we want to have such a feature in Oak to save resources, it should be the persistence responsibility to say "hey, this content is not being accessed for ages, let's try to claim some resources from it" (which could mean moving to cold storage, compress it or anything else). My 2 cents, Tommaso Il giorno lun 3 lug 2017 alle ore 15:46 Thomas Mueller ha scritto: > Hi, > > > a property on the node, e.g. "archiveState=toArchive" > > I wonder if we _can_ easily write to the version store? Also, some > nodetypes don't allow such properties? It might need to be a hidden > property, but then you can't use the JCR API. Or maintain this data in a > "shadow" structure (not with the nodes), which would complicate move > operations. > > If I was a customer, I wouldn't wan't to *manually* mark / unmark binaries > to be moved to / from long time storage. I would probably just want to rely > on automatic management. But I'm not a customer, so my opinion is not that > relevant ( > > > Using a property directly specified for this purpose gives us more > direct control over how it is being used I think. > > Sure, but it also comes with some complexities. > > Regards, > Thomas > > > >
Re: [DiSCUSS] - highly vs rarely used data
>From my experience working with customers, I can pretty much guarantee that sooner or later: (a) the implementation of an automatism is not *quite* what they need/want (b) they want to be able to manually select (or more likely override) whether a file can be archived Thus I suggest to come up with a pluggable "strategy" interface and provide a sensible default implementation. The default will be fine for most customers/users, but advanced use-cases can be implemented by substituting the implementation. Implementations could then also respect manually set flags (=properties) if desired. A much more important and difficult question to answer IMHO is how to deal with the slow retrieval of archived content. And if needed, how to expose the slow availability (i.e. unavailable now but available later) to the end user (or application layer). To me this sounds tricky if we want to stick to the JCR API. Regards Julian On Mon, Jul 3, 2017 at 4:33 PM, Tommaso Teofili wrote: > I am sure there are both use cases for automatic vs manual/controlled > collection of unused data, however if I were a user I would personally not > want to care about this. While I'd be happy to know that my repo is faster > / smaller / cleaner / whatever it'd sound overly complex to deal with JCR > and Oak constraints and behaviours from the application layer. > IMHO if we want to have such a feature in Oak to save resources, it should > be the persistence responsibility to say "hey, this content is not being > accessed for ages, let's try to claim some resources from it" (which could > mean moving to cold storage, compress it or anything else). > > My 2 cents, > Tommaso > > > > Il giorno lun 3 lug 2017 alle ore 15:46 Thomas Mueller > ha scritto: > >> Hi, >> >> > a property on the node, e.g. "archiveState=toArchive" >> >> I wonder if we _can_ easily write to the version store? Also, some >> nodetypes don't allow such properties? It might need to be a hidden >> property, but then you can't use the JCR API. Or maintain this data in a >> "shadow" structure (not with the nodes), which would complicate move >> operations. >> >> If I was a customer, I wouldn't wan't to *manually* mark / unmark binaries >> to be moved to / from long time storage. I would probably just want to rely >> on automatic management. But I'm not a customer, so my opinion is not that >> relevant ( >> >> > Using a property directly specified for this purpose gives us more >> direct control over how it is being used I think. >> >> Sure, but it also comes with some complexities. >> >> Regards, >> Thomas >> >> >> >>
Re: [DiSCUSS] - highly vs rarely used data
On 04/07/2017 11:48, Julian Sedding wrote: > A much more important and difficult question to answer IMHO is how to > deal with the slow retrieval of archived content. And if needed, how > to expose the slow availability (i.e. unavailable now but available > later) to the end user (or application layer). To me this sounds > tricky if we want to stick to the JCR API. I think we should NOT touch the JCR api but rather use/expose the Oak API for such features. And having a consuming application leverage one, the other or both. If we are going to touch the JCR API we should probably sit down a while and think about JCR API 3 ;) Davide
Re: [DiSCUSS] - highly vs rarely used data
Hi, > (a) the implementation of an automatism is not *quite* what they need/want > (b) they want to be able to manually select (or more likely override) whether a file can be archived Well, behind the scenes, we anyway need a way to move entries to / from cold storage. But in my view, that's low-level API, and I wouldn't expose it first, but instead concentrate on implementing an automatic solution, that has no API (except for some config options). If it later turns out the low-level API is needed, it can still be added. I wouldn't introduce that as public API right from the start, just because we _think_ it _might_ be needed at some point later. Because having to maintain the API is expensive. What I would introduce right from the start is a way to measure which binaries were read recently, and how frequently. But even for that, there is no public API needed first (except for maybe logging some statistics). > Thus I suggest to come up with a pluggable "strategy" interface That is too abstract for me. I think it is very important to have a concreate behaviour and API, otherwise discussing it is not possible. > A much more important and difficult question to answer IMHO is how to deal > with the slow retrieval of archived content. My concrete suggestion would be, as I wrote: if it's in cold storage, throw an exception saying so, and load the binary into hot storage. A few minutes later, re-reading will not throw an exception as it's in hot storage. So, there is no API change needed, except for a new exception class (subclass of RepositoryException). An application can catch those exceptions and deal with them in a special way (write that the binary is not currently available). Possibly the new exception could have a method "doNotMoveBinary()" in case moving is not needed, but by default the binary should be moved, so that old applications don't have to be changed at all (backward compatibility). What is your concrete suggestion? Regards, Thomas
Re: [DiSCUSS] - highly vs rarely used data
On Tue, Jul 4, 2017 at 12:48 PM, Julian Sedding wrote: > ...I suggest to come up with a pluggable "strategy" interface and > provide a sensible default implementation... Big +1 to that, requirements can vary widely IMO, also depending on the characteristics of whatever cold storage is used. > ...A much more important and difficult question to answer IMHO is how to > deal with the slow retrieval of archived content Throw an exception maybe? BinaryNotAvailableAtThisTime, including an ETA for availability. The application can then decide how to handle that. -Bertrand
Re: [DiSCUSS] - highly vs rarely used data
I would prefer a 2 phase implementation here A - CompositeBlobStore - Have support for multiple BlobStores plugged within an Oak setup and provide an API for layer above to select which BlobStore should be used. This forms the lower most layer in stack. Such a feature should support 1. Selecting which store a binary should be written to 2. How binary gets read 3. Support Blob GC B - BinaryStorage Support Once we have A implemented then layer above can implement some logic to manage where binaries are stored without requiring major changes in core. For example Oak can extend the current extension point in BlobStatsCollector to allow plugging in custom stats collector. This can be then used by application to build logic to move content based on various heuristics 1. Path Based 2. Access Based Application can then use std api to "copy/move" binary from one store to another. We can also provide some out of box implementation but key thing here is that it should be built on top of Oak Core and hence plug-gable. Given that we have been discussing enhancements in Binary area for long time now [1] it would be better to get #A implemented now with an eye for requirements of #B. So that we make some progress here Chetan Mehrotra [1] https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase
Re: [DiSCUSS] - highly vs rarely used data
Hi, On 10.07.17, 11:18, "Bertrand Delacretaz" wrote: > Throw an exception maybe? BinaryNotAvailableAtThisTime, including an > ETA for availability. The application can then decide how to handle >that. Bertrand, this is exactly what I have suggested in two previous mails: My concrete suggestion would be, as I wrote: if it's in cold storage, throw an exception saying so, and load the binary into hot storage. A few minutes later, re-reading will not throw an exception as it's in hot storage. So, there is no API change needed, except for a new exception class (subclass of RepositoryException). An application can catch those exceptions and deal with them in a special way (write that the binary is not currently available). Possibly the new exception could have a method "doNotMoveBinary()" in case moving is not needed, but by default the binary should be moved, so that old applications don't have to be changed at all (backward compatibility). Regards, Thomas
Re: [DiSCUSS] - highly vs rarely used data
Hi Thomas, On Tue, Jul 11, 2017 at 3:14 PM, Thomas Mueller wrote: > ...if it's in cold storage, throw an exception saying so, and load the binary > into hot storage. > A few minutes later, re-reading will not throw an exception as it's in hot > storage Ok great, sorry that I missed that earlier. Note that the exception should not prevent the client from getting the rest of the data (other properties) of the same Node - I suppose that's natural if the exception is thrown when calling Binary.getStream(). -Bertrand