I'm well aware of all this :) Few points here :
1/ having 1 GTS with different retention per points should be banned since on the operator side it's very hard to manage. From our experience, we will enforce a single retention policy per user account. You can still have the same GTS with different retention policies, but spreaded in different accounts. This way we can have an autonomous system that clean accounts based on the accound defined TTL. 2/ the process you're proposing is already what we do, but it doesn't work. It's way too slow for highly dynamic environnment where you create more series than you delete. For big account with more than dozens of millions of series, the FIND/META/DELETE process is just not working hence the idea to identify deletes on the directory itself through the internal scanner. 3/ For the example with a LA, it's perfeclty acceptable to me that a batched produced data pushed 2 years ago with a TTL of one year could have its datapoints purged. If the user wants a bigger TTL, it's up to him to define it, but the TTL should also be associated with the time which you pushed datapoints, not their own timestamp value. In analytics for example, if you have a forensic job and you need to compute datapoints for the next 6 month, your series could have 10 years lifetime, but at the end, you know that when you're finished, the job is done, so the TTL is here to help the customer clean its dataset. Like you said with the .dpts, since its customer custom, the clean process should also be customer scoped. The proposed solution may not be the best but for sure there is no existing solution currently to this problem. Still I'm open to any other idea that ease delete operations if you see any alternative. Otherwise, if you agree on the statement, we can start working on a PR. On Friday, 28 February 2020 08:57:11 UTC+1, Mathias Herberts wrote: > > The other important point is that last activity tracks when the GTS were > last updated (or had their attributes modified), but it does not tell > anything about what datapoints were written, meaning that a series updated > 2 years ago with datapoints which had a TTL set to 1 year could very well > have data in the current 1 year period ending now if the datapoints written > 2 years ago were then in the future with a HBase cell timestamp set to that > of the datapoints (again, see > https://blog.senx.io/all-there-is-to-know-about-warp-10-tokens/ and more > specifically the .dpts attribute). > > On Friday, February 28, 2020 at 8:54:08 AM UTC+1, Mathias Herberts wrote: >> >> Hi, >> >> the TTL is not linked to the GTS itself but to each datapoint pushed to >> it. As the TTL can be set in the Token (see >> https://blog.senx.io/all-there-is-to-know-about-warp-10-tokens/), a >> single GTS can have datapoints with differing TTLs. >> >> As of today the purge of what you call dead series can be performed with >> a combination of last activity, FIND, META and DELETE (or /find, /meta, >> /delete) in the following way: >> >> 1) Identify the series with no activity after a cut off timestamp (via >> FIND or /find) >> 2) Mark those GTS with a special attribute (via META or /meta) >> 3) Delete fully those GTS you select using the special attribute set in 2 >> (via DELETE or /delete) >> >> The overall process could be made a little simpler if support for quiet >> after/active after was added to the /delete endpoint, so far it was >> withheld intentionally to avoid accidental deletes by a misinterpretation >> of the last activity window semantics. >> >> On Wednesday, February 26, 2020 at 7:37:52 PM UTC+1, Steven Le Roux wrote: >>> >>> (This topic is a follow up for the Github issue : >>> https://github.com/senx/warp10-platform/issues/674) >>> >>> There are few ways to manage data retention, and one we’ve pushed in the >>> past was to support TTL so that datapoint can be stored with internal HBase >>> insert time according to this TTL. >>> >>> In case an operator implement a TTL based data eviction policy, there is >>> a situation that can occur where if a series don’t have any new data points >>> pushed during the TTL period, then there is no point anymore for a series, >>> but where the series still exist. >>> >>> I’ve called this the Dead Series pattern, and we've though of different >>> ways to answer this need. >>> >>> >>> The first one would be to track the TTL from the token to the metadata. >>> Then the directory would be TTL aware and could process some routine to >>> garbage collect dead series that have a TTL in their metadata structure. >>> Then we can process a find on a selector and apply a comparison between a >>> LA and a TTL field. >>> >>> The second one would be to add a specific Egress call. >>> Alongside w/ /update /meta /delete /fetch /find, for example a >>> /clean?ttl= , so that the TTL is actually not forged into the metadata >>> structure but passed as a parameter to a specific method. This way we can >>> still implement the cleaning process inside the directory directly which >>> would : scan the series like a FIND, compare the LastActivity with the >>> given TTL, and delete the series directly. The problem here is that it >>> would require to query over each directory to make it happen. Then I >>> propose that this routine could be enable on a special directory that could >>> be specialized on this job, and push a delete message on Kafka metadata >>> topic so that all directories will consume it. >>> >>> >>> I feel that the second proposition is more efficient and less intrusive >>> than the first one. The first one requires to modify the token and metadata >>> structures and offers less flexibility where the second could enable a >>> clean process by a user on an arbitrary TTL value (one week for example, >>> while on the operator side, it could be the TTL defined on the platform). >>> >>> >>> Also, since we rely on TTL based on the LSM implementation in HBase, we >>> decorrelate the series from the data points, but TTL applies on the >>> datapoints only. This mecanism is a proposition to help customers manage >>> the entire dataset, by completing the metadata part. >>> >>> What do you think ? >>> >> -- You received this message because you are subscribed to the Google Groups "Warp 10 users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/warp10-users/01f1c76a-ce48-43e9-932c-0f6c9308e68b%40googlegroups.com.
