[warp10-users] Re: Dead Series management

Mathias Herberts Fri, 10 Apr 2020 00:33:46 -0700

It is not possible to specify multiple selectors in a /delete or DELETE 
request. The reason is that those multiple selectors could match the same 
GTS multiple times and as no deduplication is done on the Directory side, 
multiple delete messages could be sent for the same GTS which would not be 
efficient.


The selector can contain regular expressions for both class and labels so 
you can use this if what you are trying to match can be expressed as a 
regexp, or you can issue multiple /delete or DELETE queries if not.

On Monday, April 6, 2020 at 3:46:14 PM UTC+2, A. Hébert wrote:
>
> We tested and are currently running the new delete version. And it's 
> working great, we are able to delete time series based on the last 
> activity. However, I didn't find any way to send several selectors to a 
> delete endpoint, is it possible?
>
> Le lundi 9 mars 2020 19:08:54 UTC+1, A. Hébert a écrit :
>>
>> Hello I just open a first PR about this subject : 
>> https://github.com/senx/warp10-platform/pull/687. Let-me know your 
>> thought about it.
>>
>> Le lundi 9 mars 2020 11:05:45 UTC+1, A. Hébert a écrit :
>>>
>>> Sorry if my answer lead you to think we had a performance issue. This 
>>> isn't the idea of this topic, the DELETE endpoint is great to delete DATA. 
>>> But how do we handle massive META deletes ? The actual process with the 
>>> current delete isn't trivial for a Warp 10 user.
>>> The proposal of a clean endpoint allow a user to simply delete series 
>>> and their META. 
>>> For example, he can delete empty/unvalid/unused series (that still have 
>>> point that will be deleted later once they expired with a TTL).
>>> Using the TTL on such an entry point can be considered here as optional, 
>>> it's more acting as a security to ensure that no recent series are deleted.
>>>
>>> Le samedi 7 mars 2020 15:11:42 UTC+1, Mathias Herberts a écrit :
>>>>
>>>> Can you give some details as to number of classes, number of GTS per 
>>>> class, etc. The output of FINDSTATS and the size of arrays returned by 
>>>> FINDSETS. Do your FIND requests contain activeafter/quietafter specifiers?
>>>>
>>>>
>>>> On Friday, March 6, 2020 at 5:54:25 PM UTC+1, A. Hébert wrote:
>>>>>
>>>>> Hello, technically it's working as expected, however for large 
>>>>> accounts (around 50 Millions of series), deleting empty series with this 
>>>>> method takes time (FIND with an high cardinality is slow (5-60s depending 
>>>>> of the selector), then you have to produce a META message before applying 
>>>>> the DELETE). This also means that the Directory have 3 messages to 
>>>>> process 
>>>>> to be able to complete a clean of those empty series. That’s why we are 
>>>>> thinking about how we can simplify it and came with the idea of a /clean 
>>>>> directly inside the directory.
>>>>>
>>>>> The idea of the "clean" endpoint is really to be able to clean series 
>>>>> according to a selector and a TTL. As mentioned, it can remove a series 
>>>>> that still have points, for me it's a valid trade-off, and can even be 
>>>>> great to delete only the META entry of an unused series and letting HBase 
>>>>> purged the data-points once they reach their TTL.
>>>>>
>>>>> Le mardi 3 mars 2020 16:33:10 UTC+1, Mathias Herberts a écrit :
>>>>>>
>>>>>> Can you be more specific about what "doesn't work"?
>>>>>>
>>>>>> On Tuesday, March 3, 2020 at 10:23:48 AM UTC+1, Steven Le Roux wrote:
>>>>>>>
>>>>>>> I'm well aware of all this :) 
>>>>>>>
>>>>>>> Few points here : 
>>>>>>>
>>>>>>> 1/ having 1 GTS with different retention per points should be banned 
>>>>>>> since on the operator side it's very hard to manage. From our 
>>>>>>> experience, 
>>>>>>> we will enforce a single retention policy per user account. You can 
>>>>>>> still 
>>>>>>> have the same GTS with different retention policies, but spreaded in 
>>>>>>> different accounts. This way we can have an autonomous system that 
>>>>>>> clean 
>>>>>>> accounts based on the accound defined TTL.
>>>>>>>
>>>>>>> 2/ the process you're proposing is already what we do, but it 
>>>>>>> doesn't work. It's way too slow for highly dynamic environnment where 
>>>>>>> you 
>>>>>>> create more series than you delete. For big account with more than 
>>>>>>> dozens 
>>>>>>> of millions of series, the FIND/META/DELETE process is just not working 
>>>>>>> hence the idea to identify deletes on the directory itself through the 
>>>>>>> internal scanner.
>>>>>>>
>>>>>>> 3/ For the example with a LA, it's perfeclty acceptable to me that a 
>>>>>>> batched produced data pushed 2 years ago with a TTL of one year could 
>>>>>>> have 
>>>>>>> its datapoints purged. If the user wants a bigger TTL, it's up to him 
>>>>>>> to 
>>>>>>> define it, but the TTL should also be associated with the time which 
>>>>>>> you 
>>>>>>> pushed datapoints, not their own timestamp value. In analytics for 
>>>>>>> example, 
>>>>>>> if you have a forensic job and you need to compute datapoints for the 
>>>>>>> next 
>>>>>>> 6 month, your series could have 10 years lifetime, but at the end, you 
>>>>>>> know 
>>>>>>> that when you're finished, the job is done, so the TTL is here to help 
>>>>>>> the 
>>>>>>> customer clean its dataset.
>>>>>>>
>>>>>>> Like you said with the .dpts, since its customer custom, the clean 
>>>>>>> process should also be customer scoped.
>>>>>>> The proposed solution may not be the best but for sure there is no 
>>>>>>> existing solution currently to this problem. Still I'm open to any 
>>>>>>> other 
>>>>>>> idea that ease delete operations if you see any alternative.
>>>>>>>
>>>>>>> Otherwise, if you agree on the statement, we can start working on a 
>>>>>>> PR.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Friday, 28 February 2020 08:57:11 UTC+1, Mathias Herberts wrote:
>>>>>>>>
>>>>>>>> The other important point is that last activity tracks when the GTS 
>>>>>>>> were last updated (or had their attributes modified), but it does not 
>>>>>>>> tell 
>>>>>>>> anything about what datapoints were written, meaning that a series 
>>>>>>>> updated 
>>>>>>>> 2 years ago with datapoints which had a TTL set to 1 year could very 
>>>>>>>> well 
>>>>>>>> have data in the current 1 year period ending now if the datapoints 
>>>>>>>> written 
>>>>>>>> 2 years ago were then in the future with a HBase cell timestamp set to 
>>>>>>>> that 
>>>>>>>> of the datapoints (again, see 
>>>>>>>> https://blog.senx.io/all-there-is-to-know-about-warp-10-tokens/ and 
>>>>>>>> more specifically the .dpts attribute).
>>>>>>>>
>>>>>>>> On Friday, February 28, 2020 at 8:54:08 AM UTC+1, Mathias Herberts 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> the TTL is not linked to the GTS itself but to each datapoint 
>>>>>>>>> pushed to it. As the TTL can be set in the Token (see 
>>>>>>>>> https://blog.senx.io/all-there-is-to-know-about-warp-10-tokens/), 
>>>>>>>>> a single GTS can have datapoints with differing TTLs.
>>>>>>>>>
>>>>>>>>> As of today the purge of what you call dead series can be 
>>>>>>>>> performed with a combination of last activity, FIND, META and DELETE 
>>>>>>>>> (or 
>>>>>>>>> /find, /meta, /delete) in the following way:
>>>>>>>>>
>>>>>>>>> 1) Identify the series with no activity after a cut off timestamp 
>>>>>>>>> (via FIND or /find)
>>>>>>>>> 2) Mark  those GTS with a special attribute (via META or /meta)
>>>>>>>>> 3) Delete fully those GTS you select using the special attribute 
>>>>>>>>> set in 2 (via DELETE or /delete)
>>>>>>>>>
>>>>>>>>> The overall process could be made a little simpler if support for 
>>>>>>>>> quiet after/active after was added to the /delete endpoint, so far it 
>>>>>>>>> was 
>>>>>>>>> withheld intentionally to avoid accidental deletes by a 
>>>>>>>>> misinterpretation 
>>>>>>>>> of the last activity window semantics.
>>>>>>>>>
>>>>>>>>> On Wednesday, February 26, 2020 at 7:37:52 PM UTC+1, Steven Le 
>>>>>>>>> Roux wrote:
>>>>>>>>>>
>>>>>>>>>> (This topic is a follow up for the Github issue : 
>>>>>>>>>> https://github.com/senx/warp10-platform/issues/674)
>>>>>>>>>>
>>>>>>>>>> There are few ways to manage data retention, and one we’ve pushed 
>>>>>>>>>> in the past was to support TTL so that datapoint can be stored with 
>>>>>>>>>> internal HBase insert time according to this TTL.
>>>>>>>>>>
>>>>>>>>>> In case an operator implement a TTL based data eviction policy, 
>>>>>>>>>> there is a situation that can occur where if a series don’t have any 
>>>>>>>>>> new 
>>>>>>>>>> data points pushed during the TTL period, then  there is no point 
>>>>>>>>>> anymore 
>>>>>>>>>> for a series, but where the series still exist. 
>>>>>>>>>>
>>>>>>>>>> I’ve called this the Dead Series pattern, and we've though of 
>>>>>>>>>> different ways to answer this need.  
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The first one would be to track the TTL from the token to the 
>>>>>>>>>> metadata.
>>>>>>>>>> Then the directory would be TTL aware and could process some 
>>>>>>>>>> routine to garbage collect dead series that have a TTL in their 
>>>>>>>>>> metadata 
>>>>>>>>>> structure. Then we can process a find on a selector and apply a 
>>>>>>>>>> comparison 
>>>>>>>>>> between a LA and a TTL field.
>>>>>>>>>>
>>>>>>>>>> The second one would be to add a specific Egress call.
>>>>>>>>>> Alongside w/ /update /meta /delete /fetch /find, for example a 
>>>>>>>>>> /clean?ttl= , so that the TTL is actually not forged into the 
>>>>>>>>>> metadata 
>>>>>>>>>> structure but passed as a parameter to a specific method. This way 
>>>>>>>>>> we can 
>>>>>>>>>> still implement the cleaning process inside the directory directly 
>>>>>>>>>> which 
>>>>>>>>>> would : scan the series like a FIND, compare the LastActivity with 
>>>>>>>>>> the 
>>>>>>>>>> given TTL, and delete the series directly. The problem here is that 
>>>>>>>>>> it 
>>>>>>>>>> would require to query over each directory to make it happen. Then I 
>>>>>>>>>> propose that this routine could be enable on a special directory 
>>>>>>>>>> that could 
>>>>>>>>>> be specialized on this job, and push a delete message on Kafka 
>>>>>>>>>> metadata 
>>>>>>>>>> topic so that all directories will consume it.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I feel that the second proposition is more efficient and less 
>>>>>>>>>> intrusive than the first one. The first one requires to modify the 
>>>>>>>>>> token 
>>>>>>>>>> and metadata structures and offers less flexibility where the second 
>>>>>>>>>> could 
>>>>>>>>>> enable a clean process by a user on an arbitrary TTL value (one week 
>>>>>>>>>> for 
>>>>>>>>>> example, while on the operator side, it could be the TTL defined on 
>>>>>>>>>> the 
>>>>>>>>>> platform).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Also, since we rely on TTL based on the LSM implementation in 
>>>>>>>>>> HBase, we decorrelate the series from the data points, but TTL 
>>>>>>>>>> applies on 
>>>>>>>>>> the datapoints only. This mecanism is a proposition to help 
>>>>>>>>>> customers 
>>>>>>>>>> manage the entire dataset, by completing the metadata part. 
>>>>>>>>>>
>>>>>>>>>> What do you think ? 
>>>>>>>>>>
>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups "Warp 
10 users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/warp10-users/a1c810ce-649c-4569-bf93-df23b69b8c5f%40googlegroups.com.

[warp10-users] Re: Dead Series management

Reply via email to