(This topic is a follow up for the Github issue : 
https://github.com/senx/warp10-platform/issues/674)

There are few ways to manage data retention, and one we’ve pushed in the 
past was to support TTL so that datapoint can be stored with internal HBase 
insert time according to this TTL.

In case an operator implement a TTL based data eviction policy, there is a 
situation that can occur where if a series don’t have any new data points 
pushed during the TTL period, then  there is no point anymore for a series, 
but where the series still exist. 

I’ve called this the Dead Series pattern, and we've though of different 
ways to answer this need.  


The first one would be to track the TTL from the token to the metadata.
Then the directory would be TTL aware and could process some routine to 
garbage collect dead series that have a TTL in their metadata structure. 
Then we can process a find on a selector and apply a comparison between a 
LA and a TTL field.

The second one would be to add a specific Egress call.
Alongside w/ /update /meta /delete /fetch /find, for example a /clean?ttl= 
, so that the TTL is actually not forged into the metadata structure but 
passed as a parameter to a specific method. This way we can still implement 
the cleaning process inside the directory directly which would : scan the 
series like a FIND, compare the LastActivity with the given TTL, and delete 
the series directly. The problem here is that it would require to query 
over each directory to make it happen. Then I propose that this routine 
could be enable on a special directory that could be specialized on this 
job, and push a delete message on Kafka metadata topic so that all 
directories will consume it.


I feel that the second proposition is more efficient and less intrusive 
than the first one. The first one requires to modify the token and metadata 
structures and offers less flexibility where the second could enable a 
clean process by a user on an arbitrary TTL value (one week for example, 
while on the operator side, it could be the TTL defined on the platform).


Also, since we rely on TTL based on the LSM implementation in HBase, we 
decorrelate the series from the data points, but TTL applies on the 
datapoints only. This mecanism is a proposition to help customers manage 
the entire dataset, by completing the metadata part. 

What do you think ? 

-- 
You received this message because you are subscribed to the Google Groups "Warp 
10 users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/warp10-users/2936ef1d-e81c-4265-b76d-21aac3acbb56%40googlegroups.com.

Reply via email to