GitHub user LindaSummer edited a discussion: T-Digest Design Proposal
## Introduction [Redis Stack has supported a new probabilistic data structure t-digest](https://redis.io/blog/t-digest-in-redis-stack/). In #2423, we plan to support t-digest as well. ## Basic about T-Digest The original paper is [Computing Extremely Accurate Quantiles Using t-Digests](https://arxiv.org/abs/1902.04023) [1]. Thanks to the blog [T-Digest in Python](https://www.kuniga.me/blog/2021/11/29/t-digest.html) [7] and the [slide](https://blog.bcmeng.com/pdf/TDigest.pdf) [8], I have a better understanding of t-digest. The main idea of t-digest is to divide the data into small bins and store the mean and count of each bin aka centroids. This action compressed ranges of data into a single centroids with just mean and weight. The mean is the average of the data in the bin and the weight is the number of data compressed in the bin. This behavior is called as sketch. Sketch is necessary when we need to deal with plenty of data. We use these centroids with interpolation to estimate the quantile of the data. We lost some precisions of the original data and get the ability to easily merge them and calculate the quantile. We use the potential function `k` to control the distribution of the bins. Inside the function we provide a $\delta$ to control the error of the distribution. ### metadata In metadata, we should store the compression $(1/\delta)$, total_weight, minimum and maximum of t-digest. For temporary input buffer, we should have a buffer left space for merging trigger. When one item is pushed, `buffer_free` will be decreased. When `buffer_free` is zero, we should merge the buffer to the centroids and reset the `buffer_free`. +----------+------------+-----------+-----------+----------------+----------------+-----------------+---------------+-------------+ key => | flags | expire | version | size | compression | total_weight | minimum | maximum | buffer_free | | (1byte) | (Ebyte) | (8byte) | (Sbyte) | double(8byte) | double(8byte) | double(8byte) | double(8byte) | Sbyte | +----------+------------+-----------+-----------+----------------+----------------+-----------------+---------------+-------------+ ### centroids Each centroid should be sorted by mean and have a weight. The mean and weight are both double precision numbers. So, if the we try to keep the order of mean, we should design a way to serialize double and keep its order same before serialization. Here is one encoding format of [duckdb](https://github.com/duckdb/duckdb/blob/34a3acc6b3354be86fe593d09e0702ab5eafe757/src/include/duckdb/common/radix.hpp#L112-L161) [9] to keep the original order of double and convert to a 8-byte uint64. ```cpp inline uint64_t EncodeDouble(double x) { uint64_t buff; //! zero if (x == 0) { buff = 0; buff += (1ULL << 63); return buff; } // nan if (Value::IsNan(x)) { return ULLONG_MAX; } //! infinity if (x > DBL_MAX) { return ULLONG_MAX - 1; } //! -infinity if (x < -DBL_MAX) { return 0; } buff = Load<uint64_t>(const_data_ptr_cast(&x)); if (buff < (1ULL << 63)) { //! +0 and positive numbers buff += (1ULL << 63); } else { //! negative numbers buff = ~buff; //! complement 1 } return buff; } ``` Here is the subkey of centroid. 'centroid' is an enum for the centroid subkey. 'mean' is a serialized double with order. +----------------+ key|version|centroid|mean => | weight | | (8byte) double | +----------------+ ### temparory buffer Temp buffer is just a list of doubles, we should control the limit of whole buffer's size and should acuiqre a lock when try to merge it to the centroids. Since we have a way to serialize the double to a ordered one, we can use same way to encode the temparory double buffer. To avoid collisions with same value, we should add an unique identifier for each item. It can be a uint64 millisecond timestamp for simplicity. 'buffer' is a enum for buffer subtype. 'value' is serialized double for data to be inserted with order. 'id' is a unique identifier for collision. key|version|buffer|value|id => NULL ### concurrency safety The pushing to temporary list action should have acquire the lock for updating the metadata. The merge action should acquire a lock. It includes the buffer merging and merging with other digests. When we try to calculate the perncetile or CDF, we should merge the buffer and make a snapshot. Then the calculation should have no lock. ## References [1][Computing Extremely Accurate Quantiles Using t-Digests](https://arxiv.org/abs/1902.04023) [2]<https://github.com/tdunning/t-digest> [3]<https://issues.apache.org/jira/browse/ARROW-11367> [4]<https://github.com/apache/arrow/pull/9310> [5]<https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/tdigest.cc> [6]<https://redis.io/docs/latest/develop/data-types/probabilistic/t-digest/> [7][T-Digest in Python](https://www.kuniga.me/blog/2021/11/29/t-digest.html) [8]<https://blog.bcmeng.com/pdf/TDigest.pdf> [9]<https://github.com/duckdb/duckdb> GitHub link: https://github.com/apache/kvrocks/discussions/2542 ---- This is an automatically sent email for issues@kvrocks.apache.org. To unsubscribe, please send an email to: issues-unsubscr...@kvrocks.apache.org