Re: [D] T-Digest Design Proposal [kvrocks]

via GitHub Sat, 28 Sep 2024 20:32:35 -0700


GitHub user LindaSummer edited a discussion: T-Digest Design Proposal


## Introduction

[Redis Stack has supported a new probabilistic data structure 
t-digest](https://redis.io/blog/t-digest-in-redis-stack/).

In #2423, we plan to support t-digest as well.

## Basic about T-Digest

The original paper is [Computing Extremely Accurate Quantiles Using 
t-Digests](https://arxiv.org/abs/1902.04023) [1].

Thanks to the blog [T-Digest in 
Python](https://www.kuniga.me/blog/2021/11/29/t-digest.html) [7] and the 
[slide](https://blog.bcmeng.com/pdf/TDigest.pdf) [8], I have a better 
understanding of t-digest.

The main idea of t-digest is to divide the data into small bins and store the 
mean and count of each bin aka centroids.

This action compressed ranges of data into a single centroids with just mean 
and weight. The mean is the average of the data in the bin and the weight is 
the number of data compressed in the bin.

This behavior is called as sketch. Sketch is necessary when we need to deal 
with plenty of data.

We use these centroids with interpolation to estimate the quantile of the data.

We lost some precisions of the original data and get the ability to easily 
merge them and calculate the quantile.

We use the potential function `k` to control the distribution of the bins.

Inside the function we provide a $\delta$ to control the error of the 
distribution.

### metadata

In metadata, we should store the compression $(1/\delta)$, total_weight, 
minimum and maximum of t-digest.

For temporary input buffer, we should have a buffer left space for merging 
trigger.
When one item is pushed, `buffer_free` will be decreased. When `buffer_free` is 
zero, we should merge the buffer to the centroids and reset the `buffer_free`.

            
+----------+------------+-----------+-----------+----------------+----------------+-----------------+---------------+-------------+
    key =>  |  flags   |  expire    |  version  |  size     |  compression   |  
total_weight  |     minimum     |    maximum    | buffer_free |
            | (1byte)  | (Ebyte)    |  (8byte)  | (Sbyte)   |  double(8byte) |  
double(8byte) |  double(8byte)  | double(8byte) |    Sbyte    |
            
+----------+------------+-----------+-----------+----------------+----------------+-----------------+---------------+-------------+

### centroids

Each centroid should be sorted by mean and have a weight. The mean and weight 
are both double precision numbers.
So, if the we try to keep the order of mean, we should design a way to 
serialize double and keep its order same before serialization.
Here is one encoding format of 
[duckdb](https://github.com/duckdb/duckdb/blob/34a3acc6b3354be86fe593d09e0702ab5eafe757/src/include/duckdb/common/radix.hpp#L112-L161)
 [9] to keep the original order of double and convert to a 8-byte uint64.

```cpp
inline uint64_t EncodeDouble(double x) {
  uint64_t buff;
  //! zero
  if (x == 0) {
   buff = 0;
   buff += (1ULL << 63);
   return buff;
  }
  // nan
  if (Value::IsNan(x)) {
   return ULLONG_MAX;
  }
  //! infinity
  if (x > DBL_MAX) {
   return ULLONG_MAX - 1;
  }
  //! -infinity
  if (x < -DBL_MAX) {
   return 0;
  }
  buff = Load<uint64_t>(const_data_ptr_cast(&x));
  if (buff < (1ULL << 63)) { //! +0 and positive numbers
   buff += (1ULL << 63);
  } else {          //! negative numbers
   buff = ~buff; //! complement 1
  }
  return buff;
 }
```

Here is the subkey of centroid.

'centroid' is an enum for the centroid subkey.
'mean' is a serialized double with order.

                                  +----------------+
    key|version|centroid|mean =>  |      weight    |
                                  | (8byte) double |
                                  +----------------+

### temparory buffer

Temp buffer is just a list of doubles, we should control the limit of whole 
buffer's size and should acuiqre a lock when try to merge it to the centroids.

Since we have a way to serialize the double to a ordered one, we can use same 
way to encode the temparory double buffer.

To avoid collisions with same value, we should add an unique identifier for 
each item. It can be a uint64 millisecond timestamp for simplicity.

'buffer' is a enum for buffer subtype.
'value' is serialized double for data to be inserted with order.
'id' is a unique identifier for collision.

      key|version|buffer|value|id => NULL

### concurrency safety

The pushing to temporary list action should have acquire the lock for updating 
the metadata.

The merge action should acquire a lock. It includes the buffer merging and 
merging with other digests.

When we try to calculate the perncetile or CDF, we should merge the buffer and 
make a snapshot.

Then the calculation should have no lock.

## References

[1][Computing Extremely Accurate Quantiles Using 
t-Digests](https://arxiv.org/abs/1902.04023)

[2]<https://github.com/tdunning/t-digest>

[3]<https://issues.apache.org/jira/browse/ARROW-11367>

[4]<https://github.com/apache/arrow/pull/9310>

[5]<https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/tdigest.cc>

[6]<https://redis.io/docs/latest/develop/data-types/probabilistic/t-digest/>

[7][T-Digest in Python](https://www.kuniga.me/blog/2021/11/29/t-digest.html)

[8]<https://blog.bcmeng.com/pdf/TDigest.pdf>

[9]<https://github.com/duckdb/duckdb>

GitHub link: https://github.com/apache/kvrocks/discussions/2542

----
This is an automatically sent email for issues@kvrocks.apache.org.
To unsubscribe, please send an email to: issues-unsubscr...@kvrocks.apache.org

Re: [D] T-Digest Design Proposal [kvrocks]

Reply via email to