scheler opened a new issue, #15704:
URL: https://github.com/apache/pinot/issues/15704
Requesting comments on the following proposal to add `quantileIndex` support
for fast percentile queries.
**Description**:
Currently, Apache Pinot does not provide a native, automatic way to optimize
percentile and quantile queries. Users often resort to manually creating and
ingesting sketches (such as KLL or t-digest) outside of Pinot, which results in
redundant data, complex workflows, and inefficient query execution. The
sketches must be merged at query time, leading to full scans of the data, which
defeats the purpose of using sketches for optimization.
This proposal suggests adding a quantileIndex to Pinot. This index type
would allow the system to automatically build and store quantile sketches
(e.g., KLL sketches) during segment creation, enabling fast percentile queries
without requiring full scans or manual sketch creation.
**Motivation**:
Simplifies user experience: No need to manage sketch columns or manual
aggregation.
Improves query performance: Merged sketches at the segment level, reducing
the need for row-level processing at query time.
Aligns with Pinot's indexing model: Like existing indexes (inverted, bloom,
range), the quantileIndex can be treated as an optional, transparent
optimization.
**Proposed Solution**:
Implement the quantileIndex as a new index type in Pinot that supports the
creation and querying of KLL sketches (initially), with potential for
supporting other sketch types in the future.
During segment generation, Pinot would generate a single sketch for each
quantile-indexed column.
At query time, the planner can leverage this index to quickly return
approximate percentile values without scanning all rows.
**Storage Impact**:
Sketch Storage: Each segment will contain the merged KLL sketch for the
column. This will add some storage overhead, but the sketch is compact compared
to raw data.
Segment-level Storage: The sketch will increase storage per segment, but the
space is used efficiently to provide significant performance benefits during
percentile queries.
Selective Indexing: Users can choose which columns to apply the
quantileIndex to, allowing for targeted optimization and minimal storage
overhead for less critical columns.
**Benefits**:
Faster percentile queries by using pre-computed, merged sketches.
Reduced ingestion and storage overhead compared to storing a sketch per row.
No change to existing query syntax: Users can continue using PERCENTILE or
PERCENTILEEST functions as usual.
**Interface Definitions**:
1. Setting Up the Index in the Schema:
To apply the quantileIndex on a column, the user would define the index type
in the schema configuration. The index can be configured with optional
parameters, like k, which controls the accuracy of the KLL sketch.
Example Schema Configuration:
```
{
"columns": [
{
"name": "latency_ms",
"dataType": "FLOAT",
"indexTypes": ["quantileIndex"],
"quantileIndexConfig": {
"k": 200
}
}
]
}
```
In this example, the quantileIndex is applied to the latency_ms column with
the parameter k set to 200 (number of summary points in the KLL sketch).
2. Querying Percentiles:
The query syntax for percentiles would remain the same as it is in Pinot
today. When a quantileIndex is present for the column, Pinot will automatically
use the pre-merged sketch to speed up the percentile calculation.
Example SQL Query:
`
SELECT PERCENTILE(latency_ms, 95) FROM my_table`
The query planner will recognize the quantileIndex and efficiently return
the 95th percentile using the precomputed KLL sketch data.
**Supported Primitive Data Types for quantileIndex:**
The quantileIndex can be used for numeric columns where percentile and
quantile calculations are meaningful.
- INT
- BIGINT
- FLOAT
- DOUBLE
- DECIMAL
These data types represent continuous numeric values, which are ideal for
computing percentiles. The KLL sketch (or similar sketches) will be computed
based on these columns during segment creation, allowing fast approximate
percentile queries at query time.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]