[jira] [Comment Edited] (HIVE-26221) Add histogram-based column statistics

Alessandro Solimando (Jira) Thu, 30 Jun 2022 02:13:03 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17560941#comment-17560941
 ]


Alessandro Solimando edited comment on HIVE-26221 at 6/30/22 9:12 AM:
----------------------------------------------------------------------

Thanks [~Chunwei Lei] for your interest, there is a WIP PR already which is 
almost ready for review (need to fix a conflict and update some test output 
files). I have linked it already to the ticket in case you want to take a look 
before it is finalized.

Regarding support for strings, these are the considerations we have made so far:
 * KLL sketches support only _float_: we could of course use an encoding 
respecting the lexicographical ordering of strings,
 * there is no general way to use KLL sketches for equality predicates: they 
are naturally tailored for range predicates, because for equality we need a 
notion of "immediate predecessor/successor" to get the cardinality of range 
_<pred(elem), elem>_ or _<elem, succ(elem)>_, and this is trivial only for data 
types mapping to the integer family (due to how [getCDF(float[] 
splitPoints)|https://datasketches.apache.org/api/java/snapshot/apidocs/org/apache/datasketches/kll/KllFloatsSketch.html#getCDF-float:A-]
 method works),
 * strings seem to be more frequently involved in equality predicates, for 
which 
[ItemsSketch|https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html]
 is more suitable, we are exploring this angle in a parallel on-going project



was (Author: asolimando):
Thanks [~Chunwei Lei] for your interest, there is a WIP PR already which is 
almost ready for review (need to fix a conflict and update some test output 
files). I have linked it already to the ticket in case you want to take a look 
before it is finalized.

Regarding support for strings, these are the considerations we have made so far:
 * KLL sketches support only _float_: we could of course use an encoding 
respecting the lexicographical ordering of strings,
 * there is no general way to use KLL sketches for equality predicates: they 
are naturally tailored for range predicates, because for equality we need a 
notion of "immediate predecessor/successor" to get the cardinality of range 
_<pred(elem), elem>_ or _<elem, succ(elem)>_, and this is trivial only for data 
types mapping to the integer family (due to how 
[getCDF()|https://datasketches.apache.org/api/java/snapshot/apidocs/org/apache/datasketches/kll/KllFloatsSketch.html#getCDF-float:A-]
 method works),
 * strings seem to be more frequently involved in equality predicates, for 
which 
[ItemsSketch|https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html]
 is more suitable, we are exploring this angle in a parallel on-going project


> Add histogram-based column statistics
> -------------------------------------
>
>                 Key: HIVE-26221
>                 URL: https://issues.apache.org/jira/browse/HIVE-26221
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO, Metastore, Statistics
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>
> Hive does not support histogram statistics, which are particularly useful for 
> skewed data (which is very common in practice) and range predicates.
> Hive's current selectivity estimation for range predicates is based on a 
> hard-coded value of 1/3 (see 
> [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).])
> The current proposal aims at integrating histogram as an additional column 
> statistics, stored into the Hive metastore at the table (or partition) level.
> The main requirements for histogram integration are the following:
>  * efficiency: the approach must scale and support billions of rows
>  * merge-ability: partition-level histograms have to be merged to form 
> table-level histograms
>  * explicit and configurable trade-off between memory footprint and accuracy
> Hive already integrates [KLL data 
> sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. 
> Datasketches are small, stateful programs that process massive data-streams 
> and can provide approximate answers, with mathematical guarantees, to 
> computationally difficult queries orders-of-magnitude faster than 
> traditional, exact methods.
> We propose to use KLL, and more specifically the cumulative distribution 
> function (CDF), as the underlying data structure for our histogram statistics.
> The current proposal targets numeric data types (float, integer and numeric 
> families) and temporal data types (date and timestamp).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HIVE-26221) Add histogram-based column statistics

Reply via email to