[ https://issues.apache.org/jira/browse/HIVE-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17560941#comment-17560941 ]
Alessandro Solimando edited comment on HIVE-26221 at 6/30/22 9:12 AM: ---------------------------------------------------------------------- Thanks [~Chunwei Lei] for your interest, there is a WIP PR already which is almost ready for review (need to fix a conflict and update some test output files). I have linked it already to the ticket in case you want to take a look before it is finalized. Regarding support for strings, these are the considerations we have made so far: * KLL sketches support only _float_: we could of course use an encoding respecting the lexicographical ordering of strings, * there is no general way to use KLL sketches for equality predicates: they are naturally tailored for range predicates, because for equality we need a notion of "immediate predecessor/successor" to get the cardinality of range _<pred(elem), elem>_ or _<elem, succ(elem)>_, and this is trivial only for data types mapping to the integer family (due to how [getCDF(float[] splitPoints)|https://datasketches.apache.org/api/java/snapshot/apidocs/org/apache/datasketches/kll/KllFloatsSketch.html#getCDF-float:A-] method works), * strings seem to be more frequently involved in equality predicates, for which [ItemsSketch|https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html] is more suitable, we are exploring this angle in a parallel on-going project was (Author: asolimando): Thanks [~Chunwei Lei] for your interest, there is a WIP PR already which is almost ready for review (need to fix a conflict and update some test output files). I have linked it already to the ticket in case you want to take a look before it is finalized. Regarding support for strings, these are the considerations we have made so far: * KLL sketches support only _float_: we could of course use an encoding respecting the lexicographical ordering of strings, * there is no general way to use KLL sketches for equality predicates: they are naturally tailored for range predicates, because for equality we need a notion of "immediate predecessor/successor" to get the cardinality of range _<pred(elem), elem>_ or _<elem, succ(elem)>_, and this is trivial only for data types mapping to the integer family (due to how [getCDF()|https://datasketches.apache.org/api/java/snapshot/apidocs/org/apache/datasketches/kll/KllFloatsSketch.html#getCDF-float:A-] method works), * strings seem to be more frequently involved in equality predicates, for which [ItemsSketch|https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html] is more suitable, we are exploring this angle in a parallel on-going project > Add histogram-based column statistics > ------------------------------------- > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics > Affects Versions: 4.0.0-alpha-2 > Reporter: Alessandro Solimando > Assignee: Alessandro Solimando > Priority: Major > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable trade-off between memory footprint and accuracy > Hive already integrates [KLL data > sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. > Datasketches are small, stateful programs that process massive data-streams > and can provide approximate answers, with mathematical guarantees, to > computationally difficult queries orders-of-magnitude faster than > traditional, exact methods. > We propose to use KLL, and more specifically the cumulative distribution > function (CDF), as the underlying data structure for our histogram statistics. > The current proposal targets numeric data types (float, integer and numeric > families) and temporal data types (date and timestamp). -- This message was sent by Atlassian Jira (v8.20.10#820010)