[ 
https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=831286&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831286
 ]

ASF GitHub Bot logged work on HIVE-26221:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 06/Dec/22 07:28
            Start Date: 06/Dec/22 07:28
    Worklog Time Spent: 10m 
      Work Description: dengzhhu653 commented on code in PR #3137:
URL: https://github.com/apache/hive/pull/3137#discussion_r1040592119


##########
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/StatisticsTestUtils.java:
##########
@@ -109,4 +135,116 @@ public static HyperLogLog createHll(String... values) {
     }
     return hll;
   }
+
+  /**
+   * Creates an HLL object initialized with the given values.
+   * @param values the values to be added
+   * @return an HLL object initialized with the given values.
+   */
+  public static HyperLogLog createHll(double... values) {
+    HyperLogLog hll = HyperLogLog.builder().build();
+    Arrays.stream(values).forEach(hll::addDouble);
+    return hll;
+  }
+
+  /**
+   * Creates a KLL object initialized with the given values.
+   * @param values the values to be added
+   * @return a KLL object initialized with the given values.
+   */
+  public static KllFloatsSketch createKll(float... values) {
+    KllFloatsSketch kll = new KllFloatsSketch();
+    for (float value : values) {
+      kll.update(value);
+    }
+    return kll;
+  }
+
+  /**
+   * Creates a KLL object initialized with the given values.
+   * @param values the values to be added
+   * @return a KLL object initialized with the given values.
+   */
+  public static KllFloatsSketch createKll(double... values) {
+    KllFloatsSketch kll = new KllFloatsSketch();
+    for (double value : values) {
+      kll.update(Double.valueOf(value).floatValue());
+    }
+    return kll;
+  }
+
+  /**
+   * Creates a KLL object initialized with the given values.
+   * @param values the values to be added
+   * @return a KLL object initialized with the given values.
+   */
+  public static KllFloatsSketch createKll(long... values) {
+    KllFloatsSketch kll = new KllFloatsSketch();
+    for (long value : values) {
+      kll.update(value);
+    }
+    return kll;
+  }
+
+  /**
+   * Checks if expected and computed statistics data are equal.
+   * @param expected expected statistics data
+   * @param computed computed statistics data
+   */
+  public static void assertEqualStatistics(ColumnStatisticsData expected, 
ColumnStatisticsData computed) {
+    if (expected.getSetField() != computed.getSetField()) {
+      throw new IllegalArgumentException("Expected data is of type " + 
expected.getSetField()
+          + " while computed data is of type " + computed.getSetField());
+    }
+
+    Class<?> dataClass = null;
+    switch (expected.getSetField()) {
+    case DATE_STATS:
+      dataClass = DateColumnStatsData.class;
+      break;
+    case LONG_STATS:
+      dataClass = LongColumnStatsData.class;
+      break;
+    case DOUBLE_STATS:
+      dataClass = DoubleColumnStatsData.class;
+      break;
+    case DECIMAL_STATS:
+      dataClass = DecimalColumnStatsData.class;
+      break;
+    case TIMESTAMP_STATS:
+      dataClass = TimestampColumnStatsData.class;
+      break;
+    default:
+      // it's an unsupported class for KLL, no special treatment needed
+      Assert.assertEquals(expected, computed);
+      return;
+    }
+    assertEqualStatistics(expected, computed, dataClass);
+  }
+
+  private static <X> void assertEqualStatistics(

Review Comment:
   This function only compares the `histogram`,  and does not tell much truth 
when either `computedHasHistograms` or `expectedHasHistograms` is false. Cloud 
we compare the `ColumnStatisticsData` by `Assert.assertEquals(expected, 
computed);` as we did in Line 219?





Issue Time Tracking
-------------------

    Worklog Id:     (was: 831286)
    Time Spent: 4.5h  (was: 4h 20m)

> Add histogram-based column statistics
> -------------------------------------
>
>                 Key: HIVE-26221
>                 URL: https://issues.apache.org/jira/browse/HIVE-26221
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO, Metastore, Statistics
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Hive does not support histogram statistics, which are particularly useful for 
> skewed data (which is very common in practice) and range predicates.
> Hive's current selectivity estimation for range predicates is based on a 
> hard-coded value of 1/3 (see 
> [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).])
> The current proposal aims at integrating histogram as an additional column 
> statistics, stored into the Hive metastore at the table (or partition) level.
> The main requirements for histogram integration are the following:
>  * efficiency: the approach must scale and support billions of rows
>  * merge-ability: partition-level histograms have to be merged to form 
> table-level histograms
>  * explicit and configurable trade-off between memory footprint and accuracy
> Hive already integrates [KLL data 
> sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. 
> Datasketches are small, stateful programs that process massive data-streams 
> and can provide approximate answers, with mathematical guarantees, to 
> computationally difficult queries orders-of-magnitude faster than 
> traditional, exact methods.
> We propose to use KLL, and more specifically the cumulative distribution 
> function (CDF), as the underlying data structure for our histogram statistics.
> The current proposal targets numeric data types (float, integer and numeric 
> families) and temporal data types (date and timestamp).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to