[GitHub] [hive] dengzhhu653 commented on a diff in pull request #3137: HIVE-26221: Add histogram-based column statistics

GitBox Mon, 05 Dec 2022 00:52:36 -0800


dengzhhu653 commented on code in PR #3137:
URL: https://github.com/apache/hive/pull/3137#discussion_r1039312469



##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/LongColumnStatsAggregator.java:
##########
@@ -51,73 +52,94 @@ public ColumnStatisticsObj 
aggregate(List<ColStatsObjWithSourceInfo> colStatsWit
     checkStatisticsList(colStatsWithSourceInfo);
 
     ColumnStatisticsObj statsObj = null;
-    String colType = null;
+    String colType;
     String colName = null;
     // check if all the ColumnStatisticsObjs contain stats and all the ndv are
     // bitvectors
     boolean doAllPartitionContainStats = partNames.size() == 
colStatsWithSourceInfo.size();
     NumDistinctValueEstimator ndvEstimator = null;
+    KllHistogramEstimator histogramEstimator = null;
+    boolean areAllNDVEstimatorsMergeable = true;
+    boolean areAllHistogramEstimatorsMergeable = true;
     for (ColStatsObjWithSourceInfo csp : colStatsWithSourceInfo) {
       ColumnStatisticsObj cso = csp.getColStatsObj();
       if (statsObj == null) {
         colName = cso.getColName();
         colType = cso.getColType();
         statsObj = ColumnStatsAggregatorFactory.newColumnStaticsObj(colName, 
colType,
             cso.getStatsData().getSetField());
-        LOG.trace("doAllPartitionContainStats for column: {} is: {}", colName,
-            doAllPartitionContainStats);
+        LOG.trace("doAllPartitionContainStats for column: {} is: {}", colName, 
doAllPartitionContainStats);
       }
-      LongColumnStatsDataInspector longColumnStatsData = 
longInspectorFromStats(cso);
-      if (longColumnStatsData.getNdvEstimator() == null) {
-        ndvEstimator = null;
-        break;
-      } else {
-        // check if all of the bit vectors can merge
-        NumDistinctValueEstimator estimator = 
longColumnStatsData.getNdvEstimator();
+      LongColumnStatsDataInspector columnStatsData = 
longInspectorFromStats(cso);
+
+      // check if we can merge NDV estimators
+      if (columnStatsData.getNdvEstimator() == null) {
+        areAllNDVEstimatorsMergeable = false;
+      } else if (areAllNDVEstimatorsMergeable) {
+        NumDistinctValueEstimator estimator = 
columnStatsData.getNdvEstimator();
         if (ndvEstimator == null) {
           ndvEstimator = estimator;
         } else {
-          if (ndvEstimator.canMerge(estimator)) {
-            continue;
-          } else {
-            ndvEstimator = null;
-            break;
+          if (!ndvEstimator.canMerge(estimator)) {
+            areAllNDVEstimatorsMergeable = false;
+          }
+        }
+      }
+      // check if we can merge histogram estimators
+      if (columnStatsData.getHistogramEstimator() == null) {

Review Comment:
   To keep things simple, can we call 
   ```java
    // merge what can be merged and keep the one with the biggest cardinality
         KllHistogramEstimator mergedKllHistogramEstimator = 
mergeHistograms(colStatsWithSourceInfo);
         if (mergedKllHistogramEstimator != null) {
           
columnStatisticsData.getLongStats().setHistogram(mergedKllHistogramEstimator.serialize());
         }
   ``` 
   directly to aggregate the histogram statistics instead of introducing 
`areAllHistogramEstimatorsMergeable` and `histogramEstimator` via iterating 
over the `colStatsWithSourceInfo`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hive] dengzhhu653 commented on a diff in pull request #3137: HIVE-26221: Add histogram-based column statistics

Reply via email to