[jira] [Work logged] (HIVE-27000) Improve the modularity of the *ColumnStatsMerger classes

ASF GitHub Bot (Jira) Tue, 31 Jan 2023 10:13:08 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-27000?focusedWorklogId=842662&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-842662
 ]


ASF GitHub Bot logged work on HIVE-27000:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 31/Jan/23 18:12
            Start Date: 31/Jan/23 18:12
    Worklog Time Spent: 10m 
      Work Description: asolimando commented on code in PR #3997:
URL: https://github.com/apache/hive/pull/3997#discussion_r1092311907


##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/DateColumnStatsMerger.java:
##########
@@ -43,64 +46,57 @@ public void merge(ColumnStatisticsObj aggregateColStats, 
ColumnStatisticsObj new
     DateColumnStatsDataInspector aggregateData = 
dateInspectorFromStats(aggregateColStats);
     DateColumnStatsDataInspector newData = dateInspectorFromStats(newColStats);
 
-    setLowValue(aggregateData, newData);
-    setHighValue(aggregateData, newData);
-
-    aggregateData.setNumNulls(aggregateData.getNumNulls() + 
newData.getNumNulls());
-    if (aggregateData.getNdvEstimator() == null || newData.getNdvEstimator() 
== null) {
-      aggregateData.setNumDVs(Math.max(aggregateData.getNumDVs(), 
newData.getNumDVs()));
-    } else {
-      NumDistinctValueEstimator oldEst = aggregateData.getNdvEstimator();
-      NumDistinctValueEstimator newEst = newData.getNdvEstimator();
-      final long ndv;
-      if (oldEst.canMerge(newEst)) {
-        oldEst.mergeEstimators(newEst);
-        ndv = oldEst.estimateNumDistinctValues();
-        aggregateData.setNdvEstimator(oldEst);
-      } else {
-        ndv = Math.max(aggregateData.getNumDVs(), newData.getNumDVs());
-      }
-      LOG.debug("Use bitvector to merge column {}'s ndvs of {} and {} to be 
{}", aggregateColStats.getColName(),
-          aggregateData.getNumDVs(), newData.getNumDVs(), ndv);
-      aggregateData.setNumDVs(ndv);
+    Date lowValue = mergeLowValue(getLowValue(aggregateData), 
getLowValue(newData));
+    if (lowValue != null) {
+      aggregateData.setLowValue(lowValue);
+    }
+    Date highValue = mergeHighValue(getHighValue(aggregateData), 
getHighValue(newData));
+    if (highValue != null) {

Review Comment:
   Thanks @akshat0395 for your comment, you are right, but unfortunately the 
problem lies in the `*ColumnStatsDataInspector` objects, which inherits from 
the corresponding `*ColumnStatsData`, which don't have a common ancestor.
   
   For this reason it's hard to write a generic method, we would need to rely 
on generic for the   data type (say `Double`) and anyway a big switch for 
handling the *ColumnStatsDataInspector objects and casting, it would rapidly 
defeat the purpose.
   
   This lack of class hierarchy for column statistics comes directly from the 
thrift interface objects for the column statistics in the metastore (see 
[here](https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift)).
   
   HIVE-26313 could be a chance to overcome that, and define a better hierarchy 
for those classes, since we won't anyway be having a 1-1 mapping from a column 
in the metastore for column statistics and a Java class, but we will have to 
parse a (probably Json) blob and do the "mapping" ourselves.
   
   In summary, I'd rather leave this for when we will have a better class 
hierarchy for those objects (hopefully with HIVE-26313). WDYT?





Issue Time Tracking
-------------------

    Worklog Id:     (was: 842662)
    Time Spent: 1h  (was: 50m)

> Improve the modularity of the *ColumnStatsMerger classes
> --------------------------------------------------------
>
>                 Key: HIVE-27000
>                 URL: https://issues.apache.org/jira/browse/HIVE-27000
>             Project: Hive
>          Issue Type: Improvement
>          Components: Statistics
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> *ColumnStatsMerger classes contain a lot of duplicate code which is not 
> specific to the data type, and that could therefore be lifted to a common 
> parent class.
> This phenomenon is bound to become even worse if we keep enriching further 
> our supported set of statistics as we did in the context of HIVE-26221.
> The current ticket aims at improving the modularity and code reuse of the 
> *ColumnStatsMerger classes, while improving unit-test coverage to cover all 
> classes and support more use-cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (HIVE-27000) Improve the modularity of the *ColumnStatsMerger classes

Reply via email to