Re: [PR] HIVE-28080: Propagate statistics from a source table to the materialized CTE [hive]

via GitHub Tue, 20 Feb 2024 02:51:48 -0800


kasakrisz commented on code in PR #5089:
URL: https://github.com/apache/hive/pull/5089#discussion_r1495606754



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:
##########
@@ -8174,6 +8176,9 @@ protected Operator genFileSinkPlan(String dest, QB qb, 
Operator input)
 
     FileSinkOperator fso = (FileSinkOperator) output;
     fso.getConf().setTable(destinationTable);
+    if (destTableIsMaterialization) {
+      ctx.addMaterializedTableSource(destinationTable.getFullTableName(), fso);

Review Comment:
   Can this be merged with `Context.addMaterializedTable` and called in 
`materializeCTE`?
   `SemanticAnalyzer` has `getSinkOp()` hence we can get the `FileSinkOperator` 
later from the analyzer instance.
   
   Does the whole FileSinkOperator is needed?
   ```
   Context.addMaterializedTable(String cteName, Table table, Statistics 
statistics);
   ```
   



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:
##########
@@ -12143,6 +12148,27 @@ samplePredicate, true, new 
SampleDesc(ts.getNumerator(),
 
     Operator output = putOpInsertMap(op, rwsch);
 
+    if (tab.isMaterializedTable()) {
+      final FileSinkOperator source = 
ctx.getMaterializedTableSource(tab.getFullTableName());
+      final Statistics stats = source.getStatistics().clone();
+      final List<ColStatistics> sourceColStatsList = stats.getColumnStats();
+      final List<String> colNames = 
tab.getCols().stream().map(FieldSchema::getName).collect(Collectors.toList());
+      if (sourceColStatsList.size() != colNames.size()) {
+        throw new IllegalStateException(String.format(
+            "The size of col stats must be equal to that of schema. Stats = 
%s, Schema = %s",
+            sourceColStatsList, colNames));
+      }
+      final List<ColStatistics> colStatsList = new 
ArrayList<>(sourceColStatsList.size());
+      for (int i = 0; i < sourceColStatsList.size(); i++) {
+        final ColStatistics colStats = sourceColStatsList.get(i).clone();
+        // FileSinkOperator stores column stats with internal names such as 
"_col1"
+        colStats.setColumnName(colNames.get(i));
+        colStatsList.add(colStats);
+      }
+      stats.setColumnStats(colStatsList);

Review Comment:
   What is the reason of cloning `ColStatistics` objects again?
   `Statistics.clone()` does it too
   
https://github.com/apache/hive/blob/5b76949da6fe65364a4e3766680871167131157f/ql/src/java/org/apache/hadoop/hive/ql/plan/Statistics.java#L210-L216



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-28080: Propagate statistics from a source table to the materialized CTE [hive]

Reply via email to