kasakrisz commented on code in PR #6525:
URL: https://github.com/apache/hive/pull/6525#discussion_r3395275563


##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -1993,6 +1982,15 @@ public static boolean checkCanProvideColumnStats(Table 
table) {
     return !table.isNonNative() || 
table.getStorageHandler().canProvideColStatistics(table);
   }
 
+  /**
+   * Whether a table's statistics may be used for plan-shape optimizations 
such as semijoin
+   * reduction or map-join conversion, where relying on stale stats only 
affects performance,
+   * never correctness.
+   */
+  public static boolean checkCanProvideStatsForOpt(Table table) {
+    return checkCanProvideStats(table) || 
StatsSetupConst.areBasicStatsUptoDate(table.getParameters());

Review Comment:
   How is statistics accuracy ensured for external tables?
   By their very nature, external tables can be modified by third-party tools 
outside of Hive. In these scenarios, the metadata stored in the Hive Metastore 
that indicates statistics accuracy is not updated. Consequently, we could end 
up with a false "stats are accurate" signal.
   
   Please consider the following scenario:
   1. An external table is created.
   2. 1M records are inserted via Hive -> statistics are accurate.
   3. A query requiring Dynamic Partition Pruning optimization is executed -> 
with this patch, the query might run faster. 
   4. Additional data files containing 10M records are added directly to the 
table path to simulate a third-party tool. At this point, the table actually 
contains 11M records, but the metadata stats still report only 1M. In this 
case, whether or not we trigger DPP optimization becomes irrelevant because the 
execution plan won't reflect the massively increased record count anyway. 
   
   Given this risk, I don't see the benefit of this change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to