soumyakanti3578 commented on code in PR #6496:
URL: https://github.com/apache/hive/pull/6496#discussion_r3321143557
##########
ql/src/test/org/apache/hadoop/hive/ql/stats/TestStatsUtils.java:
##########
@@ -565,4 +569,44 @@ void testGetColStatisticsTimestampType() {
assertEquals(1700000000L, range.maxValue.longValue(), "maxValue mismatch
for TIMESTAMP");
}
+ @Test
+ void testEstimateStatsForMissingColsHandlesEmptyList() {
+ HiveConf conf = new HiveConf();
+
+ ColumnInfo columnInfoA = new ColumnInfo("a", TypeInfoFactory.intTypeInfo,
"t", false);
+
+ List<ColStatistics> allColumnStats =
StatsUtils.estimateStatsForMissingCols(
+ List.of("a"), Collections.emptyList(), conf, 0, List.of(columnInfoA));
+
+ assertEquals(1, allColumnStats.size());
+ }
+
+ @Test
+ void testEstimateStatsForMissingColsCombinesExistingStatsAndEstimations() {
+ HiveConf conf = new HiveConf();
+
+ ColumnInfo colNeededButNotExists = new ColumnInfo("neededButNotExists",
TypeInfoFactory.intTypeInfo, "t", false);
+ ColumnInfo colNeededAndExists = new ColumnInfo("neededAndExists",
TypeInfoFactory.intTypeInfo, "t", false);
+ ColumnInfo colNotNeededButExists = new ColumnInfo("notNeededButExists",
TypeInfoFactory.intTypeInfo, "t", false);
+ ColumnInfo colNotNeededNotExists = new ColumnInfo("notNeededNotExists",
TypeInfoFactory.intTypeInfo, "t", false);
+
+ ColStatistics colStatNeededAndExists = new ColStatistics();
+ colStatNeededAndExists.setColumnName(colNeededAndExists.getInternalName());
+ ColStatistics colStatNotNeededButExists = new ColStatistics();
+
colStatNotNeededButExists.setColumnName(colNotNeededButExists.getInternalName());
+
+ List<ColStatistics> allColumnStats =
StatsUtils.estimateStatsForMissingCols(
+ List.of(colNeededAndExists.getInternalName(),
colNeededButNotExists.getInternalName()),
+ List.of(colStatNeededAndExists, colStatNotNeededButExists),
+ conf,
+ 0,
+ List.of(colNeededButNotExists, colNeededAndExists,
colNotNeededButExists, colNotNeededNotExists));
+
+ assertEquals(3, allColumnStats.size());
+ assertEquals(allColumnStats.get(0), colStatNeededAndExists);
+ assertEquals(allColumnStats.get(1), colStatNotNeededButExists);
+ assertTrue(allColumnStats.get(2).isEstimated());
+ assertEquals(allColumnStats.get(2).getColumnName(),
colNeededButNotExists.getInternalName());
Review Comment:
nit: I think except the first `assertEquals` all of them have the order of
arguments flipped?
##########
ql/src/test/org/apache/hadoop/hive/ql/stats/TestStatsUtils.java:
##########
@@ -565,4 +569,44 @@ void testGetColStatisticsTimestampType() {
assertEquals(1700000000L, range.maxValue.longValue(), "maxValue mismatch
for TIMESTAMP");
}
+ @Test
+ void testEstimateStatsForMissingColsHandlesEmptyList() {
+ HiveConf conf = new HiveConf();
+
+ ColumnInfo columnInfoA = new ColumnInfo("a", TypeInfoFactory.intTypeInfo,
"t", false);
+
+ List<ColStatistics> allColumnStats =
StatsUtils.estimateStatsForMissingCols(
+ List.of("a"), Collections.emptyList(), conf, 0, List.of(columnInfoA));
Review Comment:
Do you think we can have a situation where `neededColumnNames` is empty and
`existingColStats` is non-empty? If yes, we should probably add a test for that
too?
##########
ql/src/test/org/apache/hadoop/hive/ql/stats/TestStatsUtils.java:
##########
@@ -18,6 +18,8 @@
package org.apache.hadoop.hive.ql.stats;
+import static org.junit.Assert.assertFalse;
Review Comment:
I think this is unused, as highlighted by sonar too.
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -239,22 +239,33 @@ public static long getNumRows(HiveConf conf,
List<ColumnInfo> schema, Table tabl
return aggregateStat.getNumRows();
}
- private static void estimateStatsForMissingCols(List<String> neededColumns,
List<ColStatistics> columnStats,
- HiveConf conf, long nr, List<ColumnInfo> schema) {
+ /**
+ * Estimates column statistics for columns specified in {@code
neededColumnNames}
+ * that do not already have statistics in the {@code existingColStats} list.
+ *
+ * @return A {@link List} of {@link ColStatistics} objects containing
+ * both the provided existing statistics and the newly estimated ones.
+ */
+ static List<ColStatistics> estimateStatsForMissingCols(
+ List<String> neededColumnNames, List<ColStatistics> existingColStats,
HiveConf conf, long nr,
+ List<ColumnInfo> schema) {
- Set<String> neededCols = new HashSet<>(neededColumns);
- Set<String> colsWithStats = new HashSet<>();
+ Set<String> neededCols = new HashSet<>(neededColumnNames);
+ Set<String> columnNamesWithStats = new HashSet<>(existingColStats.size());
Review Comment:
nit: Sonar's suggestion of `HashSet.newHashSet(int numMappings)` is more
efficient. Not a blocker for this PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]