Re: [PR] HIVE-29147: Iceberg: Table-level column stats filter support [hive]

via GitHub Tue, 02 Sep 2025 13:02:59 -0700


ayushtkn commented on code in PR #5724:
URL: https://github.com/apache/hive/pull/5724#discussion_r2317030550



##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java:
##########
@@ -223,18 +233,18 @@ static Snapshot getTableSnapshot(Table table, String 
snapshotRef) {
     return table.currentSnapshot();
   }
 
-  static Optional<Path> getColStatsPath(Table table) {
+  static Path getColStatsPath(Table table) {
     return getColStatsPath(table, table.currentSnapshot().snapshotId());
   }
 
-  static Optional<Path> getColStatsPath(Table table, long snapshotId) {
+  static Path getColStatsPath(Table table, long snapshotId) {

Review Comment:
   If we are changing the return type, can't we change it to String? I see we 
are either checking for `null` or during `readColStats` reading `Path`  and 
then converting it to `String`. We can avoid the path creating in this method 
itself



##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##########
@@ -1864,19 +1861,6 @@ public void addResourcesForCreateTable(Map<String, 
String> tblProps, HiveConf hi
     }
   }
 
-  /**
-   * Check the operation type of all snapshots which are newer than the 
specified. The specified snapshot is excluded.
-   * @param hmsTable table metadata stored in Hive Metastore
-   * @param since the snapshot preceding the oldest snapshot which should be 
checked.
-   *              The value null means all should be checked.
-   * @return null if table is empty, true if all snapshots are {@link 
SnapshotContext.WriteOperationType#APPEND}s,
-   * false otherwise.
-   *
-   * @deprecated
-   * <br>Use {@link HiveStorageHandler#getSnapshotContexts(
-   * org.apache.hadoop.hive.ql.metadata.Table hmsTable, SnapshotContext since)}
-   * and check {@link SnapshotContext.WriteOperationType#APPEND}.equals({@link 
SnapshotContext#getOperation()}).
-   */

Review Comment:
   Why are you removing the javadoc here?



##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##########
@@ -637,7 +644,7 @@ private boolean writeColStats(List<ColumnStatistics> 
colStats, Table tbl) {
       return true;
 
     } catch (Exception e) {
-      LOG.warn("Unable to invalidate or merge stats: {}", e.getMessage());
+      LOG.warn("Unable to invalidate or merge column stats: {}", 
e.getMessage());

Review Comment:
   if the stats file was generated, but we weren't able to update or write 
column stat, shouldn't we delete the ``statsPath`` as well as a cleanup here?



##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##########
@@ -601,20 +601,27 @@ private boolean writeColStats(List<ColumnStatistics> 
colStats, Table tbl) {
         long snapshotId = tbl.currentSnapshot().snapshotId();
         long snapshotSequenceNumber = tbl.currentSnapshot().sequenceNumber();
 
-        colStats.forEach(statsObj -> {
-          byte[] serializeColStats = SerializationUtils.serialize(statsObj);
-          puffinWriter.add(
-            new Blob(
-              ColumnStatisticsObj.class.getSimpleName(),
-              ImmutableList.of(1),
-              snapshotId,
-              snapshotSequenceNumber,
-              ByteBuffer.wrap(serializeColStats),
-              PuffinCompressionCodec.NONE,
-              ImmutableMap.of("partition",
-                  String.valueOf(statsObj.getStatsDesc().getPartName()))
-            ));
+        colStats.forEach(stats -> {
+          boolean isTblLevel = stats.getStatsDesc().isIsTblLevel();
+
+          for (Serializable statsObj : isTblLevel ? stats.getStatsObj() : 
Collections.singletonList(stats)) {
+            byte[] serializeColStats = SerializationUtils.serialize(statsObj);
+            puffinWriter.add(
+              new Blob(
+                ColumnStatisticsObj.class.getSimpleName(),
+                ImmutableList.of(isTblLevel ? tbl.spec().schema().findField(
+                    ((ColumnStatisticsObj) statsObj).getColName()).fieldId() : 
1),

Review Comment:
   Is this under some lock. Curious about the case of schema evolution, like if 
the column gets dropped in the iceberg table before we fetch the table in 
``setColStatistics``



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29147: Iceberg: Table-level column stats filter support [hive]

Reply via email to