Re: [PR] Spark 4.0: Fix source location in stats file copy plan in RewriteTablePathSparkAction [iceberg]

via GitHub Thu, 21 Aug 2025 17:16:44 -0700


dramaticlly commented on code in PR #13881:
URL: https://github.com/apache/iceberg/pull/13881#discussion_r2291672854



##########
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java:
##########
@@ -963,6 +963,59 @@ public void testTableWithManyStatisticFiles() throws 
IOException {
         iterations * 2 + 1, iterations, iterations, iterations, iterations * 6 
+ 1, result);
   }
 
+  @Test
+  public void testStatisticsFileSourcePath() throws IOException {
+    String sourceTableLocation = newTableLocation();
+    Map<String, String> properties = Maps.newHashMap();
+    properties.put("format-version", "2");
+    String tableName = "v2tblwithstats";
+    Table sourceTable =
+        createMetastoreTable(sourceTableLocation, properties, "default", 
tableName, 1);
+
+    // Compute table statistics to generate a .stats file
+    actions().computeTableStats(sourceTable).execute();
+
+    assertThat(sourceTable.statisticsFiles())
+        .as("Should include 1 statistics file after compute stats")
+        .hasSize(1);
+
+    String targetTableLocation = targetTableLocation();
+    RewriteTablePath.Result result =
+        actions()
+            .rewriteTablePath(sourceTable)
+            .rewriteLocationPrefix(sourceTableLocation, targetTableLocation)
+            .execute();
+
+    checkFileNum(3, 1, 1, 1, 7, result);
+
+    // Read the file list to verify statistics file paths
+    List<Tuple2<String, String>> filesToMove = 
readPathPairList(result.fileListLocation());
+
+    // Find the statistics file entry in the file list
+    Tuple2<String, String> statsFilePathPair = null;
+    for (Tuple2<String, String> pathPair : filesToMove) {
+      if (pathPair._1().endsWith(".stats")) {
+        statsFilePathPair = pathPair;
+        break;
+      }
+    }
+
+    assertThat(statsFilePathPair).as("Should find statistics file in file 
list").isNotNull();
+
+    // Verify the source path points to the actual source location, not staging
+    assertThat(statsFilePathPair._1())
+        .as("Statistics file source should point to source table location")
+        .startsWith(sourceTableLocation);
+    assertThat(statsFilePathPair._1())
+        .as("Statistics file source should NOT point to staging directory")
+        .doesNotContain("staging");

Review Comment:
   nit: those can be combined
   
   ```java
      assertThat(statsFilePathPair._1())
          .as("Statistics file source should point to source table location, 
not staging")
          .startsWith(sourceTableLocation)
          .doesNotContain("staging");
   ```



##########
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java:
##########
@@ -963,6 +963,59 @@ public void testTableWithManyStatisticFiles() throws 
IOException {
         iterations * 2 + 1, iterations, iterations, iterations, iterations * 6 
+ 1, result);
   }
 
+  @Test
+  public void testStatisticsFileSourcePath() throws IOException {
+    String sourceTableLocation = newTableLocation();
+    Map<String, String> properties = Maps.newHashMap();
+    properties.put("format-version", "2");
+    String tableName = "v2tblwithstats";
+    Table sourceTable =
+        createMetastoreTable(sourceTableLocation, properties, "default", 
tableName, 1);
+
+    // Compute table statistics to generate a .stats file
+    actions().computeTableStats(sourceTable).execute();
+
+    assertThat(sourceTable.statisticsFiles())
+        .as("Should include 1 statistics file after compute stats")
+        .hasSize(1);
+
+    String targetTableLocation = targetTableLocation();
+    RewriteTablePath.Result result =
+        actions()
+            .rewriteTablePath(sourceTable)
+            .rewriteLocationPrefix(sourceTableLocation, targetTableLocation)
+            .execute();
+
+    checkFileNum(3, 1, 1, 1, 7, result);
+
+    // Read the file list to verify statistics file paths
+    List<Tuple2<String, String>> filesToMove = 
readPathPairList(result.fileListLocation());
+
+    // Find the statistics file entry in the file list
+    Tuple2<String, String> statsFilePathPair = null;
+    for (Tuple2<String, String> pathPair : filesToMove) {
+      if (pathPair._1().endsWith(".stats")) {
+        statsFilePathPair = pathPair;
+        break;
+      }
+    }

Review Comment:
   nit: can also be replaced with stream
   
   ```java
   Tuple2<String, String> statsFilePathPair = filesToMove.stream()
          .filter(pathPair -> pathPair._1().endsWith(".stats"))
          .findFirst()
          .orElse(null);
   ```



##########
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java:
##########
@@ -404,10 +404,7 @@ private Set<Pair<String, String>> statsFileCopyPlan(
       Preconditions.checkArgument(
           before.fileSizeInBytes() == after.fileSizeInBytes(),
           "Before and after path rewrite, statistic file size should be same");
-      result.add(
-          Pair.of(
-              RewriteTablePathUtil.stagingPath(before.path(), sourcePrefix, 
stagingDir),
-              after.path()));
+      result.add(Pair.of(before.path(), after.path()));

Review Comment:
   good catch I think we dont need to open and rewrite the content of the stats 
file, so it only need to copied from source to target, instead of from staging 
to target.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark 4.0: Fix source location in stats file copy plan in RewriteTablePathSparkAction [iceberg]

Reply via email to