[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #4674: Spark-3.2: Avoid duplicate computation of ALL_MANIFESTS metadata table for spark actions

GitBox Sun, 01 May 2022 06:16:04 -0700


ajantha-bhat commented on code in PR #4674:
URL: https://github.com/apache/iceberg/pull/4674#discussion_r862473883



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteReachableFilesSparkAction.java:
##########
@@ -125,8 +126,9 @@ private Result doExecute() {
 
   private Dataset<Row> buildReachableFileDF(TableMetadata metadata) {
     Table staticTable = newStaticTable(metadata, io);
-    return withFileType(buildValidContentFileDF(staticTable), CONTENT_FILE)
-        .union(withFileType(buildManifestFileDF(staticTable), MANIFEST))
+    Dataset<Row> allManifests = loadMetadataTable(staticTable, ALL_MANIFESTS);
+    return withFileType(buildValidContentFileDF(staticTable, allManifests), 
CONTENT_FILE)
+        .union(withFileType(buildManifestFileDF(allManifests), MANIFEST))

Review Comment:
   @RussellSpitzer : I got what you meant now.
   
   To solve this problem only way I have now is to collect the dataset as list 
of rows and prepare new dataset from list of rows and use that dataset in both 
location. This way, reading of manifest list happens only once. WDYT ? any 
other solution? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #4674: Spark-3.2: Avoid duplicate computation of ALL_MANIFESTS metadata table for spark actions

Reply via email to