[PR] Core, Spark: Exclude non live content file in RewriteTablePathUtil [iceberg]

via GitHub Sat, 25 Jan 2025 21:12:31 -0800


dramaticlly opened a new pull request, #12006:
URL: https://github.com/apache/iceberg/pull/12006

Instead of scanning all entries in data/manifest for identifying list of
content files to copy, scan only the live one. This is essential to prevent
rewrite table path to carry the files already expired as part of snapshot
expiration in the source table.

[Existing
logic](https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java#L473-L475)
fetch both added/existing/deleted entry from manifest to collect list of
content files to be copied and rely on reducer for deduplicate based on file
name.

However we want to avoid the scenario where the given content file with only
deleted status in older manifest, as snapshot expiration might already removed
the snapshot which reference the given content file, and deleted as part of
snapshot expiration.

With some concrete examlpe to help with explanation,
```
assume we have 3 snapshots of overwrite operation
1. 8729031490038117099 (dataSeq=1, added d2.parquet)
2. 6024975807438659167 (dataSeq=2, removed d2,parquet and added d3.parquet)
3. 4358334817990999907 (dataSeq=3, removed d3.parquet and added d4.parquet)
```

the expiration of first snapshot `8729031490038117099`, will remove
d2.parquet on disk,
second snapshot `6024975807438659167` might still have data manifest entry
of deleted (status=2) for d2.parquet.

However it's not desired to include d2.parquet as part of files for path
rewrite.

CC @szehon-ho

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Core, Spark: Exclude non live content file in RewriteTablePathUtil [iceberg]

Reply via email to