haizhou-zhao opened a new issue, #11103: URL: https://github.com/apache/iceberg/issues/11103
### Apache Iceberg version 1.6.1 (latest release) ### Query engine Spark ### Please describe the bug 🐞 ## Background When using Spark (or likewise execution engines) on top of REST Catalog to commit a new snapshot, there are 3 critical timestamps: 1. When Spark invokes `SnapshotProducer` to produce a new snapshot - referred to as `snapshotCreationTs`: [ref](https://github.com/apache/iceberg/blob/d17a7f1/core/src/main/java/org/apache/iceberg/SnapshotProducer.java#L386) 2. When REST catalog receives the update request from Spark and commit new metadata - referred to as `metadataCommitTs` 3. At the time when commit is done finished, and Spark refreshes table state - referred to as `tableAccessTs`: [ref](https://github.com/apache/iceberg/blob/afda8be/core/src/main/java/org/apache/iceberg/rest/responses/LoadTableResponse.java#L64) ## Desired behavior `spark.sql("SELECT * from ${db}.${table}.metadata_log_entries")` means the user is looking for `metadataCommitTs`; while `spark.sql("SELECT * from ${db}.${table}. snapshots")` means the user is looking for `snapshotCreationTs`. ## Issue 1 Traditionally, with Hadoop and Hive Catalog, Spark (using Iceberg client) is responsible for both generating new snapshots and committing the new metadata files. And those catalogs are capable of enforcing `snapshotCreationTs` and `metadataCommitTs` to take on the exact same value for every new snapshot committed. However, with REST Catalog, because Spark only controls new snapshot generation while REST Catalog Server controls metadata commit, based on REST Catalog implementation (which is not controllable by this repo), `snapshotCreationTs` and `metadataCommitTs` may or may not be the same. The current implementation in some cases assumes that the two timestamp takes on exact same value. For example, this integration test: [ref](https://github.com/apache/iceberg/blob/2240154/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMetadataTables.java#L446). It would be best to clarify whether a valid REST Catalog implementation should always enforce `snapshotCreationTs` and `metadataCommitTs` to take on the same value. For the current reference implementation (RESTCatalogAdapter on top of JdbcCatalog), the answer is positive (they take on the same value). However, should it be (or feasible to be) enforced for all REST Catalog implementations. ## Issue 2 Currently, when loading Table from REST Catalog using LoadTableResponse, the `lastUpdatedMillis` attribute of the metadata (which may be taking the value of `metadataCommitTs` or `snapshotCreationTs` on REST Catalog side based on impl detail) will be incorrectly replaced by `tableAccessTs` ([ref1](https://github.com/apache/iceberg/blob/afda8be/core/src/main/java/org/apache/iceberg/rest/responses/LoadTableResponse.java#L64), [ref2](https://github.com/apache/iceberg/blob/113c6e7/core/src/main/java/org/apache/iceberg/TableMetadata.java#L938)). Because Spark depends on `lastUpdatedMillis` to generate the latest `metadataCommitTs` on `metadata_log_entries` ([ref](https://github.com/apache/iceberg/blob/8a70fe0/core/src/main/java/org/apache/iceberg/MetadataLogEntriesTable.java#L66)), there will always be a wrong time stamp on `metadata_log_entries` if REST Catalog is used. ### Willingness to contribute - [ ] I can contribute a fix for this bug independently - [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [ ] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org