haizhou-zhao opened a new issue, #11103:
URL: https://github.com/apache/iceberg/issues/11103

   ### Apache Iceberg version
   
   1.6.1 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   ## Background
   When using Spark (or likewise execution engines) on top of REST Catalog to 
commit a new snapshot, there are 3 critical timestamps:
   
   1. When Spark invokes `SnapshotProducer` to produce a new snapshot - 
referred to as `snapshotCreationTs`: 
[ref](https://github.com/apache/iceberg/blob/d17a7f1/core/src/main/java/org/apache/iceberg/SnapshotProducer.java#L386)
   2. When REST catalog receives the update request from Spark and commit new 
metadata - referred to as `metadataCommitTs`
   3. At the time when commit is done finished, and Spark refreshes table state 
- referred to as `tableAccessTs`: 
[ref](https://github.com/apache/iceberg/blob/afda8be/core/src/main/java/org/apache/iceberg/rest/responses/LoadTableResponse.java#L64)
   
   ## Desired behavior
   `spark.sql("SELECT * from ${db}.${table}.metadata_log_entries")` means the 
user is looking for `metadataCommitTs`; 
   
   while `spark.sql("SELECT * from ${db}.${table}. snapshots")` means the user 
is looking for `snapshotCreationTs`.
   
   
   ## Issue 1
   Traditionally, with Hadoop and Hive Catalog, Spark (using Iceberg client) is 
responsible for both generating new snapshots and committing the new metadata 
files. And those catalogs are capable of enforcing `snapshotCreationTs` and 
`metadataCommitTs` to take on the exact same value for every new snapshot 
committed. However, with REST Catalog, because Spark only controls new snapshot 
generation while REST Catalog Server controls metadata commit, based on REST 
Catalog implementation (which is not controllable by this repo), 
`snapshotCreationTs` and `metadataCommitTs` may or may not be the same.
   
   The current implementation in some cases assumes that the two timestamp 
takes on exact same value. For example, this integration test: 
[ref](https://github.com/apache/iceberg/blob/2240154/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMetadataTables.java#L446).
   
   It would be best to clarify whether a valid REST Catalog implementation 
should always enforce `snapshotCreationTs` and `metadataCommitTs` to take on 
the same value. For the current reference implementation (RESTCatalogAdapter on 
top of JdbcCatalog), the answer is positive (they take on the same value). 
However, should it be (or feasible to be) enforced for all REST Catalog 
implementations.
   
   ## Issue 2
   Currently, when loading Table from REST Catalog using LoadTableResponse, the 
`lastUpdatedMillis` attribute of the metadata (which may be taking the value of 
`metadataCommitTs` or `snapshotCreationTs` on REST Catalog side based on impl 
detail) will be incorrectly replaced by `tableAccessTs` 
([ref1](https://github.com/apache/iceberg/blob/afda8be/core/src/main/java/org/apache/iceberg/rest/responses/LoadTableResponse.java#L64),
 
[ref2](https://github.com/apache/iceberg/blob/113c6e7/core/src/main/java/org/apache/iceberg/TableMetadata.java#L938)).
 Because Spark depends on `lastUpdatedMillis` to generate the latest 
`metadataCommitTs` on `metadata_log_entries` 
([ref](https://github.com/apache/iceberg/blob/8a70fe0/core/src/main/java/org/apache/iceberg/MetadataLogEntriesTable.java#L66)),
 there will always be a wrong time stamp on `metadata_log_entries` if REST 
Catalog is used.
   
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to