stephensunyj opened a new issue, #15741:
URL: https://github.com/apache/iceberg/issues/15741
### Apache Iceberg version
1.5.0
### Query engine
Spark
### Please describe the bug 🐞
When querying Iceberg through Spark in Scala, running 2 successive spark.sql
queries for 2 different branches of the same Iceberg table results in the
snapshot from the branch that was queried earlier being returned for both
queries.
### Steps to reproduce
getTableBranchLatestVersion is a helper function I wrote to get the latest
snapshot ID from the input branch name, so that we can plug the resolved
version for the branch into the `SELECT * FROM $fullTableName VERSION AS OF
$resolvedVersion` query.
```scala
def getTableBranchLatestVersion(spark: SparkSession, fullTableName: String,
branchName: String): Long = {
val refs = spark.sql(
s"""
SELECT snapshot_id
FROM $fullTableName.refs
WHERE name = '$branchName'
""")
.collect()
if (refs.isEmpty) {
throw new NoSuchElementException(s"Branch name $branchName was not
found in $fullTableName.refs")
}
refs(0).getAs[Long]("snapshot_id")
}
val resolvedVersion1 = getTableBranchLatestVersion(spark, fullTableName,
branchName1).toString
val df1 = spark.sql(
s"""
SELECT *
FROM $fullTableName
VERSION AS OF $resolvedVersion1
""")
val resolvedVersion2 = getTableBranchLatestVersion(spark, fullTableName,
branchName2).toString
val df2 = spark.sql(
s"""
SELECT *
FROM $fullTableName
VERSION AS OF $resolvedVersion2
""")
```
df2 will contain the exact data that is returned in df1. It looks to me like
either Spark or Iceberg is assuming that the 2 snapshots are identical and
returning the cached data for df1 instead of correctly retrieving the 2nd
snapshot ID.
### Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]