[I] Running 2 queries on the same table but different branch in Spark results in first branch's data returned for both queries [iceberg]

via GitHub Mon, 23 Mar 2026 10:04:37 -0700


stephensunyj opened a new issue, #15741:
URL: https://github.com/apache/iceberg/issues/15741


   ### Apache Iceberg version
   
   1.5.0
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   When querying Iceberg through Spark in Scala, running 2 successive spark.sql 
queries for 2 different branches of the same Iceberg table results in the 
snapshot from the branch that was queried earlier being returned for both 
queries. 
   
   ### Steps to reproduce
   getTableBranchLatestVersion is a helper function I wrote to get the latest 
snapshot ID from the input branch name, so that we can plug the resolved 
version for the branch into the `SELECT * FROM $fullTableName VERSION AS OF 
$resolvedVersion` query. 
   
   
   ```scala
   def getTableBranchLatestVersion(spark: SparkSession, fullTableName: String, 
branchName: String): Long = {
       val refs = spark.sql(
               s"""
                           SELECT snapshot_id
                           FROM $fullTableName.refs
                           WHERE name = '$branchName'
                       """)
           .collect()
   
       if (refs.isEmpty) {
           throw new NoSuchElementException(s"Branch name $branchName was not 
found in $fullTableName.refs")
       }
   
       refs(0).getAs[Long]("snapshot_id")
   }
   
   val resolvedVersion1 = getTableBranchLatestVersion(spark, fullTableName, 
branchName1).toString
   val df1 = spark.sql(
       s"""
                  SELECT *
                  FROM $fullTableName
                  VERSION AS OF $resolvedVersion1
               """)
   
   val resolvedVersion2 = getTableBranchLatestVersion(spark, fullTableName, 
branchName2).toString
   val df2 = spark.sql(
       s"""
                  SELECT *
                  FROM $fullTableName
                  VERSION AS OF $resolvedVersion2
               """)
   ```
   
   df2 will contain the exact data that is returned in df1. It looks to me like 
either Spark or Iceberg is assuming that the 2 snapshots are identical and 
returning the cached data for df1 instead of correctly retrieving the 2nd 
snapshot ID. 
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [x] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Running 2 queries on the same table but different branch in Spark results in first branch's data returned for both queries [iceberg]

Reply via email to