One more thing: a work-around is to redefine the view. That discards the original logical plan and table and returns the expected result in Spark 3.
On Mon, Jul 13, 2020 at 11:53 AM Ryan Blue <rb...@netflix.com> wrote: > Dongjoon, > > Thanks for raising this issue. I did some digging and the problem is that > in Spark 3.0, the logical plans saves a Table instance with the current > state when it was loaded -- when the `createOrReplaceTempView` call > happened. That never gets refreshed, which is why you get stale data. In > 2.4, there is no "table" in Spark so it gets reloaded every time. > > In the SQL path, this problem is avoided because the same table instance > is returned to Spark when the table is loaded from the catalog because of > table caching. The write will use the same table instance and keep it up to > date, and a call to `REFRESH TABLE` would also update it. > > I'm not sure the right way to avoid this behavior. We purposely don't > refresh the table each time Spark plans a scan because we want the version > of a table that is used to be consistent. A query that performs a > self-join, for example, should always use the same version of the table. On > the other hand, when a table reference is held in a logical plan like this, > there is no mechanism to update it. Maybe each table instance should keep a > timer and refresh after some interval? > > Either way, I agree that this is something we should add to documentation. > I don't think that this should fail the release, but we should fix it by > the next one. Does that sound reasonable to you? > > On Sun, Jul 12, 2020 at 6:16 PM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> I verified the hash/sign/build/UT and manual testing with Apache Spark >> 2.4.6 (hadoop-2.7) and 3.0.0 (hadoop-3.2) on Apache Hive 2.3.7 metastore. >> (BTW, for spark-3.0.0-bin-hadoop3.2, I used Ryan's example command with >> `spark.sql.warehouse.dir` instead of `spark.warehouse.path`) >> >> 1. Iceberg 0.9 + Spark 2.4.6 works as expected. >> 2. Iceberg 0.9 + Spark 3.0.0 works as expected mostly, but views work >> differently from Iceberg 0.9 + Spark 2.4.6. >> >> The following is the example. >> >> ... >> scala> >> spark.read.format("iceberg").load("/tmp/t1").createOrReplaceTempView("t2") >> >> scala> sql("select * from t2").show >> +---+---+ >> | a| b| >> +---+---+ >> +---+---+ >> >> scala> Seq(("a", 1), ("b", 2)).toDF("a", >> "b").write.format("iceberg").mode("overwrite").save("/tmp/t1") >> >> scala> sql("select * from t2").show // Iceberg 0.9 with Spark 2.4.6 shows >> the updated result correctly here. >> +---+---+ >> | a| b| >> +---+---+ >> +---+---+ >> >> scala> spark.read.format("iceberg").load("/tmp/t1").show >> +---+---+ >> | a| b| >> +---+---+ >> | a| 1| >> | b| 2| >> +---+---+ >> >> So far, I'm not sure which part (Iceberg or Spark) introduces this >> difference, >> but it would be nice if we had some document about the difference between >> versions if this is designed like this. >> >> Thanks, >> Dongjoon. >> >> On Sun, Jul 12, 2020 at 12:32 PM Daniel Weeks <dwe...@apache.org> wrote: >> >>> +1 (binding) >>> >>> Verified sigs/sums/license/build/test >>> >>> I did have an issue with the test metastore for the spark3 tests on the >>> first run, but couldn't replicate it in subsequent tests. >>> >>> -Dan >>> >>> On Fri, Jul 10, 2020 at 10:42 AM Ryan Blue <rb...@netflix.com.invalid> >>> wrote: >>> >>>> +1 (binding) >>>> >>>> Verified checksums, ran tests, staged convenience binaries. >>>> >>>> I also ran a few tests using Spark 3.0.0 and Spark 2.4.5 and the >>>> runtime Jars. For anyone that would like to use spark-sql or spark-shell, >>>> here are the commands that I used: >>>> >>>> ~/Apps/spark-3.0.0-bin-hadoop2.7/bin/spark-sql \ >>>> --repositories >>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1008 \ >>>> --packages org.apache.iceberg:iceberg-spark3-runtime:0.9.0 \ >>>> --conf spark.warehouse.path=$PWD/spark-warehouse \ >>>> --conf spark.hadoop.hive.metastore.uris=thrift://localhost:42745 \ >>>> --conf >>>> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog >>>> \ >>>> --conf spark.sql.catalog.spark_catalog.type=hive \ >>>> --conf >>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog \ >>>> --conf spark.sql.catalog.hive_prod.type=hive \ >>>> --conf >>>> spark.sql.catalog.hadoop_prod=org.apache.iceberg.spark.SparkCatalog \ >>>> --conf spark.sql.catalog.hadoop_prod.type=hadoop \ >>>> --conf spark.sql.catalog.hadoop_prod.warehouse=$PWD/hadoop-warehouse >>>> >>>> ~/Apps/spark-2.4.5-bin-hadoop2.7/bin/spark-shell \ >>>> --repositories >>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1008 \ >>>> --packages org.apache.iceberg:iceberg-spark-runtime:0.9.0 \ >>>> --conf spark.hadoop.hive.metastore.uris=thrift://localhost:42745 \ >>>> --conf spark.warehouse.path=$PWD/spark-warehouse >>>> >>>> The Spark 3 command sets up 3 catalogs: >>>> >>>> - A wrapper around the built-in catalog, spark_catalog, that adds >>>> support for Iceberg tables >>>> - A hive_prod Iceberg catalog that uses the same metastore as the >>>> session catalog >>>> - A hadoop_prod Iceberg catalog that stores tables in a >>>> hadoop-warehouse folder >>>> >>>> Everything worked great, except for a minor issue with CTAS, #1194 >>>> <https://github.com/apache/iceberg/pull/1194>. I’m okay to release >>>> with that issue, but we can always build a new RC if anyone thinks it is a >>>> blocker. >>>> >>>> If you’d like to run tests in a downstream project, you can use the >>>> staged binary artifacts by adding this to your gradle build: >>>> >>>> repositories { >>>> maven { >>>> name 'stagedIceberg' >>>> url >>>> 'https://repository.apache.org/content/repositories/orgapacheiceberg-1008/' >>>> } >>>> } >>>> >>>> ext { >>>> icebergVersion = '0.9.0' >>>> } >>>> >>>> >>>> On Fri, Jul 10, 2020 at 9:20 AM Ryan Murray <rym...@dremio.com> wrote: >>>> >>>>> 1. Verify the signature: OK >>>>> 2. Verify the checksum: OK >>>>> 3. Untar the archive tarball: OK >>>>> 4. Run RAT checks to validate license headers: RAT checks passed >>>>> 5. Build and test the project: all unit tests passed. >>>>> >>>>> +1 (non-binding) >>>>> >>>>> I did see that my build took >12 minutes and used all 100% of all 8 >>>>> cores & 32GB of memory (openjdk-8 ubuntu 18.04) which I haven't noticed >>>>> before. >>>>> On Fri, Jul 10, 2020 at 4:37 AM OpenInx <open...@gmail.com> wrote: >>>>> >>>>>> I followed the verify guide here ( >>>>>> https://lists.apache.org/thread.html/rd5e6b1656ac80252a9a7d473b36b6227da91d07d86d4ba4bee10df66%40%3Cdev.iceberg.apache.org%3E) >>>>>> : >>>>>> >>>>>> 1. Verify the signature: OK >>>>>> 2. Verify the checksum: OK >>>>>> 3. Untar the archive tarball: OK >>>>>> 4. Run RAT checks to validate license headers: RAT checks passed >>>>>> 5. Build and test the project: all unit tests passed. >>>>>> >>>>>> +1 (non-binding). >>>>>> >>>>>> On Fri, Jul 10, 2020 at 9:46 AM Ryan Blue <rb...@netflix.com.invalid> >>>>>> wrote: >>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> I propose the following RC to be released as the official Apache >>>>>>> Iceberg 0.9.0 release. >>>>>>> >>>>>>> The commit id is 4e66b4c10603e762129bc398146e02d21689e6dd >>>>>>> * This corresponds to the tag: apache-iceberg-0.9.0-rc5 >>>>>>> * https://github.com/apache/iceberg/commits/apache-iceberg-0.9.0-rc5 >>>>>>> * https://github.com/apache/iceberg/tree/4e66b4c1 >>>>>>> >>>>>>> The release tarball, signature, and checksums are here: >>>>>>> * >>>>>>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.9.0-rc5/ >>>>>>> >>>>>>> You can find the KEYS file here: >>>>>>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS >>>>>>> >>>>>>> Convenience binary artifacts are staged in Nexus. The Maven >>>>>>> repository URL is: >>>>>>> * >>>>>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1008/ >>>>>>> >>>>>>> This release includes support for Spark 3 and vectorized reads for >>>>>>> flat schemas in Spark. >>>>>>> >>>>>>> Please download, verify, and test. >>>>>>> >>>>>>> Please vote in the next 72 hours. >>>>>>> >>>>>>> [ ] +1 Release this as Apache Iceberg 0.9.0 >>>>>>> [ ] +0 >>>>>>> [ ] -1 Do not release this because... >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Software Engineer >>>>>>> Netflix >>>>>>> >>>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> > > -- > Ryan Blue > Software Engineer > Netflix > -- Ryan Blue Software Engineer Netflix