[ https://issues.apache.org/jira/browse/SPARK-25717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jinhua Fu updated SPARK-25717: ------------------------------ Description: Consider the following scenario: {code:java} spark.range(100).createTempView("temp") (0 until 3).foreach { _ => spark.sql("drop table if exists tableA") spark.sql("create table if not exists tableA(a int) partitioned by (p int) location 'file:/e:/study/warehouse/tableA'") spark.sql("insert overwrite table tableA partition(p=1) select * from temp") spark.sql("select count(1) from tableA where p=1").show } {code} We expect the count always be 100, but the actual results are as follows: {code:java} +--------+ |count(1)| +--------+ | 100| +--------+ +--------+ |count(1)| +--------+ | 200| +--------+ +--------+ |count(1)| +--------+ | 300| +--------+ {code} when spark executes an `insert overwrite` command, it gets the historical partition first, and then delete it from fileSystem. But for recreated external and partitioned table, the partitions were all deleted by the `drop table` command with data unremoved. So the historical data is preserved which lead to the query results incorrect. was: Consider the following scenario: {code:java} spark.range(100).createTempView("temp") (0 until 3).foreach { _ => spark.sql("drop table if exists tableA") spark.sql("create table if not exists tableA(a int) partitioned by (p int) location 'file:/e:/study/warehouse/tableA'") spark.sql("insert overwrite table tableA partition(p=1) select * from temp") spark.sql("select count(1) from tableA where p=1").show } {code} We expect the count always be 100, but the actual results are as follows: {code:java} +--------+ |count(1)| +--------+ | 100| +--------+ +--------+ |count(1)| +--------+ | 200| +--------+ +--------+ |count(1)| +--------+ | 300| +--------+ {code} when spark executes an `insert overwrite` command, it gets the historical partition first, and then delete it from fileSystem. But for recreated external and partitioned table, the partitions were all deleted by the `drop table` command. So the historical data is preserved which lead to the query results incorrect. > Insert overwrite a recreated external and partitioned table may result in > incorrect query results > ------------------------------------------------------------------------------------------------- > > Key: SPARK-25717 > URL: https://issues.apache.org/jira/browse/SPARK-25717 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.2 > Reporter: Jinhua Fu > Priority: Major > > Consider the following scenario: > {code:java} > spark.range(100).createTempView("temp") > (0 until 3).foreach { _ => > spark.sql("drop table if exists tableA") > spark.sql("create table if not exists tableA(a int) partitioned by (p int) > location 'file:/e:/study/warehouse/tableA'") > spark.sql("insert overwrite table tableA partition(p=1) select * from temp") > spark.sql("select count(1) from tableA where p=1").show > } > {code} > We expect the count always be 100, but the actual results are as follows: > {code:java} > +--------+ > |count(1)| > +--------+ > | 100| > +--------+ > +--------+ > |count(1)| > +--------+ > | 200| > +--------+ > +--------+ > |count(1)| > +--------+ > | 300| > +--------+ > {code} > when spark executes an `insert overwrite` command, it gets the historical > partition first, and then delete it from fileSystem. > But for recreated external and partitioned table, the partitions were all > deleted by the `drop table` command with data unremoved. So the historical > data is preserved which lead to the query results incorrect. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org