[ https://issues.apache.org/jira/browse/SPARK-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374040#comment-16374040 ]
Xiaoju Wu commented on SPARK-9278: ---------------------------------- Seems the issue still exists, here's the test: val data = Seq( (7, "test1", 1.0), (8, "test#test", 0.0), (9, "test3", 0.0) ) import spark.implicits._ val table = "default.tbl" spark .createDataset(data) .toDF("col1", "col2", "col3") .write .partitionBy("col1") .saveAsTable(table) val data2 = Seq( (7, "test2", 1.0), (8, "test#test", 0.0), (9, "test3", 0.0) ) spark .createDataset(data2) .toDF("col1", "col2", "col3") .write .insertInto(table) sql("select * from " + table).show() +---------+----+----+ | col2|col3|col1| +---------+----+----+ |test#test| 0.0| 8| | test1| 1.0| 7| | test3| 0.0| 9| | 8|null| 0| | 9|null| 0| | 7|null| 1| +---------+----+----+ No exception was thrown since I only run insertInto not together with partitionBy. The data are inserted incorrectly. The issue is related to column order. If I change to partitionBy col3, which is the last column in order, it works. val data = Seq( (7, "test1", 1.0), (8, "test#test", 0.0), (9, "test3", 0.0) ) import spark.implicits._ val table = "default.tbl" spark .createDataset(data) .toDF("col1", "col2", "col3") .write .partitionBy("col3") .saveAsTable(table) val data2 = Seq( (7, "test2", 1.0), (8, "test#test", 0.0), (9, "test3", 0.0) ) spark .createDataset(data2) .toDF("col1", "col2", "col3") .write .insertInto(table) sql("select * from " + table).show() +----+---------+----+ |col1| col2|col3| +----+---------+----+ | 8|test#test| 0.0| | 9| test3| 0.0| | 8|test#test| 0.0| | 9| test3| 0.0| | 7| test1| 1.0| | 7| test2| 1.0| +----+---------+----+ > DataFrameWriter.insertInto inserts incorrect data > ------------------------------------------------- > > Key: SPARK-9278 > URL: https://issues.apache.org/jira/browse/SPARK-9278 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.4.0 > Environment: Linux, S3, Hive Metastore > Reporter: Steve Lindemann > Assignee: Cheng Lian > Priority: Critical > > After creating a partitioned Hive table (stored as Parquet) via the > DataFrameWriter.createTable command, subsequent attempts to insert additional > data into new partitions of this table result in inserting incorrect data > rows. Reordering the columns in the data to be written seems to avoid this > issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org