[ 
https://issues.apache.org/jira/browse/SPARK-28411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maria Rebelka updated SPARK-28411:
----------------------------------
    Description: 
The df.write.mode("overwrite").insertInto("table") has inconsistent behaviour 
between Scala and Python. In Python, insertInto ignores "mode" parameter and 
appends by default. Only when changing syntax to df.write.insertInto("table", 
overwrite=True) we get expected behaviour.

This is a native Spark syntax, expected to be the same between languages... 
Also, in other write methods, like saveAsTable or write.parquet "mode" seem to 
be respected.

Reproduce, Python, ignore "overwrite":
{code:java}
df = spark.createDataFrame(sc.parallelize([(1, 2),(3,4)]),['i','j'])

# create the table and load data
df.write.saveAsTable("spark_overwrite_issue")

# insert overwrite, expected result - 2 rows
df.write.mode("overwrite").insertInto("spark_overwrite_issue")

spark.sql("select * from spark_overwrite_issue").count()
# result - 4 rows, insert appended data instead of overwrite{code}
Reproduce, Scala, works as expected:
{code:java}
val df = Seq((1, 2),(3,4)).toDF("i","j")

df.write.mode("overwrite").insertInto("spark_overwrite_issue")

spark.sql("select * from spark_overwrite_issue").count()
# result - 2 rows{code}
Tested on Spark 2.2.1 (EMR) and 2.4.0 (Databricks)

  was:
The df.write.mode("overwrite").insertInto("table") has inconsistent behaviour 
between Scala and Python. In Python, insertInto ignores "mode" parameter and 
appends by default. Only when changing syntax to df.write.insertInto("table", 
overwrite=True) we get expected behaviour.

This is a native Spark syntax, expected to be the same between languages... 
Also, in other write methods, like saveAsTable or write.parquet "mode" seem to 
be respected.

Reproduce, Python, ignore "overwrite":
 {{}}
{code:java}
df = spark.createDataFrame(sc.parallelize([(1, 2),(3,4)]),['i','j'])

# create the table and load data
df.write.saveAsTable("spark_overwrite_issue")

# insert overwrite, expected result - 2 rows
df.write.mode("overwrite").insertInto("spark_overwrite_issue")

spark.sql("select * from spark_overwrite_issue").count()
# result - 4 rows, insert appended data instead of overwrite{code}
Reproduce, Scala, works as expected:
{code:java}
val df = Seq((1, 2),(3,4)).toDF("i","j")

df.write.mode("overwrite").insertInto("spark_overwrite_issue")

spark.sql("select * from spark_overwrite_issue").count()
# result - 2 rows{code}
Tested on Spark 2.2.1 (EMR) and 2.4.0 (Databricks)


> insertInto with overwrite inconsistent behaviour Python/Scala
> -------------------------------------------------------------
>
>                 Key: SPARK-28411
>                 URL: https://issues.apache.org/jira/browse/SPARK-28411
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.2.1, 2.4.0
>            Reporter: Maria Rebelka
>            Priority: Minor
>
> The df.write.mode("overwrite").insertInto("table") has inconsistent behaviour 
> between Scala and Python. In Python, insertInto ignores "mode" parameter and 
> appends by default. Only when changing syntax to df.write.insertInto("table", 
> overwrite=True) we get expected behaviour.
> This is a native Spark syntax, expected to be the same between languages... 
> Also, in other write methods, like saveAsTable or write.parquet "mode" seem 
> to be respected.
> Reproduce, Python, ignore "overwrite":
> {code:java}
> df = spark.createDataFrame(sc.parallelize([(1, 2),(3,4)]),['i','j'])
> # create the table and load data
> df.write.saveAsTable("spark_overwrite_issue")
> # insert overwrite, expected result - 2 rows
> df.write.mode("overwrite").insertInto("spark_overwrite_issue")
> spark.sql("select * from spark_overwrite_issue").count()
> # result - 4 rows, insert appended data instead of overwrite{code}
> Reproduce, Scala, works as expected:
> {code:java}
> val df = Seq((1, 2),(3,4)).toDF("i","j")
> df.write.mode("overwrite").insertInto("spark_overwrite_issue")
> spark.sql("select * from spark_overwrite_issue").count()
> # result - 2 rows{code}
> Tested on Spark 2.2.1 (EMR) and 2.4.0 (Databricks)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to