Artem Kupchinskiy created SPARK-48458:
-----------------------------------------

             Summary: Dynamic partition override mode might be ignored in 
certain scenarios causing data loss
                 Key: SPARK-48458
                 URL: https://issues.apache.org/jira/browse/SPARK-48458
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.5.1, 2.4.8, 4.0.0
            Reporter: Artem Kupchinskiy


If an active spark session is stopped in the middle of an insert into file 
system, the session config responsible for overwriting partitions behavior 
might be not respected. The failure scenario basically is following:
 # The spark context is stopped just before  [getting a partition override mode 
setting|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L69]
 # Due to the [fallback config usage in case of stopped spark 
context,|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L121]
 this mode is evaluated to static (default mode in the default SQLConf used as  
a fallback)
 # The data is cleared 
[here|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L131]
 totally which is literally a data loss from the user perspective who intends 
to overwrite data just partially.

This 
[gist|https://gist.github.com/akupchinskiy/b5f31781d59e5c0e9b172e7de40132cd] 
reproduces this behavior. On my local machine, it takes 1-3 iterations to have  
pre-created data cleared totally.

The mitigation of this bug would be usage of explicit write parameter 
`partitionOverwriteMode`  instead of relying on session configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to