Artem Kupchinskiy created SPARK-48458: -----------------------------------------
Summary: Dynamic partition override mode might be ignored in certain scenarios causing data loss Key: SPARK-48458 URL: https://issues.apache.org/jira/browse/SPARK-48458 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.1, 2.4.8, 4.0.0 Reporter: Artem Kupchinskiy If an active spark session is stopped in the middle of an insert into file system, the session config responsible for overwriting partitions behavior might be not respected. The failure scenario basically is following: # The spark context is stopped just before [getting a partition override mode setting|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L69] # Due to the [fallback config usage in case of stopped spark context,|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L121] this mode is evaluated to static (default mode in the default SQLConf used as a fallback) # The data is cleared [here|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L131] totally which is literally a data loss from the user perspective who intends to overwrite data just partially. This [gist|https://gist.github.com/akupchinskiy/b5f31781d59e5c0e9b172e7de40132cd] reproduces this behavior. On my local machine, it takes 1-3 iterations to have pre-created data cleared totally. The mitigation of this bug would be usage of explicit write parameter `partitionOverwriteMode` instead of relying on session configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org