[ https://issues.apache.org/jira/browse/SPARK-40284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598511#comment-17598511 ]
Sean R. Owen commented on SPARK-40284: -------------------------------------- You have a race condition where two requests try to delete then write. I don't think this is a Spark issue. > spark concurrent overwrite mode writes data to files in HDFS format, all > request data write success > ---------------------------------------------------------------------------------------------------- > > Key: SPARK-40284 > URL: https://issues.apache.org/jira/browse/SPARK-40284 > Project: Spark > Issue Type: Improvement > Components: Input/Output > Affects Versions: 3.0.1 > Reporter: Liu > Priority: Major > > We use Spark as a service. The same Spark service needs to handle multiple > requests, but I have a problem with this > When multiple requests are overwritten to a directory at the same time, the > results of two overwrite requests may be written successfully. I think this > does not meet the definition of overwrite write > First I ran Write SQL1, then I ran Write SQL2, and I found that both data > were written in the end, which I thought was unreasonable > {code:java} > sparkSession.udf.register("sleep", (time: Long) => Thread.sleep(time)) > -- write sql1 > sparkSession.sql("select 1 as id, sleep(40000) as > time").write.mode(SaveMode.Overwrite).parquet("path") > -- write sql2 > sparkSession.sql("select 2 as id, 1 as > time").write.mode(SaveMode.Overwrite).parquet("path") {code} > When the spark source, and I saw that all these logic in > InsertIntoHadoopFsRelationCommand this class. > > When the target directory already exists, Spark directly deletes the target > directory and writes to the _temporary directory that it requests. However, > when multiple requests are written, the data will all append in; For example, > in Write SQL above, this procedure occurs > 1. excute write sql1, spark create the _temporary directory for SQL1, and > continue > 2. excute write sql2 , spark will delete the entire target directory and > create its own > _temporary > 3. sql2 writes its data > 4. sql1 complete the calculation, The corresponding _temporary /0/attemp_id > directory does not exist and so the request fail. However, the task is > retried, but the _temporary directory is not deleted when the task is > retried. Therefore, the execution result of sql1 result is append to the > target directory > > Based on the above process, the write process, can spark do a directory > check before the write task or some other way to avoid this kind of problem? > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org