[jira] [Commented] (SPARK-40284) spark concurrent overwrite mode writes data to files in HDFS format, all request data write success

Sean R. Owen (Jira) Wed, 31 Aug 2022 10:05:08 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-40284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598511#comment-17598511
 ]


Sean R. Owen commented on SPARK-40284:
--------------------------------------

You have a race condition where two requests try to delete then write. I don't 
think this is a Spark issue.

> spark  concurrent overwrite mode writes data to files in HDFS format, all 
> request data write success
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-40284
>                 URL: https://issues.apache.org/jira/browse/SPARK-40284
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 3.0.1
>            Reporter: Liu
>            Priority: Major
>
> We use Spark as a service. The same Spark service needs to handle multiple 
> requests, but I have a problem with this
> When multiple requests are overwritten to a directory at the same time, the 
> results of two overwrite requests may be written successfully. I think this 
> does not meet the definition of overwrite write
> First I ran Write SQL1, then I ran Write SQL2, and I found that both data 
> were written in the end, which I thought was unreasonable
> {code:java}
> sparkSession.udf.register("sleep",  (time: Long) => Thread.sleep(time))
> -- write sql1
> sparkSession.sql("select 1 as id, sleep(40000) as 
> time").write.mode(SaveMode.Overwrite).parquet("path")
> -- write sql2
>  sparkSession.sql("select 2 as id, 1 as 
> time").write.mode(SaveMode.Overwrite).parquet("path") {code}
> When the spark source, and I saw that all these logic in 
> InsertIntoHadoopFsRelationCommand this class.
>  
> When the target directory already exists, Spark directly deletes the target 
> directory and writes to the _temporary directory that it requests. However, 
> when multiple requests are written, the data will all append in; For example, 
> in Write SQL above, this procedure occurs
> 1. excute write sql1, spark  create the _temporary directory for SQL1, and 
> continue
> 2. excute write sql2 , spark will  delete the entire target directory and 
> create its own 
> _temporary
> 3. sql2 writes  its data
> 4. sql1 complete the calculation, The corresponding _temporary /0/attemp_id 
> directory does not exist and so the request fail. However, the task is 
> retried, but the _temporary  directory is not deleted when the task is 
> retried. Therefore, the execution result of sql1  result is append to the 
> target directory 
>  
> Based on the above process, the write process, can  spark do a directory 
> check before the write task or some other way to avoid this kind of problem?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40284) spark concurrent overwrite mode writes data to files in HDFS format, all request data write success

Reply via email to