[jira] [Updated] (SPARK-40284) spark concurrent overwrite mode writes data to files in HDFS format, all request data write success

Liu (Jira) Tue, 30 Aug 2022 22:01:07 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-40284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Liu updated SPARK-40284:
------------------------
    Description: 
We use Spark as a service. The same Spark service needs to handle multiple 
requests, but I have a problem with this

When multiple requests are overwritten to a directory at the same time, the 
results of two overwrite requests may be written successfully. I think this 
does not meet the definition of overwrite write

First I ran Write SQL1, then I ran Write SQL2, and I found that both data were 
written in the end, which I thought was unreasonable
{code:java}
sparkSession.udf.register("sleep",  (time: Long) => Thread.sleep(time))



-- write sql1
sparkSession.sql("select 1 as id, sleep(40000) as 
time").write.mode(SaveMode.Overwrite).parquet("path")

-- write sql2
 sparkSession.sql("select 2 as id, 1 as 
time").write.mode(SaveMode.Overwrite).parquet("path") {code}
When the spark source, and I saw that all these logic in 
InsertIntoHadoopFsRelationCommand this class.

 

When the target directory already exists, Spark directly deletes the target 
directory and writes to the _temporary directory that it requests. However, 
when multiple requests are written, the data will all append in; For example, 
in Write SQL above, this procedure occurs

1. Run write SQL 1, SQL 1 to create the _TEMPORARY directory, and continue

2. Run write SQL 2 to delete the entire target directory and create its own 
_TEMPORARY

3. Sql2 writes data

4. SQL 1 completion. The corresponding _temporary /0/attemp_id directory does 
not exist and fails. However, the task is retried, but the _temporary  
directory is not deleted when the task is retried. Therefore, the execution 
result of SQL1 is sent to the target directory by append

 

Based on the above process, the write process, can you do a directory check 
before the write task or some other way to avoid this kind of problem

 

 

 

 

  was:
We use Spark as a service. The same Spark service needs to handle multiple 
requests, but I have a problem with this

When multiple requests are overwritten to a directory at the same time, the 
results of two overwrite requests may be written successfully. I think this 
does not meet the definition of overwrite write

First I ran Write SQL1, then I ran Write SQL2, and I found that both data were 
written in the end, which I thought was unreasonable
{code:java}
sparkSession.udf.register("sleep",  (time: Long) => Thread.sleep(time))



-- write sql1
sparkSession.sql("select 1 as id, sleep(40000) as 
time").write.mode(SaveMode.Overwrite).parquet("path")

-- write sql2
 sparkSession.sql("select 2 as id, 1 as 
time").write.mode(SaveMode.Overwrite).parquet("path") {code}
When the spark source, and I saw that all these logic in 
InsertIntoHadoopFsRelationCommand this class.

 

When the target directory already exists, Spark directly deletes the target 
directory and writes to the _temporary directory that it requests. However, 
when multiple requests are written, the data will all append in; For example, 
in Write SQL above, this procedure occurs

1. Run write SQL 1, SQL 1 to create the _TEMPORARY directory, and continue

2. Run write SQL 2 to delete the entire target directory and create its own 
_TEMPORARY

3. Sql2 writes data

4. SQL 1 completion. The corresponding _Temporary /0/attemp_id directory does 
not exist and fails. However, the task is retried, but the directory is not 
deleted when the task is retried. Therefore, the execution result of SQL1 is 
sent to the target directory by append

 

Based on the above process, the write process, can you do a directory check 
before the write task or some other way to avoid this kind of problem

 

 

 

 


> spark  concurrent overwrite mode writes data to files in HDFS format, all 
> request data write success
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-40284
>                 URL: https://issues.apache.org/jira/browse/SPARK-40284
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 3.0.1
>            Reporter: Liu
>            Priority: Major
>
> We use Spark as a service. The same Spark service needs to handle multiple 
> requests, but I have a problem with this
> When multiple requests are overwritten to a directory at the same time, the 
> results of two overwrite requests may be written successfully. I think this 
> does not meet the definition of overwrite write
> First I ran Write SQL1, then I ran Write SQL2, and I found that both data 
> were written in the end, which I thought was unreasonable
> {code:java}
> sparkSession.udf.register("sleep",  (time: Long) => Thread.sleep(time))
> -- write sql1
> sparkSession.sql("select 1 as id, sleep(40000) as 
> time").write.mode(SaveMode.Overwrite).parquet("path")
> -- write sql2
>  sparkSession.sql("select 2 as id, 1 as 
> time").write.mode(SaveMode.Overwrite).parquet("path") {code}
> When the spark source, and I saw that all these logic in 
> InsertIntoHadoopFsRelationCommand this class.
>  
> When the target directory already exists, Spark directly deletes the target 
> directory and writes to the _temporary directory that it requests. However, 
> when multiple requests are written, the data will all append in; For example, 
> in Write SQL above, this procedure occurs
> 1. Run write SQL 1, SQL 1 to create the _TEMPORARY directory, and continue
> 2. Run write SQL 2 to delete the entire target directory and create its own 
> _TEMPORARY
> 3. Sql2 writes data
> 4. SQL 1 completion. The corresponding _temporary /0/attemp_id directory does 
> not exist and fails. However, the task is retried, but the _temporary  
> directory is not deleted when the task is retried. Therefore, the execution 
> result of SQL1 is sent to the target directory by append
>  
> Based on the above process, the write process, can you do a directory check 
> before the write task or some other way to avoid this kind of problem
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40284) spark concurrent overwrite mode writes data to files in HDFS format, all request data write success

Reply via email to