[jira] [Commented] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

2023-11-27 Thread jeanlyn (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790135#comment-17790135
 ] 

jeanlyn commented on SPARK-43106:
-

I think we also encountered similar problems, we circumvent this problem by 
using parameters *spark.sql.hive.convertInsertingPartitionedTable=false*

> Data lost from the table if the INSERT OVERWRITE query fails
> 
>
> Key: SPARK-43106
> URL: https://issues.apache.org/jira/browse/SPARK-43106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vaibhav Beriwala
>Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
> Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>  
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data 
> from the original table is lost. 
> 2) If the insert job in step 2 above takes a huge time to complete, then 
> table data is unavailable to other readers for the entire duration the job 
> takes.
> This behavior is the same even for the partitioned tables when using static 
> partitioning. For dynamic partitioning, we do not delete the table data 
> before the job launch.
>  
> Is there a reason as to why we perform this delete before the job launch and 
> not as part of the Job commit operation? This issue is not there with Hive - 
> where the data is cleaned up as part of the Job commit operation probably. As 
> part of SPARK-19183, we did add a new hook in the commit protocol for this 
> exact same purpose, but seems like its default behavior is still to delete 
> the table data before the job launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

2023-06-05 Thread Vaibhav Beriwala (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17729193#comment-17729193
 ] 

Vaibhav Beriwala commented on SPARK-43106:
--

[~dongjoon] Did you get a chance to take a look at this?

Any feedback from you would be really helpful and as [~itskals] mentioned, we 
at Uber are willing to work on this - just looking out for any gotchas as to 
why the idea proposed in  SPARK-19183 was not taken forward.

> Data lost from the table if the INSERT OVERWRITE query fails
> 
>
> Key: SPARK-43106
> URL: https://issues.apache.org/jira/browse/SPARK-43106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vaibhav Beriwala
>Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
> Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>  
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data 
> from the original table is lost. 
> 2) If the insert job in step 2 above takes a huge time to complete, then 
> table data is unavailable to other readers for the entire duration the job 
> takes.
> This behavior is the same even for the partitioned tables when using static 
> partitioning. For dynamic partitioning, we do not delete the table data 
> before the job launch.
>  
> Is there a reason as to why we perform this delete before the job launch and 
> not as part of the Job commit operation? This issue is not there with Hive - 
> where the data is cleaned up as part of the Job commit operation probably. As 
> part of SPARK-19183, we did add a new hook in the commit protocol for this 
> exact same purpose, but seems like its default behavior is still to delete 
> the table data before the job launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

2023-04-29 Thread Vaibhav Beriwala (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717917#comment-17717917
 ] 

Vaibhav Beriwala commented on SPARK-43106:
--

[~dongjoon] Thank you for taking a look at this issue.

1) This is not specific to just Spark-3.3.2. The issue exists in the master 
branch as well.

2) As [~itskals] mentioned above, at Uber we mostly use HDFS as the storage 
backend but this same issue would exist for cloud object storage as well.

3) Running any *_INSERT OVERWRITE TABLE_* query over any unpartitioned table 
would help you quickly reproduce this issue. You will notice that Spark would 
first clean up the table output path and then launch a job that does the 
computation for the new data.

 

Some code pointers on this:

1) Refer to the class _*InsertIntoHadoopFsRelation*_ -> method *_run._*

2) Inside the _*run*_ method, you would see that we first call 
{_}*deleteMatchingPartitions*{_}(this will clean up the table data) and then 
later call  {_}*FileFormatWriter.write*{_}(this will trigger the actual job).

> Data lost from the table if the INSERT OVERWRITE query fails
> 
>
> Key: SPARK-43106
> URL: https://issues.apache.org/jira/browse/SPARK-43106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Vaibhav Beriwala
>Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
> Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>  
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data 
> from the original table is lost. 
> 2) If the insert job in step 2 above takes a huge time to complete, then 
> table data is unavailable to other readers for the entire duration the job 
> takes.
> This behavior is the same even for the partitioned tables when using static 
> partitioning. For dynamic partitioning, we do not delete the table data 
> before the job launch.
>  
> Is there a reason as to why we perform this delete before the job launch and 
> not as part of the Job commit operation? This issue is not there with Hive - 
> where the data is cleaned up as part of the Job commit operation probably. As 
> part of SPARK-19183, we did add a new hook in the commit protocol for this 
> exact same purpose, but seems like its default behavior is still to delete 
> the table data before the job launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

2023-04-28 Thread kalyan s (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717882#comment-17717882
 ] 

kalyan s commented on SPARK-43106:
--

Thank you for the response [~dongjoon] .

Most of the workloads have been running on 2.4, while we have made good 
progress moving workloads to 3.X this year.

We notice this in a few long-running workloads on static 
partitions/unpartitioned datasets.

While HDFS has been our primary storage backend, moving to object stores on GCP 
has been making this problem more pronounced, due to inherent slowness in 
writing to these.

[~vaibhavb] can you share some test code to help here?

> Data lost from the table if the INSERT OVERWRITE query fails
> 
>
> Key: SPARK-43106
> URL: https://issues.apache.org/jira/browse/SPARK-43106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Vaibhav Beriwala
>Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
> Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>  
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data 
> from the original table is lost. 
> 2) If the insert job in step 2 above takes a huge time to complete, then 
> table data is unavailable to other readers for the entire duration the job 
> takes.
> This behavior is the same even for the partitioned tables when using static 
> partitioning. For dynamic partitioning, we do not delete the table data 
> before the job launch.
>  
> Is there a reason as to why we perform this delete before the job launch and 
> not as part of the Job commit operation? This issue is not there with Hive - 
> where the data is cleaned up as part of the Job commit operation probably. As 
> part of SPARK-19183, we did add a new hook in the commit protocol for this 
> exact same purpose, but seems like its default behavior is still to delete 
> the table data before the job launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

2023-04-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717855#comment-17717855
 ] 

Dongjoon Hyun commented on SPARK-43106:
---

Thank you for reporting. To narrow down your issue more, let me ask more 
information, [~itskals].
# Is this specific to Apache Spark 3.3.2, could you try to use other Apache 
Spark versions like Apache Spark 3.4.0 or Apache Spark 3.3.1?
# What storage backend are you using now, HDFS or S3?
# Do you think you can provide us a reproducible example?

> Data lost from the table if the INSERT OVERWRITE query fails
> 
>
> Key: SPARK-43106
> URL: https://issues.apache.org/jira/browse/SPARK-43106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Vaibhav Beriwala
>Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
> Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>  
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data 
> from the original table is lost. 
> 2) If the insert job in step 2 above takes a huge time to complete, then 
> table data is unavailable to other readers for the entire duration the job 
> takes.
> This behavior is the same even for the partitioned tables when using static 
> partitioning. For dynamic partitioning, we do not delete the table data 
> before the job launch.
>  
> Is there a reason as to why we perform this delete before the job launch and 
> not as part of the Job commit operation? This issue is not there with Hive - 
> where the data is cleaned up as part of the Job commit operation probably. As 
> part of SPARK-19183, we did add a new hook in the commit protocol for this 
> exact same purpose, but seems like its default behavior is still to delete 
> the table data before the job launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

2023-04-27 Thread kalyan s (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717461#comment-17717461
 ] 

kalyan s commented on SPARK-43106:
--

[~cloud_fan] , [~dongjoon]  , [~gurwls223] 

Input from you folks will help us.

Myself(kalyan) and [~vaibhavb] are from Uber and we see our workloads getting 
into this problem more regularly.

While we are willing to work on it, we would like to know if there were any 
gotchas in taking forward the idea proposed in SPARK-19183.

 

> Data lost from the table if the INSERT OVERWRITE query fails
> 
>
> Key: SPARK-43106
> URL: https://issues.apache.org/jira/browse/SPARK-43106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: vaibhav beriwala
>Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
> Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>  
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data 
> from the original table is lost. 
> 2) If the insert job in step 2 above takes a huge time to complete, then 
> table data is unavailable to other readers for the entire duration the job 
> takes.
> This behavior is the same even for the partitioned tables when using static 
> partitioning. For dynamic partitioning, we do not delete the table data 
> before the job launch.
>  
> Is there a reason as to why we perform this delete before the job launch and 
> not as part of the Job commit operation? This issue is not there with Hive - 
> where the data is cleaned up as part of the Job commit operation probably. As 
> part of SPARK-19183, we did add a new hook in the commit protocol for this 
> exact same purpose, but seems like its default behavior is still to delete 
> the table data before the job launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org