subject:"\[jira\] \[Comment Edited\] \(SPARK\-44884\) Spark doesn't create SUCCESS file in Spark 3.3.0\+"

[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode is dynamic

2024-07-22 Thread Anika Kelhanka (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867688#comment-17867688
 ] 

Anika Kelhanka edited comment on SPARK-44884 at 7/22/24 8:57 AM:
-

*Issue:*
* This issue happens specifically when  {{partitionOverwriteMode = dynamic}} 
(Insert Overwrite - 
[SPARK-20236|https://issues.apache.org/jira/browse/SPARK-20236]).
* "_SUCCESS" file is created for spark version <= 3.0.2, given:  
{{"spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs"=”true”}}.
* "_SUCCESS" file is not created for spark version > 3.0.2 even when 
{{"spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs"=”true”}}.

*Analysis (RCA):*

* In the Spark versions prior to 3.0.2, the SUCCESS Marker file is created on 
the root path when spark job is successful. This is expected behavior.
* What changed: After the change for 
[SPARK-29302|https://issues.apache.org/jira/browse/SPARK-29302] (dynamic 
partition overwrite with speculation enabled) got merged, the SUCCESS marker 
file stopped getting created at the root location when the Spark job writes in 
dynamic partition override mode. 
* The change [SPARK-29302|https://issues.apache.org/jira/browse/SPARK-29302] 
(dynamic partition overwrite with speculation enabled) sets the 
{{committerOutputPath=${stagingDir}}} which previously stored root dir path, in 
[this 
codeblock|https://github.com/apache/spark/pull/29000/files#diff-15b529afe19e971b138fc604909bcab2e42484babdcea937f41d18cb22d9401dR167-R175].
 
* The {{committerOutputPath}} parameter is passed on to the hadoop committer, 
which creates the SUCCESS marker file at the path specified in 
{{committerOutputPath}} parameter. Thus, the SUCESS marker is now created 
inside the stagingDir.
* Once Hadoop committer has finished writing, The Spark Commit Protocol logic 
copies all the data files to root path, [but NOT the SUCCESS marker] before 
deleting the ${stagingDir}.  
* The stagingDir is then deleted along with SUCCESS Marker file. 

*Proposed Fix:*

The gap in this logic can be mended by adding a step to copy _SUCCESS file as 
well to the final location before deleting the stagingDir. 
Also, ensure that when 
{{"spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs"=”false”}}, 
the _SUCCESS marker file will not be created by the Hadoop output committers in 
stagingDir itself.

I am working on a fix for same. 


was (Author: anikakelhanka):
*Issue:*
* This issue happens specifically when  {{partitionOverwriteMode = dynamic}} 
(Insert Overwrite - 
[SPARK-20236|https://issues.apache.org/jira/browse/SPARK-20236]).
* "_SUCCESS" file is created for spark version <= 3.0.2, given:  
{{"spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs"=”true”}}.
* "_SUCCESS" file is not created for spark version > 3.0.2 even when 
{{"spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs"=”true”}}.



*Analysis (RCA):*

* In the Spark versions prior to 3.0.2, the SUCCESS Marker file is created on 
the root path when spark job is successful. This is expected behavior.
* What changed: After the change for 
[SPARK-29302|https://issues.apache.org/jira/browse/SPARK-29302] (dynamic 
partition overwrite with speculation enabled) got merged, the SUCCESS marker 
file stopped getting created at the root location when the Spark job writes in 
dynamic partition override mode. 
* The change [SPARK-29302|https://issues.apache.org/jira/browse/SPARK-29302] 
(dynamic partition overwrite with speculation enabled) sets the 
{{committerOutputPath=${stagingDir}}} which previously stored root dir path, in 
[this 
codeblock|https://github.com/apache/spark/pull/29000/files#diff-15b529afe19e971b138fc604909bcab2e42484babdcea937f41d18cb22d9401dR167-R175].
 
* The {{committerOutputPath}} parameter is passed on to the hadoop committer, 
which creates the SUCCESS marker file at the path specified in 
{{committerOutputPath}} parameter. Thus, the SUCESS marker is now created 
inside the stagingDir.
* Once Hadoop committer has finished writing, The Spark Commit Protocol logic 
copies all the data files to root path, [but NOT the SUCCESS marker] before 
deleting the ${stagingDir}.  
* The stagingDir is then deleted along with SUCCESS Marker file. 


*Proposed Fix:*

The gap in this logic can be mended by adding a step to copy _SUCCESS file as 
well to the final location before deleting the stagingDir. 
Also, ensure that when 
{{"spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs"=”false”}}, 
the _SUCCESS marker file will not be created by the Hadoop output committers in 
stagingDir itself.

I am working on a fix for s

[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode is dynamic

2023-08-25 Thread Dipayan Dev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759065#comment-17759065
 ] 

Dipayan Dev edited comment on SPARK-44884 at 8/25/23 2:22 PM:
--

Right, the behaviour is same in Spark 2 and 3. However, in Spark 2.x after 
renaming the temporary subdir, it writes the _SUCCESS file on the root path but 
NOT in Spark 3.x when that param is passed. 

I see this part of the code ([Hadoop 
Committer|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L433[]])
 is not changed in the latest hadoop-mapreduce, but somewhere probably 
_partitionOverwriteMode_ {color:#172b4d}option is broken when passed from 
latest Spark Dataframewriter. {color}

In Spark 2.x, the _SUCCESS file gets updated everytime you do insert overwrite. 

 


was (Author: JIRAUSER301514):
Right, the behaviour is same in Spark 2 and 3. However, in Spark 2.x after 
renaming the temporary subdir, it writes the _SUCCESS file on the root path but 
not in Spark 3.x when that param is passed. 

I see this part of the code ([Hadoop 
Committer|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L433[]])
 is not changed in the latest hadoop-mapreduce, but somewhere probably 
_partitionOverwriteMode_ {color:#172b4d}option is broken when passed from 
latest Spark Dataframewriter. {color}

In Spark 2.x, the _SUCCESS file gets updated everytime you do insert overwrite. 

 

> Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode 
> is dynamic
> 
>
> Key: SPARK-44884
> URL: https://issues.apache.org/jira/browse/SPARK-44884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dipayan Dev
>Priority: Critical
> Attachments: image-2023-08-20-18-46-53-342.png, 
> image-2023-08-25-13-01-42-137.png
>
>
> The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0 
> (tested with 3.4.1 as well)
> Code to reproduce the issue
>  
> {code:java}
> scala> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") 
> scala> val DF = Seq(("test1", 123)).toDF("name", "num")
> scala> DF.write.option("path", 
> "gs://test_bucket/table").mode("overwrite").partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1")
>  {code}
>  
> The above code succeeds and creates external Hive table, but {*}there is no 
> SUCCESS file generated{*}.
> Adding the content of the bucket after table creation
> !image-2023-08-25-13-01-42-137.png|width=500,height=130!
>  The same code when running with spark 2.4.0 (with or without external path), 
> generates the SUCCESS file.
> {code:java}
> scala> 
> DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1"){code}
> !image-2023-08-20-18-46-53-342.png|width=465,height=166!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode is dynamic

2023-08-25 Thread Dipayan Dev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759065#comment-17759065
 ] 

Dipayan Dev edited comment on SPARK-44884 at 8/25/23 2:21 PM:
--

Right, the behaviour is same in Spark 2 and 3. However, in Spark 2.x after 
renaming the temporary subdir, it writes the _SUCCESS file on the root path but 
not in Spark 3.x when that param is passed. 

I see this part of the code ([Hadoop 
Committer|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L433[]])
 is not changed in the latest hadoop-mapreduce, but somewhere probably 
_partitionOverwriteMode_ {color:#172b4d}option is broken when passed from 
latest Spark Dataframewriter. {color}

In Spark 2.x, the _SUCCESS file gets updated everytime you do insert overwrite. 

 


was (Author: JIRAUSER301514):
Right, the behaviour is same in Spark 2 and 3. However, in Spark 2.x after 
renaming the temporary subdir, it writes the _SUCCESS file on the root path but 
not in Spark 3.x when that param is passed. 

I see this part of the code ([Hadoop 
Committer|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L433[]])
 is not changed in the latest hadoop-mapreduce, but somewhere probably 
_partitionOverwriteMode_ {color:#172b4d}option is broken when passed from 
latest Spark Dataframewriter. {color}

 

 

> Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode 
> is dynamic
> 
>
> Key: SPARK-44884
> URL: https://issues.apache.org/jira/browse/SPARK-44884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dipayan Dev
>Priority: Critical
> Attachments: image-2023-08-20-18-46-53-342.png, 
> image-2023-08-25-13-01-42-137.png
>
>
> The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0 
> (tested with 3.4.1 as well)
> Code to reproduce the issue
>  
> {code:java}
> scala> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") 
> scala> val DF = Seq(("test1", 123)).toDF("name", "num")
> scala> DF.write.option("path", 
> "gs://test_bucket/table").mode("overwrite").partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1")
>  {code}
>  
> The above code succeeds and creates external Hive table, but {*}there is no 
> SUCCESS file generated{*}.
> Adding the content of the bucket after table creation
> !image-2023-08-25-13-01-42-137.png|width=500,height=130!
>  The same code when running with spark 2.4.0 (with or without external path), 
> generates the SUCCESS file.
> {code:java}
> scala> 
> DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1"){code}
> !image-2023-08-20-18-46-53-342.png|width=465,height=166!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode is dynamic

2023-08-25 Thread Dipayan Dev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759065#comment-17759065
 ] 

Dipayan Dev edited comment on SPARK-44884 at 8/25/23 2:20 PM:
--

Right, the behaviour is same in Spark 2 and 3. However, in Spark 2.x after 
renaming the temporary subdir, it writes the _SUCCESS file on the root path but 
not in Spark 3.x when that param is passed. 

I see this part of the code ([Hadoop 
Committer|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L433[]])
 is not changed in the latest hadoop-mapreduce, but somewhere probably 
_partitionOverwriteMode_ {color:#172b4d}option is broken when passed from 
latest Spark Dataframewriter. {color}

 

 


was (Author: JIRAUSER301514):
Right, the behaviour is same in Spark 2 and 3. However, in Spark 2.x after 
renaming the temporary subdir, it writes the _SUCCESS file on the root path but 
not in Spark 3.x when that param is passed. 

I see this part of the code ([Hadoop 
Committer|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L433[]|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L433])
 is not changed in the latest hadoop-mapreduce, but somewhere 
partitionOverwriteMode {color:#172b4d}option is broken when passed latest Spark 
Dataframewriter. {color}

 

 

> Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode 
> is dynamic
> 
>
> Key: SPARK-44884
> URL: https://issues.apache.org/jira/browse/SPARK-44884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dipayan Dev
>Priority: Critical
> Attachments: image-2023-08-20-18-46-53-342.png, 
> image-2023-08-25-13-01-42-137.png
>
>
> The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0 
> (tested with 3.4.1 as well)
> Code to reproduce the issue
>  
> {code:java}
> scala> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") 
> scala> val DF = Seq(("test1", 123)).toDF("name", "num")
> scala> DF.write.option("path", 
> "gs://test_bucket/table").mode("overwrite").partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1")
>  {code}
>  
> The above code succeeds and creates external Hive table, but {*}there is no 
> SUCCESS file generated{*}.
> Adding the content of the bucket after table creation
> !image-2023-08-25-13-01-42-137.png|width=500,height=130!
>  The same code when running with spark 2.4.0 (with or without external path), 
> generates the SUCCESS file.
> {code:java}
> scala> 
> DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1"){code}
> !image-2023-08-20-18-46-53-342.png|width=465,height=166!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+

2023-08-25 Thread Dipayan Dev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757052#comment-17757052
 ] 

Dipayan Dev edited comment on SPARK-44884 at 8/25/23 7:34 AM:
--

There is no reason to disable this feature in Spark 3.3.0. There can be lots of 
downstream applications that are dependent on the _SUCCESS file and this 
feature change wasn't mention anywhere in the release. Any workaround for this? 
[~ste...@apache.org] 


was (Author: JIRAUSER301514):
There is no reason to disable this feature in Spark 3.3.0. There can be lots of 
downstream applications that are dependent on the _SUCCESS file and this 
feature change wasn't mention anywhere in the release. Any workaround for this 
or anyway I can contribute? [~ste...@apache.org] 

> Spark doesn't create SUCCESS file in Spark 3.3.0+
> -
>
> Key: SPARK-44884
> URL: https://issues.apache.org/jira/browse/SPARK-44884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dipayan Dev
>Priority: Critical
> Attachments: image-2023-08-20-18-46-53-342.png, 
> image-2023-08-25-13-01-42-137.png
>
>
> The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0
> Code to reproduce the issue
>  
> {code:java}
> scala> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") 
> scala> val DF = Seq(("test1", 123)).toDF("name", "num")
> scala> DF.write.option("path", 
> "gs://test_bucket/table").mode("overwrite").partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1")
>  {code}
>  
> The above code succeeds and creates external Hive table, but {*}there is no 
> SUCCESS file generated{*}.
> Adding the content of the bucket after table creation
> !image-2023-08-25-13-01-42-137.png|width=500,height=130!
>  The same code when running with spark 2.4.0 (with or without external path), 
> generates the SUCCESS file.
> {code:java}
> scala> 
> DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1"){code}
> !image-2023-08-20-18-46-53-342.png|width=465,height=166!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+

2023-08-25 Thread Dipayan Dev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757052#comment-17757052
 ] 

Dipayan Dev edited comment on SPARK-44884 at 8/25/23 7:33 AM:
--

There is no reason to disable this feature in Spark 3.3.0. There can be lots of 
downstream applications that are dependent on the _SUCCESS file and this 
feature change wasn't mention anywhere in the release. Any workaround for this 
or anyway I can contribute? [~ste...@apache.org] 


was (Author: JIRAUSER301514):
There is no reason to disable this feature in Spark 3.3.0. We have a lot of 
downstream applications that are dependent on the _SUCCESS file and this 
feature change wasn't mention anywhere in the release. Any workaround for this 
or anyway I can contribute? [~ste...@apache.org] 

> Spark doesn't create SUCCESS file in Spark 3.3.0+
> -
>
> Key: SPARK-44884
> URL: https://issues.apache.org/jira/browse/SPARK-44884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dipayan Dev
>Priority: Critical
> Attachments: image-2023-08-20-18-46-53-342.png, 
> image-2023-08-25-13-01-42-137.png
>
>
> The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0
> Code to reproduce the issue
>  
> {code:java}
> scala> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") 
> scala> val DF = Seq(("test1", 123)).toDF("name", "num")
> scala> DF.write.option("path", 
> "gs://test_bucket/table").mode("overwrite").partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1")
>  {code}
>  
> The above code succeeds and creates external Hive table, but {*}there is no 
> SUCCESS file generated{*}.
> Adding the content of the bucket after table creation
> !image-2023-08-25-13-01-42-137.png|width=500,height=130!
>  The same code when running with spark 2.4.0 (with or without external path), 
> generates the SUCCESS file.
> {code:java}
> scala> 
> DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1"){code}
> !image-2023-08-20-18-46-53-342.png|width=465,height=166!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode is dynamic

[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode is dynamic

[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode is dynamic

[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode is dynamic

[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+

[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+

6 matches

Site Navigation

Mail list logo

Footer information