[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode is dynamic
[ https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867688#comment-17867688 ] Anika Kelhanka edited comment on SPARK-44884 at 7/22/24 8:57 AM: - *Issue:* * This issue happens specifically when {{partitionOverwriteMode = dynamic}} (Insert Overwrite - [SPARK-20236|https://issues.apache.org/jira/browse/SPARK-20236]). * "_SUCCESS" file is created for spark version <= 3.0.2, given: {{"spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs"=”true”}}. * "_SUCCESS" file is not created for spark version > 3.0.2 even when {{"spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs"=”true”}}. *Analysis (RCA):* * In the Spark versions prior to 3.0.2, the SUCCESS Marker file is created on the root path when spark job is successful. This is expected behavior. * What changed: After the change for [SPARK-29302|https://issues.apache.org/jira/browse/SPARK-29302] (dynamic partition overwrite with speculation enabled) got merged, the SUCCESS marker file stopped getting created at the root location when the Spark job writes in dynamic partition override mode. * The change [SPARK-29302|https://issues.apache.org/jira/browse/SPARK-29302] (dynamic partition overwrite with speculation enabled) sets the {{committerOutputPath=${stagingDir}}} which previously stored root dir path, in [this codeblock|https://github.com/apache/spark/pull/29000/files#diff-15b529afe19e971b138fc604909bcab2e42484babdcea937f41d18cb22d9401dR167-R175]. * The {{committerOutputPath}} parameter is passed on to the hadoop committer, which creates the SUCCESS marker file at the path specified in {{committerOutputPath}} parameter. Thus, the SUCESS marker is now created inside the stagingDir. * Once Hadoop committer has finished writing, The Spark Commit Protocol logic copies all the data files to root path, [but NOT the SUCCESS marker] before deleting the ${stagingDir}. * The stagingDir is then deleted along with SUCCESS Marker file. *Proposed Fix:* The gap in this logic can be mended by adding a step to copy _SUCCESS file as well to the final location before deleting the stagingDir. Also, ensure that when {{"spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs"=”false”}}, the _SUCCESS marker file will not be created by the Hadoop output committers in stagingDir itself. I am working on a fix for same. was (Author: anikakelhanka): *Issue:* * This issue happens specifically when {{partitionOverwriteMode = dynamic}} (Insert Overwrite - [SPARK-20236|https://issues.apache.org/jira/browse/SPARK-20236]). * "_SUCCESS" file is created for spark version <= 3.0.2, given: {{"spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs"=”true”}}. * "_SUCCESS" file is not created for spark version > 3.0.2 even when {{"spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs"=”true”}}. *Analysis (RCA):* * In the Spark versions prior to 3.0.2, the SUCCESS Marker file is created on the root path when spark job is successful. This is expected behavior. * What changed: After the change for [SPARK-29302|https://issues.apache.org/jira/browse/SPARK-29302] (dynamic partition overwrite with speculation enabled) got merged, the SUCCESS marker file stopped getting created at the root location when the Spark job writes in dynamic partition override mode. * The change [SPARK-29302|https://issues.apache.org/jira/browse/SPARK-29302] (dynamic partition overwrite with speculation enabled) sets the {{committerOutputPath=${stagingDir}}} which previously stored root dir path, in [this codeblock|https://github.com/apache/spark/pull/29000/files#diff-15b529afe19e971b138fc604909bcab2e42484babdcea937f41d18cb22d9401dR167-R175]. * The {{committerOutputPath}} parameter is passed on to the hadoop committer, which creates the SUCCESS marker file at the path specified in {{committerOutputPath}} parameter. Thus, the SUCESS marker is now created inside the stagingDir. * Once Hadoop committer has finished writing, The Spark Commit Protocol logic copies all the data files to root path, [but NOT the SUCCESS marker] before deleting the ${stagingDir}. * The stagingDir is then deleted along with SUCCESS Marker file. *Proposed Fix:* The gap in this logic can be mended by adding a step to copy _SUCCESS file as well to the final location before deleting the stagingDir. Also, ensure that when {{"spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs"=”false”}}, the _SUCCESS marker file will not be created by the Hadoop output committers in stagingDir itself. I am working on a fix for s
[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode is dynamic
[ https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759065#comment-17759065 ] Dipayan Dev edited comment on SPARK-44884 at 8/25/23 2:22 PM: -- Right, the behaviour is same in Spark 2 and 3. However, in Spark 2.x after renaming the temporary subdir, it writes the _SUCCESS file on the root path but NOT in Spark 3.x when that param is passed. I see this part of the code ([Hadoop Committer|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L433[]]) is not changed in the latest hadoop-mapreduce, but somewhere probably _partitionOverwriteMode_ {color:#172b4d}option is broken when passed from latest Spark Dataframewriter. {color} In Spark 2.x, the _SUCCESS file gets updated everytime you do insert overwrite. was (Author: JIRAUSER301514): Right, the behaviour is same in Spark 2 and 3. However, in Spark 2.x after renaming the temporary subdir, it writes the _SUCCESS file on the root path but not in Spark 3.x when that param is passed. I see this part of the code ([Hadoop Committer|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L433[]]) is not changed in the latest hadoop-mapreduce, but somewhere probably _partitionOverwriteMode_ {color:#172b4d}option is broken when passed from latest Spark Dataframewriter. {color} In Spark 2.x, the _SUCCESS file gets updated everytime you do insert overwrite. > Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode > is dynamic > > > Key: SPARK-44884 > URL: https://issues.apache.org/jira/browse/SPARK-44884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dipayan Dev >Priority: Critical > Attachments: image-2023-08-20-18-46-53-342.png, > image-2023-08-25-13-01-42-137.png > > > The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0 > (tested with 3.4.1 as well) > Code to reproduce the issue > > {code:java} > scala> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") > scala> val DF = Seq(("test1", 123)).toDF("name", "num") > scala> DF.write.option("path", > "gs://test_bucket/table").mode("overwrite").partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1") > {code} > > The above code succeeds and creates external Hive table, but {*}there is no > SUCCESS file generated{*}. > Adding the content of the bucket after table creation > !image-2023-08-25-13-01-42-137.png|width=500,height=130! > The same code when running with spark 2.4.0 (with or without external path), > generates the SUCCESS file. > {code:java} > scala> > DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1"){code} > !image-2023-08-20-18-46-53-342.png|width=465,height=166! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode is dynamic
[ https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759065#comment-17759065 ] Dipayan Dev edited comment on SPARK-44884 at 8/25/23 2:21 PM: -- Right, the behaviour is same in Spark 2 and 3. However, in Spark 2.x after renaming the temporary subdir, it writes the _SUCCESS file on the root path but not in Spark 3.x when that param is passed. I see this part of the code ([Hadoop Committer|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L433[]]) is not changed in the latest hadoop-mapreduce, but somewhere probably _partitionOverwriteMode_ {color:#172b4d}option is broken when passed from latest Spark Dataframewriter. {color} In Spark 2.x, the _SUCCESS file gets updated everytime you do insert overwrite. was (Author: JIRAUSER301514): Right, the behaviour is same in Spark 2 and 3. However, in Spark 2.x after renaming the temporary subdir, it writes the _SUCCESS file on the root path but not in Spark 3.x when that param is passed. I see this part of the code ([Hadoop Committer|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L433[]]) is not changed in the latest hadoop-mapreduce, but somewhere probably _partitionOverwriteMode_ {color:#172b4d}option is broken when passed from latest Spark Dataframewriter. {color} > Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode > is dynamic > > > Key: SPARK-44884 > URL: https://issues.apache.org/jira/browse/SPARK-44884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dipayan Dev >Priority: Critical > Attachments: image-2023-08-20-18-46-53-342.png, > image-2023-08-25-13-01-42-137.png > > > The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0 > (tested with 3.4.1 as well) > Code to reproduce the issue > > {code:java} > scala> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") > scala> val DF = Seq(("test1", 123)).toDF("name", "num") > scala> DF.write.option("path", > "gs://test_bucket/table").mode("overwrite").partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1") > {code} > > The above code succeeds and creates external Hive table, but {*}there is no > SUCCESS file generated{*}. > Adding the content of the bucket after table creation > !image-2023-08-25-13-01-42-137.png|width=500,height=130! > The same code when running with spark 2.4.0 (with or without external path), > generates the SUCCESS file. > {code:java} > scala> > DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1"){code} > !image-2023-08-20-18-46-53-342.png|width=465,height=166! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode is dynamic
[ https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759065#comment-17759065 ] Dipayan Dev edited comment on SPARK-44884 at 8/25/23 2:20 PM: -- Right, the behaviour is same in Spark 2 and 3. However, in Spark 2.x after renaming the temporary subdir, it writes the _SUCCESS file on the root path but not in Spark 3.x when that param is passed. I see this part of the code ([Hadoop Committer|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L433[]]) is not changed in the latest hadoop-mapreduce, but somewhere probably _partitionOverwriteMode_ {color:#172b4d}option is broken when passed from latest Spark Dataframewriter. {color} was (Author: JIRAUSER301514): Right, the behaviour is same in Spark 2 and 3. However, in Spark 2.x after renaming the temporary subdir, it writes the _SUCCESS file on the root path but not in Spark 3.x when that param is passed. I see this part of the code ([Hadoop Committer|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L433[]|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L433]) is not changed in the latest hadoop-mapreduce, but somewhere partitionOverwriteMode {color:#172b4d}option is broken when passed latest Spark Dataframewriter. {color} > Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode > is dynamic > > > Key: SPARK-44884 > URL: https://issues.apache.org/jira/browse/SPARK-44884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dipayan Dev >Priority: Critical > Attachments: image-2023-08-20-18-46-53-342.png, > image-2023-08-25-13-01-42-137.png > > > The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0 > (tested with 3.4.1 as well) > Code to reproduce the issue > > {code:java} > scala> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") > scala> val DF = Seq(("test1", 123)).toDF("name", "num") > scala> DF.write.option("path", > "gs://test_bucket/table").mode("overwrite").partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1") > {code} > > The above code succeeds and creates external Hive table, but {*}there is no > SUCCESS file generated{*}. > Adding the content of the bucket after table creation > !image-2023-08-25-13-01-42-137.png|width=500,height=130! > The same code when running with spark 2.4.0 (with or without external path), > generates the SUCCESS file. > {code:java} > scala> > DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1"){code} > !image-2023-08-20-18-46-53-342.png|width=465,height=166! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+
[ https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757052#comment-17757052 ] Dipayan Dev edited comment on SPARK-44884 at 8/25/23 7:34 AM: -- There is no reason to disable this feature in Spark 3.3.0. There can be lots of downstream applications that are dependent on the _SUCCESS file and this feature change wasn't mention anywhere in the release. Any workaround for this? [~ste...@apache.org] was (Author: JIRAUSER301514): There is no reason to disable this feature in Spark 3.3.0. There can be lots of downstream applications that are dependent on the _SUCCESS file and this feature change wasn't mention anywhere in the release. Any workaround for this or anyway I can contribute? [~ste...@apache.org] > Spark doesn't create SUCCESS file in Spark 3.3.0+ > - > > Key: SPARK-44884 > URL: https://issues.apache.org/jira/browse/SPARK-44884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dipayan Dev >Priority: Critical > Attachments: image-2023-08-20-18-46-53-342.png, > image-2023-08-25-13-01-42-137.png > > > The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0 > Code to reproduce the issue > > {code:java} > scala> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") > scala> val DF = Seq(("test1", 123)).toDF("name", "num") > scala> DF.write.option("path", > "gs://test_bucket/table").mode("overwrite").partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1") > {code} > > The above code succeeds and creates external Hive table, but {*}there is no > SUCCESS file generated{*}. > Adding the content of the bucket after table creation > !image-2023-08-25-13-01-42-137.png|width=500,height=130! > The same code when running with spark 2.4.0 (with or without external path), > generates the SUCCESS file. > {code:java} > scala> > DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1"){code} > !image-2023-08-20-18-46-53-342.png|width=465,height=166! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+
[ https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757052#comment-17757052 ] Dipayan Dev edited comment on SPARK-44884 at 8/25/23 7:33 AM: -- There is no reason to disable this feature in Spark 3.3.0. There can be lots of downstream applications that are dependent on the _SUCCESS file and this feature change wasn't mention anywhere in the release. Any workaround for this or anyway I can contribute? [~ste...@apache.org] was (Author: JIRAUSER301514): There is no reason to disable this feature in Spark 3.3.0. We have a lot of downstream applications that are dependent on the _SUCCESS file and this feature change wasn't mention anywhere in the release. Any workaround for this or anyway I can contribute? [~ste...@apache.org] > Spark doesn't create SUCCESS file in Spark 3.3.0+ > - > > Key: SPARK-44884 > URL: https://issues.apache.org/jira/browse/SPARK-44884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dipayan Dev >Priority: Critical > Attachments: image-2023-08-20-18-46-53-342.png, > image-2023-08-25-13-01-42-137.png > > > The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0 > Code to reproduce the issue > > {code:java} > scala> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") > scala> val DF = Seq(("test1", 123)).toDF("name", "num") > scala> DF.write.option("path", > "gs://test_bucket/table").mode("overwrite").partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1") > {code} > > The above code succeeds and creates external Hive table, but {*}there is no > SUCCESS file generated{*}. > Adding the content of the bucket after table creation > !image-2023-08-25-13-01-42-137.png|width=500,height=130! > The same code when running with spark 2.4.0 (with or without external path), > generates the SUCCESS file. > {code:java} > scala> > DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1"){code} > !image-2023-08-20-18-46-53-342.png|width=465,height=166! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org