[jira] [Updated] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency
[ https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-30602: - Description: 白月山火禾In a large deployment of a Spark compute infrastructure, Spark shuffle is becoming a potential scaling bottleneck and a source of inefficiency in the cluster. When doing Spark on YARN for a large-scale deployment, people usually enable Spark external shuffle service and store the intermediate shuffle files on HDD. Because the number of blocks generated for a particular shuffle grows quadratically compared to the size of shuffled data (# mappers and reducers grows linearly with the size of shuffled data, but # blocks is # mappers * # reducers), one general trend we have observed is that the more data a Spark application processes, the smaller the block size becomes. In a few production clusters we have seen, the average shuffle block size is only 10s of KBs. Because of the inefficiency of performing random reads on HDD for small amount of data, the overall efficiency of the Spark external shuffle services serving the shuffle blocks degrades as we see an increasing # of Spark applications processing an increasing amount of data. In addition, because Spark external shuffle service is a shared service in a multi-tenancy cluster, the inefficiency with one Spark application could propagate to other applications as well. In this ticket, we propose a solution to improve Spark shuffle efficiency in above mentioned environments with push-based shuffle. With push-based shuffle, shuffle is performed at the end of mappers and blocks get pre-merged and move towards reducers. In our prototype implementation, we have seen significant efficiency improvements when performing large shuffles. We take a Spark-native approach to achieve this, i.e., extending Spark’s existing shuffle netty protocol, and the behaviors of Spark mappers, reducers and drivers. This way, we can bring the benefits of more efficient shuffle in Spark without incurring the dependency or overhead of either specialized storage layer or external infrastructure pieces. Link to dev mailing list discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html was: In a large deployment of a Spark compute infrastructure, Spark shuffle is becoming a potential scaling bottleneck and a source of inefficiency in the cluster. When doing Spark on YARN for a large-scale deployment, people usually enable Spark external shuffle service and store the intermediate shuffle files on HDD. Because the number of blocks generated for a particular shuffle grows quadratically compared to the size of shuffled data (# mappers and reducers grows linearly with the size of shuffled data, but # blocks is # mappers * # reducers), one general trend we have observed is that the more data a Spark application processes, the smaller the block size becomes. In a few production clusters we have seen, the average shuffle block size is only 10s of KBs. Because of the inefficiency of performing random reads on HDD for small amount of data, the overall efficiency of the Spark external shuffle services serving the shuffle blocks degrades as we see an increasing # of Spark applications processing an increasing amount of data. In addition, because Spark external shuffle service is a shared service in a multi-tenancy cluster, the inefficiency with one Spark application could propagate to other applications as well. In this ticket, we propose a solution to improve Spark shuffle efficiency in above mentioned environments with push-based shuffle. With push-based shuffle, shuffle is performed at the end of mappers and blocks get pre-merged and move towards reducers. In our prototype implementation, we have seen significant efficiency improvements when performing large shuffles. We take a Spark-native approach to achieve this, i.e., extending Spark’s existing shuffle netty protocol, and the behaviors of Spark mappers, reducers and drivers. This way, we can bring the benefits of more efficient shuffle in Spark without incurring the dependency or overhead of either specialized storage layer or external infrastructure pieces. Link to dev mailing list discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html > SPIP: Support push-based shuffle to improve shuffle efficiency > -- > > Key: SPARK-30602 > URL: https://issues.apache.org/jira/browse/SPARK-30602 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > Labels: release-notes > Attachments: Screen Shot 2020-
[jira] [Resolved] (SPARK-35083) Support remote scheduler pool file
[ https://issues.apache.org/jira/browse/SPARK-35083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-35083. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32184 [https://github.com/apache/spark/pull/32184] > Support remote scheduler pool file > -- > > Key: SPARK-35083 > URL: https://issues.apache.org/jira/browse/SPARK-35083 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.2.0 > > > Make `spark.scheduler.allocation.file` suport remote file. When using Spark > as a server (e.g. SparkThriftServer), it's hard for user specify a local path > as the scheduler pool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35083) Support remote scheduler pool file
[ https://issues.apache.org/jira/browse/SPARK-35083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-35083: - Assignee: ulysses you > Support remote scheduler pool file > -- > > Key: SPARK-35083 > URL: https://issues.apache.org/jira/browse/SPARK-35083 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > > Make `spark.scheduler.allocation.file` suport remote file. When using Spark > as a server (e.g. SparkThriftServer), it's hard for user specify a local path > as the scheduler pool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34494) move JSON data source options from Python and Scala into a single page.
[ https://issues.apache.org/jira/browse/SPARK-34494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-34494: Summary: move JSON data source options from Python and Scala into a single page. (was: Move other data source options than Parquet from Python and Scala into a single page.) > move JSON data source options from Python and Scala into a single page. > --- > > Key: SPARK-34494 > URL: https://issues.apache.org/jira/browse/SPARK-34494 > Project: Spark > Issue Type: Sub-task > Components: docs >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > Refer to https://issues.apache.org/jira/browse/SPARK-34491 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34494) Move JSON data source options from Python and Scala into a single page.
[ https://issues.apache.org/jira/browse/SPARK-34494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-34494: Summary: Move JSON data source options from Python and Scala into a single page. (was: move JSON data source options from Python and Scala into a single page.) > Move JSON data source options from Python and Scala into a single page. > --- > > Key: SPARK-34494 > URL: https://issues.apache.org/jira/browse/SPARK-34494 > Project: Spark > Issue Type: Sub-task > Components: docs >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > Refer to https://issues.apache.org/jira/browse/SPARK-34491 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35104) Fix ugly indentation of multiple JSON records in a single split file generated by JacksonGenerator when pretty option is true
[ https://issues.apache.org/jira/browse/SPARK-35104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-35104. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32203 [https://github.com/apache/spark/pull/32203] > Fix ugly indentation of multiple JSON records in a single split file > generated by JacksonGenerator when pretty option is true > - > > Key: SPARK-35104 > URL: https://issues.apache.org/jira/browse/SPARK-35104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.2.0 > > > When writing multiple JSON records into a single split file with pretty > option true, indentation will be broken except for the first JSON record. > {code:java} > // Run in the Spark Shell. > // Set spark.sql.leafNodeDefaultParallelism to 1 for the current master. > // Or set spark.default.parallelism for the previous releases. > spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1) > val df = Seq("a", "b", "c").toDF > df.write.option("pretty", "true").json("/path/to/output") > # Run in a Shell > $ cat /path/to/output/*.json > { > "value" : "a" > } > { > "value" : "b" > } > { > "value" : "c" > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34494) Move JSON data source options from Python and Scala into a single page.
[ https://issues.apache.org/jira/browse/SPARK-34494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34494: Assignee: (was: Apache Spark) > Move JSON data source options from Python and Scala into a single page. > --- > > Key: SPARK-34494 > URL: https://issues.apache.org/jira/browse/SPARK-34494 > Project: Spark > Issue Type: Sub-task > Components: docs >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > Refer to https://issues.apache.org/jira/browse/SPARK-34491 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34494) Move JSON data source options from Python and Scala into a single page.
[ https://issues.apache.org/jira/browse/SPARK-34494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322718#comment-17322718 ] Apache Spark commented on SPARK-34494: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/32204 > Move JSON data source options from Python and Scala into a single page. > --- > > Key: SPARK-34494 > URL: https://issues.apache.org/jira/browse/SPARK-34494 > Project: Spark > Issue Type: Sub-task > Components: docs >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > Refer to https://issues.apache.org/jira/browse/SPARK-34491 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34494) Move JSON data source options from Python and Scala into a single page.
[ https://issues.apache.org/jira/browse/SPARK-34494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34494: Assignee: Apache Spark > Move JSON data source options from Python and Scala into a single page. > --- > > Key: SPARK-34494 > URL: https://issues.apache.org/jira/browse/SPARK-34494 > Project: Spark > Issue Type: Sub-task > Components: docs >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > Refer to https://issues.apache.org/jira/browse/SPARK-34491 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-34995: Assignee: Haejoon Lee > Port/integrate Koalas remaining codes into PySpark > -- > > Key: SPARK-34995 > URL: https://issues.apache.org/jira/browse/SPARK-34995 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Haejoon Lee >Priority: Major > > There are some more commits remaining after the main codes were ported. > - > [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47] > - > [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34995. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32197 [https://github.com/apache/spark/pull/32197] > Port/integrate Koalas remaining codes into PySpark > -- > > Key: SPARK-34995 > URL: https://issues.apache.org/jira/browse/SPARK-34995 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.2.0 > > > There are some more commits remaining after the main codes were ported. > - > [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47] > - > [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35105) Support multiple paths for ADD FILE/JAR/ARCHIVE commands
Kousuke Saruta created SPARK-35105: -- Summary: Support multiple paths for ADD FILE/JAR/ARCHIVE commands Key: SPARK-35105 URL: https://issues.apache.org/jira/browse/SPARK-35105 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta In the current master, ADD FILE/JAR/ARCHIVE don't support multiple path arguments. It's great if those commands can take multiple paths like Hive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35105) Support multiple paths for ADD FILE/JAR/ARCHIVE commands
[ https://issues.apache.org/jira/browse/SPARK-35105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323618#comment-17323618 ] Apache Spark commented on SPARK-35105: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32205 > Support multiple paths for ADD FILE/JAR/ARCHIVE commands > > > Key: SPARK-35105 > URL: https://issues.apache.org/jira/browse/SPARK-35105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > In the current master, ADD FILE/JAR/ARCHIVE don't support multiple path > arguments. > It's great if those commands can take multiple paths like Hive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35105) Support multiple paths for ADD FILE/JAR/ARCHIVE commands
[ https://issues.apache.org/jira/browse/SPARK-35105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35105: Assignee: Apache Spark (was: Kousuke Saruta) > Support multiple paths for ADD FILE/JAR/ARCHIVE commands > > > Key: SPARK-35105 > URL: https://issues.apache.org/jira/browse/SPARK-35105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Major > > In the current master, ADD FILE/JAR/ARCHIVE don't support multiple path > arguments. > It's great if those commands can take multiple paths like Hive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35105) Support multiple paths for ADD FILE/JAR/ARCHIVE commands
[ https://issues.apache.org/jira/browse/SPARK-35105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35105: Assignee: Kousuke Saruta (was: Apache Spark) > Support multiple paths for ADD FILE/JAR/ARCHIVE commands > > > Key: SPARK-35105 > URL: https://issues.apache.org/jira/browse/SPARK-35105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > In the current master, ADD FILE/JAR/ARCHIVE don't support multiple path > arguments. > It's great if those commands can take multiple paths like Hive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency
[ https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-30602: - Description: In a large deployment of a Spark compute infrastructure, Spark shuffle is becoming a potential scaling bottleneck and a source of inefficiency in the cluster. When doing Spark on YARN for a large-scale deployment, people usually enable Spark external shuffle service and store the intermediate shuffle files on HDD. Because the number of blocks generated for a particular shuffle grows quadratically compared to the size of shuffled data (# mappers and reducers grows linearly with the size of shuffled data, but # blocks is # mappers * # reducers), one general trend we have observed is that the more data a Spark application processes, the smaller the block size becomes. In a few production clusters we have seen, the average shuffle block size is only 10s of KBs. Because of the inefficiency of performing random reads on HDD for small amount of data, the overall efficiency of the Spark external shuffle services serving the shuffle blocks degrades as we see an increasing # of Spark applications processing an increasing amount of data. In addition, because Spark external shuffle service is a shared service in a multi-tenancy cluster, the inefficiency with one Spark application could propagate to other applications as well. In this ticket, we propose a solution to improve Spark shuffle efficiency in above mentioned environments with push-based shuffle. With push-based shuffle, shuffle is performed at the end of mappers and blocks get pre-merged and move towards reducers. In our prototype implementation, we have seen significant efficiency improvements when performing large shuffles. We take a Spark-native approach to achieve this, i.e., extending Spark’s existing shuffle netty protocol, and the behaviors of Spark mappers, reducers and drivers. This way, we can bring the benefits of more efficient shuffle in Spark without incurring the dependency or overhead of either specialized storage layer or external infrastructure pieces. Link to dev mailing list discussion: [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html] was: 白月山火禾In a large deployment of a Spark compute infrastructure, Spark shuffle is becoming a potential scaling bottleneck and a source of inefficiency in the cluster. When doing Spark on YARN for a large-scale deployment, people usually enable Spark external shuffle service and store the intermediate shuffle files on HDD. Because the number of blocks generated for a particular shuffle grows quadratically compared to the size of shuffled data (# mappers and reducers grows linearly with the size of shuffled data, but # blocks is # mappers * # reducers), one general trend we have observed is that the more data a Spark application processes, the smaller the block size becomes. In a few production clusters we have seen, the average shuffle block size is only 10s of KBs. Because of the inefficiency of performing random reads on HDD for small amount of data, the overall efficiency of the Spark external shuffle services serving the shuffle blocks degrades as we see an increasing # of Spark applications processing an increasing amount of data. In addition, because Spark external shuffle service is a shared service in a multi-tenancy cluster, the inefficiency with one Spark application could propagate to other applications as well. In this ticket, we propose a solution to improve Spark shuffle efficiency in above mentioned environments with push-based shuffle. With push-based shuffle, shuffle is performed at the end of mappers and blocks get pre-merged and move towards reducers. In our prototype implementation, we have seen significant efficiency improvements when performing large shuffles. We take a Spark-native approach to achieve this, i.e., extending Spark’s existing shuffle netty protocol, and the behaviors of Spark mappers, reducers and drivers. This way, we can bring the benefits of more efficient shuffle in Spark without incurring the dependency or overhead of either specialized storage layer or external infrastructure pieces. Link to dev mailing list discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html > SPIP: Support push-based shuffle to improve shuffle efficiency > -- > > Key: SPARK-30602 > URL: https://issues.apache.org/jira/browse/SPARK-30602 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > Labels: release-notes > Attachments: Screen Shot 202
[jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark
[ https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323773#comment-17323773 ] Ste Millington commented on SPARK-19842: Is there any progress on this, has anything landed in Spark 3? Feels like there is potential for some massive query speedups here if hints are allowed to specify unique and/or referential constraints. > Informational Referential Integrity Constraints Support in Spark > > > Key: SPARK-19842 > URL: https://issues.apache.org/jira/browse/SPARK-19842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ioana Delaney >Priority: Major > Attachments: InformationalRIConstraints.doc > > > *Informational Referential Integrity Constraints Support in Spark* > This work proposes support for _informational primary key_ and _foreign key > (referential integrity) constraints_ in Spark. The main purpose is to open up > an area of query optimization techniques that rely on referential integrity > constraints semantics. > An _informational_ or _statistical constraint_ is a constraint such as a > _unique_, _primary key_, _foreign key_, or _check constraint_, that can be > used by Spark to improve query performance. Informational constraints are not > enforced by the Spark SQL engine; rather, they are used by Catalyst to > optimize the query processing. They provide semantics information that allows > Catalyst to rewrite queries to eliminate joins, push down aggregates, remove > unnecessary Distinct operations, and perform a number of other optimizations. > Informational constraints are primarily targeted to applications that load > and analyze data that originated from a data warehouse. For such > applications, the conditions for a given constraint are known to be true, so > the constraint does not need to be enforced during data load operations. > The attached document covers constraint definition, metastore storage, > constraint validation, and maintenance. The document shows many examples of > query performance improvements that utilize referential integrity constraints > and can be implemented in Spark. > Link to the google doc: > [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used
Erik Krogen created SPARK-35106: --- Summary: HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used Key: SPARK-35106 URL: https://issues.apache.org/jira/browse/SPARK-35106 Project: Spark Issue Type: Bug Components: Input/Output, Spark Core Affects Versions: 3.1.1 Reporter: Erik Krogen Recently when evaluating the code in {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under the {{dynamicPartitionOverwrite == true}} scenario: {code:language=scala} # BLOCK 1 if (dynamicPartitionOverwrite) { val absPartitionPaths = filesToMove.values.map(new Path(_).getParent).toSet logDebug(s"Clean up absolute partition directories for overwriting: $absPartitionPaths") absPartitionPaths.foreach(fs.delete(_, true)) } # BLOCK 2 for ((src, dst) <- filesToMove) { fs.rename(new Path(src), new Path(dst)) } # BLOCK 3 if (dynamicPartitionOverwrite) { val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _) logDebug(s"Clean up default partition directories for overwriting: $partitionPaths") for (part <- partitionPaths) { val finalPartPath = new Path(path, part) if (!fs.delete(finalPartPath, true) && !fs.exists(finalPartPath.getParent)) { // According to the official hadoop FileSystem API spec, delete op should assume // the destination is no longer present regardless of return value, thus we do not // need to double check if finalPartPath exists before rename. // Also in our case, based on the spec, delete returns false only when finalPartPath // does not exist. When this happens, we need to take action if parent of finalPartPath // also does not exist(e.g. the scenario described on SPARK-23815), because // FileSystem API spec on rename op says the rename dest(finalPartPath) must have // a parent that exists, otherwise we may get unexpected result on the rename. fs.mkdirs(finalPartPath.getParent) } fs.rename(new Path(stagingDir, part), finalPartPath) } } {code} Assuming {{dynamicPartitionOverwrite == true}}, we have the following sequence of events: # Block 1 deletes all parent directories of {{filesToMove.values}} # Block 2 attempts to rename all {{filesToMove.keys}} to {{filesToMove.values}} # Block 3 does directory-level renames to place files into their final locations All renames in Block 2 will always fail, since all parent directories of {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS scenario, the contract of {{fs.rename}} is to return {{false}} under such a failure scenario, as opposed to throwing an exception. There is a separate issue here that Block 2 should probably be checking for those {{false}} return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", albeit with a bunch of failed renames in the middle. Really, we should only run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and consolidate Blocks 1 and 3 to run in the {{true}} case. We discovered this issue when testing against a {{FileSystem}} implementation which was throwing an exception for this failed rename scenario instead of returning false, escalating the silent/ignored rename failures into actual failures. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used
[ https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35106: Assignee: Apache Spark > HadoopMapReduceCommitProtocol performs bad rename when dynamic partition > overwrite is used > -- > > Key: SPARK-35106 > URL: https://issues.apache.org/jira/browse/SPARK-35106 > Project: Spark > Issue Type: Bug > Components: Input/Output, Spark Core >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Assignee: Apache Spark >Priority: Major > > Recently when evaluating the code in > {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under > the {{dynamicPartitionOverwrite == true}} scenario: > {code:language=scala} > # BLOCK 1 > if (dynamicPartitionOverwrite) { > val absPartitionPaths = filesToMove.values.map(new > Path(_).getParent).toSet > logDebug(s"Clean up absolute partition directories for overwriting: > $absPartitionPaths") > absPartitionPaths.foreach(fs.delete(_, true)) > } > # BLOCK 2 > for ((src, dst) <- filesToMove) { > fs.rename(new Path(src), new Path(dst)) > } > # BLOCK 3 > if (dynamicPartitionOverwrite) { > val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _) > logDebug(s"Clean up default partition directories for overwriting: > $partitionPaths") > for (part <- partitionPaths) { > val finalPartPath = new Path(path, part) > if (!fs.delete(finalPartPath, true) && > !fs.exists(finalPartPath.getParent)) { > // According to the official hadoop FileSystem API spec, delete > op should assume > // the destination is no longer present regardless of return > value, thus we do not > // need to double check if finalPartPath exists before rename. > // Also in our case, based on the spec, delete returns false only > when finalPartPath > // does not exist. When this happens, we need to take action if > parent of finalPartPath > // also does not exist(e.g. the scenario described on > SPARK-23815), because > // FileSystem API spec on rename op says the rename > dest(finalPartPath) must have > // a parent that exists, otherwise we may get unexpected result > on the rename. > fs.mkdirs(finalPartPath.getParent) > } > fs.rename(new Path(stagingDir, part), finalPartPath) > } > } > {code} > Assuming {{dynamicPartitionOverwrite == true}}, we have the following > sequence of events: > # Block 1 deletes all parent directories of {{filesToMove.values}} > # Block 2 attempts to rename all {{filesToMove.keys}} to > {{filesToMove.values}} > # Block 3 does directory-level renames to place files into their final > locations > All renames in Block 2 will always fail, since all parent directories of > {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS > scenario, the contract of {{fs.rename}} is to return {{false}} under such a > failure scenario, as opposed to throwing an exception. There is a separate > issue here that Block 2 should probably be checking for those {{false}} > return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", > albeit with a bunch of failed renames in the middle. Really, we should only > run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and > consolidate Blocks 1 and 3 to run in the {{true}} case. > We discovered this issue when testing against a {{FileSystem}} implementation > which was throwing an exception for this failed rename scenario instead of > returning false, escalating the silent/ignored rename failures into actual > failures. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used
[ https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35106: Assignee: (was: Apache Spark) > HadoopMapReduceCommitProtocol performs bad rename when dynamic partition > overwrite is used > -- > > Key: SPARK-35106 > URL: https://issues.apache.org/jira/browse/SPARK-35106 > Project: Spark > Issue Type: Bug > Components: Input/Output, Spark Core >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Priority: Major > > Recently when evaluating the code in > {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under > the {{dynamicPartitionOverwrite == true}} scenario: > {code:language=scala} > # BLOCK 1 > if (dynamicPartitionOverwrite) { > val absPartitionPaths = filesToMove.values.map(new > Path(_).getParent).toSet > logDebug(s"Clean up absolute partition directories for overwriting: > $absPartitionPaths") > absPartitionPaths.foreach(fs.delete(_, true)) > } > # BLOCK 2 > for ((src, dst) <- filesToMove) { > fs.rename(new Path(src), new Path(dst)) > } > # BLOCK 3 > if (dynamicPartitionOverwrite) { > val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _) > logDebug(s"Clean up default partition directories for overwriting: > $partitionPaths") > for (part <- partitionPaths) { > val finalPartPath = new Path(path, part) > if (!fs.delete(finalPartPath, true) && > !fs.exists(finalPartPath.getParent)) { > // According to the official hadoop FileSystem API spec, delete > op should assume > // the destination is no longer present regardless of return > value, thus we do not > // need to double check if finalPartPath exists before rename. > // Also in our case, based on the spec, delete returns false only > when finalPartPath > // does not exist. When this happens, we need to take action if > parent of finalPartPath > // also does not exist(e.g. the scenario described on > SPARK-23815), because > // FileSystem API spec on rename op says the rename > dest(finalPartPath) must have > // a parent that exists, otherwise we may get unexpected result > on the rename. > fs.mkdirs(finalPartPath.getParent) > } > fs.rename(new Path(stagingDir, part), finalPartPath) > } > } > {code} > Assuming {{dynamicPartitionOverwrite == true}}, we have the following > sequence of events: > # Block 1 deletes all parent directories of {{filesToMove.values}} > # Block 2 attempts to rename all {{filesToMove.keys}} to > {{filesToMove.values}} > # Block 3 does directory-level renames to place files into their final > locations > All renames in Block 2 will always fail, since all parent directories of > {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS > scenario, the contract of {{fs.rename}} is to return {{false}} under such a > failure scenario, as opposed to throwing an exception. There is a separate > issue here that Block 2 should probably be checking for those {{false}} > return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", > albeit with a bunch of failed renames in the middle. Really, we should only > run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and > consolidate Blocks 1 and 3 to run in the {{true}} case. > We discovered this issue when testing against a {{FileSystem}} implementation > which was throwing an exception for this failed rename scenario instead of > returning false, escalating the silent/ignored rename failures into actual > failures. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used
[ https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323924#comment-17323924 ] Apache Spark commented on SPARK-35106: -- User 'xkrogen' has created a pull request for this issue: https://github.com/apache/spark/pull/32207 > HadoopMapReduceCommitProtocol performs bad rename when dynamic partition > overwrite is used > -- > > Key: SPARK-35106 > URL: https://issues.apache.org/jira/browse/SPARK-35106 > Project: Spark > Issue Type: Bug > Components: Input/Output, Spark Core >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Priority: Major > > Recently when evaluating the code in > {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under > the {{dynamicPartitionOverwrite == true}} scenario: > {code:language=scala} > # BLOCK 1 > if (dynamicPartitionOverwrite) { > val absPartitionPaths = filesToMove.values.map(new > Path(_).getParent).toSet > logDebug(s"Clean up absolute partition directories for overwriting: > $absPartitionPaths") > absPartitionPaths.foreach(fs.delete(_, true)) > } > # BLOCK 2 > for ((src, dst) <- filesToMove) { > fs.rename(new Path(src), new Path(dst)) > } > # BLOCK 3 > if (dynamicPartitionOverwrite) { > val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _) > logDebug(s"Clean up default partition directories for overwriting: > $partitionPaths") > for (part <- partitionPaths) { > val finalPartPath = new Path(path, part) > if (!fs.delete(finalPartPath, true) && > !fs.exists(finalPartPath.getParent)) { > // According to the official hadoop FileSystem API spec, delete > op should assume > // the destination is no longer present regardless of return > value, thus we do not > // need to double check if finalPartPath exists before rename. > // Also in our case, based on the spec, delete returns false only > when finalPartPath > // does not exist. When this happens, we need to take action if > parent of finalPartPath > // also does not exist(e.g. the scenario described on > SPARK-23815), because > // FileSystem API spec on rename op says the rename > dest(finalPartPath) must have > // a parent that exists, otherwise we may get unexpected result > on the rename. > fs.mkdirs(finalPartPath.getParent) > } > fs.rename(new Path(stagingDir, part), finalPartPath) > } > } > {code} > Assuming {{dynamicPartitionOverwrite == true}}, we have the following > sequence of events: > # Block 1 deletes all parent directories of {{filesToMove.values}} > # Block 2 attempts to rename all {{filesToMove.keys}} to > {{filesToMove.values}} > # Block 3 does directory-level renames to place files into their final > locations > All renames in Block 2 will always fail, since all parent directories of > {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS > scenario, the contract of {{fs.rename}} is to return {{false}} under such a > failure scenario, as opposed to throwing an exception. There is a separate > issue here that Block 2 should probably be checking for those {{false}} > return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", > albeit with a bunch of failed renames in the middle. Really, we should only > run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and > consolidate Blocks 1 and 3 to run in the {{true}} case. > We discovered this issue when testing against a {{FileSystem}} implementation > which was throwing an exception for this failed rename scenario instead of > returning false, escalating the silent/ignored rename failures into actual > failures. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35097) Add column name to SparkUpgradeException about ancient datetime
[ https://issues.apache.org/jira/browse/SPARK-35097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323943#comment-17323943 ] Xiao Li commented on SPARK-35097: - Can you help fix this, [~angerszhuuu] ? > Add column name to SparkUpgradeException about ancient datetime > --- > > Key: SPARK-35097 > URL: https://issues.apache.org/jira/browse/SPARK-35097 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > The error message: > {code:java} > org.apache.spark.SparkUpgradeException: You may get a different result due to > the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps > before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files > may be written by Spark 2.x or legacy versions of Hive, which uses a legacy > hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian > calendar. See more details in SPARK-31404. You can set > spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the > datetime values w.r.t. the calendar difference during reading. Or set > spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the > datetime values as it is. > {code} > doesn't have any clues of which column causes the issue. Need to improve the > message and add column name to it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35103) Improve type coercion rule performance
[ https://issues.apache.org/jira/browse/SPARK-35103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323984#comment-17323984 ] Apache Spark commented on SPARK-35103: -- User 'sigmod' has created a pull request for this issue: https://github.com/apache/spark/pull/32208 > Improve type coercion rule performance > -- > > Key: SPARK-35103 > URL: https://issues.apache.org/jira/browse/SPARK-35103 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yingyi Bu >Priority: Major > > Reduce the time spent on type coercion rules by running them together > one-tree-node-at-a-time in a combined rule. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35103) Improve type coercion rule performance
[ https://issues.apache.org/jira/browse/SPARK-35103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35103: Assignee: Apache Spark > Improve type coercion rule performance > -- > > Key: SPARK-35103 > URL: https://issues.apache.org/jira/browse/SPARK-35103 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yingyi Bu >Assignee: Apache Spark >Priority: Major > > Reduce the time spent on type coercion rules by running them together > one-tree-node-at-a-time in a combined rule. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35103) Improve type coercion rule performance
[ https://issues.apache.org/jira/browse/SPARK-35103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35103: Assignee: (was: Apache Spark) > Improve type coercion rule performance > -- > > Key: SPARK-35103 > URL: https://issues.apache.org/jira/browse/SPARK-35103 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yingyi Bu >Priority: Major > > Reduce the time spent on type coercion rules by running them together > one-tree-node-at-a-time in a combined rule. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35103) Improve type coercion rule performance
[ https://issues.apache.org/jira/browse/SPARK-35103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323986#comment-17323986 ] Apache Spark commented on SPARK-35103: -- User 'sigmod' has created a pull request for this issue: https://github.com/apache/spark/pull/32208 > Improve type coercion rule performance > -- > > Key: SPARK-35103 > URL: https://issues.apache.org/jira/browse/SPARK-35103 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yingyi Bu >Priority: Major > > Reduce the time spent on type coercion rules by running them together > one-tree-node-at-a-time in a combined rule. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35097) Add column name to SparkUpgradeException about ancient datetime
[ https://issues.apache.org/jira/browse/SPARK-35097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323998#comment-17323998 ] angerszhu commented on SPARK-35097: --- [~smilegator] Sure, working on this. > Add column name to SparkUpgradeException about ancient datetime > --- > > Key: SPARK-35097 > URL: https://issues.apache.org/jira/browse/SPARK-35097 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > The error message: > {code:java} > org.apache.spark.SparkUpgradeException: You may get a different result due to > the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps > before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files > may be written by Spark 2.x or legacy versions of Hive, which uses a legacy > hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian > calendar. See more details in SPARK-31404. You can set > spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the > datetime values w.r.t. the calendar difference during reading. Or set > spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the > datetime values as it is. > {code} > doesn't have any clues of which column causes the issue. Need to improve the > message and add column name to it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35107) Parse unit-to-unit interval literals to ANSI intervals
Max Gekk created SPARK-35107: Summary: Parse unit-to-unit interval literals to ANSI intervals Key: SPARK-35107 URL: https://issues.apache.org/jira/browse/SPARK-35107 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Max Gekk Assignee: Max Gekk Parse unit-to-unit intervals like INTERVAL '1-1' YEAR TO MONTH to either YearMonthIntervalType or DayTimeIntervalType by default. But when spark.sql.legacy.interval.enabled is set to true, parse them to CalendarInterval. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35107) Parse unit-to-unit interval literals to ANSI intervals
[ https://issues.apache.org/jira/browse/SPARK-35107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324034#comment-17324034 ] Apache Spark commented on SPARK-35107: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/32209 > Parse unit-to-unit interval literals to ANSI intervals > -- > > Key: SPARK-35107 > URL: https://issues.apache.org/jira/browse/SPARK-35107 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Parse unit-to-unit intervals like INTERVAL '1-1' YEAR TO MONTH to either > YearMonthIntervalType or DayTimeIntervalType by default. But when > spark.sql.legacy.interval.enabled is set to true, parse them to > CalendarInterval. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35107) Parse unit-to-unit interval literals to ANSI intervals
[ https://issues.apache.org/jira/browse/SPARK-35107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35107: Assignee: Apache Spark (was: Max Gekk) > Parse unit-to-unit interval literals to ANSI intervals > -- > > Key: SPARK-35107 > URL: https://issues.apache.org/jira/browse/SPARK-35107 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Parse unit-to-unit intervals like INTERVAL '1-1' YEAR TO MONTH to either > YearMonthIntervalType or DayTimeIntervalType by default. But when > spark.sql.legacy.interval.enabled is set to true, parse them to > CalendarInterval. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35107) Parse unit-to-unit interval literals to ANSI intervals
[ https://issues.apache.org/jira/browse/SPARK-35107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35107: Assignee: Max Gekk (was: Apache Spark) > Parse unit-to-unit interval literals to ANSI intervals > -- > > Key: SPARK-35107 > URL: https://issues.apache.org/jira/browse/SPARK-35107 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Parse unit-to-unit intervals like INTERVAL '1-1' YEAR TO MONTH to either > YearMonthIntervalType or DayTimeIntervalType by default. But when > spark.sql.legacy.interval.enabled is set to true, parse them to > CalendarInterval. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)
Robert Joseph Evans created SPARK-35108: --- Summary: Pickle produces incorrect key labels for GenericRowWithSchema (data corruption) Key: SPARK-35108 URL: https://issues.apache.org/jira/browse/SPARK-35108 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.2, 3.0.1 Reporter: Robert Joseph Evans Attachments: test.py, test.sh I think this also shows up for all versions of Spark that pickle the data when doing a collect from python. When you do a collect in python java will do a collect and convert the UnsafeRows into GenericRowWithSchema instances before it sends them to the Pickler. The Pickler, by default, will try to dedupe objects using hashCode and .equals for the object. But .equals and .hashCode for GenericRowWithSchema only looks at the data, not the schema. But when we pickle the row the keys from the schema are written out. This can result in data corruption, sort of, in a few cases where a row has the same number of elements as a struct within the row does, or a sub-struct within another struct. If the data happens to be the same, the keys for the resulting row or struct can be wrong. My repro case is a bit convoluted, but it does happen. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)
[ https://issues.apache.org/jira/browse/SPARK-35108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated SPARK-35108: Attachment: test.sh test.py > Pickle produces incorrect key labels for GenericRowWithSchema (data > corruption) > --- > > Key: SPARK-35108 > URL: https://issues.apache.org/jira/browse/SPARK-35108 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.0.2 >Reporter: Robert Joseph Evans >Priority: Major > Attachments: test.py, test.sh > > > I think this also shows up for all versions of Spark that pickle the data > when doing a collect from python. > When you do a collect in python java will do a collect and convert the > UnsafeRows into GenericRowWithSchema instances before it sends them to the > Pickler. The Pickler, by default, will try to dedupe objects using hashCode > and .equals for the object. But .equals and .hashCode for > GenericRowWithSchema only looks at the data, not the schema. But when we > pickle the row the keys from the schema are written out. > This can result in data corruption, sort of, in a few cases where a row has > the same number of elements as a struct within the row does, or a sub-struct > within another struct. > If the data happens to be the same, the keys for the resulting row or struct > can be wrong. > My repro case is a bit convoluted, but it does happen. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)
[ https://issues.apache.org/jira/browse/SPARK-35108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324088#comment-17324088 ] Robert Joseph Evans commented on SPARK-35108: - If you have SPARK_HOME set when you run test.sh on a system with 6 cores or more it should reproduce the issue. I was able to mitigate the issue by adding .equals and .hashCode to GenericRowWithSchema so it took into account the schema. But we could also try to turn off the dedupe or value compare dedupe (Pickler has options to disable these things). I am not sure what the proper fix for this would be because the code for all of these is shared with other code paths. > Pickle produces incorrect key labels for GenericRowWithSchema (data > corruption) > --- > > Key: SPARK-35108 > URL: https://issues.apache.org/jira/browse/SPARK-35108 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.0.2 >Reporter: Robert Joseph Evans >Priority: Major > Attachments: test.py, test.sh > > > I think this also shows up for all versions of Spark that pickle the data > when doing a collect from python. > When you do a collect in python java will do a collect and convert the > UnsafeRows into GenericRowWithSchema instances before it sends them to the > Pickler. The Pickler, by default, will try to dedupe objects using hashCode > and .equals for the object. But .equals and .hashCode for > GenericRowWithSchema only looks at the data, not the schema. But when we > pickle the row the keys from the schema are written out. > This can result in data corruption, sort of, in a few cases where a row has > the same number of elements as a struct within the row does, or a sub-struct > within another struct. > If the data happens to be the same, the keys for the resulting row or struct > can be wrong. > My repro case is a bit convoluted, but it does happen. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes and move to docker 'virtualization' layer
[ https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324131#comment-17324131 ] Shane Knapp commented on SPARK-34738: - alright, my canary build w/skipping the PV integration test passed w/the docker driver:https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-clone/20/ i'll put together a PR for this over the weekend (it's a one-liner) and once we merge i can get the remaining workers upgraded early next week. > Upgrade Minikube and kubernetes and move to docker 'virtualization' layer > - > > Key: SPARK-34738 > URL: https://issues.apache.org/jira/browse/SPARK-34738 > Project: Spark > Issue Type: Task > Components: jenkins, Kubernetes >Affects Versions: 3.2.0 >Reporter: Attila Zsolt Piros >Assignee: Shane Knapp >Priority: Major > Attachments: integration-tests.log > > > [~shaneknapp] as we discussed [on the mailing > list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html] > Minikube can be upgraded to the latest (v1.18.1) and kubernetes version > should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`). > [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new > method to configure the kubernetes client. Thanks in advance to use it for > testing on the Jenkins after the Minikube version is updated. > > Added by Shane: > we also need to move from the kvm2 virtualization layer to docker. docker is > a recommended driver w/the latest versions of minikube, and this will allow > devs to more easily recreate the minikube/k8s env on their local workstations > and run the integration tests in an identical environment as jenkins. > the TL;DR is that upgrading to docker works, except that the PV integration > tests are failing due to a couple of possible reasons: > 1) the 'spark-kubernetes-driver' isn't properly being loaded > (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312517&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312517) > 2) during the PV test run, the error message 'Given path > (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist' shows up in > the logs. however, the mk cluster *does* mount successfully to the local > bare-metal filesystem *and* if i 'minikube ssh' in to it, i can see the mount > and read/write successfully to it > (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312548&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312548) > i could really use some help, and if it's useful, i can create some local > accounts manually and allow ssh access for a couple of people to assist me. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes and move to docker 'virtualization' layer
[ https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324165#comment-17324165 ] Attila Zsolt Piros commented on SPARK-34738: @shane I can modify my PR to skip the PV tests by just simply remove the k8s label from it https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/PVTestsSuite.scala#L125 and https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala#L567 As the tests are selected via that label, see the *-Dtest.include.tags=k8s* in the mvn command: {{+ /home/jenkins/workspace/SparkPullRequestBuilder-K8s/build/mvn install -f /home/jenkins/workspace/SparkPullRequestBuilder-K8s/pom.xml -pl resource-managers/kubernetes/integration-tests -am -Pscala-2.12 -Phadoop-3.2 -Pkubernetes -Pkubernetes-integration-tests -Djava.version=8 -Dspark.kubernetes.test.sparkTgz=/home/jenkins/workspace/SparkPullRequestBuilder-K8s/spark-3.2.0-SNAPSHOT-bin-20210408-a823d0e7c36.tgz -Dspark.kubernetes.test.imageTag=N/A -Dspark.kubernetes.test.imageRepo=docker.io/kubespark -Dspark.kubernetes.test.deployMode=minikube -Dtest.include.tags=k8s -Dspark.kubernetes.test.jvmImage=spark -Dspark.kubernetes.test.pythonImage=spark-py -Dspark.kubernetes.test.rImage=spark-r -Dlog4j.logger.org.apache.spark=DEBUG}} I can use a different brand new label. This way if you can make the PR building controlled by a text in the the PR title like it is done by maven version we can switch on the test only for those PRs where we try to fix the problem. Regarding: > i could really use some help, and if it's useful, i can create some local > accounts manually and allow ssh access for a couple of people to assist me. I am happy to help I just need access to logs. As I told earlier it's probably enough if the PR builder already uses Docker for testing and PV tests are running. That was I referred to: > This way I can run experiments via the PR builder and see the logs with my > own eyes. That way in my PR (or the next one) I can temporary can log out what I need. But to be the safe side you can give me that ssh access too. The only problem is time I can do the analizes part only at the end of next week. > Upgrade Minikube and kubernetes and move to docker 'virtualization' layer > - > > Key: SPARK-34738 > URL: https://issues.apache.org/jira/browse/SPARK-34738 > Project: Spark > Issue Type: Task > Components: jenkins, Kubernetes >Affects Versions: 3.2.0 >Reporter: Attila Zsolt Piros >Assignee: Shane Knapp >Priority: Major > Attachments: integration-tests.log > > > [~shaneknapp] as we discussed [on the mailing > list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html] > Minikube can be upgraded to the latest (v1.18.1) and kubernetes version > should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`). > [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new > method to configure the kubernetes client. Thanks in advance to use it for > testing on the Jenkins after the Minikube version is updated. > > Added by Shane: > we also need to move from the kvm2 virtualization layer to docker. docker is > a recommended driver w/the latest versions of minikube, and this will allow > devs to more easily recreate the minikube/k8s env on their local workstations > and run the integration tests in an identical environment as jenkins. > the TL;DR is that upgrading to docker works, except that the PV integration > tests are failing due to a couple of possible reasons: > 1) the 'spark-kubernetes-driver' isn't properly being loaded > (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312517&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312517) > 2) during the PV test run, the error message 'Given path > (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist' shows up in > the logs. however, the mk cluster *does* mount successfully to the local > bare-metal filesystem *and* if i 'minikube ssh' in to it, i can see the mount > and read/write successfully to it > (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312548&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312548) > i could really use some help, and if it's useful, i can create some local > accounts manually and allow ssh access for a couple of people to assist me. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (SPARK-34738) Upgrade Minikube and kubernetes and move to docker 'virtualization' layer
[ https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324165#comment-17324165 ] Attila Zsolt Piros edited comment on SPARK-34738 at 4/17/21, 3:56 AM: -- @shane I can modify my PR to skip the PV tests by just simply remove the k8s label from it https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/PVTestsSuite.scala#L125 and https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala#L567 As the tests are selected via that label, see the *-Dtest.include.tags=k8s* in the mvn command: {{+ /home/jenkins/workspace/SparkPullRequestBuilder-K8s/build/mvn install -f /home/jenkins/workspace/SparkPullRequestBuilder-K8s/pom.xml -pl resource-managers/kubernetes/integration-tests -am -Pscala-2.12 -Phadoop-3.2 -Pkubernetes -Pkubernetes-integration-tests -Djava.version=8 -Dspark.kubernetes.test.sparkTgz=/home/jenkins/workspace/SparkPullRequestBuilder-K8s/spark-3.2.0-SNAPSHOT-bin-20210408-a823d0e7c36.tgz -Dspark.kubernetes.test.imageTag=N/A -Dspark.kubernetes.test.imageRepo=docker.io/kubespark -Dspark.kubernetes.test.deployMode=minikube -Dtest.include.tags=k8s -Dspark.kubernetes.test.jvmImage=spark -Dspark.kubernetes.test.pythonImage=spark-py -Dspark.kubernetes.test.rImage=spark-r -Dlog4j.logger.org.apache.spark=DEBUG}} I can use a different brand new label. This way if you can make the PR building controlled by a text in the the PR title like it is done by maven version we can switch on the test only for those PRs where we try to fix the problem. I mean the section "Running Different Test Permutations on Jenkins" at https://spark.apache.org/developer-tools.html. Regarding: > i could really use some help, and if it's useful, i can create some local > accounts manually and allow ssh access for a couple of people to assist me. I am happy to help I just need access to logs. As I told earlier it's probably enough if the PR builder already uses Docker for testing and PV tests are running. That was I referred to: > This way I can run experiments via the PR builder and see the logs with my > own eyes. That way in my PR (or the next one) I can temporary can log out what I need. But to be the safe side you can give me that ssh access too. The only problem is time I can do the analizes part only at the end of next week. was (Author: attilapiros): @shane I can modify my PR to skip the PV tests by just simply remove the k8s label from it https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/PVTestsSuite.scala#L125 and https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala#L567 As the tests are selected via that label, see the *-Dtest.include.tags=k8s* in the mvn command: {{+ /home/jenkins/workspace/SparkPullRequestBuilder-K8s/build/mvn install -f /home/jenkins/workspace/SparkPullRequestBuilder-K8s/pom.xml -pl resource-managers/kubernetes/integration-tests -am -Pscala-2.12 -Phadoop-3.2 -Pkubernetes -Pkubernetes-integration-tests -Djava.version=8 -Dspark.kubernetes.test.sparkTgz=/home/jenkins/workspace/SparkPullRequestBuilder-K8s/spark-3.2.0-SNAPSHOT-bin-20210408-a823d0e7c36.tgz -Dspark.kubernetes.test.imageTag=N/A -Dspark.kubernetes.test.imageRepo=docker.io/kubespark -Dspark.kubernetes.test.deployMode=minikube -Dtest.include.tags=k8s -Dspark.kubernetes.test.jvmImage=spark -Dspark.kubernetes.test.pythonImage=spark-py -Dspark.kubernetes.test.rImage=spark-r -Dlog4j.logger.org.apache.spark=DEBUG}} I can use a different brand new label. This way if you can make the PR building controlled by a text in the the PR title like it is done by maven version we can switch on the test only for those PRs where we try to fix the problem. Regarding: > i could really use some help, and if it's useful, i can create some local > accounts manually and allow ssh access for a couple of people to assist me. I am happy to help I just need access to logs. As I told earlier it's probably enough if the PR builder already uses Docker for testing and PV tests are running. That was I referred to: > This way I can run experiments via the PR builder and see the logs with my > own eyes. That way in my PR (or the next one) I can temporary can log out what I need. But to be the safe side you can give me that ssh access too. The only problem is time I can do the analizes part only at the end of next week. > Upgrade Minikube and kubernetes and move to docker 'virtualization' layer > ---
[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes and move to docker 'virtualization' layer
[ https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324167#comment-17324167 ] Attila Zsolt Piros commented on SPARK-34738: My new commit: https://github.com/apache/spark/pull/31829/commits/1f594aa40061da96545d29d28ca2b2a410655c34 So the new tag is "persistentVolume". This PR on its own should switch off the "PVs with local storage" unit test, but to be the safe side please run a test with it. > Upgrade Minikube and kubernetes and move to docker 'virtualization' layer > - > > Key: SPARK-34738 > URL: https://issues.apache.org/jira/browse/SPARK-34738 > Project: Spark > Issue Type: Task > Components: jenkins, Kubernetes >Affects Versions: 3.2.0 >Reporter: Attila Zsolt Piros >Assignee: Shane Knapp >Priority: Major > Attachments: integration-tests.log > > > [~shaneknapp] as we discussed [on the mailing > list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html] > Minikube can be upgraded to the latest (v1.18.1) and kubernetes version > should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`). > [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new > method to configure the kubernetes client. Thanks in advance to use it for > testing on the Jenkins after the Minikube version is updated. > > Added by Shane: > we also need to move from the kvm2 virtualization layer to docker. docker is > a recommended driver w/the latest versions of minikube, and this will allow > devs to more easily recreate the minikube/k8s env on their local workstations > and run the integration tests in an identical environment as jenkins. > the TL;DR is that upgrading to docker works, except that the PV integration > tests are failing due to a couple of possible reasons: > 1) the 'spark-kubernetes-driver' isn't properly being loaded > (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312517&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312517) > 2) during the PV test run, the error message 'Given path > (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist' shows up in > the logs. however, the mk cluster *does* mount successfully to the local > bare-metal filesystem *and* if i 'minikube ssh' in to it, i can see the mount > and read/write successfully to it > (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312548&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312548) > i could really use some help, and if it's useful, i can create some local > accounts manually and allow ssh access for a couple of people to assist me. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34738) Upgrade Minikube and kubernetes and move to docker 'virtualization' layer
[ https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324167#comment-17324167 ] Attila Zsolt Piros edited comment on SPARK-34738 at 4/17/21, 4:27 AM: -- My new commit: https://github.com/apache/spark/pull/31829/commits/1f594aa40061da96545d29d28ca2b2a410655c34 So the new tag is "persistentVolume". This PR on its own should switch off the "PVs with local storage" unit test, but to be the safe side please run a test with it. Regarding controlling the build by the PR title we have to set set one flag for the dev-run-integration-tests.sh. It is the "--include-tags", see: https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh#L88-L91 was (Author: attilapiros): My new commit: https://github.com/apache/spark/pull/31829/commits/1f594aa40061da96545d29d28ca2b2a410655c34 So the new tag is "persistentVolume". This PR on its own should switch off the "PVs with local storage" unit test, but to be the safe side please run a test with it. > Upgrade Minikube and kubernetes and move to docker 'virtualization' layer > - > > Key: SPARK-34738 > URL: https://issues.apache.org/jira/browse/SPARK-34738 > Project: Spark > Issue Type: Task > Components: jenkins, Kubernetes >Affects Versions: 3.2.0 >Reporter: Attila Zsolt Piros >Assignee: Shane Knapp >Priority: Major > Attachments: integration-tests.log > > > [~shaneknapp] as we discussed [on the mailing > list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html] > Minikube can be upgraded to the latest (v1.18.1) and kubernetes version > should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`). > [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new > method to configure the kubernetes client. Thanks in advance to use it for > testing on the Jenkins after the Minikube version is updated. > > Added by Shane: > we also need to move from the kvm2 virtualization layer to docker. docker is > a recommended driver w/the latest versions of minikube, and this will allow > devs to more easily recreate the minikube/k8s env on their local workstations > and run the integration tests in an identical environment as jenkins. > the TL;DR is that upgrading to docker works, except that the PV integration > tests are failing due to a couple of possible reasons: > 1) the 'spark-kubernetes-driver' isn't properly being loaded > (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312517&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312517) > 2) during the PV test run, the error message 'Given path > (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist' shows up in > the logs. however, the mk cluster *does* mount successfully to the local > bare-metal filesystem *and* if i 'minikube ssh' in to it, i can see the mount > and read/write successfully to it > (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312548&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312548) > i could really use some help, and if it's useful, i can create some local > accounts manually and allow ssh access for a couple of people to assist me. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32634) Introduce sort-based fallback mechanism for shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-32634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324173#comment-17324173 ] Apache Spark commented on SPARK-32634: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/32210 > Introduce sort-based fallback mechanism for shuffled hash join > --- > > Key: SPARK-32634 > URL: https://issues.apache.org/jira/browse/SPARK-32634 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Minor > > A major pain point for spark users to stay away from using shuffled hash join > is out of memory issue. Shuffled hash join tends to have OOM issue because it > allocates in-memory hashed relation (`UnsafeHashedRelation` or > `LongHashedRelation`) for build side, and there's no recovery (e.g. > fallback/spill) once the size of hashed relation grows and cannot fit in > memory. On the other hand, shuffled hash join is more CPU and IO efficient > than sort merge join when joining one large table and a small table (but > small table is too large to be broadcasted), as SHJ does not sort the large > table, but SMJ needs to do that. > To improve the reliability of shuffled hash join, a fallback mechanism can be > introduced to avoid shuffled hash join OOM issue completely. Similarly we > already have a fallback to sort-based aggregation for hash aggregate. The > idea is: > (1).Build hashed relation as current, but monitor the hashed relation size > when inserting each build side row. If size of hashed relation being always > smaller than a configurable threshold, go to (2.1), else go to (2.2). > (2.1).Current shuffled hash join logic: reading stream side rows and probing > hashed relation. > (2.2).Fall back to sort merge join: Sort stream side rows, and sort build > side rows (iterate rows already in hashed relation (e.g. through > `BytesToBytesMap.destructiveIterator`), then iterate rest of un-read build > side rows). Then doing sort merge join for stream + build side rows. > > Note: > (1).the fallback is dynamic and happened per task, which means task 0 can > incur the fallback e.g. if it has a big build side, but task 1,2 don't need > to incur the fallback depending on the size of hashed relation. > (2).there's no major code change for SHJ and SMJ. Major change is around > HashedRelation to introduce some new methods, e.g. > `HashedRelation.destructiveValues()` to return an Iterator of build side rows > in hashed relation and cleaning up hashed relation along the way. > (3).we have run this feature by default in our internal fork more than 2 > years, and we benefit a lot from it with users can choose to use SHJ, and we > don't need to worry about SHJ reliability (see > https://issues.apache.org/jira/browse/SPARK-21505 for the original proposal > from our side, I tweak here to make it less intrusive and more acceptable, > e.g. not introducing a separate join operator, but doing the fallback > automatically inside SHJ operator itself). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32634) Introduce sort-based fallback mechanism for shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-32634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32634: Assignee: (was: Apache Spark) > Introduce sort-based fallback mechanism for shuffled hash join > --- > > Key: SPARK-32634 > URL: https://issues.apache.org/jira/browse/SPARK-32634 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Minor > > A major pain point for spark users to stay away from using shuffled hash join > is out of memory issue. Shuffled hash join tends to have OOM issue because it > allocates in-memory hashed relation (`UnsafeHashedRelation` or > `LongHashedRelation`) for build side, and there's no recovery (e.g. > fallback/spill) once the size of hashed relation grows and cannot fit in > memory. On the other hand, shuffled hash join is more CPU and IO efficient > than sort merge join when joining one large table and a small table (but > small table is too large to be broadcasted), as SHJ does not sort the large > table, but SMJ needs to do that. > To improve the reliability of shuffled hash join, a fallback mechanism can be > introduced to avoid shuffled hash join OOM issue completely. Similarly we > already have a fallback to sort-based aggregation for hash aggregate. The > idea is: > (1).Build hashed relation as current, but monitor the hashed relation size > when inserting each build side row. If size of hashed relation being always > smaller than a configurable threshold, go to (2.1), else go to (2.2). > (2.1).Current shuffled hash join logic: reading stream side rows and probing > hashed relation. > (2.2).Fall back to sort merge join: Sort stream side rows, and sort build > side rows (iterate rows already in hashed relation (e.g. through > `BytesToBytesMap.destructiveIterator`), then iterate rest of un-read build > side rows). Then doing sort merge join for stream + build side rows. > > Note: > (1).the fallback is dynamic and happened per task, which means task 0 can > incur the fallback e.g. if it has a big build side, but task 1,2 don't need > to incur the fallback depending on the size of hashed relation. > (2).there's no major code change for SHJ and SMJ. Major change is around > HashedRelation to introduce some new methods, e.g. > `HashedRelation.destructiveValues()` to return an Iterator of build side rows > in hashed relation and cleaning up hashed relation along the way. > (3).we have run this feature by default in our internal fork more than 2 > years, and we benefit a lot from it with users can choose to use SHJ, and we > don't need to worry about SHJ reliability (see > https://issues.apache.org/jira/browse/SPARK-21505 for the original proposal > from our side, I tweak here to make it less intrusive and more acceptable, > e.g. not introducing a separate join operator, but doing the fallback > automatically inside SHJ operator itself). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32634) Introduce sort-based fallback mechanism for shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-32634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32634: Assignee: Apache Spark > Introduce sort-based fallback mechanism for shuffled hash join > --- > > Key: SPARK-32634 > URL: https://issues.apache.org/jira/browse/SPARK-32634 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Apache Spark >Priority: Minor > > A major pain point for spark users to stay away from using shuffled hash join > is out of memory issue. Shuffled hash join tends to have OOM issue because it > allocates in-memory hashed relation (`UnsafeHashedRelation` or > `LongHashedRelation`) for build side, and there's no recovery (e.g. > fallback/spill) once the size of hashed relation grows and cannot fit in > memory. On the other hand, shuffled hash join is more CPU and IO efficient > than sort merge join when joining one large table and a small table (but > small table is too large to be broadcasted), as SHJ does not sort the large > table, but SMJ needs to do that. > To improve the reliability of shuffled hash join, a fallback mechanism can be > introduced to avoid shuffled hash join OOM issue completely. Similarly we > already have a fallback to sort-based aggregation for hash aggregate. The > idea is: > (1).Build hashed relation as current, but monitor the hashed relation size > when inserting each build side row. If size of hashed relation being always > smaller than a configurable threshold, go to (2.1), else go to (2.2). > (2.1).Current shuffled hash join logic: reading stream side rows and probing > hashed relation. > (2.2).Fall back to sort merge join: Sort stream side rows, and sort build > side rows (iterate rows already in hashed relation (e.g. through > `BytesToBytesMap.destructiveIterator`), then iterate rest of un-read build > side rows). Then doing sort merge join for stream + build side rows. > > Note: > (1).the fallback is dynamic and happened per task, which means task 0 can > incur the fallback e.g. if it has a big build side, but task 1,2 don't need > to incur the fallback depending on the size of hashed relation. > (2).there's no major code change for SHJ and SMJ. Major change is around > HashedRelation to introduce some new methods, e.g. > `HashedRelation.destructiveValues()` to return an Iterator of build side rows > in hashed relation and cleaning up hashed relation along the way. > (3).we have run this feature by default in our internal fork more than 2 > years, and we benefit a lot from it with users can choose to use SHJ, and we > don't need to worry about SHJ reliability (see > https://issues.apache.org/jira/browse/SPARK-21505 for the original proposal > from our side, I tweak here to make it less intrusive and more acceptable, > e.g. not introducing a separate join operator, but doing the fallback > automatically inside SHJ operator itself). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35109) Fix minor exception messages of HashedRelation and HashedJoin
Cheng Su created SPARK-35109: Summary: Fix minor exception messages of HashedRelation and HashedJoin Key: SPARK-35109 URL: https://issues.apache.org/jira/browse/SPARK-35109 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Cheng Su It seems that we miss classifying one `SparkOutOfMemoryError` in `HashedRelation`. Add the error classification for it. In addition, clean up two errors definition of `HashJoin` as they are not used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35109) Fix minor exception messages of HashedRelation and HashedJoin
[ https://issues.apache.org/jira/browse/SPARK-35109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35109: Assignee: Apache Spark > Fix minor exception messages of HashedRelation and HashedJoin > - > > Key: SPARK-35109 > URL: https://issues.apache.org/jira/browse/SPARK-35109 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Assignee: Apache Spark >Priority: Trivial > > It seems that we miss classifying one `SparkOutOfMemoryError` in > `HashedRelation`. Add the error classification for it. In addition, clean up > two errors definition of `HashJoin` as they are not used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35109) Fix minor exception messages of HashedRelation and HashedJoin
[ https://issues.apache.org/jira/browse/SPARK-35109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324180#comment-17324180 ] Apache Spark commented on SPARK-35109: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/32211 > Fix minor exception messages of HashedRelation and HashedJoin > - > > Key: SPARK-35109 > URL: https://issues.apache.org/jira/browse/SPARK-35109 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Trivial > > It seems that we miss classifying one `SparkOutOfMemoryError` in > `HashedRelation`. Add the error classification for it. In addition, clean up > two errors definition of `HashJoin` as they are not used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35109) Fix minor exception messages of HashedRelation and HashedJoin
[ https://issues.apache.org/jira/browse/SPARK-35109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35109: Assignee: (was: Apache Spark) > Fix minor exception messages of HashedRelation and HashedJoin > - > > Key: SPARK-35109 > URL: https://issues.apache.org/jira/browse/SPARK-35109 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Trivial > > It seems that we miss classifying one `SparkOutOfMemoryError` in > `HashedRelation`. Add the error classification for it. In addition, clean up > two errors definition of `HashJoin` as they are not used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35109) Fix minor exception messages of HashedRelation and HashJoin
[ https://issues.apache.org/jira/browse/SPARK-35109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-35109: - Summary: Fix minor exception messages of HashedRelation and HashJoin (was: Fix minor exception messages of HashedRelation and HashedJoin) > Fix minor exception messages of HashedRelation and HashJoin > --- > > Key: SPARK-35109 > URL: https://issues.apache.org/jira/browse/SPARK-35109 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Trivial > > It seems that we miss classifying one `SparkOutOfMemoryError` in > `HashedRelation`. Add the error classification for it. In addition, clean up > two errors definition of `HashJoin` as they are not used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org