[spark] branch master updated: [SPARK-33084][TEST][FOLLOWUP] Fix a flaky test in SparkContextSuite
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8a637c8 [SPARK-33084][TEST][FOLLOWUP] Fix a flaky test in SparkContextSuite 8a637c8 is described below commit 8a637c80c67fdffaaf873e047a3a133c4e2bc16c Author: Gengliang Wang AuthorDate: Wed Jan 5 15:51:09 2022 +0800 [SPARK-33084][TEST][FOLLOWUP] Fix a flaky test in SparkContextSuite ### What changes were proposed in this pull request? The test case `SPARK-33084: Add jar support Ivy URI -- transitive=true will download dependency jars` in `SparkContextSuite` is becoming flaky: - https://github.com/gengliangwang/spark/runs/4698825652?check_suite_focus=true - https://github.com/gengliangwang/spark/runs/4698331067?check_suite_focus=true - https://github.com/AngersZh/spark/runs/4697626841?check_suite_focus=true The reason is that some of the events in `LogAppender` are null so that there is NPE: ``` [info] Cause: java.lang.NullPointerException: [info] at org.apache.spark.SparkContextSuite.$anonfun$new$128(SparkContextSuite.scala:1077) [info] at org.apache.spark.SparkContextSuite.$anonfun$new$128$adapted(SparkContextSuite.scala:1077) ``` This PR is to fix the issue to unblock PR builders. ### Why are the changes needed? Fix flaky test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just tests Closes #35098 from gengliangwang/fixFlakyTest. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- core/src/test/scala/org/apache/spark/SparkFunSuite.scala | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/core/src/test/scala/org/apache/spark/SparkFunSuite.scala b/core/src/test/scala/org/apache/spark/SparkFunSuite.scala index 5be0c96..273ffeb 100644 --- a/core/src/test/scala/org/apache/spark/SparkFunSuite.scala +++ b/core/src/test/scala/org/apache/spark/SparkFunSuite.scala @@ -266,23 +266,25 @@ abstract class SparkFunSuite class LogAppender(msg: String = "", maxEvents: Int = 1000) extends AbstractAppender("logAppender", null, null) { -val loggingEvents = new ArrayBuffer[LogEvent]() +private val _loggingEvents = new ArrayBuffer[LogEvent]() private var _threshold: Level = Level.INFO override def append(loggingEvent: LogEvent): Unit = loggingEvent.synchronized { val copyEvent = loggingEvent.toImmutable if (copyEvent.getLevel.isMoreSpecificThan(_threshold)) { -if (loggingEvents.size >= maxEvents) { +if (_loggingEvents.size >= maxEvents) { val loggingInfo = if (msg == "") "." else s" while logging $msg." throw new IllegalStateException( s"Number of events reached the limit of $maxEvents$loggingInfo") } -loggingEvents.append(copyEvent) +_loggingEvents.append(copyEvent) } } def setThreshold(threshold: Level): Unit = { _threshold = threshold } + +def loggingEvents: ArrayBuffer[LogEvent] = _loggingEvents.filterNot(_ == null) } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-33701][SHUFFLE] Adaptive shuffle merge finalization for push-based shuffle
This is an automated email from the ASF dual-hosted git repository. mridulm80 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f6128a6 [SPARK-33701][SHUFFLE] Adaptive shuffle merge finalization for push-based shuffle f6128a6 is described below commit f6128a6f4215dc45a19209d799dd9bf98fab6d8a Author: Venkata krishnan Sowrirajan AuthorDate: Wed Jan 5 01:47:01 2022 -0600 [SPARK-33701][SHUFFLE] Adaptive shuffle merge finalization for push-based shuffle ### What changes were proposed in this pull request? As part of SPARK-32920 implemented a simple approach to finalization for push-based shuffle. Shuffle merge finalization is the final operation happens at the end of the stage when all the tasks are completed asking all the external shuffle services to complete the shuffle merge for the stage. Once this request is completed no more shuffle pushes will be accepted. With this approach, `DAGScheduler` waits for a fixed time of 10s (`spark.shuffle.push.finalize.timeout`) to allow some time [...] In this PR, instead of waiting for fixed amount of time before shuffle merge finalization now this is controlled adaptively if min threshold number of map tasks shuffle push (`spark.shuffle.push.minPushRatio`) completed then shuffle merge finalization will be scheduled. Also additionally if the total shuffle generated is lesser than min threshold shuffle size (`spark.shuffle.push.minShuffleSizeToWait`) then immediately shuffle merge finalization is scheduled. ### Why are the changes needed? This is a performance improvement to the existing functionality ### Does this PR introduce _any_ user-facing change? Yes additional user facing configs `spark.shuffle.push.minPushRatio` and `spark.shuffle.push.minShuffleSizeToWait` ### How was this patch tested? Added unit tests in `DAGSchedulerSuite`, `ShuffleBlockPusherSuite` Lead-authored-by: Min Shen Co-authored-by: Venkata krishnan Sowrirajan Closes #33896 from venkata91/SPARK-33701. Lead-authored-by: Venkata krishnan Sowrirajan Co-authored-by: Min Shen Signed-off-by: Mridul Muralidharan gmail.com> --- .../main/scala/org/apache/spark/Dependency.scala | 35 ++- .../scala/org/apache/spark/MapOutputTracker.scala | 6 +- .../src/main/scala/org/apache/spark/SparkEnv.scala | 3 + .../executor/CoarseGrainedExecutorBackend.scala| 6 + .../org/apache/spark/internal/config/package.scala | 27 ++ .../org/apache/spark/scheduler/DAGScheduler.scala | 278 + .../apache/spark/scheduler/DAGSchedulerEvent.scala | 4 + .../cluster/CoarseGrainedClusterMessage.scala | 3 + .../cluster/CoarseGrainedSchedulerBackend.scala| 3 + .../apache/spark/shuffle/ShuffleBlockPusher.scala | 39 ++- .../apache/spark/scheduler/DAGSchedulerSuite.scala | 340 +++-- .../spark/shuffle/ShuffleBlockPusherSuite.scala| 101 +- docs/configuration.md | 16 + 13 files changed, 772 insertions(+), 89 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/Dependency.scala b/core/src/main/scala/org/apache/spark/Dependency.scala index 1b4e7ba..8e348ee 100644 --- a/core/src/main/scala/org/apache/spark/Dependency.scala +++ b/core/src/main/scala/org/apache/spark/Dependency.scala @@ -17,8 +17,12 @@ package org.apache.spark +import java.util.concurrent.ScheduledFuture + import scala.reflect.ClassTag +import org.roaringbitmap.RoaringBitmap + import org.apache.spark.annotation.DeveloperApi import org.apache.spark.internal.Logging import org.apache.spark.rdd.RDD @@ -131,9 +135,7 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( def shuffleMergeId: Int = _shuffleMergeId def setMergerLocs(mergerLocs: Seq[BlockManagerId]): Unit = { -if (mergerLocs != null) { - this.mergerLocs = mergerLocs -} +this.mergerLocs = mergerLocs } def getMergerLocs: Seq[BlockManagerId] = mergerLocs @@ -160,6 +162,8 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( _shuffleMergedFinalized = false mergerLocs = Nil _shuffleMergeId += 1 +finalizeTask = None +shufflePushCompleted.clear() } private def canShuffleMergeBeEnabled(): Boolean = { @@ -169,11 +173,34 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( if (isPushShuffleEnabled && rdd.isBarrier()) { logWarning("Push-based shuffle is currently not supported for barrier stages") } -isPushShuffleEnabled && +isPushShuffleEnabled && numPartitions > 0 && // TODO: SPARK-35547: Push based shuffle is currently unsupported for Barrier stages !rdd.isBarrier() } + @transient private[this] val shufflePushCompleted = new RoaringBitmap() + + /** + * Mark a given map task as push
[spark] branch master updated (27d5575 -> 0b9f120)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 27d5575 [SPARK-37807][SQL] Fix a typo in HttpAuthenticationException message add 0b9f120 [SPARK-37750][SQL][FOLLOWUP] Change SQLConf parameter name No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala | 4 ++-- .../src/main/scala/org/apache/spark/sql/internal/SQLConf.scala| 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated: [SPARK-37807][SQL] Fix a typo in HttpAuthenticationException message
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 3bcd036 [SPARK-37807][SQL] Fix a typo in HttpAuthenticationException message 3bcd036 is described below commit 3bcd036367afc550065f6049d6059b3b92729267 Author: Chilaka Ramakrishna AuthorDate: Tue Jan 4 23:30:20 2022 -0800 [SPARK-37807][SQL] Fix a typo in HttpAuthenticationException message ### What changes were proposed in this pull request? The error message is not correct, So we update the error message. ### Why are the changes needed? The exception message when password is left empty in HTTP mode of hive thrift server is not correct.. Updated the text to reflect it. Please check JIRA ISSUE: https://issues.apache.org/jira/browse/SPARK-37807 ### Does this PR introduce _any_ user-facing change? Yes, The exception messages in HiveServer2 is changed. ### How was this patch tested? This was tested manually Closes #35097 from RamakrishnaChilaka/feature/error_string_fix. Authored-by: Chilaka Ramakrishna Signed-off-by: Dongjoon Hyun (cherry picked from commit 27d5575f13fe69459d7fa72cee11d4166c9e1a10) Signed-off-by: Dongjoon Hyun --- .../main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java index 0cfa84d..be06206 100644 --- a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java +++ b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java @@ -493,7 +493,7 @@ public class ThriftHttpServlet extends TServlet { // Password must be present if (creds[1] == null || creds[1].isEmpty()) { throw new HttpAuthenticationException("Authorization header received " + - "from the client does not contain username."); + "from the client does not contain password."); } return creds[1]; } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-37807][SQL] Fix a typo in HttpAuthenticationException message
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 731e056 [SPARK-37807][SQL] Fix a typo in HttpAuthenticationException message 731e056 is described below commit 731e05672983f75ba70b10207774d29ab4ad8602 Author: Chilaka Ramakrishna AuthorDate: Tue Jan 4 23:30:20 2022 -0800 [SPARK-37807][SQL] Fix a typo in HttpAuthenticationException message ### What changes were proposed in this pull request? The error message is not correct, So we update the error message. ### Why are the changes needed? The exception message when password is left empty in HTTP mode of hive thrift server is not correct.. Updated the text to reflect it. Please check JIRA ISSUE: https://issues.apache.org/jira/browse/SPARK-37807 ### Does this PR introduce _any_ user-facing change? Yes, The exception messages in HiveServer2 is changed. ### How was this patch tested? This was tested manually Closes #35097 from RamakrishnaChilaka/feature/error_string_fix. Authored-by: Chilaka Ramakrishna Signed-off-by: Dongjoon Hyun (cherry picked from commit 27d5575f13fe69459d7fa72cee11d4166c9e1a10) Signed-off-by: Dongjoon Hyun --- .../main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java index 0e2f374..f0f5cdc 100644 --- a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java +++ b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java @@ -492,7 +492,7 @@ public class ThriftHttpServlet extends TServlet { // Password must be present if (creds[1] == null || creds[1].isEmpty()) { throw new HttpAuthenticationException("Authorization header received " + - "from the client does not contain username."); + "from the client does not contain password."); } return creds[1]; } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (4a2ba5b -> 27d5575)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 4a2ba5b [SPARK-37813][SQL][TESTS] Enable vectorization for nested column in ORC scan benchmark add 27d5575 [SPARK-37807][SQL] Fix a typo in HttpAuthenticationException message No new revisions were added by this update. Summary of changes: .../main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (9a9b379 -> 4a2ba5b)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 9a9b379 [SPARK-37786][SQL] StreamingQueryListener support use SQLConf.get to get corresponding SessionState's SQLConf add 4a2ba5b [SPARK-37813][SQL][TESTS] Enable vectorization for nested column in ORC scan benchmark No new revisions were added by this update. Summary of changes: .../benchmarks/OrcReadBenchmark-jdk11-results.txt | 198 ++--- .../benchmarks/OrcReadBenchmark-jdk17-results.txt | 168 +++-- sql/hive/benchmarks/OrcReadBenchmark-results.txt | 188 +-- .../spark/sql/hive/orc/OrcReadBenchmark.scala | 8 +- 4 files changed, 349 insertions(+), 213 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (93c614b -> 9a9b379)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 93c614b [SPARK-30789][SQL][DOCS][FOLLOWUP] Add document for syntax `(IGNORE | RESPECT) NULLS` add 9a9b379 [SPARK-37786][SQL] StreamingQueryListener support use SQLConf.get to get corresponding SessionState's SQLConf No new revisions were added by this update. Summary of changes: project/MimaExcludes.scala | 4 +++- .../spark/sql/internal/BaseSessionStateBuilder.scala | 3 ++- .../spark/sql/streaming/StreamingQueryManager.scala | 16 ++-- .../streaming/StreamingQueryListenersConfSuite.scala | 20 +++- 4 files changed, 34 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-30789][SQL][DOCS][FOLLOWUP] Add document for syntax `(IGNORE | RESPECT) NULLS`
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 9b6be1a [SPARK-30789][SQL][DOCS][FOLLOWUP] Add document for syntax `(IGNORE | RESPECT) NULLS` 9b6be1a is described below commit 9b6be1a6c004e50ffdf59f7fa1986adeb03e45cd Author: Jiaan Geng AuthorDate: Wed Jan 5 12:57:21 2022 +0800 [SPARK-30789][SQL][DOCS][FOLLOWUP] Add document for syntax `(IGNORE | RESPECT) NULLS` ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/30943 supports syntax `(IGNORE | RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE`, but update document. The screen snapshot before this PR ![screenshot-20211231-174803](https://user-images.githubusercontent.com/8486025/147816336-debca074-0b84-48e8-9ed2-cb13f562cf12.png) This PR adds document for syntax `(IGNORE | RESPECT) NULLS` The screen snapshot after this PR ![image](https://user-images.githubusercontent.com/8486025/148141568-506e9232-a3c4-4a25-a5c6-65a5d5a2e066.png) ![image](https://user-images.githubusercontent.com/8486025/148061495-b7198417-9d4c-4c03-9060-385271ea9a46.png) ### Why are the changes needed? Add document for syntax `(IGNORE | RESPECT) NULLS` ### Does this PR introduce _any_ user-facing change? 'No'. Just update docs. ### How was this patch tested? Manual check. Closes #35079 from beliefer/SPARK-30789-docs. Authored-by: Jiaan Geng Signed-off-by: Wenchen Fan (cherry picked from commit 93c614bf1e6aba092d82bcd8616b5ea31eb191a2) Signed-off-by: Wenchen Fan --- docs/sql-ref-syntax-qry-select-window.md | 37 ++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/docs/sql-ref-syntax-qry-select-window.md b/docs/sql-ref-syntax-qry-select-window.md index a1c2b18..6e65778 100644 --- a/docs/sql-ref-syntax-qry-select-window.md +++ b/docs/sql-ref-syntax-qry-select-window.md @@ -26,7 +26,7 @@ Window functions operate on a group of rows, referred to as a window, and calcul ### Syntax ```sql -window_function OVER +window_function [ nulls_option ] OVER ( [ { PARTITION | DISTRIBUTE } BY partition_col_name = partition_col_val ( [ , ... ] ) ] { ORDER | SORT } BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ] [ window_frame ] ) @@ -42,7 +42,7 @@ window_function OVER * Analytic Functions - **Syntax:** `CUME_DIST | LAG | LEAD` + **Syntax:** `CUME_DIST | LAG | LEAD | NTH_VALUE | FIRST_VALUE | LAST_VALUE` * Aggregate Functions @@ -50,6 +50,16 @@ window_function OVER Please refer to the [Built-in Aggregation Functions](sql-ref-functions-builtin.html#aggregate-functions) document for a complete list of Spark aggregate functions. +* **nulls_option** + +Specifies whether or not to skip null values when evaluating the window function. `RESECT NULLS` means not skipping null values, while `IGNORE NULLS` means skipping. If not specified, the default is `RESECT NULLS`. + +**Syntax:** + +`{ IGNORE | RESPECT } NULLS` + +**Note:** Only `LAG | LEAD | NTH_VALUE | FIRST_VALUE | LAST_VALUE` can be used with `IGNORE NULLS`. + * **window_frame** Specifies which row to start the window on and where to end it. @@ -184,6 +194,29 @@ SELECT name, salary, | Jane| Marketing| 29000|29000|35000| | Jeff| Marketing| 35000|29000|0| +-+---+--+-+-+ + +SELECT id, v, +LEAD(v, 0) IGNORE NULLS OVER w lead, +LAG(v, 0) IGNORE NULLS OVER w lag, +NTH_VALUE(v, 2) IGNORE NULLS OVER w nth_value, +FIRST_VALUE(v) IGNORE NULLS OVER w first_value, +LAST_VALUE(v) IGNORE NULLS OVER w last_value +FROM test_ignore_null +WINDOW w AS (ORDER BY id) +ORDER BY id; ++--++++-+---+--+ +|id| v|lead| lag|nth_value|first_value|last_value| ++--++++-+---+--+ +| 0|NULL|NULL|NULL| NULL| NULL| NULL| +| 1| x| x| x| NULL| x| x| +| 2|NULL|NULL|NULL| NULL| x| x| +| 3|NULL|NULL|NULL| NULL| x| x| +| 4| y| y| y|y| x| y| +| 5|NULL|NULL|NULL|y| x| y| +| 6| z| z| z|y| x| z| +| 7| v| v| v|y| x| v| +| 8|NULL|NULL|NULL|y| x| v| ++--++++-+---+--+ ``` ### Related Statements - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (3c6c690 -> 93c614b)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 3c6c690 [MINOR][DOCS] Make code blocks pretty in README.md add 93c614b [SPARK-30789][SQL][DOCS][FOLLOWUP] Add document for syntax `(IGNORE | RESPECT) NULLS` No new revisions were added by this update. Summary of changes: docs/sql-ref-syntax-qry-select-window.md | 37 ++-- 1 file changed, 35 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (639d6f4 -> 3c6c690)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 639d6f4 [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv add 3c6c690 [MINOR][DOCS] Make code blocks pretty in README.md No new revisions were added by this update. Summary of changes: README.md | 32 1 file changed, 24 insertions(+), 8 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv
This is an automated email from the ASF dual-hosted git repository. wuyi pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new be441e8 [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv be441e8 is described below commit be441e84069acc711ea848c69ae5bd55a7c93531 Author: yi.wu AuthorDate: Wed Jan 5 10:48:16 2022 +0800 [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv ### What changes were proposed in this pull request? This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`. ### Why are the changes needed? `System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3 recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. [...] ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested manually. Closes #35069 from Ngone51/fix-workerwatcher-exit. Authored-by: yi.wu Signed-off-by: yi.wu (cherry picked from commit 639d6f40e597d79c680084376ece87e40f4d2366) Signed-off-by: yi.wu --- .../main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala index efffc9f..b7a5728 100644 --- a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala +++ b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala @@ -54,8 +54,12 @@ private[spark] class WorkerWatcher( if (isTesting) { isShutDown = true } else if (isChildProcessStopping.compareAndSet(false, true)) { - // SPARK-35714: avoid the duplicate call of `System.exit` to avoid the dead lock - System.exit(-1) + // SPARK-35714: avoid the duplicate call of `System.exit` to avoid the dead lock. + // Same as SPARK-14180, we should run `System.exit` in a separate thread to avoid + // dead lock since `System.exit` will trigger the shutdown hook of `executor.stop`. + new Thread("WorkerWatcher-exit-executor") { +override def run(): Unit = System.exit(-1) + }.start() } override def receive: PartialFunction[Any, Unit] = { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated: [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv
This is an automated email from the ASF dual-hosted git repository. wuyi pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 70d4fb1 [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv 70d4fb1 is described below commit 70d4fb15c8b7d02a4e7fe3d1c82cfd35d23d053c Author: yi.wu AuthorDate: Wed Jan 5 10:48:16 2022 +0800 [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv ### What changes were proposed in this pull request? This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`. ### Why are the changes needed? `System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3 recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. [...] ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested manually. Closes #35069 from Ngone51/fix-workerwatcher-exit. Authored-by: yi.wu Signed-off-by: yi.wu (cherry picked from commit 639d6f40e597d79c680084376ece87e40f4d2366) Signed-off-by: yi.wu --- .../main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala index efffc9f..b7a5728 100644 --- a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala +++ b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala @@ -54,8 +54,12 @@ private[spark] class WorkerWatcher( if (isTesting) { isShutDown = true } else if (isChildProcessStopping.compareAndSet(false, true)) { - // SPARK-35714: avoid the duplicate call of `System.exit` to avoid the dead lock - System.exit(-1) + // SPARK-35714: avoid the duplicate call of `System.exit` to avoid the dead lock. + // Same as SPARK-14180, we should run `System.exit` in a separate thread to avoid + // dead lock since `System.exit` will trigger the shutdown hook of `executor.stop`. + new Thread("WorkerWatcher-exit-executor") { +override def run(): Unit = System.exit(-1) + }.start() } override def receive: PartialFunction[Any, Unit] = { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv
This is an automated email from the ASF dual-hosted git repository. wuyi pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 537de84 [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv 537de84 is described below commit 537de8459e3f7b1cd02521efa24b2036a9019ba5 Author: yi.wu AuthorDate: Wed Jan 5 10:48:16 2022 +0800 [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv ### What changes were proposed in this pull request? This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`. ### Why are the changes needed? `System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3 recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. [...] ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested manually. Closes #35069 from Ngone51/fix-workerwatcher-exit. Authored-by: yi.wu Signed-off-by: yi.wu (cherry picked from commit 639d6f40e597d79c680084376ece87e40f4d2366) Signed-off-by: yi.wu --- .../main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala index efffc9f..b7a5728 100644 --- a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala +++ b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala @@ -54,8 +54,12 @@ private[spark] class WorkerWatcher( if (isTesting) { isShutDown = true } else if (isChildProcessStopping.compareAndSet(false, true)) { - // SPARK-35714: avoid the duplicate call of `System.exit` to avoid the dead lock - System.exit(-1) + // SPARK-35714: avoid the duplicate call of `System.exit` to avoid the dead lock. + // Same as SPARK-14180, we should run `System.exit` in a separate thread to avoid + // dead lock since `System.exit` will trigger the shutdown hook of `executor.stop`. + new Thread("WorkerWatcher-exit-executor") { +override def run(): Unit = System.exit(-1) + }.start() } override def receive: PartialFunction[Any, Unit] = { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (3b88bc8 -> 639d6f4)
This is an automated email from the ASF dual-hosted git repository. wuyi pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 3b88bc8 [SPARK-37792][CORE] Fix the check of custom configuration in SparkShellLoggingFilter add 639d6f4 [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv No new revisions were added by this update. Summary of changes: .../main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (98e1c77 -> 3b88bc8)
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 98e1c77 [SPARK-37803][SQL] Add ORC read benchmarks for structs add 3b88bc8 [SPARK-37792][CORE] Fix the check of custom configuration in SparkShellLoggingFilter No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/internal/Logging.scala | 19 -- .../org/apache/spark/internal/LoggingSuite.scala | 23 +++--- .../scala/org/apache/spark/repl/ReplSuite.scala| 19 -- 3 files changed, 42 insertions(+), 19 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-37803][SQL] Add ORC read benchmarks for structs
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 98e1c77 [SPARK-37803][SQL] Add ORC read benchmarks for structs 98e1c77 is described below commit 98e1c77710e44190112610e21d6f02de1b620611 Author: Bruce Robbins AuthorDate: Tue Jan 4 15:55:11 2022 -0800 [SPARK-37803][SQL] Add ORC read benchmarks for structs ### What changes were proposed in this pull request? Add Orc read benchmarks for structs and nested structs. ### Why are the changes needed? This PR will provide baseline benchmarks for PR #35090, which will hopefully make the deserialization of Orc structs more efficient. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New benchmark tests. Closes #35100 from bersprockets/orc_struct_benchmark. Authored-by: Bruce Robbins Signed-off-by: Dongjoon Hyun --- sql/hive/benchmarks/OrcReadBenchmark-results.txt | 168 ++--- .../spark/sql/hive/orc/OrcReadBenchmark.scala | 82 ++ 2 files changed, 199 insertions(+), 51 deletions(-) diff --git a/sql/hive/benchmarks/OrcReadBenchmark-results.txt b/sql/hive/benchmarks/OrcReadBenchmark-results.txt index 24969ce..9ffd7a5 100644 --- a/sql/hive/benchmarks/OrcReadBenchmark-results.txt +++ b/sql/hive/benchmarks/OrcReadBenchmark-results.txt @@ -6,49 +6,49 @@ OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1022-azure Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz SQL Single TINYINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Native ORC MR 700748 79 22.5 44.5 1.0X -Native ORC Vectorized 103126 15153.2 6.5 6.8X -Hive built-in ORC 952978 26 16.5 60.5 0.7X +Native ORC MR 832 1153 453 18.9 52.9 1.0X +Native ORC Vectorized 148189 24106.5 9.4 5.6X +Hive built-in ORC 986 1028 59 15.9 62.7 0.8X OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1022-azure Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz SQL Single SMALLINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Native ORC MR 793814 35 19.8 50.4 1.0X -Native ORC Vectorized 107119 12146.9 6.8 7.4X -Hive built-in ORC 1025 1025 1 15.3 65.1 0.8X +Native ORC MR 868913 60 18.1 55.2 1.0X +Native ORC Vectorized 133150 21118.6 8.4 6.5X +Hive built-in ORC 1098 1102 6 14.3 69.8 0.8X OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1022-azure Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz SQL Single INT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Native ORC MR 818832 19 19.2 52.0 1.0X -Native ORC Vectorized 141167 28111.7 8.9 5.8X -Hive built-in ORC 1079 1089 15 14.6 68.6 0.8X +Native ORC MR 898917 24 17.5 57.1 1.0X +Native ORC Vectorized 155175 16101.4 9.9 5.8X +Hive built-in ORC 1114 1126 17 14.1 70.8 0.8X OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1022-azure Intel(R) Xeon(R) Platinum 8272CL CPU @
[spark] branch master updated (eeef48fa -> 9c6f3fb)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from eeef48fa [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState() add 9c6f3fb [SPARK-37806][K8S] Support minimum number of tasks per executor before being rolled No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/deploy/k8s/Config.scala | 10 ++ .../scheduler/cluster/k8s/ExecutorRollPlugin.scala | 9 +++-- .../cluster/k8s/ExecutorRollPluginSuite.scala | 23 +- 3 files changed, 39 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState()
This is an automated email from the ASF dual-hosted git repository. joshrosen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 3aaf722 [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState() 3aaf722 is described below commit 3aaf722a0d6552733504c794f59d390c349dfa80 Author: Josh Rosen AuthorDate: Tue Jan 4 10:59:53 2022 -0800 [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState() ### What changes were proposed in this pull request? This PR fixes a correctness issue in the CodeGenerator.addBufferedState() helper method (which is used by the SortMergeJoinExec operator). The addBufferedState() method generates code for buffering values that come from a row in an operator's input iterator, performing any necessary copying so that the buffered values remain correct after the input iterator advances to the next row. The current logic does not correctly handle UDTs: these fall through to the match statement's default branch, causing UDT values to be buffered without copying. This is problematic if the UDT's underlying SQL type is an array, map, struct, or string type (since those types require copying). Failing to copy values can lead to correctness issues or crashes. This patch's fix is simple: when the dataType is a UDT, use its underlying sqlType for determining whether values need to be copied. I used an existing helper function to perform this type unwrapping. ### Why are the changes needed? Fix a correctness issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I manually tested this change by re-running a workload which failed with a segfault prior to this patch. See JIRA for more details: https://issues.apache.org/jira/browse/SPARK-37784 So far I have been unable to come up with a CI-runnable regression test which would have failed prior to this change (my only working reproduction runs in a pre-production environment and does not fail in my development environment). Closes #35066 from JoshRosen/SPARK-37784. Authored-by: Josh Rosen Signed-off-by: Josh Rosen (cherry picked from commit eeef48fac412a57382b02ba3f39456d96379b5f5) Signed-off-by: Josh Rosen --- .../apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala index 74c6ea5..90ebe40 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala @@ -330,7 +330,7 @@ class CodegenContext extends Logging { */ def addBufferedState(dataType: DataType, variableName: String, initCode: String): ExprCode = { val value = addMutableState(javaType(dataType), variableName) -val code = dataType match { +val code = UserDefinedType.sqlType(dataType) match { case StringType => code"$value = $initCode.clone();" case _: StructType | _: ArrayType | _: MapType => code"$value = $initCode.copy();" case _ => code"$value = $initCode;" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated: [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState()
This is an automated email from the ASF dual-hosted git repository. joshrosen pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 5cc8b39 [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState() 5cc8b39 is described below commit 5cc8b397f1b01b5ee4a26e8e8540baf1e05c97a0 Author: Josh Rosen AuthorDate: Tue Jan 4 10:59:53 2022 -0800 [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState() ### What changes were proposed in this pull request? This PR fixes a correctness issue in the CodeGenerator.addBufferedState() helper method (which is used by the SortMergeJoinExec operator). The addBufferedState() method generates code for buffering values that come from a row in an operator's input iterator, performing any necessary copying so that the buffered values remain correct after the input iterator advances to the next row. The current logic does not correctly handle UDTs: these fall through to the match statement's default branch, causing UDT values to be buffered without copying. This is problematic if the UDT's underlying SQL type is an array, map, struct, or string type (since those types require copying). Failing to copy values can lead to correctness issues or crashes. This patch's fix is simple: when the dataType is a UDT, use its underlying sqlType for determining whether values need to be copied. I used an existing helper function to perform this type unwrapping. ### Why are the changes needed? Fix a correctness issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I manually tested this change by re-running a workload which failed with a segfault prior to this patch. See JIRA for more details: https://issues.apache.org/jira/browse/SPARK-37784 So far I have been unable to come up with a CI-runnable regression test which would have failed prior to this change (my only working reproduction runs in a pre-production environment and does not fail in my development environment). Closes #35066 from JoshRosen/SPARK-37784. Authored-by: Josh Rosen Signed-off-by: Josh Rosen (cherry picked from commit eeef48fac412a57382b02ba3f39456d96379b5f5) Signed-off-by: Josh Rosen --- .../apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala index 6e6b946..4092436 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala @@ -334,7 +334,7 @@ class CodegenContext extends Logging { */ def addBufferedState(dataType: DataType, variableName: String, initCode: String): ExprCode = { val value = addMutableState(javaType(dataType), variableName) -val code = dataType match { +val code = UserDefinedType.sqlType(dataType) match { case StringType => code"$value = $initCode.clone();" case _: StructType | _: ArrayType | _: MapType => code"$value = $initCode.copy();" case _ => code"$value = $initCode;" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState()
This is an automated email from the ASF dual-hosted git repository. joshrosen pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 45b7b7e [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState() 45b7b7e is described below commit 45b7b7e1682946ef86b42fd85d58bdd471996d0c Author: Josh Rosen AuthorDate: Tue Jan 4 10:59:53 2022 -0800 [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState() ### What changes were proposed in this pull request? This PR fixes a correctness issue in the CodeGenerator.addBufferedState() helper method (which is used by the SortMergeJoinExec operator). The addBufferedState() method generates code for buffering values that come from a row in an operator's input iterator, performing any necessary copying so that the buffered values remain correct after the input iterator advances to the next row. The current logic does not correctly handle UDTs: these fall through to the match statement's default branch, causing UDT values to be buffered without copying. This is problematic if the UDT's underlying SQL type is an array, map, struct, or string type (since those types require copying). Failing to copy values can lead to correctness issues or crashes. This patch's fix is simple: when the dataType is a UDT, use its underlying sqlType for determining whether values need to be copied. I used an existing helper function to perform this type unwrapping. ### Why are the changes needed? Fix a correctness issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I manually tested this change by re-running a workload which failed with a segfault prior to this patch. See JIRA for more details: https://issues.apache.org/jira/browse/SPARK-37784 So far I have been unable to come up with a CI-runnable regression test which would have failed prior to this change (my only working reproduction runs in a pre-production environment and does not fail in my development environment). Closes #35066 from JoshRosen/SPARK-37784. Authored-by: Josh Rosen Signed-off-by: Josh Rosen (cherry picked from commit eeef48fac412a57382b02ba3f39456d96379b5f5) Signed-off-by: Josh Rosen --- .../apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala index b8b5a40..132bb25 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala @@ -347,7 +347,7 @@ class CodegenContext extends Logging { */ def addBufferedState(dataType: DataType, variableName: String, initCode: String): ExprCode = { val value = addMutableState(javaType(dataType), variableName) -val code = dataType match { +val code = UserDefinedType.sqlType(dataType) match { case StringType => code"$value = $initCode.clone();" case _: StructType | _: ArrayType | _: MapType => code"$value = $initCode.copy();" case _ => code"$value = $initCode;" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState()
This is an automated email from the ASF dual-hosted git repository. joshrosen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new eeef48fa [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState() eeef48fa is described below commit eeef48fac412a57382b02ba3f39456d96379b5f5 Author: Josh Rosen AuthorDate: Tue Jan 4 10:59:53 2022 -0800 [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState() ### What changes were proposed in this pull request? This PR fixes a correctness issue in the CodeGenerator.addBufferedState() helper method (which is used by the SortMergeJoinExec operator). The addBufferedState() method generates code for buffering values that come from a row in an operator's input iterator, performing any necessary copying so that the buffered values remain correct after the input iterator advances to the next row. The current logic does not correctly handle UDTs: these fall through to the match statement's default branch, causing UDT values to be buffered without copying. This is problematic if the UDT's underlying SQL type is an array, map, struct, or string type (since those types require copying). Failing to copy values can lead to correctness issues or crashes. This patch's fix is simple: when the dataType is a UDT, use its underlying sqlType for determining whether values need to be copied. I used an existing helper function to perform this type unwrapping. ### Why are the changes needed? Fix a correctness issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I manually tested this change by re-running a workload which failed with a segfault prior to this patch. See JIRA for more details: https://issues.apache.org/jira/browse/SPARK-37784 So far I have been unable to come up with a CI-runnable regression test which would have failed prior to this change (my only working reproduction runs in a pre-production environment and does not fail in my development environment). Closes #35066 from JoshRosen/SPARK-37784. Authored-by: Josh Rosen Signed-off-by: Josh Rosen --- .../apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala index 720dfb2..c982a7b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala @@ -348,7 +348,7 @@ class CodegenContext extends Logging { */ def addBufferedState(dataType: DataType, variableName: String, initCode: String): ExprCode = { val value = addMutableState(javaType(dataType), variableName) -val code = dataType match { +val code = UserDefinedType.sqlType(dataType) match { case StringType => code"$value = $initCode.clone();" case _: StructType | _: ArrayType | _: MapType => code"$value = $initCode.copy();" case _ => code"$value = $initCode;" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated (c280f08 -> 5f9b92c)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git. from c280f08 [SPARK-37800][SQL] TreeNode.argString incorrectly formats arguments of type Set[_] add 5f9b92c [SPARK-37728][SQL][3.2] Reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException No new revisions were added by this update. Summary of changes: .../datasources/orc/OrcArrayColumnVector.java | 13 +--- .../execution/datasources/orc/OrcColumnVector.java | 2 +- .../datasources/orc/OrcColumnVectorUtils.java | 6 ++ .../datasources/orc/OrcMapColumnVector.java| 13 +--- .../execution/datasources/orc/OrcQuerySuite.scala | 23 ++ 5 files changed, 36 insertions(+), 21 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (3a2da70 -> 08fd501)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 3a2da70 [SPARK-37789][SQL] Add a class to represent general aggregate functions in DS V2 add 08fd501 [SPARK-37750][SQL] ANSI mode: optionally return null result if element not exists in array/map No new revisions were added by this update. Summary of changes: core/src/main/resources/error/error-classes.json | 9 ++- .../expressions/collectionOperations.scala | 6 +- .../expressions/complexTypeExtractors.scala| 10 +-- .../spark/sql/errors/QueryExecutionErrors.scala| 32 +++-- .../org/apache/spark/sql/internal/SQLConf.scala| 10 +++ .../test/resources/sql-tests/inputs/ansi/array.sql | 16 + .../test/resources/sql-tests/inputs/ansi/map.sql | 7 ++ .../resources/sql-tests/results/ansi/array.sql.out | 84 +- .../resources/sql-tests/results/ansi/map.sql.out | 29 +++- 9 files changed, 182 insertions(+), 21 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (a8addd4 -> 3a2da70)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from a8addd4 [SPARK-37800][SQL] TreeNode.argString incorrectly formats arguments of type Set[_] add 3a2da70 [SPARK-37789][SQL] Add a class to represent general aggregate functions in DS V2 No new revisions were added by this update. Summary of changes: .../sql/connector/expressions/Expression.java | 2 +- .../sql/connector/expressions/aggregate/Count.java | 3 -- .../connector/expressions/aggregate/CountStar.java | 3 -- .../{Count.java => GeneralAggregateFunc.java} | 42 ++ .../sql/connector/expressions/aggregate/Max.java | 3 -- .../sql/connector/expressions/aggregate/Min.java | 3 -- .../sql/connector/expressions/aggregate/Sum.java | 3 -- .../sql/connector/expressions/filter/Filter.java | 3 -- .../connector/read/SupportsPushDownAggregates.java | 21 +-- .../sql/connector/expressions/expressions.scala| 20 --- .../expressions/TransformExtractorSuite.scala | 8 ++--- .../spark/sql/execution/DataSourceScanExec.scala | 4 +-- .../execution/datasources/DataSourceStrategy.scala | 6 ++-- .../datasources/v2/V2ScanRelationPushDown.scala| 16 ++--- .../datasources/v2/jdbc/JDBCScanBuilder.scala | 2 +- .../org/apache/spark/sql/jdbc/JdbcDialects.scala | 8 +++-- .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala| 33 - 17 files changed, 99 insertions(+), 81 deletions(-) copy sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/aggregate/{Count.java => GeneralAggregateFunc.java} (51%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org