[spark] branch master updated: [SPARK-33084][TEST][FOLLOWUP] Fix a flaky test in SparkContextSuite

2022-01-04 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8a637c8  [SPARK-33084][TEST][FOLLOWUP] Fix a flaky test in 
SparkContextSuite
8a637c8 is described below

commit 8a637c80c67fdffaaf873e047a3a133c4e2bc16c
Author: Gengliang Wang 
AuthorDate: Wed Jan 5 15:51:09 2022 +0800

[SPARK-33084][TEST][FOLLOWUP] Fix a flaky test in SparkContextSuite

### What changes were proposed in this pull request?

The test case `SPARK-33084: Add jar support Ivy URI -- transitive=true will 
download dependency jars` in `SparkContextSuite` is becoming flaky:

- 
https://github.com/gengliangwang/spark/runs/4698825652?check_suite_focus=true
- 
https://github.com/gengliangwang/spark/runs/4698331067?check_suite_focus=true
- 
https://github.com/AngersZh/spark/runs/4697626841?check_suite_focus=true

The reason is that some of the events in `LogAppender` are null so that 
there is NPE:
```
[info]   Cause: java.lang.NullPointerException:
[info]   at 
org.apache.spark.SparkContextSuite.$anonfun$new$128(SparkContextSuite.scala:1077)
[info]   at 
org.apache.spark.SparkContextSuite.$anonfun$new$128$adapted(SparkContextSuite.scala:1077)
```
This PR is to fix the issue to unblock PR builders.

### Why are the changes needed?

Fix flaky test

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Just tests

Closes #35098 from gengliangwang/fixFlakyTest.

Authored-by: Gengliang Wang 
Signed-off-by: Gengliang Wang 
---
 core/src/test/scala/org/apache/spark/SparkFunSuite.scala | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/core/src/test/scala/org/apache/spark/SparkFunSuite.scala 
b/core/src/test/scala/org/apache/spark/SparkFunSuite.scala
index 5be0c96..273ffeb 100644
--- a/core/src/test/scala/org/apache/spark/SparkFunSuite.scala
+++ b/core/src/test/scala/org/apache/spark/SparkFunSuite.scala
@@ -266,23 +266,25 @@ abstract class SparkFunSuite
 
   class LogAppender(msg: String = "", maxEvents: Int = 1000)
   extends AbstractAppender("logAppender", null, null) {
-val loggingEvents = new ArrayBuffer[LogEvent]()
+private val _loggingEvents = new ArrayBuffer[LogEvent]()
 private var _threshold: Level = Level.INFO
 
 override def append(loggingEvent: LogEvent): Unit = 
loggingEvent.synchronized {
   val copyEvent = loggingEvent.toImmutable
   if (copyEvent.getLevel.isMoreSpecificThan(_threshold)) {
-if (loggingEvents.size >= maxEvents) {
+if (_loggingEvents.size >= maxEvents) {
   val loggingInfo = if (msg == "") "." else s" while logging $msg."
   throw new IllegalStateException(
 s"Number of events reached the limit of $maxEvents$loggingInfo")
 }
-loggingEvents.append(copyEvent)
+_loggingEvents.append(copyEvent)
   }
 }
 
 def setThreshold(threshold: Level): Unit = {
   _threshold = threshold
 }
+
+def loggingEvents: ArrayBuffer[LogEvent] = _loggingEvents.filterNot(_ == 
null)
   }
 }

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-33701][SHUFFLE] Adaptive shuffle merge finalization for push-based shuffle

2022-01-04 Thread mridulm80
This is an automated email from the ASF dual-hosted git repository.

mridulm80 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f6128a6  [SPARK-33701][SHUFFLE] Adaptive shuffle merge finalization 
for push-based shuffle
f6128a6 is described below

commit f6128a6f4215dc45a19209d799dd9bf98fab6d8a
Author: Venkata krishnan Sowrirajan 
AuthorDate: Wed Jan 5 01:47:01 2022 -0600

[SPARK-33701][SHUFFLE] Adaptive shuffle merge finalization for push-based 
shuffle

### What changes were proposed in this pull request?

As part of SPARK-32920 implemented a simple approach to finalization for 
push-based shuffle. Shuffle merge finalization is the final operation happens 
at the end of the stage when all the tasks are completed asking all the 
external shuffle services to complete the shuffle merge for the stage. Once 
this request is completed no more shuffle pushes will be accepted. With this 
approach, `DAGScheduler` waits for a fixed time of 10s 
(`spark.shuffle.push.finalize.timeout`) to allow some time [...]

In this PR, instead of waiting for fixed amount of time before shuffle 
merge finalization now this is controlled adaptively if min threshold number of 
map tasks shuffle push (`spark.shuffle.push.minPushRatio`) completed then 
shuffle merge finalization will be scheduled. Also additionally if the total 
shuffle generated is lesser than min threshold shuffle size 
(`spark.shuffle.push.minShuffleSizeToWait`) then immediately shuffle merge 
finalization is scheduled.
### Why are the changes needed?

This is a performance improvement to the existing functionality

### Does this PR introduce _any_ user-facing change?

Yes additional user facing configs `spark.shuffle.push.minPushRatio` and 
`spark.shuffle.push.minShuffleSizeToWait`

### How was this patch tested?

Added unit tests in `DAGSchedulerSuite`, `ShuffleBlockPusherSuite`

Lead-authored-by: Min Shen 
Co-authored-by: Venkata krishnan Sowrirajan 

Closes #33896 from venkata91/SPARK-33701.

Lead-authored-by: Venkata krishnan Sowrirajan 
Co-authored-by: Min Shen 
Signed-off-by: Mridul Muralidharan gmail.com>
---
 .../main/scala/org/apache/spark/Dependency.scala   |  35 ++-
 .../scala/org/apache/spark/MapOutputTracker.scala  |   6 +-
 .../src/main/scala/org/apache/spark/SparkEnv.scala |   3 +
 .../executor/CoarseGrainedExecutorBackend.scala|   6 +
 .../org/apache/spark/internal/config/package.scala |  27 ++
 .../org/apache/spark/scheduler/DAGScheduler.scala  | 278 +
 .../apache/spark/scheduler/DAGSchedulerEvent.scala |   4 +
 .../cluster/CoarseGrainedClusterMessage.scala  |   3 +
 .../cluster/CoarseGrainedSchedulerBackend.scala|   3 +
 .../apache/spark/shuffle/ShuffleBlockPusher.scala  |  39 ++-
 .../apache/spark/scheduler/DAGSchedulerSuite.scala | 340 +++--
 .../spark/shuffle/ShuffleBlockPusherSuite.scala| 101 +-
 docs/configuration.md  |  16 +
 13 files changed, 772 insertions(+), 89 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/Dependency.scala 
b/core/src/main/scala/org/apache/spark/Dependency.scala
index 1b4e7ba..8e348ee 100644
--- a/core/src/main/scala/org/apache/spark/Dependency.scala
+++ b/core/src/main/scala/org/apache/spark/Dependency.scala
@@ -17,8 +17,12 @@
 
 package org.apache.spark
 
+import java.util.concurrent.ScheduledFuture
+
 import scala.reflect.ClassTag
 
+import org.roaringbitmap.RoaringBitmap
+
 import org.apache.spark.annotation.DeveloperApi
 import org.apache.spark.internal.Logging
 import org.apache.spark.rdd.RDD
@@ -131,9 +135,7 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
   def shuffleMergeId: Int = _shuffleMergeId
 
   def setMergerLocs(mergerLocs: Seq[BlockManagerId]): Unit = {
-if (mergerLocs != null) {
-  this.mergerLocs = mergerLocs
-}
+this.mergerLocs = mergerLocs
   }
 
   def getMergerLocs: Seq[BlockManagerId] = mergerLocs
@@ -160,6 +162,8 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
 _shuffleMergedFinalized = false
 mergerLocs = Nil
 _shuffleMergeId += 1
+finalizeTask = None
+shufflePushCompleted.clear()
   }
 
   private def canShuffleMergeBeEnabled(): Boolean = {
@@ -169,11 +173,34 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
 if (isPushShuffleEnabled && rdd.isBarrier()) {
   logWarning("Push-based shuffle is currently not supported for barrier 
stages")
 }
-isPushShuffleEnabled &&
+isPushShuffleEnabled && numPartitions > 0 &&
   // TODO: SPARK-35547: Push based shuffle is currently unsupported for 
Barrier stages
   !rdd.isBarrier()
   }
 
+  @transient private[this] val shufflePushCompleted = new RoaringBitmap()
+
+  /**
+   * Mark a given map task as push 

[spark] branch master updated (27d5575 -> 0b9f120)

2022-01-04 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 27d5575  [SPARK-37807][SQL] Fix a typo in HttpAuthenticationException 
message
 add 0b9f120  [SPARK-37750][SQL][FOLLOWUP] Change SQLConf parameter name

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala | 4 ++--
 .../src/main/scala/org/apache/spark/sql/internal/SQLConf.scala| 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.1 updated: [SPARK-37807][SQL] Fix a typo in HttpAuthenticationException message

2022-01-04 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 3bcd036  [SPARK-37807][SQL] Fix a typo in HttpAuthenticationException 
message
3bcd036 is described below

commit 3bcd036367afc550065f6049d6059b3b92729267
Author: Chilaka Ramakrishna 
AuthorDate: Tue Jan 4 23:30:20 2022 -0800

[SPARK-37807][SQL] Fix a typo in HttpAuthenticationException message

### What changes were proposed in this pull request?
The error message is not correct, So we update the error message.

### Why are the changes needed?
The exception message when password is left empty in HTTP mode of hive 
thrift server is not correct.. Updated the text to reflect it. Please check

JIRA ISSUE: https://issues.apache.org/jira/browse/SPARK-37807

### Does this PR introduce _any_ user-facing change?
Yes, The exception messages in HiveServer2 is changed.

### How was this patch tested?
This was tested manually

Closes #35097 from RamakrishnaChilaka/feature/error_string_fix.

Authored-by: Chilaka Ramakrishna 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 27d5575f13fe69459d7fa72cee11d4166c9e1a10)
Signed-off-by: Dongjoon Hyun 
---
 .../main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java
 
b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java
index 0cfa84d..be06206 100644
--- 
a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java
+++ 
b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java
@@ -493,7 +493,7 @@ public class ThriftHttpServlet extends TServlet {
 // Password must be present
 if (creds[1] == null || creds[1].isEmpty()) {
   throw new HttpAuthenticationException("Authorization header received " +
-  "from the client does not contain username.");
+  "from the client does not contain password.");
 }
 return creds[1];
   }

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-37807][SQL] Fix a typo in HttpAuthenticationException message

2022-01-04 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 731e056  [SPARK-37807][SQL] Fix a typo in HttpAuthenticationException 
message
731e056 is described below

commit 731e05672983f75ba70b10207774d29ab4ad8602
Author: Chilaka Ramakrishna 
AuthorDate: Tue Jan 4 23:30:20 2022 -0800

[SPARK-37807][SQL] Fix a typo in HttpAuthenticationException message

### What changes were proposed in this pull request?
The error message is not correct, So we update the error message.

### Why are the changes needed?
The exception message when password is left empty in HTTP mode of hive 
thrift server is not correct.. Updated the text to reflect it. Please check

JIRA ISSUE: https://issues.apache.org/jira/browse/SPARK-37807

### Does this PR introduce _any_ user-facing change?
Yes, The exception messages in HiveServer2 is changed.

### How was this patch tested?
This was tested manually

Closes #35097 from RamakrishnaChilaka/feature/error_string_fix.

Authored-by: Chilaka Ramakrishna 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 27d5575f13fe69459d7fa72cee11d4166c9e1a10)
Signed-off-by: Dongjoon Hyun 
---
 .../main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java
 
b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java
index 0e2f374..f0f5cdc 100644
--- 
a/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java
+++ 
b/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java
@@ -492,7 +492,7 @@ public class ThriftHttpServlet extends TServlet {
 // Password must be present
 if (creds[1] == null || creds[1].isEmpty()) {
   throw new HttpAuthenticationException("Authorization header received " +
-  "from the client does not contain username.");
+  "from the client does not contain password.");
 }
 return creds[1];
   }

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (4a2ba5b -> 27d5575)

2022-01-04 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 4a2ba5b  [SPARK-37813][SQL][TESTS] Enable vectorization for nested 
column in ORC scan benchmark
 add 27d5575  [SPARK-37807][SQL] Fix a typo in HttpAuthenticationException 
message

No new revisions were added by this update.

Summary of changes:
 .../main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (9a9b379 -> 4a2ba5b)

2022-01-04 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 9a9b379  [SPARK-37786][SQL] StreamingQueryListener support use 
SQLConf.get to get corresponding SessionState's SQLConf
 add 4a2ba5b  [SPARK-37813][SQL][TESTS] Enable vectorization for nested 
column in ORC scan benchmark

No new revisions were added by this update.

Summary of changes:
 .../benchmarks/OrcReadBenchmark-jdk11-results.txt  | 198 ++---
 .../benchmarks/OrcReadBenchmark-jdk17-results.txt  | 168 +++--
 sql/hive/benchmarks/OrcReadBenchmark-results.txt   | 188 +--
 .../spark/sql/hive/orc/OrcReadBenchmark.scala  |   8 +-
 4 files changed, 349 insertions(+), 213 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (93c614b -> 9a9b379)

2022-01-04 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 93c614b  [SPARK-30789][SQL][DOCS][FOLLOWUP] Add document for syntax 
`(IGNORE | RESPECT) NULLS`
 add 9a9b379  [SPARK-37786][SQL] StreamingQueryListener support use 
SQLConf.get to get corresponding SessionState's SQLConf

No new revisions were added by this update.

Summary of changes:
 project/MimaExcludes.scala   |  4 +++-
 .../spark/sql/internal/BaseSessionStateBuilder.scala |  3 ++-
 .../spark/sql/streaming/StreamingQueryManager.scala  | 16 ++--
 .../streaming/StreamingQueryListenersConfSuite.scala | 20 +++-
 4 files changed, 34 insertions(+), 9 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-30789][SQL][DOCS][FOLLOWUP] Add document for syntax `(IGNORE | RESPECT) NULLS`

2022-01-04 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 9b6be1a  [SPARK-30789][SQL][DOCS][FOLLOWUP] Add document for syntax 
`(IGNORE | RESPECT) NULLS`
9b6be1a is described below

commit 9b6be1a6c004e50ffdf59f7fa1986adeb03e45cd
Author: Jiaan Geng 
AuthorDate: Wed Jan 5 12:57:21 2022 +0800

[SPARK-30789][SQL][DOCS][FOLLOWUP] Add document for syntax `(IGNORE | 
RESPECT) NULLS`

### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/30943 supports syntax `(IGNORE | 
RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE`, but update 
document.
The screen snapshot before this PR

![screenshot-20211231-174803](https://user-images.githubusercontent.com/8486025/147816336-debca074-0b84-48e8-9ed2-cb13f562cf12.png)

This PR adds document for syntax `(IGNORE | RESPECT) NULLS`

The screen snapshot after this PR

![image](https://user-images.githubusercontent.com/8486025/148141568-506e9232-a3c4-4a25-a5c6-65a5d5a2e066.png)


![image](https://user-images.githubusercontent.com/8486025/148061495-b7198417-9d4c-4c03-9060-385271ea9a46.png)

### Why are the changes needed?
Add document for syntax `(IGNORE | RESPECT) NULLS`

### Does this PR introduce _any_ user-facing change?
'No'. Just update docs.

### How was this patch tested?
Manual check.

Closes #35079 from beliefer/SPARK-30789-docs.

Authored-by: Jiaan Geng 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 93c614bf1e6aba092d82bcd8616b5ea31eb191a2)
Signed-off-by: Wenchen Fan 
---
 docs/sql-ref-syntax-qry-select-window.md | 37 ++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/docs/sql-ref-syntax-qry-select-window.md 
b/docs/sql-ref-syntax-qry-select-window.md
index a1c2b18..6e65778 100644
--- a/docs/sql-ref-syntax-qry-select-window.md
+++ b/docs/sql-ref-syntax-qry-select-window.md
@@ -26,7 +26,7 @@ Window functions operate on a group of rows, referred to as a 
window, and calcul
 ### Syntax
 
 ```sql
-window_function OVER
+window_function [ nulls_option ] OVER
 ( [  { PARTITION | DISTRIBUTE } BY partition_col_name = partition_col_val ( [ 
, ... ] ) ]
   { ORDER | SORT } BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , 
... ]
   [ window_frame ] )
@@ -42,7 +42,7 @@ window_function OVER
 
 * Analytic Functions
 
-  **Syntax:** `CUME_DIST | LAG | LEAD`
+  **Syntax:** `CUME_DIST | LAG | LEAD | NTH_VALUE | FIRST_VALUE | 
LAST_VALUE`
 
 * Aggregate Functions
 
@@ -50,6 +50,16 @@ window_function OVER
 
   Please refer to the [Built-in Aggregation 
Functions](sql-ref-functions-builtin.html#aggregate-functions) document for a 
complete list of Spark aggregate functions.
 
+* **nulls_option**
+
+Specifies whether or not to skip null values when evaluating the window 
function. `RESECT NULLS` means not skipping null values, while `IGNORE NULLS` 
means skipping. If not specified, the default is `RESECT NULLS`.
+
+**Syntax:**
+
+`{ IGNORE | RESPECT } NULLS`
+
+**Note:** Only `LAG | LEAD | NTH_VALUE | FIRST_VALUE | LAST_VALUE` can be 
used with `IGNORE NULLS`.
+
 * **window_frame**
 
 Specifies which row to start the window on and where to end it.
@@ -184,6 +194,29 @@ SELECT name, salary,
 | Jane|  Marketing| 29000|29000|35000|
 | Jeff|  Marketing| 35000|29000|0|
 +-+---+--+-+-+
+
+SELECT id, v,
+LEAD(v, 0) IGNORE NULLS OVER w lead,
+LAG(v, 0) IGNORE NULLS OVER w lag,
+NTH_VALUE(v, 2) IGNORE NULLS OVER w nth_value,
+FIRST_VALUE(v) IGNORE NULLS OVER w first_value,
+LAST_VALUE(v) IGNORE NULLS OVER w last_value
+FROM test_ignore_null
+WINDOW w AS (ORDER BY id)
+ORDER BY id;
++--++++-+---+--+
+|id|   v|lead| lag|nth_value|first_value|last_value|
++--++++-+---+--+
+| 0|NULL|NULL|NULL| NULL|   NULL|  NULL|
+| 1|   x|   x|   x| NULL|  x| x|
+| 2|NULL|NULL|NULL| NULL|  x| x|
+| 3|NULL|NULL|NULL| NULL|  x| x|
+| 4|   y|   y|   y|y|  x| y|
+| 5|NULL|NULL|NULL|y|  x| y|
+| 6|   z|   z|   z|y|  x| z|
+| 7|   v|   v|   v|y|  x| v|
+| 8|NULL|NULL|NULL|y|  x| v|
++--++++-+---+--+
 ```
 
 ### Related Statements

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (3c6c690 -> 93c614b)

2022-01-04 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3c6c690  [MINOR][DOCS] Make code blocks pretty in README.md
 add 93c614b  [SPARK-30789][SQL][DOCS][FOLLOWUP] Add document for syntax 
`(IGNORE | RESPECT) NULLS`

No new revisions were added by this update.

Summary of changes:
 docs/sql-ref-syntax-qry-select-window.md | 37 ++--
 1 file changed, 35 insertions(+), 2 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (639d6f4 -> 3c6c690)

2022-01-04 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 639d6f4  [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run 
System.exit in a thread out of RpcEnv
 add 3c6c690  [MINOR][DOCS] Make code blocks pretty in README.md

No new revisions were added by this update.

Summary of changes:
 README.md | 32 
 1 file changed, 24 insertions(+), 8 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.0 updated: [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv

2022-01-04 Thread wuyi
This is an automated email from the ASF dual-hosted git repository.

wuyi pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new be441e8  [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run 
System.exit in a thread out of RpcEnv
be441e8 is described below

commit be441e84069acc711ea848c69ae5bd55a7c93531
Author: yi.wu 
AuthorDate: Wed Jan 5 10:48:16 2022 +0800

[SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a 
thread out of RpcEnv

### What changes were proposed in this pull request?

This PR proposes to let `WorkerWatcher` run `System.exit` in a separate 
thread instead of some thread of `RpcEnv`.

### Why are the changes needed?

`System.exit` will trigger the shutdown hook to run `executor.stop`, which 
will result in the same deadlock issue with SPARK-14180. But note that since 
Spark upgrades to Hadoop 3  recently, each hook now will have a [timeout 
threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209)
 which forcibly interrupt the hook execution once reaches timeout. [...]

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested manually.

Closes #35069 from Ngone51/fix-workerwatcher-exit.

Authored-by: yi.wu 
Signed-off-by: yi.wu 
(cherry picked from commit 639d6f40e597d79c680084376ece87e40f4d2366)
Signed-off-by: yi.wu 
---
 .../main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala 
b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala
index efffc9f..b7a5728 100644
--- a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala
@@ -54,8 +54,12 @@ private[spark] class WorkerWatcher(
 if (isTesting) {
   isShutDown = true
 } else if (isChildProcessStopping.compareAndSet(false, true)) {
-  // SPARK-35714: avoid the duplicate call of `System.exit` to avoid the 
dead lock
-  System.exit(-1)
+  // SPARK-35714: avoid the duplicate call of `System.exit` to avoid the 
dead lock.
+  // Same as SPARK-14180, we should run `System.exit` in a separate thread 
to avoid
+  // dead lock since `System.exit` will trigger the shutdown hook of 
`executor.stop`.
+  new Thread("WorkerWatcher-exit-executor") {
+override def run(): Unit = System.exit(-1)
+  }.start()
 }
 
   override def receive: PartialFunction[Any, Unit] = {

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.1 updated: [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv

2022-01-04 Thread wuyi
This is an automated email from the ASF dual-hosted git repository.

wuyi pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 70d4fb1  [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run 
System.exit in a thread out of RpcEnv
70d4fb1 is described below

commit 70d4fb15c8b7d02a4e7fe3d1c82cfd35d23d053c
Author: yi.wu 
AuthorDate: Wed Jan 5 10:48:16 2022 +0800

[SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a 
thread out of RpcEnv

### What changes were proposed in this pull request?

This PR proposes to let `WorkerWatcher` run `System.exit` in a separate 
thread instead of some thread of `RpcEnv`.

### Why are the changes needed?

`System.exit` will trigger the shutdown hook to run `executor.stop`, which 
will result in the same deadlock issue with SPARK-14180. But note that since 
Spark upgrades to Hadoop 3  recently, each hook now will have a [timeout 
threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209)
 which forcibly interrupt the hook execution once reaches timeout. [...]

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested manually.

Closes #35069 from Ngone51/fix-workerwatcher-exit.

Authored-by: yi.wu 
Signed-off-by: yi.wu 
(cherry picked from commit 639d6f40e597d79c680084376ece87e40f4d2366)
Signed-off-by: yi.wu 
---
 .../main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala 
b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala
index efffc9f..b7a5728 100644
--- a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala
@@ -54,8 +54,12 @@ private[spark] class WorkerWatcher(
 if (isTesting) {
   isShutDown = true
 } else if (isChildProcessStopping.compareAndSet(false, true)) {
-  // SPARK-35714: avoid the duplicate call of `System.exit` to avoid the 
dead lock
-  System.exit(-1)
+  // SPARK-35714: avoid the duplicate call of `System.exit` to avoid the 
dead lock.
+  // Same as SPARK-14180, we should run `System.exit` in a separate thread 
to avoid
+  // dead lock since `System.exit` will trigger the shutdown hook of 
`executor.stop`.
+  new Thread("WorkerWatcher-exit-executor") {
+override def run(): Unit = System.exit(-1)
+  }.start()
 }
 
   override def receive: PartialFunction[Any, Unit] = {

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv

2022-01-04 Thread wuyi
This is an automated email from the ASF dual-hosted git repository.

wuyi pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 537de84  [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run 
System.exit in a thread out of RpcEnv
537de84 is described below

commit 537de8459e3f7b1cd02521efa24b2036a9019ba5
Author: yi.wu 
AuthorDate: Wed Jan 5 10:48:16 2022 +0800

[SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a 
thread out of RpcEnv

### What changes were proposed in this pull request?

This PR proposes to let `WorkerWatcher` run `System.exit` in a separate 
thread instead of some thread of `RpcEnv`.

### Why are the changes needed?

`System.exit` will trigger the shutdown hook to run `executor.stop`, which 
will result in the same deadlock issue with SPARK-14180. But note that since 
Spark upgrades to Hadoop 3  recently, each hook now will have a [timeout 
threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209)
 which forcibly interrupt the hook execution once reaches timeout. [...]

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested manually.

Closes #35069 from Ngone51/fix-workerwatcher-exit.

Authored-by: yi.wu 
Signed-off-by: yi.wu 
(cherry picked from commit 639d6f40e597d79c680084376ece87e40f4d2366)
Signed-off-by: yi.wu 
---
 .../main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala 
b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala
index efffc9f..b7a5728 100644
--- a/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala
@@ -54,8 +54,12 @@ private[spark] class WorkerWatcher(
 if (isTesting) {
   isShutDown = true
 } else if (isChildProcessStopping.compareAndSet(false, true)) {
-  // SPARK-35714: avoid the duplicate call of `System.exit` to avoid the 
dead lock
-  System.exit(-1)
+  // SPARK-35714: avoid the duplicate call of `System.exit` to avoid the 
dead lock.
+  // Same as SPARK-14180, we should run `System.exit` in a separate thread 
to avoid
+  // dead lock since `System.exit` will trigger the shutdown hook of 
`executor.stop`.
+  new Thread("WorkerWatcher-exit-executor") {
+override def run(): Unit = System.exit(-1)
+  }.start()
 }
 
   override def receive: PartialFunction[Any, Unit] = {

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (3b88bc8 -> 639d6f4)

2022-01-04 Thread wuyi
This is an automated email from the ASF dual-hosted git repository.

wuyi pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3b88bc8  [SPARK-37792][CORE] Fix the check of custom configuration in 
SparkShellLoggingFilter
 add 639d6f4  [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run 
System.exit in a thread out of RpcEnv

No new revisions were added by this update.

Summary of changes:
 .../main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (98e1c77 -> 3b88bc8)

2022-01-04 Thread sarutak
This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 98e1c77  [SPARK-37803][SQL] Add ORC read benchmarks for structs
 add 3b88bc8  [SPARK-37792][CORE] Fix the check of custom configuration in 
SparkShellLoggingFilter

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/internal/Logging.scala  | 19 --
 .../org/apache/spark/internal/LoggingSuite.scala   | 23 +++---
 .../scala/org/apache/spark/repl/ReplSuite.scala| 19 --
 3 files changed, 42 insertions(+), 19 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-37803][SQL] Add ORC read benchmarks for structs

2022-01-04 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 98e1c77  [SPARK-37803][SQL] Add ORC read benchmarks for structs
98e1c77 is described below

commit 98e1c77710e44190112610e21d6f02de1b620611
Author: Bruce Robbins 
AuthorDate: Tue Jan 4 15:55:11 2022 -0800

[SPARK-37803][SQL] Add ORC read benchmarks for structs

### What changes were proposed in this pull request?

Add Orc read benchmarks for structs and nested structs.

### Why are the changes needed?

This PR will provide baseline benchmarks for PR #35090, which will 
hopefully make the deserialization of Orc structs more efficient.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New benchmark tests.

Closes #35100 from bersprockets/orc_struct_benchmark.

Authored-by: Bruce Robbins 
Signed-off-by: Dongjoon Hyun 
---
 sql/hive/benchmarks/OrcReadBenchmark-results.txt   | 168 ++---
 .../spark/sql/hive/orc/OrcReadBenchmark.scala  |  82 ++
 2 files changed, 199 insertions(+), 51 deletions(-)

diff --git a/sql/hive/benchmarks/OrcReadBenchmark-results.txt 
b/sql/hive/benchmarks/OrcReadBenchmark-results.txt
index 24969ce..9ffd7a5 100644
--- a/sql/hive/benchmarks/OrcReadBenchmark-results.txt
+++ b/sql/hive/benchmarks/OrcReadBenchmark-results.txt
@@ -6,49 +6,49 @@ OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 
5.11.0-1022-azure
 Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 SQL Single TINYINT Column Scan:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Native ORC MR   700748 
 79 22.5  44.5   1.0X
-Native ORC Vectorized   103126 
 15153.2   6.5   6.8X
-Hive built-in ORC   952978 
 26 16.5  60.5   0.7X
+Native ORC MR   832   1153 
453 18.9  52.9   1.0X
+Native ORC Vectorized   148189 
 24106.5   9.4   5.6X
+Hive built-in ORC   986   1028 
 59 15.9  62.7   0.8X
 
 OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1022-azure
 Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 SQL Single SMALLINT Column Scan:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Native ORC MR   793814 
 35 19.8  50.4   1.0X
-Native ORC Vectorized   107119 
 12146.9   6.8   7.4X
-Hive built-in ORC  1025   1025 
  1 15.3  65.1   0.8X
+Native ORC MR   868913 
 60 18.1  55.2   1.0X
+Native ORC Vectorized   133150 
 21118.6   8.4   6.5X
+Hive built-in ORC  1098   1102 
  6 14.3  69.8   0.8X
 
 OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1022-azure
 Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 SQL Single INT Column Scan:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Native ORC MR   818832 
 19 19.2  52.0   1.0X
-Native ORC Vectorized   141167 
 28111.7   8.9   5.8X
-Hive built-in ORC  1079   1089 
 15 14.6  68.6   0.8X
+Native ORC MR   898917 
 24 17.5  57.1   1.0X
+Native ORC Vectorized   155175 
 16101.4   9.9   5.8X
+Hive built-in ORC  1114   1126 
 17 14.1  70.8   0.8X
 
 OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1022-azure
 Intel(R) Xeon(R) Platinum 8272CL CPU @ 

[spark] branch master updated (eeef48fa -> 9c6f3fb)

2022-01-04 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from eeef48fa [SPARK-37784][SQL] Correctly handle UDTs in 
CodeGenerator.addBufferedState()
 add 9c6f3fb  [SPARK-37806][K8S] Support minimum number of tasks per 
executor before being rolled

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/deploy/k8s/Config.scala | 10 ++
 .../scheduler/cluster/k8s/ExecutorRollPlugin.scala |  9 +++--
 .../cluster/k8s/ExecutorRollPluginSuite.scala  | 23 +-
 3 files changed, 39 insertions(+), 3 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.0 updated: [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState()

2022-01-04 Thread joshrosen
This is an automated email from the ASF dual-hosted git repository.

joshrosen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 3aaf722  [SPARK-37784][SQL] Correctly handle UDTs in 
CodeGenerator.addBufferedState()
3aaf722 is described below

commit 3aaf722a0d6552733504c794f59d390c349dfa80
Author: Josh Rosen 
AuthorDate: Tue Jan 4 10:59:53 2022 -0800

[SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState()

### What changes were proposed in this pull request?

This PR fixes a correctness issue in the CodeGenerator.addBufferedState() 
helper method (which is used by the SortMergeJoinExec operator).

The addBufferedState() method generates code for buffering values that come 
from a row in an operator's input iterator, performing any necessary copying so 
that the buffered values remain correct after the input iterator advances to 
the next row.

The current logic does not correctly handle UDTs: these fall through to the 
match statement's default branch, causing UDT values to be buffered without 
copying. This is problematic if the UDT's underlying SQL type is an array, map, 
struct, or string type (since those types require copying). Failing to copy 
values can lead to correctness issues or crashes.

This patch's fix is simple: when the dataType is a UDT, use its underlying 
sqlType for determining whether values need to be copied. I used an existing 
helper function to perform this type unwrapping.

### Why are the changes needed?

Fix a correctness issue.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I manually tested this change by re-running a workload which failed with a 
segfault prior to this patch. See JIRA for more details: 
https://issues.apache.org/jira/browse/SPARK-37784

So far I have been unable to come up with a CI-runnable regression test 
which would have failed prior to this change (my only working reproduction runs 
in a pre-production environment and does not fail in my development 
environment).

Closes #35066 from JoshRosen/SPARK-37784.

Authored-by: Josh Rosen 
Signed-off-by: Josh Rosen 
(cherry picked from commit eeef48fac412a57382b02ba3f39456d96379b5f5)
Signed-off-by: Josh Rosen 
---
 .../apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
index 74c6ea5..90ebe40 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
@@ -330,7 +330,7 @@ class CodegenContext extends Logging {
*/
   def addBufferedState(dataType: DataType, variableName: String, initCode: 
String): ExprCode = {
 val value = addMutableState(javaType(dataType), variableName)
-val code = dataType match {
+val code = UserDefinedType.sqlType(dataType) match {
   case StringType => code"$value = $initCode.clone();"
   case _: StructType | _: ArrayType | _: MapType => code"$value = 
$initCode.copy();"
   case _ => code"$value = $initCode;"

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.1 updated: [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState()

2022-01-04 Thread joshrosen
This is an automated email from the ASF dual-hosted git repository.

joshrosen pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 5cc8b39  [SPARK-37784][SQL] Correctly handle UDTs in 
CodeGenerator.addBufferedState()
5cc8b39 is described below

commit 5cc8b397f1b01b5ee4a26e8e8540baf1e05c97a0
Author: Josh Rosen 
AuthorDate: Tue Jan 4 10:59:53 2022 -0800

[SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState()

### What changes were proposed in this pull request?

This PR fixes a correctness issue in the CodeGenerator.addBufferedState() 
helper method (which is used by the SortMergeJoinExec operator).

The addBufferedState() method generates code for buffering values that come 
from a row in an operator's input iterator, performing any necessary copying so 
that the buffered values remain correct after the input iterator advances to 
the next row.

The current logic does not correctly handle UDTs: these fall through to the 
match statement's default branch, causing UDT values to be buffered without 
copying. This is problematic if the UDT's underlying SQL type is an array, map, 
struct, or string type (since those types require copying). Failing to copy 
values can lead to correctness issues or crashes.

This patch's fix is simple: when the dataType is a UDT, use its underlying 
sqlType for determining whether values need to be copied. I used an existing 
helper function to perform this type unwrapping.

### Why are the changes needed?

Fix a correctness issue.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I manually tested this change by re-running a workload which failed with a 
segfault prior to this patch. See JIRA for more details: 
https://issues.apache.org/jira/browse/SPARK-37784

So far I have been unable to come up with a CI-runnable regression test 
which would have failed prior to this change (my only working reproduction runs 
in a pre-production environment and does not fail in my development 
environment).

Closes #35066 from JoshRosen/SPARK-37784.

Authored-by: Josh Rosen 
Signed-off-by: Josh Rosen 
(cherry picked from commit eeef48fac412a57382b02ba3f39456d96379b5f5)
Signed-off-by: Josh Rosen 
---
 .../apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
index 6e6b946..4092436 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
@@ -334,7 +334,7 @@ class CodegenContext extends Logging {
*/
   def addBufferedState(dataType: DataType, variableName: String, initCode: 
String): ExprCode = {
 val value = addMutableState(javaType(dataType), variableName)
-val code = dataType match {
+val code = UserDefinedType.sqlType(dataType) match {
   case StringType => code"$value = $initCode.clone();"
   case _: StructType | _: ArrayType | _: MapType => code"$value = 
$initCode.copy();"
   case _ => code"$value = $initCode;"

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState()

2022-01-04 Thread joshrosen
This is an automated email from the ASF dual-hosted git repository.

joshrosen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 45b7b7e  [SPARK-37784][SQL] Correctly handle UDTs in 
CodeGenerator.addBufferedState()
45b7b7e is described below

commit 45b7b7e1682946ef86b42fd85d58bdd471996d0c
Author: Josh Rosen 
AuthorDate: Tue Jan 4 10:59:53 2022 -0800

[SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState()

### What changes were proposed in this pull request?

This PR fixes a correctness issue in the CodeGenerator.addBufferedState() 
helper method (which is used by the SortMergeJoinExec operator).

The addBufferedState() method generates code for buffering values that come 
from a row in an operator's input iterator, performing any necessary copying so 
that the buffered values remain correct after the input iterator advances to 
the next row.

The current logic does not correctly handle UDTs: these fall through to the 
match statement's default branch, causing UDT values to be buffered without 
copying. This is problematic if the UDT's underlying SQL type is an array, map, 
struct, or string type (since those types require copying). Failing to copy 
values can lead to correctness issues or crashes.

This patch's fix is simple: when the dataType is a UDT, use its underlying 
sqlType for determining whether values need to be copied. I used an existing 
helper function to perform this type unwrapping.

### Why are the changes needed?

Fix a correctness issue.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I manually tested this change by re-running a workload which failed with a 
segfault prior to this patch. See JIRA for more details: 
https://issues.apache.org/jira/browse/SPARK-37784

So far I have been unable to come up with a CI-runnable regression test 
which would have failed prior to this change (my only working reproduction runs 
in a pre-production environment and does not fail in my development 
environment).

Closes #35066 from JoshRosen/SPARK-37784.

Authored-by: Josh Rosen 
Signed-off-by: Josh Rosen 
(cherry picked from commit eeef48fac412a57382b02ba3f39456d96379b5f5)
Signed-off-by: Josh Rosen 
---
 .../apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
index b8b5a40..132bb25 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
@@ -347,7 +347,7 @@ class CodegenContext extends Logging {
*/
   def addBufferedState(dataType: DataType, variableName: String, initCode: 
String): ExprCode = {
 val value = addMutableState(javaType(dataType), variableName)
-val code = dataType match {
+val code = UserDefinedType.sqlType(dataType) match {
   case StringType => code"$value = $initCode.clone();"
   case _: StructType | _: ArrayType | _: MapType => code"$value = 
$initCode.copy();"
   case _ => code"$value = $initCode;"

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState()

2022-01-04 Thread joshrosen
This is an automated email from the ASF dual-hosted git repository.

joshrosen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new eeef48fa [SPARK-37784][SQL] Correctly handle UDTs in 
CodeGenerator.addBufferedState()
eeef48fa is described below

commit eeef48fac412a57382b02ba3f39456d96379b5f5
Author: Josh Rosen 
AuthorDate: Tue Jan 4 10:59:53 2022 -0800

[SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState()

### What changes were proposed in this pull request?

This PR fixes a correctness issue in the CodeGenerator.addBufferedState() 
helper method (which is used by the SortMergeJoinExec operator).

The addBufferedState() method generates code for buffering values that come 
from a row in an operator's input iterator, performing any necessary copying so 
that the buffered values remain correct after the input iterator advances to 
the next row.

The current logic does not correctly handle UDTs: these fall through to the 
match statement's default branch, causing UDT values to be buffered without 
copying. This is problematic if the UDT's underlying SQL type is an array, map, 
struct, or string type (since those types require copying). Failing to copy 
values can lead to correctness issues or crashes.

This patch's fix is simple: when the dataType is a UDT, use its underlying 
sqlType for determining whether values need to be copied. I used an existing 
helper function to perform this type unwrapping.

### Why are the changes needed?

Fix a correctness issue.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I manually tested this change by re-running a workload which failed with a 
segfault prior to this patch. See JIRA for more details: 
https://issues.apache.org/jira/browse/SPARK-37784

So far I have been unable to come up with a CI-runnable regression test 
which would have failed prior to this change (my only working reproduction runs 
in a pre-production environment and does not fail in my development 
environment).

Closes #35066 from JoshRosen/SPARK-37784.

Authored-by: Josh Rosen 
Signed-off-by: Josh Rosen 
---
 .../apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
index 720dfb2..c982a7b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
@@ -348,7 +348,7 @@ class CodegenContext extends Logging {
*/
   def addBufferedState(dataType: DataType, variableName: String, initCode: 
String): ExprCode = {
 val value = addMutableState(javaType(dataType), variableName)
-val code = dataType match {
+val code = UserDefinedType.sqlType(dataType) match {
   case StringType => code"$value = $initCode.clone();"
   case _: StructType | _: ArrayType | _: MapType => code"$value = 
$initCode.copy();"
   case _ => code"$value = $initCode;"

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated (c280f08 -> 5f9b92c)

2022-01-04 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c280f08  [SPARK-37800][SQL] TreeNode.argString incorrectly formats 
arguments of type Set[_]
 add 5f9b92c  [SPARK-37728][SQL][3.2] Reading nested columns with ORC 
vectorized reader can cause ArrayIndexOutOfBoundsException

No new revisions were added by this update.

Summary of changes:
 .../datasources/orc/OrcArrayColumnVector.java  | 13 +---
 .../execution/datasources/orc/OrcColumnVector.java |  2 +-
 .../datasources/orc/OrcColumnVectorUtils.java  |  6 ++
 .../datasources/orc/OrcMapColumnVector.java| 13 +---
 .../execution/datasources/orc/OrcQuerySuite.scala  | 23 ++
 5 files changed, 36 insertions(+), 21 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (3a2da70 -> 08fd501)

2022-01-04 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3a2da70  [SPARK-37789][SQL] Add a class to represent general aggregate 
functions in DS V2
 add 08fd501  [SPARK-37750][SQL] ANSI mode: optionally return null result 
if element not exists in array/map

No new revisions were added by this update.

Summary of changes:
 core/src/main/resources/error/error-classes.json   |  9 ++-
 .../expressions/collectionOperations.scala |  6 +-
 .../expressions/complexTypeExtractors.scala| 10 +--
 .../spark/sql/errors/QueryExecutionErrors.scala| 32 +++--
 .../org/apache/spark/sql/internal/SQLConf.scala| 10 +++
 .../test/resources/sql-tests/inputs/ansi/array.sql | 16 +
 .../test/resources/sql-tests/inputs/ansi/map.sql   |  7 ++
 .../resources/sql-tests/results/ansi/array.sql.out | 84 +-
 .../resources/sql-tests/results/ansi/map.sql.out   | 29 +++-
 9 files changed, 182 insertions(+), 21 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (a8addd4 -> 3a2da70)

2022-01-04 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from a8addd4  [SPARK-37800][SQL] TreeNode.argString incorrectly formats 
arguments of type Set[_]
 add 3a2da70  [SPARK-37789][SQL] Add a class to represent general aggregate 
functions in DS V2

No new revisions were added by this update.

Summary of changes:
 .../sql/connector/expressions/Expression.java  |  2 +-
 .../sql/connector/expressions/aggregate/Count.java |  3 --
 .../connector/expressions/aggregate/CountStar.java |  3 --
 .../{Count.java => GeneralAggregateFunc.java}  | 42 ++
 .../sql/connector/expressions/aggregate/Max.java   |  3 --
 .../sql/connector/expressions/aggregate/Min.java   |  3 --
 .../sql/connector/expressions/aggregate/Sum.java   |  3 --
 .../sql/connector/expressions/filter/Filter.java   |  3 --
 .../connector/read/SupportsPushDownAggregates.java | 21 +--
 .../sql/connector/expressions/expressions.scala| 20 ---
 .../expressions/TransformExtractorSuite.scala  |  8 ++---
 .../spark/sql/execution/DataSourceScanExec.scala   |  4 +--
 .../execution/datasources/DataSourceStrategy.scala |  6 ++--
 .../datasources/v2/V2ScanRelationPushDown.scala| 16 ++---
 .../datasources/v2/jdbc/JDBCScanBuilder.scala  |  2 +-
 .../org/apache/spark/sql/jdbc/JdbcDialects.scala   |  8 +++--
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala| 33 -
 17 files changed, 99 insertions(+), 81 deletions(-)
 copy 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/aggregate/{Count.java
 => GeneralAggregateFunc.java} (51%)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org