[spark] branch master updated: [SPARK-39864][SQL] Lazily register ExecutionListenerBus

2022-07-26 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 869fc2198a4 [SPARK-39864][SQL] Lazily register ExecutionListenerBus
869fc2198a4 is described below

commit 869fc2198a4bb51bc03dce36fb2b61a57fe3006e
Author: Josh Rosen 
AuthorDate: Wed Jul 27 14:25:54 2022 +0900

[SPARK-39864][SQL] Lazily register ExecutionListenerBus

### What changes were proposed in this pull request?

This PR modifies `ExecutionListenerManager` so that its 
`ExecutionListenerBus` SparkListener is lazily registered during the first 
`.register(QueryExceutionListener)` (instead of eagerly registering it in the 
constructor).

### Why are the changes needed?

This addresses a ListenerBus performance problem in applications with large 
numbers of short-lived SparkSessions.

The `ExecutionListenerBus` SparkListener is unregistered by the 
ContextCleaner after its associated ExecutionListenerManager/SparkSession is 
garbage-collected (see #31839). If many sessions are rapidly created and 
destroyed but the driver GC doesn't run then this can result in large number of 
unused ExecutionListenerBus listeners being registered on the shared 
ListenerBus queue. This can cause performance problems in the ListenerBus 
because each listener invocation has some overhead.

In one real-world application with a very large driver heap and high rate 
of SparkSession creation (hundreds per minute), I saw 5000 idle 
ExecutionListenerBus listeners, resulting in ~50ms median event processing 
times on the shared listener queue.

This patch avoids this problem by making the listener registration lazy: if 
a short-lived SparkSession never uses QueryExecutionListeners then we won't 
register the ExecutionListenerBus and won't incur these overheads.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added a new unit test.

Closes #37282 from JoshRosen/SPARK-39864.

Authored-by: Josh Rosen 
Signed-off-by: Hyukjin Kwon 
---
 .../spark/sql/util/QueryExecutionListener.scala  | 20 ++--
 .../sql/util/ExecutionListenerManagerSuite.scala | 15 +++
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala
index 7ac06a5cd7e..45482f12f3c 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala
@@ -81,7 +81,10 @@ class ExecutionListenerManager private[sql](
 loadExtensions: Boolean)
   extends Logging {
 
-  private val listenerBus = new ExecutionListenerBus(this, session)
+  // SPARK-39864: lazily create the listener bus on the first register() call 
in order to
+  // avoid listener overheads when QueryExecutionListeners aren't used:
+  private val listenerBusInitializationLock = new Object()
+  @volatile private var listenerBus: Option[ExecutionListenerBus] = None
 
   if (loadExtensions) {
 val conf = session.sparkContext.conf
@@ -97,7 +100,12 @@ class ExecutionListenerManager private[sql](
*/
   @DeveloperApi
   def register(listener: QueryExecutionListener): Unit = {
-listenerBus.addListener(listener)
+listenerBusInitializationLock.synchronized {
+  if (listenerBus.isEmpty) {
+listenerBus = Some(new ExecutionListenerBus(this, session))
+  }
+}
+listenerBus.get.addListener(listener)
   }
 
   /**
@@ -105,7 +113,7 @@ class ExecutionListenerManager private[sql](
*/
   @DeveloperApi
   def unregister(listener: QueryExecutionListener): Unit = {
-listenerBus.removeListener(listener)
+listenerBus.foreach(_.removeListener(listener))
   }
 
   /**
@@ -113,12 +121,12 @@ class ExecutionListenerManager private[sql](
*/
   @DeveloperApi
   def clear(): Unit = {
-listenerBus.removeAllListeners()
+listenerBus.foreach(_.removeAllListeners())
   }
 
   /** Only exposed for testing. */
   private[sql] def listListeners(): Array[QueryExecutionListener] = {
-listenerBus.listeners.asScala.toArray
+
listenerBus.map(_.listeners.asScala.toArray).getOrElse(Array.empty[QueryExecutionListener])
   }
 
   /**
@@ -127,7 +135,7 @@ class ExecutionListenerManager private[sql](
   private[sql] def clone(session: SparkSession, sqlConf: SQLConf): 
ExecutionListenerManager = {
 val newListenerManager =
   new ExecutionListenerManager(session, sqlConf, loadExtensions = false)
-listenerBus.listeners.asScala.foreach(newListenerManager.register)
+
listenerBus.foreach(_.listeners.asScala.foreach(newListenerManager.register))
 newListenerManager
   }

[spark] branch master updated: [SPARK-39868][CORE][TESTS] StageFailed event should attach with the root cause

2022-07-26 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 56f0233658e [SPARK-39868][CORE][TESTS] StageFailed event should attach 
with the root cause
56f0233658e is described below

commit 56f0233658e53425ff915e803284139defb4af42
Author: panbingkun 
AuthorDate: Wed Jul 27 14:21:34 2022 +0900

[SPARK-39868][CORE][TESTS] StageFailed event should attach with the root 
cause

### What changes were proposed in this pull request?
The pr follow https://github.com/apache/spark/pull/37245
StageFailed event should attach with the root cause

### Why are the changes needed?
**It may be a good way for users to know the reason of failure.**

By carefully investigating the issue: 
https://issues.apache.org/jira/browse/SPARK-39622,
I found the root cause of test failure: StageFailed don't attach the failed 
reason from executor.
when OutputCommitCoordinator execute 'taskCompleted', the 'reason' is 
ignored.

Scenario 1: receive TaskSetFailed (Success)
> InsertIntoHadoopFsRelationCommand
> FileFormatWriter.write
> _**handleTaskSetFailed**_ (**attach root cause**)
> abortStage
> failJobAndIndependentStages
> SparkListenerJobEnd

Scenario 1: receive StageFailed (Fail)
> InsertIntoHadoopFsRelationCommand
> FileFormatWriter.write
> _**handleStageFailed**_ (**don't attach root cause**)
> abortStage
> failJobAndIndependentStages
> SparkListenerJobEnd

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual run UT & Pass GitHub Actions

Closes #37292 from panbingkun/SPARK-39868.

Authored-by: panbingkun 
Signed-off-by: Hyukjin Kwon 
---
 .../apache/spark/scheduler/OutputCommitCoordinator.scala   |  3 ++-
 .../OutputCommitCoordinatorIntegrationSuite.scala  |  3 ++-
 .../sql/execution/datasources/parquet/ParquetIOSuite.scala | 14 ++
 3 files changed, 6 insertions(+), 14 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala 
b/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala
index a33c2bb93bc..cd5d6b8f9c9 100644
--- 
a/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala
+++ 
b/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala
@@ -160,7 +160,8 @@ private[spark] class OutputCommitCoordinator(
 if (stageState.authorizedCommitters(partition) == taskId) {
   sc.foreach(_.dagScheduler.stageFailed(stage, s"Authorized committer 
" +
 s"(attemptNumber=$attemptNumber, stage=$stage, 
partition=$partition) failed; " +
-s"but task commit success, data duplication may happen."))
+s"but task commit success, data duplication may happen. " +
+s"reason=$reason"))
 }
 }
   }
diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorIntegrationSuite.scala
 
b/core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorIntegrationSuite.scala
index 66b13be4f7a..45da750768f 100644
--- 
a/core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorIntegrationSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorIntegrationSuite.scala
@@ -51,7 +51,8 @@ class OutputCommitCoordinatorIntegrationSuite
 sc.parallelize(1 to 4, 
2).map(_.toString).saveAsTextFile(tempDir.getAbsolutePath + "/out")
   }
 }.getCause.getMessage
-assert(e.endsWith("failed; but task commit success, data duplication may 
happen."))
+assert(e.contains("failed; but task commit success, data duplication may 
happen.") &&
+  e.contains("Intentional exception"))
   }
 }
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
index a06ddc1b9e9..5a8f4563756 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
@@ -1202,15 +1202,7 @@ class ParquetIOSuite extends QueryTest with ParquetTest 
with SharedSparkSession
 val m1 = intercept[SparkException] {
   
spark.range(1).coalesce(1).write.options(extraOptions).parquet(dir.getCanonicalPath)
 }.getCause.getMessage
-// SPARK-39622: The current case must handle the `TaskSetFailed` event 
before SPARK-39195
-// due to `maxTaskFailures` is 1 when local mode. After SPARK-39195, 
it may handle to one
-// of the `TaskSetFailed` event and `StageFailed` event, and the 
execution order of t

[spark] branch master updated: [SPARK-39882][BUILD] Upgrade rocksdbjni to 7.4.3

2022-07-26 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e79b9f05360 [SPARK-39882][BUILD] Upgrade rocksdbjni to 7.4.3
e79b9f05360 is described below

commit e79b9f053605369a60f15c606f1a4bc0b4f31329
Author: yangjie01 
AuthorDate: Tue Jul 26 21:52:23 2022 -0700

[SPARK-39882][BUILD] Upgrade rocksdbjni to 7.4.3

### What changes were proposed in this pull request?
This PR aims to upgrade RocksDB JNI library from 7.3.1 to 7.4.3.

### Why are the changes needed?
This version brings some bug fix, the release note as follows:

- https://github.com/facebook/rocksdb/releases/tag/v7.4.3

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass GA
- The benchmark result :

**Before 7.3.1**

```
[INFO] Running org.apache.spark.util.kvstore.RocksDBBenchmark
count   meanmin max 
95th
dbClose 4   0.346   0.266   
0.539   0.539
dbCreation  4   78.389  3.583   
301.763 301.763
naturalIndexCreateIterator  10240.005   0.002   
1.469   0.007
naturalIndexDescendingCreateIterator10240.007   0.006   
0.064   0.008
naturalIndexDescendingIteration 10240.005   0.004   
0.024   0.007
naturalIndexIteration   10240.006   0.004   
0.053   0.008
randomDeleteIndexed 10240.026   0.019   
0.293   0.035
randomDeletesNoIndex10240.015   0.012   
0.053   0.017
randomUpdatesIndexed10240.081   0.033   
30.752  0.083
randomUpdatesNoIndex10240.037   0.034   
0.502   0.041
randomWritesIndexed 10240.116   0.035   
52.128  0.121
randomWritesNoIndex 10240.042   0.036   
1.869   0.046
refIndexCreateIterator  10240.004   0.004   
0.019   0.006
refIndexDescendingCreateIterator10240.003   0.002   
0.026   0.004
refIndexDescendingIteration 10240.006   0.005   
0.042   0.008
refIndexIteration   10240.007   0.005   
0.314   0.009
sequentialDeleteIndexed 10240.021   0.018   
0.104   0.026
sequentialDeleteNoIndex 10240.015   0.012   
0.053   0.019
sequentialUpdatesIndexed10240.044   0.038   
0.802   0.053
sequentialUpdatesNoIndex10240.045   0.032   
1.552   0.054
sequentialWritesIndexed 10240.048   0.040   
1.862   0.055
sequentialWritesNoIndex 10240.044   0.032   
2.929   0.048

```

**After 7.4.3**

```
[INFO] Running org.apache.spark.util.kvstore.RocksDBBenchmark
count   meanmin max 
95th
dbClose 4   0.364   0.305   
0.525   0.525
dbCreation  4   80.553  3.635   
310.735 310.735
naturalIndexCreateIterator  10240.006   0.002   
1.569   0.007
naturalIndexDescendingCreateIterator10240.006   0.005   
0.073   0.008
naturalIndexDescendingIteration 10240.006   0.004   
0.269   0.008
naturalIndexIteration   10240.006   0.004   
0.059   0.009
randomDeleteIndexed 10240.027   0.020   
0.339   0.036
randomDeletesNoIndex10240.015   0.012   
0.037   0.018
randomUpdatesIndexed10240.084   0.032   
32.286  0.090
randomUpdatesNoIndex10240.023   0.019   
0.584   0.026
randomWritesIndexed 10240.122   0.035   
53.989  0.129
randomWritesNoIndex 10240.027   0.022   
1.547   0.031
refIndexCreateIterator  10240.005   0.005   
0.027   0.007
refIndexDescendingCreateIterator10240.003   0.003   
0.035   0.005
refIndexDescendingIteration 10240.006   0.005   
0.051   0.008
refIndexIteration   10240.007   0.005   
0.096   0.009
sequentialDeleteIndexed 10240.023   0.018   
1.396   0.027
sequentialDeleteNoIndex 10240.015   0.01

[spark] branch master updated: [SPARK-37888][SQL][TESTS][FOLLOWUP] Don't check the `Created By` field in `DESCRIBE TABLE` tests

2022-07-26 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8c164c075d8 [SPARK-37888][SQL][TESTS][FOLLOWUP] Don't check the 
`Created By` field in `DESCRIBE TABLE` tests
8c164c075d8 is described below

commit 8c164c075d8ae41216e7ddf93a5f29b7900b4a95
Author: Max Gekk 
AuthorDate: Wed Jul 27 08:27:05 2022 +0500

[SPARK-37888][SQL][TESTS][FOLLOWUP] Don't check the `Created By` field in 
`DESCRIBE TABLE` tests

### What changes were proposed in this pull request?
In the PR, I propose to do not check the field `Created By` in tests that 
check output of the `DESCRIBE TABLE` command.

### Why are the changes needed?
The field `Created By` depends on the current Spark version, for instance 
`Spark 3.4.0-SNAPSHOT`. Apparently, the tests that check the field depend on 
Spark version. The changes are needed to avoid dependency from Spark version, 
and to don't change the tests when bumping Spark version.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the modified tests:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly 
*DescribeTableSuite"
```

Closes #37299 from MaxGekk/unify-describe-table-tests-followup.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
---
 .../org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala | 3 +--
 .../apache/spark/sql/hive/execution/command/DescribeTableSuite.scala   | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala
index f8e53fee723..da4eab13afb 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala
@@ -189,7 +189,7 @@ class DescribeTableSuite extends DescribeTableSuiteBase 
with CommandSuiteBase {
 ("data_type", StringType),
 ("comment", StringType)))
   QueryTest.checkAnswer(
-descriptionDf.filter("col_name != 'Created Time'"),
+descriptionDf.filter("!(col_name in ('Created Time', 'Created By'))"),
 Seq(
   Row("data", "string", null),
   Row("id", "bigint", null),
@@ -202,7 +202,6 @@ class DescribeTableSuite extends DescribeTableSuiteBase 
with CommandSuiteBase {
   Row("Database", "ns", ""),
   Row("Table", "table", ""),
   Row("Last Access", "UNKNOWN", ""),
-  Row("Created By", "Spark 3.4.0-SNAPSHOT", ""),
   Row("Type", "EXTERNAL", ""),
   Row("Provider", getProvider(), ""),
   Row("Comment", "this is a test table", ""),
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/DescribeTableSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/DescribeTableSuite.scala
index 00adb377f04..c12d236f4b6 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/DescribeTableSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/DescribeTableSuite.scala
@@ -58,7 +58,7 @@ class DescribeTableSuite extends v1.DescribeTableSuiteBase 
with CommandSuiteBase
 ("comment", StringType)))
   QueryTest.checkAnswer(
 // Filter out 'Table Properties' to don't check `transient_lastDdlTime`
-descriptionDf.filter("col_name != 'Created Time' and col_name != 
'Table Properties'"),
+descriptionDf.filter("!(col_name in ('Created Time', 'Table 
Properties', 'Created By'))"),
 Seq(
   Row("data", "string", null),
   Row("id", "bigint", null),
@@ -72,7 +72,6 @@ class DescribeTableSuite extends v1.DescribeTableSuiteBase 
with CommandSuiteBase
   Row("Table", "table", ""),
   Row(TableCatalog.PROP_OWNER.capitalize, Utils.getCurrentUserName(), 
""),
   Row("Last Access", "UNKNOWN", ""),
-  Row("Created By", "Spark 3.4.0-SNAPSHOT", ""),
   Row("Type", "EXTERNAL", ""),
   Row("Provider", getProvider(), ""),
   Row("Comment", "this is a test table", ""),


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-39865][SQL] Show proper error messages on the overflow errors of table insert

2022-07-26 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d5dbe7d4e9e [SPARK-39865][SQL] Show proper error messages on the 
overflow errors of table insert
d5dbe7d4e9e is described below

commit d5dbe7d4e9e0e46a514c363efaac15f37d07857c
Author: Gengliang Wang 
AuthorDate: Tue Jul 26 20:26:53 2022 -0700

[SPARK-39865][SQL] Show proper error messages on the overflow errors of 
table insert

### What changes were proposed in this pull request?

In Spark 3.3, the error message of ANSI CAST is improved. However, the 
table insertion is using the same CAST expression:
```
> create table tiny(i tinyint);
> insert into tiny values (1000);

org.apache.spark.SparkArithmeticException[CAST_OVERFLOW]: The value 1000 of 
the type "INT" cannot be cast to "TINYINT" due to an overflow. Use `try_cast` 
to tolerate overflow and return NULL instead. If necessary set 
"spark.sql.ansi.enabled" to "false" to bypass this error.
```

Showing the hint of `If necessary set "spark.sql.ansi.enabled" to "false" 
to bypass this error` doesn't help at all. This PR is to fix the error message. 
After changes, the error message of this example will become:
```
org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW_IN_TABLE_INSERT] 
Fail to insert a value of "INT" type into the "TINYINT" type column `i` due to 
an overflow. Use `try_cast` on the input value to tolerate overflow and return 
NULL instead.
```
### Why are the changes needed?

Show proper error messages on the overflow errors of table insert. The 
current message is super confusing.

### Does this PR introduce _any_ user-facing change?

Yes, after changes it show proper error messages on the overflow errors of 
table insert.

### How was this patch tested?

Unit test

Closes #37283 from gengliangwang/insertionOverflow.

Authored-by: Gengliang Wang 
Signed-off-by: Gengliang Wang 
---
 core/src/main/resources/error/error-classes.json   |  6 +++
 .../catalyst/analysis/TableOutputResolver.scala| 23 ++-
 .../spark/sql/catalyst/expressions/Cast.scala  | 44 ++
 .../spark/sql/errors/QueryExecutionErrors.scala| 15 
 .../sql/errors/QueryExecutionAnsiErrorsSuite.scala | 21 ++-
 .../org/apache/spark/sql/sources/InsertSuite.scala | 20 +-
 6 files changed, 117 insertions(+), 12 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index 29ca280719e..9d35b1a1a69 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -59,6 +59,12 @@
 ],
 "sqlState" : "22005"
   },
+  "CAST_OVERFLOW_IN_TABLE_INSERT" : {
+"message" : [
+  "Fail to insert a value of  type into the  type 
column  due to an overflow. Use `try_cast` on the input value to 
tolerate overflow and return NULL instead."
+],
+"sqlState" : "22005"
+  },
   "CONCURRENT_QUERY" : {
 "message" : [
   "Another instance of this query was just started by a concurrent 
session."
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala
index aca99b001d2..b9e3c380216 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala
@@ -26,7 +26,7 @@ import 
org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
 import org.apache.spark.sql.errors.QueryCompilationErrors
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.internal.SQLConf.StoreAssignmentPolicy
-import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructType}
+import org.apache.spark.sql.types.{ArrayType, DataType, DecimalType, 
IntegralType, MapType, StructType}
 
 object TableOutputResolver {
   def resolveOutputColumns(
@@ -220,6 +220,21 @@ object TableOutputResolver {
 }
   }
 
+  private def containsIntegralOrDecimalType(dt: DataType): Boolean = dt match {
+case _: IntegralType | _: DecimalType => true
+case a: ArrayType => containsIntegralOrDecimalType(a.elementType)
+case m: MapType =>
+  containsIntegralOrDecimalType(m.keyType) || 
containsIntegralOrDecimalType(m.valueType)
+case s: StructType =>
+  s.fields.exists(sf => containsIntegralOrDecimalType(sf.dataType))
+case _ => false
+  }
+
+  private def canCauseCastOverflow(cast: Cast): Boolean = {
+containsIntegralOrDecimalType(cast.dataType) &&
+  !Cast.canUpCast(cast.child.dataType, cast.dataType)
+  }
+
   private 

[spark] branch master updated: [SPARK-39319][FOLLOW-UP][SQL] Make TreeNode.context as lazy val

2022-07-26 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 90d83479a16 [SPARK-39319][FOLLOW-UP][SQL] Make TreeNode.context as 
lazy val
90d83479a16 is described below

commit 90d83479a16cb594aa1ee6c6a8219dbb7d859752
Author: Gengliang Wang 
AuthorDate: Wed Jul 27 10:57:31 2022 +0900

[SPARK-39319][FOLLOW-UP][SQL] Make TreeNode.context as lazy val

### What changes were proposed in this pull request?

- Make TreeNode.context as lazy val
- Code clean up in SQLQueryContext

### Why are the changes needed?

Making TreeNode.context as lazy val can save the memory usage, which is 
only called on certain expressions under ANSI SQL mode.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT

Closes #37307 from gengliangwang/lazyVal.

Authored-by: Gengliang Wang 
Signed-off-by: Hyukjin Kwon 
---
 .../apache/spark/sql/catalyst/trees/SQLQueryContext.scala | 15 +--
 .../org/apache/spark/sql/catalyst/trees/TreeNode.scala|  2 +-
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/SQLQueryContext.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/SQLQueryContext.scala
index 8f75079fcf9..a8806dbad4d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/SQLQueryContext.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/SQLQueryContext.scala
@@ -42,9 +42,7 @@ case class SQLQueryContext(
*/
   lazy val summary: String = {
 // If the query context is missing or incorrect, simply return an empty 
string.
-if (sqlText.isEmpty || originStartIndex.isEmpty || originStopIndex.isEmpty 
||
-  originStartIndex.get < 0 || originStopIndex.get >= sqlText.get.length ||
-  originStartIndex.get > originStopIndex.get) {
+if (!isValid) {
   ""
 } else {
   val positionContext = if (line.isDefined && startPosition.isDefined) {
@@ -119,12 +117,17 @@ case class SQLQueryContext(
 
   /** Gets the textual fragment of a SQL query. */
   override lazy val fragment: String = {
-if (sqlText.isEmpty || originStartIndex.isEmpty || originStopIndex.isEmpty 
||
-  originStartIndex.get < 0 || originStopIndex.get >= sqlText.get.length ||
-  originStartIndex.get > originStopIndex.get) {
+if (!isValid) {
   ""
 } else {
   sqlText.get.substring(originStartIndex.get, originStopIndex.get)
 }
   }
+
+  private def isValid: Boolean = {
+sqlText.isDefined && originStartIndex.isDefined && 
originStopIndex.isDefined &&
+  originStartIndex.get >= 0 && originStopIndex.get < sqlText.get.length &&
+  originStartIndex.get <= originStopIndex.get
+
+  }
 }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
index b8cfdcdbe7f..8f5858d2f4d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
@@ -66,7 +66,7 @@ case class Origin(
 objectType: Option[String] = None,
 objectName: Option[String] = None) {
 
-  val context: SQLQueryContext = SQLQueryContext(
+  lazy val context: SQLQueryContext = SQLQueryContext(
 line, startPosition, startIndex, stopIndex, sqlText, objectType, 
objectName)
 }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (d31505eb305 -> 5e43ae29ef2)

2022-07-26 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from d31505eb305 [SPARK-39870][PYTHON][TESTS] Add flag to 'run-tests.py' to 
support retaining the output
 add 5e43ae29ef2 [SPARK-39875][SQL] Change `protected` method in final 
class to `private` or `package-visible`

No new revisions were added by this update.

Summary of changes:
 .../src/main/java/org/apache/spark/unsafe/types/UTF8String.java   | 2 +-
 .../sql/catalyst/expressions/FixedLengthRowBasedKeyValueBatch.java| 4 ++--
 .../sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (01d41e7de41 -> d31505eb305)

2022-07-26 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 01d41e7de41 [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum 
memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite`
 add d31505eb305 [SPARK-39870][PYTHON][TESTS] Add flag to 'run-tests.py' to 
support retaining the output

No new revisions were added by this update.

Summary of changes:
 python/run-tests.py | 45 -
 1 file changed, 36 insertions(+), 9 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.0 updated: [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite`

2022-07-26 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new f7a94093a67 [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum 
memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite`
f7a94093a67 is described below

commit f7a94093a67956b94eaf99bdbb29c4351736d110
Author: yangjie01 
AuthorDate: Wed Jul 27 09:26:47 2022 +0900

[SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in 
`BroadcastJoinSuite*` and `HiveSparkSubmitSuite`

### What changes were proposed in this pull request?
This pr change `local-cluster[2, 1, 1024]` in `BroadcastJoinSuite*` and 
`HiveSparkSubmitSuite` to `local-cluster[2, 1, 512]` to reduce test maximum 
memory usage.

### Why are the changes needed?
Reduce the maximum memory usage of test cases.

### Does this PR introduce _any_ user-facing change?
No, test-only.

### How was this patch tested?
Should monitor CI

Closes #37298 from LuciferYang/reduce-local-cluster-memory.

Authored-by: yangjie01 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 01d41e7de418d0a40db7b16ddd0d8546f0794d17)
Signed-off-by: Hyukjin Kwon 
---
 .../sql/execution/joins/BroadcastJoinSuite.scala   |  4 +-
 .../spark/sql/hive/HiveSparkSubmitSuite.scala  | 43 +++---
 2 files changed, 32 insertions(+), 15 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
index ef0a596f211..a7fe4d1c792 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
@@ -20,6 +20,7 @@ package org.apache.spark.sql.execution.joins
 import scala.reflect.ClassTag
 
 import org.apache.spark.AccumulatorSuite
+import org.apache.spark.internal.config.EXECUTOR_MEMORY
 import org.apache.spark.sql.{Dataset, QueryTest, Row, SparkSession}
 import org.apache.spark.sql.catalyst.expressions.{BitwiseAnd, BitwiseOr, Cast, 
Literal, ShiftLeft}
 import org.apache.spark.sql.catalyst.plans.logical.BROADCAST
@@ -51,7 +52,8 @@ abstract class BroadcastJoinSuiteBase extends QueryTest with 
SQLTestUtils
   override def beforeAll(): Unit = {
 super.beforeAll()
 spark = SparkSession.builder()
-  .master("local-cluster[2,1,1024]")
+  .master("local-cluster[2,1,512]")
+  .config(EXECUTOR_MEMORY.key, "512m")
   .appName("testing")
   .getOrCreate()
   }
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
index f9ea4e31487..b2ee82db390 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
@@ -28,6 +28,7 @@ import org.scalatest.Assertions._
 
 import org.apache.spark._
 import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config.EXECUTOR_MEMORY
 import org.apache.spark.internal.config.UI.UI_ENABLED
 import org.apache.spark.sql.{QueryTest, Row, SparkSession}
 import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier}
@@ -66,7 +67,8 @@ class HiveSparkSubmitSuite
 val args = Seq(
   "--class", TemporaryHiveUDFTest.getClass.getName.stripSuffix("$"),
   "--name", "TemporaryHiveUDFTest",
-  "--master", "local-cluster[2,1,1024]",
+  "--master", "local-cluster[2,1,512]",
+  "--conf", s"${EXECUTOR_MEMORY.key}=512m",
   "--conf", "spark.ui.enabled=false",
   "--conf", "spark.master.rest.enabled=false",
   "--driver-java-options", "-Dderby.system.durability=test",
@@ -83,7 +85,8 @@ class HiveSparkSubmitSuite
 val args = Seq(
   "--class", PermanentHiveUDFTest1.getClass.getName.stripSuffix("$"),
   "--name", "PermanentHiveUDFTest1",
-  "--master", "local-cluster[2,1,1024]",
+  "--master", "local-cluster[2,1,512]",
+  "--conf", s"${EXECUTOR_MEMORY.key}=512m",
   "--conf", "spark.ui.enabled=false",
   "--conf", "spark.master.rest.enabled=false",
   "--driver-java-options", "-Dderby.system.durability=test",
@@ -100,7 +103,8 @@ class HiveSparkSubmitSuite
 val args = Seq(
   "--class", PermanentHiveUDFTest2.getClass.getName.stripSuffix("$"),
   "--name", "PermanentHiveUDFTest2",
-  "--master", "local-cluster[2,1,1024]",
+  "--master", "local-cluster[2,1,512]",
+  "--conf", s"${EXECUTOR_MEMORY.key}=512m",
   "--conf", "spark.ui.enabled=false",
   "--conf", "spark.master.rest.enabled=false",
   "--driver-java-options", "-Dderby.system.durability=test",
@@ -119,7 +123,8 @@ clas

[spark] branch branch-3.1 updated: [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite`

2022-07-26 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 3768ee1e775 [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum 
memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite`
3768ee1e775 is described below

commit 3768ee1e775d42920bb2583f5fcb5f15927688ad
Author: yangjie01 
AuthorDate: Wed Jul 27 09:26:47 2022 +0900

[SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in 
`BroadcastJoinSuite*` and `HiveSparkSubmitSuite`

### What changes were proposed in this pull request?
This pr change `local-cluster[2, 1, 1024]` in `BroadcastJoinSuite*` and 
`HiveSparkSubmitSuite` to `local-cluster[2, 1, 512]` to reduce test maximum 
memory usage.

### Why are the changes needed?
Reduce the maximum memory usage of test cases.

### Does this PR introduce _any_ user-facing change?
No, test-only.

### How was this patch tested?
Should monitor CI

Closes #37298 from LuciferYang/reduce-local-cluster-memory.

Authored-by: yangjie01 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 01d41e7de418d0a40db7b16ddd0d8546f0794d17)
Signed-off-by: Hyukjin Kwon 
---
 .../sql/execution/joins/BroadcastJoinSuite.scala   |  4 +-
 .../spark/sql/hive/HiveSparkSubmitSuite.scala  | 43 +++---
 2 files changed, 32 insertions(+), 15 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
index 98a1089709b..6883c8d1411 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
@@ -20,6 +20,7 @@ package org.apache.spark.sql.execution.joins
 import scala.reflect.ClassTag
 
 import org.apache.spark.AccumulatorSuite
+import org.apache.spark.internal.config.EXECUTOR_MEMORY
 import org.apache.spark.sql.{Dataset, QueryTest, Row, SparkSession}
 import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
BitwiseAnd, BitwiseOr, Cast, Expression, Literal, ShiftLeft}
 import org.apache.spark.sql.catalyst.optimizer.{BuildLeft, BuildRight, 
BuildSide}
@@ -54,7 +55,8 @@ abstract class BroadcastJoinSuiteBase extends QueryTest with 
SQLTestUtils
   override def beforeAll(): Unit = {
 super.beforeAll()
 spark = SparkSession.builder()
-  .master("local-cluster[2,1,1024]")
+  .master("local-cluster[2,1,512]")
+  .config(EXECUTOR_MEMORY.key, "512m")
   .appName("testing")
   .getOrCreate()
   }
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
index 426d93b3506..862d4a71ca1 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
@@ -29,6 +29,7 @@ import org.scalatest.matchers.must.Matchers
 
 import org.apache.spark._
 import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config.EXECUTOR_MEMORY
 import org.apache.spark.internal.config.UI.UI_ENABLED
 import org.apache.spark.sql.{QueryTest, Row, SparkSession}
 import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier}
@@ -67,7 +68,8 @@ class HiveSparkSubmitSuite
 val args = Seq(
   "--class", TemporaryHiveUDFTest.getClass.getName.stripSuffix("$"),
   "--name", "TemporaryHiveUDFTest",
-  "--master", "local-cluster[2,1,1024]",
+  "--master", "local-cluster[2,1,512]",
+  "--conf", s"${EXECUTOR_MEMORY.key}=512m",
   "--conf", "spark.ui.enabled=false",
   "--conf", "spark.master.rest.enabled=false",
   "--driver-java-options", "-Dderby.system.durability=test",
@@ -84,7 +86,8 @@ class HiveSparkSubmitSuite
 val args = Seq(
   "--class", PermanentHiveUDFTest1.getClass.getName.stripSuffix("$"),
   "--name", "PermanentHiveUDFTest1",
-  "--master", "local-cluster[2,1,1024]",
+  "--master", "local-cluster[2,1,512]",
+  "--conf", s"${EXECUTOR_MEMORY.key}=512m",
   "--conf", "spark.ui.enabled=false",
   "--conf", "spark.master.rest.enabled=false",
   "--driver-java-options", "-Dderby.system.durability=test",
@@ -101,7 +104,8 @@ class HiveSparkSubmitSuite
 val args = Seq(
   "--class", PermanentHiveUDFTest2.getClass.getName.stripSuffix("$"),
   "--name", "PermanentHiveUDFTest2",
-  "--master", "local-cluster[2,1,1024]",
+  "--master", "local-cluster[2,1,512]",
+  "--conf", s"${EXECUTOR_MEMORY.key}=512m",
   "--conf", "spark.ui.enabled=false",
   "--conf", "spark.master.rest.enabled=false",
   "--driver-java-opti

[spark] branch branch-3.2 updated: [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite`

2022-07-26 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 421918dd95f [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum 
memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite`
421918dd95f is described below

commit 421918dd95fde01baebe33415b7bb4ac2c8e0ec6
Author: yangjie01 
AuthorDate: Wed Jul 27 09:26:47 2022 +0900

[SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in 
`BroadcastJoinSuite*` and `HiveSparkSubmitSuite`

### What changes were proposed in this pull request?
This pr change `local-cluster[2, 1, 1024]` in `BroadcastJoinSuite*` and 
`HiveSparkSubmitSuite` to `local-cluster[2, 1, 512]` to reduce test maximum 
memory usage.

### Why are the changes needed?
Reduce the maximum memory usage of test cases.

### Does this PR introduce _any_ user-facing change?
No, test-only.

### How was this patch tested?
Should monitor CI

Closes #37298 from LuciferYang/reduce-local-cluster-memory.

Authored-by: yangjie01 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 01d41e7de418d0a40db7b16ddd0d8546f0794d17)
Signed-off-by: Hyukjin Kwon 
---
 .../sql/execution/joins/BroadcastJoinSuite.scala   |  4 +-
 .../spark/sql/hive/HiveSparkSubmitSuite.scala  | 43 +++---
 2 files changed, 32 insertions(+), 15 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
index a8b4856261d..d67e2d6ef4f 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
@@ -20,6 +20,7 @@ package org.apache.spark.sql.execution.joins
 import scala.reflect.ClassTag
 
 import org.apache.spark.AccumulatorSuite
+import org.apache.spark.internal.config.EXECUTOR_MEMORY
 import org.apache.spark.sql.{Dataset, QueryTest, Row, SparkSession}
 import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
BitwiseAnd, BitwiseOr, Cast, Expression, Literal, ShiftLeft}
 import org.apache.spark.sql.catalyst.optimizer.{BuildLeft, BuildRight, 
BuildSide}
@@ -56,7 +57,8 @@ abstract class BroadcastJoinSuiteBase extends QueryTest with 
SQLTestUtils
   override def beforeAll(): Unit = {
 super.beforeAll()
 spark = SparkSession.builder()
-  .master("local-cluster[2,1,1024]")
+  .master("local-cluster[2,1,512]")
+  .config(EXECUTOR_MEMORY.key, "512m")
   .appName("testing")
   .getOrCreate()
   }
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
index 90752e70e1b..66a915b4792 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
@@ -33,6 +33,7 @@ import org.scalatest.time.SpanSugar._
 import org.apache.spark._
 import org.apache.spark.deploy.SparkSubmitTestUtils
 import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config.EXECUTOR_MEMORY
 import org.apache.spark.internal.config.UI.UI_ENABLED
 import org.apache.spark.sql.{QueryTest, Row, SparkSession}
 import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier}
@@ -73,7 +74,8 @@ class HiveSparkSubmitSuite
 val args = Seq(
   "--class", TemporaryHiveUDFTest.getClass.getName.stripSuffix("$"),
   "--name", "TemporaryHiveUDFTest",
-  "--master", "local-cluster[2,1,1024]",
+  "--master", "local-cluster[2,1,512]",
+  "--conf", s"${EXECUTOR_MEMORY.key}=512m",
   "--conf", "spark.ui.enabled=false",
   "--conf", "spark.master.rest.enabled=false",
   "--driver-java-options", "-Dderby.system.durability=test",
@@ -90,7 +92,8 @@ class HiveSparkSubmitSuite
 val args = Seq(
   "--class", PermanentHiveUDFTest1.getClass.getName.stripSuffix("$"),
   "--name", "PermanentHiveUDFTest1",
-  "--master", "local-cluster[2,1,1024]",
+  "--master", "local-cluster[2,1,512]",
+  "--conf", s"${EXECUTOR_MEMORY.key}=512m",
   "--conf", "spark.ui.enabled=false",
   "--conf", "spark.master.rest.enabled=false",
   "--driver-java-options", "-Dderby.system.durability=test",
@@ -107,7 +110,8 @@ class HiveSparkSubmitSuite
 val args = Seq(
   "--class", PermanentHiveUDFTest2.getClass.getName.stripSuffix("$"),
   "--name", "PermanentHiveUDFTest2",
-  "--master", "local-cluster[2,1,1024]",
+  "--master", "local-cluster[2,1,512]",
+  "--conf", s"${EXECUTOR_MEMORY.key}=512m",
   "--conf", "spark.ui.enabled=false",
   "--conf", "spark.master.re

[spark] branch branch-3.3 updated: [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite`

2022-07-26 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 9fdd097aa6c [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum 
memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite`
9fdd097aa6c is described below

commit 9fdd097aa6c05e7ecfd33dccad876a00d96b6ddf
Author: yangjie01 
AuthorDate: Wed Jul 27 09:26:47 2022 +0900

[SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in 
`BroadcastJoinSuite*` and `HiveSparkSubmitSuite`

### What changes were proposed in this pull request?
This pr change `local-cluster[2, 1, 1024]` in `BroadcastJoinSuite*` and 
`HiveSparkSubmitSuite` to `local-cluster[2, 1, 512]` to reduce test maximum 
memory usage.

### Why are the changes needed?
Reduce the maximum memory usage of test cases.

### Does this PR introduce _any_ user-facing change?
No, test-only.

### How was this patch tested?
Should monitor CI

Closes #37298 from LuciferYang/reduce-local-cluster-memory.

Authored-by: yangjie01 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 01d41e7de418d0a40db7b16ddd0d8546f0794d17)
Signed-off-by: Hyukjin Kwon 
---
 .../sql/execution/joins/BroadcastJoinSuite.scala   |  4 +-
 .../spark/sql/hive/HiveSparkSubmitSuite.scala  | 43 +++---
 2 files changed, 32 insertions(+), 15 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
index 256e9426202..2d553d2b84f 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
@@ -20,6 +20,7 @@ package org.apache.spark.sql.execution.joins
 import scala.reflect.ClassTag
 
 import org.apache.spark.AccumulatorSuite
+import org.apache.spark.internal.config.EXECUTOR_MEMORY
 import org.apache.spark.sql.{Dataset, QueryTest, Row, SparkSession}
 import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
BitwiseAnd, BitwiseOr, Cast, Expression, Literal, ShiftLeft}
 import org.apache.spark.sql.catalyst.optimizer.{BuildLeft, BuildRight, 
BuildSide}
@@ -56,7 +57,8 @@ abstract class BroadcastJoinSuiteBase extends QueryTest with 
SQLTestUtils
   override def beforeAll(): Unit = {
 super.beforeAll()
 spark = SparkSession.builder()
-  .master("local-cluster[2,1,1024]")
+  .master("local-cluster[2,1,512]")
+  .config(EXECUTOR_MEMORY.key, "512m")
   .appName("testing")
   .getOrCreate()
   }
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
index 170cf4898f3..fc8d6e61a0d 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
@@ -33,6 +33,7 @@ import org.scalatest.time.SpanSugar._
 import org.apache.spark._
 import org.apache.spark.deploy.SparkSubmitTestUtils
 import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config.EXECUTOR_MEMORY
 import org.apache.spark.internal.config.UI.UI_ENABLED
 import org.apache.spark.sql.{QueryTest, Row, SparkSession}
 import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier}
@@ -73,7 +74,8 @@ class HiveSparkSubmitSuite
 val args = Seq(
   "--class", TemporaryHiveUDFTest.getClass.getName.stripSuffix("$"),
   "--name", "TemporaryHiveUDFTest",
-  "--master", "local-cluster[2,1,1024]",
+  "--master", "local-cluster[2,1,512]",
+  "--conf", s"${EXECUTOR_MEMORY.key}=512m",
   "--conf", "spark.ui.enabled=false",
   "--conf", "spark.master.rest.enabled=false",
   "--driver-java-options", "-Dderby.system.durability=test",
@@ -90,7 +92,8 @@ class HiveSparkSubmitSuite
 val args = Seq(
   "--class", PermanentHiveUDFTest1.getClass.getName.stripSuffix("$"),
   "--name", "PermanentHiveUDFTest1",
-  "--master", "local-cluster[2,1,1024]",
+  "--master", "local-cluster[2,1,512]",
+  "--conf", s"${EXECUTOR_MEMORY.key}=512m",
   "--conf", "spark.ui.enabled=false",
   "--conf", "spark.master.rest.enabled=false",
   "--driver-java-options", "-Dderby.system.durability=test",
@@ -107,7 +110,8 @@ class HiveSparkSubmitSuite
 val args = Seq(
   "--class", PermanentHiveUDFTest2.getClass.getName.stripSuffix("$"),
   "--name", "PermanentHiveUDFTest2",
-  "--master", "local-cluster[2,1,1024]",
+  "--master", "local-cluster[2,1,512]",
+  "--conf", s"${EXECUTOR_MEMORY.key}=512m",
   "--conf", "spark.ui.enabled=false",
   "--conf", "spark.master.re

[spark] branch master updated: [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite`

2022-07-26 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 01d41e7de41 [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum 
memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite`
01d41e7de41 is described below

commit 01d41e7de418d0a40db7b16ddd0d8546f0794d17
Author: yangjie01 
AuthorDate: Wed Jul 27 09:26:47 2022 +0900

[SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in 
`BroadcastJoinSuite*` and `HiveSparkSubmitSuite`

### What changes were proposed in this pull request?
This pr change `local-cluster[2, 1, 1024]` in `BroadcastJoinSuite*` and 
`HiveSparkSubmitSuite` to `local-cluster[2, 1, 512]` to reduce test maximum 
memory usage.

### Why are the changes needed?
Reduce the maximum memory usage of test cases.

### Does this PR introduce _any_ user-facing change?
No, test-only.

### How was this patch tested?
Should monitor CI

Closes #37298 from LuciferYang/reduce-local-cluster-memory.

Authored-by: yangjie01 
Signed-off-by: Hyukjin Kwon 
---
 .../sql/execution/joins/BroadcastJoinSuite.scala   |  4 +-
 .../spark/sql/hive/HiveSparkSubmitSuite.scala  | 43 +++---
 2 files changed, 32 insertions(+), 15 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
index 3cc43e2dd41..6333808b420 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
@@ -20,6 +20,7 @@ package org.apache.spark.sql.execution.joins
 import scala.reflect.ClassTag
 
 import org.apache.spark.AccumulatorSuite
+import org.apache.spark.internal.config.EXECUTOR_MEMORY
 import org.apache.spark.sql.{Dataset, QueryTest, Row, SparkSession}
 import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
BitwiseAnd, BitwiseOr, Cast, Expression, Literal, ShiftLeft}
 import org.apache.spark.sql.catalyst.optimizer.{BuildLeft, BuildRight, 
BuildSide}
@@ -56,7 +57,8 @@ abstract class BroadcastJoinSuiteBase extends QueryTest with 
SQLTestUtils
   override def beforeAll(): Unit = {
 super.beforeAll()
 spark = SparkSession.builder()
-  .master("local-cluster[2,1,1024]")
+  .master("local-cluster[2,1,512]")
+  .config(EXECUTOR_MEMORY.key, "512m")
   .appName("testing")
   .getOrCreate()
   }
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
index 18202307fc4..fa4d1b78d1c 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
@@ -33,6 +33,7 @@ import org.scalatest.time.SpanSugar._
 import org.apache.spark._
 import org.apache.spark.deploy.SparkSubmitTestUtils
 import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config.EXECUTOR_MEMORY
 import org.apache.spark.internal.config.UI.UI_ENABLED
 import org.apache.spark.sql.{QueryTest, Row, SparkSession}
 import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier}
@@ -74,7 +75,8 @@ class HiveSparkSubmitSuite
 val args = Seq(
   "--class", TemporaryHiveUDFTest.getClass.getName.stripSuffix("$"),
   "--name", "TemporaryHiveUDFTest",
-  "--master", "local-cluster[2,1,1024]",
+  "--master", "local-cluster[2,1,512]",
+  "--conf", s"${EXECUTOR_MEMORY.key}=512m",
   "--conf", "spark.ui.enabled=false",
   "--conf", "spark.master.rest.enabled=false",
   "--driver-java-options", "-Dderby.system.durability=test",
@@ -91,7 +93,8 @@ class HiveSparkSubmitSuite
 val args = Seq(
   "--class", PermanentHiveUDFTest1.getClass.getName.stripSuffix("$"),
   "--name", "PermanentHiveUDFTest1",
-  "--master", "local-cluster[2,1,1024]",
+  "--master", "local-cluster[2,1,512]",
+  "--conf", s"${EXECUTOR_MEMORY.key}=512m",
   "--conf", "spark.ui.enabled=false",
   "--conf", "spark.master.rest.enabled=false",
   "--driver-java-options", "-Dderby.system.durability=test",
@@ -108,7 +111,8 @@ class HiveSparkSubmitSuite
 val args = Seq(
   "--class", PermanentHiveUDFTest2.getClass.getName.stripSuffix("$"),
   "--name", "PermanentHiveUDFTest2",
-  "--master", "local-cluster[2,1,1024]",
+  "--master", "local-cluster[2,1,512]",
+  "--conf", s"${EXECUTOR_MEMORY.key}=512m",
   "--conf", "spark.ui.enabled=false",
   "--conf", "spark.master.rest.enabled=false",
   "--driver-java-options", "-Dderby.system.durability=test",
@@ -127,7 +131,8 @@ class Hiv

[spark] branch master updated: [SPARK-39823][SQL][PYTHON] Rename Dataset.as as Dataset.to and add DataFrame.to in PySpark

2022-07-26 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 91b95056806 [SPARK-39823][SQL][PYTHON] Rename Dataset.as as Dataset.to 
and add DataFrame.to in PySpark
91b95056806 is described below

commit 91b950568066830ecd7a4581ab5bf4dbdbbeb474
Author: Ruifeng Zheng 
AuthorDate: Wed Jul 27 08:11:18 2022 +0800

[SPARK-39823][SQL][PYTHON] Rename Dataset.as as Dataset.to and add 
DataFrame.to in PySpark

### What changes were proposed in this pull request?

1, rename `Dataset.as(StructType)` to `Dataset.to(StructType)`, since `as` 
is a keyword in python, we dont want to use a different name;
2, Add `DataFrame.to(StructType)` in Python

### Why are the changes needed?
for function parity

### Does this PR introduce _any_ user-facing change?
yes, new api

### How was this patch tested?
added UT

Closes #37233 from zhengruifeng/py_ds_as.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../source/reference/pyspark.sql/dataframe.rst |  1 +
 python/pyspark/sql/dataframe.py| 50 +
 python/pyspark/sql/tests/test_dataframe.py | 36 ++-
 .../main/scala/org/apache/spark/sql/Dataset.scala  |  4 +-
 ...emaSuite.scala => DataFrameToSchemaSuite.scala} | 52 +++---
 5 files changed, 114 insertions(+), 29 deletions(-)

diff --git a/python/docs/source/reference/pyspark.sql/dataframe.rst 
b/python/docs/source/reference/pyspark.sql/dataframe.rst
index 5b6e704ba48..8cf083e5dd4 100644
--- a/python/docs/source/reference/pyspark.sql/dataframe.rst
+++ b/python/docs/source/reference/pyspark.sql/dataframe.rst
@@ -102,6 +102,7 @@ DataFrame
 DataFrame.summary
 DataFrame.tail
 DataFrame.take
+DataFrame.to
 DataFrame.toDF
 DataFrame.toJSON
 DataFrame.toLocalIterator
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index efebd05c08d..481dafa310d 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1422,6 +1422,56 @@ class DataFrame(PandasMapOpsMixin, 
PandasConversionMixin):
 jc = self._jdf.colRegex(colName)
 return Column(jc)
 
+def to(self, schema: StructType) -> "DataFrame":
+"""
+Returns a new :class:`DataFrame` where each row is reconciled to match 
the specified
+schema.
+
+Notes
+-
+1, Reorder columns and/or inner fields by name to match the specified 
schema.
+
+2, Project away columns and/or inner fields that are not needed by the 
specified schema.
+Missing columns and/or inner fields (present in the specified schema 
but not input
+DataFrame) lead to failures.
+
+3, Cast the columns and/or inner fields to match the data types in the 
specified schema,
+if the types are compatible, e.g., numeric to numeric (error if 
overflows), but not string
+to int.
+
+4, Carry over the metadata from the specified schema, while the 
columns and/or inner fields
+still keep their own metadata if not overwritten by the specified 
schema.
+
+5, Fail if the nullability is not compatible. For example, the column 
and/or inner field
+is nullable but the specified schema requires them to be not nullable.
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+schema : :class:`StructType`
+Specified schema.
+
+Examples
+
+>>> df = spark.createDataFrame([("a", 1)], ["i", "j"])
+>>> df.schema
+StructType([StructField('i', StringType(), True), StructField('j', 
LongType(), True)])
+>>> schema = StructType([StructField("j", StringType()), 
StructField("i", StringType())])
+>>> df2 = df.to(schema)
+>>> df2.schema
+StructType([StructField('j', StringType(), True), StructField('i', 
StringType(), True)])
+>>> df2.show()
++---+---+
+|  j|  i|
++---+---+
+|  1|  a|
++---+---+
+"""
+assert schema is not None
+jschema = self._jdf.sparkSession().parseDataType(schema.json())
+return DataFrame(self._jdf.to(jschema), self.sparkSession)
+
 def alias(self, alias: str) -> "DataFrame":
 """Returns a new :class:`DataFrame` with an alias set.
 
diff --git a/python/pyspark/sql/tests/test_dataframe.py 
b/python/pyspark/sql/tests/test_dataframe.py
index ac6b6f68aed..7c7d3d1e51c 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -25,11 +25,12 @@ import unittest
 from typing import cast
 
 from pyspark.sql import SparkSession, Row
-from pyspark.sql.functions import col, lit, count, sum, mean
+from pyspark.sql.functi

[spark] branch master updated: [SPARK-39884][K8S] `KubernetesExecutorBackend` should handle `IPv6` hostname

2022-07-26 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 07ccaa6f4c7 [SPARK-39884][K8S] `KubernetesExecutorBackend` should 
handle `IPv6` hostname
07ccaa6f4c7 is described below

commit 07ccaa6f4c7c37339ee10c3d7b337059fc7468a2
Author: Dongjoon Hyun 
AuthorDate: Tue Jul 26 17:00:33 2022 -0700

[SPARK-39884][K8S] `KubernetesExecutorBackend` should handle `IPv6` hostname

### What changes were proposed in this pull request?

This PR aims to fix `KubernetesExecutorBackend` to handle IPv6 hostname.

### Why are the changes needed?

`entrypoint.sh` uses `SPARK_EXECUTOR_POD_IP` for `--hostname`.


https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L97

However, IPv6 `status.podIP` has no `[]`. We need to handle both IPv6 cases 
(`[]` or no `[]`).
```yaml
- name: SPARK_EXECUTOR_POD_IP
  fieldPath: status.podIP
  value: -Djava.net.preferIPv6Addresses=true
  podIP: 2600:1f14:552:fb00:4c14::3
```

### Does this PR introduce _any_ user-facing change?

No, previously, it fails immediately.
```
22/07/26 19:20:46 INFO Executor: Starting executor ID 5 on host 
2600:1f14:552:fb02:2dc2::1
22/07/26 19:20:46 ERROR CoarseGrainedExecutorBackend: Executor self-exiting 
due to : Unable to create executor due to assertion failed: Expected hostname 
or IPv6 IP enclosed in [] but got 2600:1f14:552:fb02:2dc2::1
```

### How was this patch tested?

Pass the CIs.

Closes #37306 from dongjoon-hyun/SPARK-39884.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/scheduler/cluster/k8s/KubernetesExecutorBackend.scala | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesExecutorBackend.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesExecutorBackend.scala
index fbf485cfa2f..bec54a11366 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesExecutorBackend.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesExecutorBackend.scala
@@ -158,7 +158,8 @@ private[spark] object KubernetesExecutorBackend extends 
Logging {
   bindAddress = value
   argv = tail
 case ("--hostname") :: value :: tail =>
-  hostname = value
+  // entrypoint.sh sets SPARK_EXECUTOR_POD_IP without '[]'
+  hostname = Utils.addBracketsIfNeeded(value)
   argv = tail
 case ("--cores") :: value :: tail =>
   cores = value.toInt


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-39319][CORE][SQL] Make query contexts as a part of `SparkThrowable`

2022-07-26 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e9eb28e27d1 [SPARK-39319][CORE][SQL] Make query contexts as a part of 
`SparkThrowable`
e9eb28e27d1 is described below

commit e9eb28e27d10497c8b36774609823f4bbd2c8500
Author: Max Gekk 
AuthorDate: Tue Jul 26 20:17:09 2022 +0500

[SPARK-39319][CORE][SQL] Make query contexts as a part of `SparkThrowable`

### What changes were proposed in this pull request?
In the PR, I propose to add new interface `QueryContext` Spark core, and 
allow to get an instance of `QueryContext` from Spark's exceptions of the type 
`SparkThrowable`. For instance, `QueryContext` should help users to figure out 
where an error occur while executing queries in Spark SQL.

Also this PR adds `SqlQueryContext` as one of implementation of 
`QueryContext` to Spark SQL `Origin` which contains a context of TreeNodes + 
textual summary of the error. The `context` value in `Origin` will have all 
necessary structural info about the fragment of SQL query to which an error can 
be linked.

All Spark's exceptions are modified to accept the optional `QueryContext` 
and pre-built text summary. Apparently, SQL expressions init and pass new 
context to exceptions.

Closes #36702

### Why are the changes needed?
In the future, this enriches the information of the error message. With the 
change, it is possible to have a new pretty printing format error message like
```sql
> SELECT * FROM v1;

{
  “errorClass” : [ “DIVIDE_BY_ZERO” ],
  “parameters” : [ {
   “name” = “config”,
   “value” = “spark.sql.ansi.enabled”
 }
 ],
  “sqlState” : “42000”,
  “context” : {
  “objectType” : “VIEW”,
  “objectName” : “default.v1”
  “indexStart” : 36,
  “indexEnd” : 41,
  “fragment” : “1 / 0” }
   }
}
```

### Does this PR introduce _any_ user-facing change?
Yes. The PR changes Spark's exception by replacing the type of 
`queryContext` from `String` to `Option[QueryContext]`. User's code can fail if 
it uses `queryContext`.

### How was this patch tested?
By running the modified test suites:
```
$ build/sbt "test:testOnly *DecimalExpressionSuite"
$ build/sbt "test:testOnly *TreeNodeSuite"
```
and affected test suites:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
```

Authored-by: Max Gekk 
Co-authored-by: Gengliang Wang 

Closes #37209 from MaxGekk/query-context-in-sparkthrowable.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
---
 .../main/java/org/apache/spark/QueryContext.java   |  48 
 .../main/java/org/apache/spark/SparkThrowable.java |   2 +
 .../apache/spark/memory/SparkOutOfMemoryError.java |   3 +-
 .../main/scala/org/apache/spark/ErrorInfo.scala|  16 ++-
 .../scala/org/apache/spark/SparkException.scala|  58 +
 .../spark/sql/catalyst/expressions/Cast.scala  |  20 ++--
 .../sql/catalyst/expressions/Expression.scala  |   6 +-
 .../catalyst/expressions/aggregate/Average.scala   |  14 +--
 .../sql/catalyst/expressions/aggregate/Sum.scala   |  18 +--
 .../sql/catalyst/expressions/arithmetic.scala  |  27 ++---
 .../expressions/collectionOperations.scala |   8 +-
 .../expressions/complexTypeExtractors.scala|  13 ++-
 .../catalyst/expressions/decimalExpressions.scala  |  25 ++--
 .../catalyst/expressions/intervalExpressions.scala |  22 ++--
 .../sql/catalyst/expressions/mathExpressions.scala |   2 +-
 .../catalyst/expressions/stringExpressions.scala   |  10 +-
 .../spark/sql/catalyst/trees/SQLQueryContext.scala | 130 +
 .../apache/spark/sql/catalyst/trees/TreeNode.scala |  99 ++--
 .../spark/sql/catalyst/util/DateTimeUtils.scala|  22 ++--
 .../spark/sql/catalyst/util/IntervalUtils.scala|   2 +-
 .../apache/spark/sql/catalyst/util/MathUtils.scala |  38 +++---
 .../spark/sql/catalyst/util/UTF8StringUtils.scala  |  25 ++--
 .../apache/spark/sql/errors/QueryErrorsBase.scala  |   5 +
 .../spark/sql/errors/QueryExecutionErrors.scala|  74 +++-
 .../scala/org/apache/spark/sql/types/Decimal.scala |   7 +-
 .../expressions/DecimalExpressionSuite.scala   |   4 +-
 .../spark/sql/catalyst/trees/TreeNodeSuite.scala   |  15 ++-
 27 files changed, 445 insertions(+), 268 deletions(-)

diff --git a/core/src/main/java/org/apache/spark/QueryContext.java 
b/core/src/main/java/org/apache/spark/QueryContext.java
new file mode 100644
index 000..de5b29d0295
--- /dev/null
+++ b/core/src/main/java/org/apache/spark/QueryContext.java
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under

[spark] branch master updated: [SPARK-39874][SQL][TESTS] Add System.gc at beforeEach in BroadcastJoinSuite*

2022-07-26 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 329bb1430b1 [SPARK-39874][SQL][TESTS] Add System.gc at beforeEach in 
BroadcastJoinSuite*
329bb1430b1 is described below

commit 329bb1430b175cc6a6c4769cfc99ec07cc306a6f
Author: Hyukjin Kwon 
AuthorDate: Tue Jul 26 22:20:14 2022 +0900

[SPARK-39874][SQL][TESTS] Add System.gc at beforeEach in BroadcastJoinSuite*

### What changes were proposed in this pull request?

This PR is similar with https://github.com/apache/spark/pull/37291. Call 
`System.gc()`.

### Why are the changes needed?

To deflake it. See 
https://github.com/MaxGekk/spark/runs/7516270590?check_suite_focus=true

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Should monitor CI

Closes #37297 from HyukjinKwon/SPARK-39874.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .../org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala| 5 +
 1 file changed, 5 insertions(+)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
index 256e9426202..3cc43e2dd41 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
@@ -70,6 +70,11 @@ abstract class BroadcastJoinSuiteBase extends QueryTest with 
SQLTestUtils
 }
   }
 
+  override def beforeEach(): Unit = {
+super.beforeEach()
+System.gc()
+  }
+
   /**
* Test whether the specified broadcast join updates the peak execution 
memory accumulator.
*/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-39869][SQL][TESTS] Fix flaky hive - slow tests because of out-of-memory

2022-07-26 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c4c623a3a89 [SPARK-39869][SQL][TESTS] Fix flaky hive - slow tests 
because of out-of-memory
c4c623a3a89 is described below

commit c4c623a3a890267b2f9f8d472c8be30fc5db53e1
Author: Hyukjin Kwon 
AuthorDate: Tue Jul 26 18:37:20 2022 +0900

[SPARK-39869][SQL][TESTS] Fix flaky hive - slow tests because of 
out-of-memory

### What changes were proposed in this pull request?

This PR adds some manual `System.gc`. I know enough that this doesn't 
guarantee the garbage collection and sounds somewhat funny but it works in my 
experience so far, and I did such hack in some places before.

### Why are the changes needed?

To deflake the tests.

### Does this PR introduce _any_ user-facing change?

No, dev and test-only.

### How was this patch tested?

CI in this PR should test it out.

Closes #37291 from HyukjinKwon/SPARK-39869.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .../apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala | 5 +
 .../test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala  | 1 +
 2 files changed, 6 insertions(+)

diff --git 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala
 
b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala
index ecbe87b163d..debae3ad520 100644
--- 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala
+++ 
b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala
@@ -1388,6 +1388,11 @@ abstract class HiveThriftServer2TestBase extends 
SparkFunSuite with BeforeAndAft
""".stripMargin)
   }
 
+  override def beforeEach(): Unit = {
+super.beforeEach()
+System.gc()
+  }
+
   override protected def beforeAll(): Unit = {
 super.beforeAll()
 diagnosisBuffer.clear()
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
index 170cf4898f3..18202307fc4 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
@@ -63,6 +63,7 @@ class HiveSparkSubmitSuite
 
   override def beforeEach(): Unit = {
 super.beforeEach()
+System.gc()
   }
 
   test("temporary Hive UDF: define a UDF and use it") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (55c3347c48f -> de9a4b0747a)

2022-07-26 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 55c3347c48f [SPARK-38864][SQL] Add unpivot / melt to Dataset
 add de9a4b0747a [SPARK-39856][TESTS][INFRA] Skip q72 at TPC-DS build at 
GitHub Actions

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/TPCDSQuerySuite.scala |  6 --
 .../org/apache/spark/sql/TPCDSQueryTestSuite.scala | 23 --
 2 files changed, 17 insertions(+), 12 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated: [SPARK-39856][TESTS][INFRA] Skip q72 at TPC-DS build at GitHub Actions

2022-07-26 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new c9d56758a8c [SPARK-39856][TESTS][INFRA] Skip q72 at TPC-DS build at 
GitHub Actions
c9d56758a8c is described below

commit c9d56758a8c28a44161f63eb5c8763ab92616a56
Author: Hyukjin Kwon 
AuthorDate: Tue Jul 26 18:25:50 2022 +0900

[SPARK-39856][TESTS][INFRA] Skip q72 at TPC-DS build at GitHub Actions

### What changes were proposed in this pull request?

This PR reverts 
https://github.com/apache/spark/commit/7358253755762f9bfe6cedc1a50ec14616cfeace,
 
https://github.com/apache/spark/commit/ae1f6a26ed39b297ace8d6c9420b72a3c01a3291 
and 
https://github.com/apache/spark/commit/72b55ccf8327c00e173ab6130fdb428ad0d5aacc 
because they do not help fixing the TPC-DS build.

In addition, this PR skips the problematic query in GitHub Actions to avoid 
OOM.

### Why are the changes needed?

To make the build pass.

### Does this PR introduce _any_ user-facing change?

No, dev and test-only.

### How was this patch tested?

CI in this PR should test it out.

Closes #37289 from HyukjinKwon/SPARK-39856-followup.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit de9a4b0747a4127e320f80f5e1bf431429da70a9)
Signed-off-by: Hyukjin Kwon 
---
 .../org/apache/spark/sql/TPCDSQuerySuite.scala |  6 --
 .../org/apache/spark/sql/TPCDSQueryTestSuite.scala | 23 --
 2 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala
index 22e1b838f3f..8c4d25a7eb9 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala
@@ -29,7 +29,8 @@ import org.apache.spark.tags.ExtendedSQLTest
 @ExtendedSQLTest
 class TPCDSQuerySuite extends BenchmarkQueryTest with TPCDSBase {
 
-  tpcdsQueries.foreach { name =>
+  // q72 is skipped due to GitHub Actions' memory limit.
+  tpcdsQueries.filterNot(sys.env.contains("GITHUB_ACTIONS") && _ == 
"q72").foreach { name =>
 val queryString = resourceToString(s"tpcds/$name.sql",
   classLoader = Thread.currentThread().getContextClassLoader)
 test(name) {
@@ -39,7 +40,8 @@ class TPCDSQuerySuite extends BenchmarkQueryTest with 
TPCDSBase {
 }
   }
 
-  tpcdsQueriesV2_7_0.foreach { name =>
+  // q72 is skipped due to GitHub Actions' memory limit.
+  tpcdsQueriesV2_7_0.filterNot(sys.env.contains("GITHUB_ACTIONS") && _ == 
"q72").foreach { name =>
 val queryString = resourceToString(s"tpcds-v2.7.0/$name.sql",
   classLoader = Thread.currentThread().getContextClassLoader)
 test(s"$name-v2.7") {
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala
index 9affe827bc1..8019fc98a52 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala
@@ -62,7 +62,7 @@ class TPCDSQueryTestSuite extends QueryTest with TPCDSBase 
with SQLQueryTestHelp
 
   // To make output results deterministic
   override protected def sparkConf: SparkConf = super.sparkConf
-.set(SQLConf.SHUFFLE_PARTITIONS.key, 32.toString)
+.set(SQLConf.SHUFFLE_PARTITIONS.key, "1")
 
   protected override def createSparkSession: TestSparkSession = {
 new TestSparkSession(new SparkContext("local[1]", 
this.getClass.getSimpleName, sparkConf))
@@ -105,6 +105,7 @@ class TPCDSQueryTestSuite extends QueryTest with TPCDSBase 
with SQLQueryTestHelp
   query: String,
   goldenFile: File,
   conf: Map[String, String]): Unit = {
+val shouldSortResults = sortMergeJoinConf != conf  // Sort for other joins
 withSQLConf(conf.toSeq: _*) {
   try {
 val (schema, output) = handleExceptions(getNormalizedResult(spark, 
query))
@@ -142,15 +143,17 @@ class TPCDSQueryTestSuite extends QueryTest with 
TPCDSBase with SQLQueryTestHelp
 assertResult(expectedSchema, s"Schema did not match\n$queryString") {
   schema
 }
-// Truncate precisions because they can be vary per how the shuffle is 
performed.
-val expectSorted = expectedOutput.split("\n").sorted.map(_.trim)
-  .mkString("\n").replaceAll("\\s+$", "")
-  .replaceAll("""([0-9]+.[0-9]{10})([0-9]*)""", "$1")
-val outputSorted = output.sorted.map(_.trim).mkString("\n")
-  .replaceAll("\\s+$", "")
-  .replaceAll("""([0-9]+.[0-9]{10})([0-9]*)""", "$1")
-assertResult(expectSorted, s"Result did not match\n$queryString") {
-  outputS

[spark] branch master updated: [SPARK-38864][SQL] Add unpivot / melt to Dataset

2022-07-26 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 55c3347c48f [SPARK-38864][SQL] Add unpivot / melt to Dataset
55c3347c48f is described below

commit 55c3347c48f93a9c5c5c2fb00b30f838eb081b7f
Author: Enrico Minack 
AuthorDate: Tue Jul 26 15:50:03 2022 +0800

[SPARK-38864][SQL] Add unpivot / melt to Dataset

### What changes were proposed in this pull request?
This proposes a Scala implementation of the `melt` (aka. `unpivot`) 
operation.

### Why are the changes needed?
The Scala Dataset API provides the `pivot` operation, but not its reverse 
operation `unpivot` or `melt`. The `melt` operation is part of the [Pandas 
API](https://pandas.pydata.org/docs/reference/api/pandas.melt.html), which is 
why this method is provided by PySpark Pandas API, implemented purely in Python.

[It should be implemented in 
Scala](https://github.com/apache/spark/pull/26912#pullrequestreview-332975715) 
to make this operation available to Scala / Java, SQL, PySpark, and to reuse 
the implementation in PySpark Pandas APIs.

The `melt` / `unpivot` operation exists in other systems like 
[BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#unpivot_operator),
 
[T-SQL](https://docs.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-ver15#unpivot-example),
 [Oracle](https://www.oracletutorial.com/oracle-basics/oracle-unpivot/).

It supports expressions for ids and value columns including `*` expansion 
and structs. So this also fixes / includes SPARK-39292.

### Does this PR introduce _any_ user-facing change?
It adds `melt` to the `Dataset` API (Scala and Java).

### How was this patch tested?
It is tested in the `DatasetMeltSuite` and `JavaDatasetSuite`.

Closes #36150 from EnricoMi/branch-melt.

Authored-by: Enrico Minack 
Signed-off-by: Wenchen Fan 
---
 core/src/main/resources/error/error-classes.json   |  12 +
 .../spark/sql/catalyst/analysis/Analyzer.scala |  41 ++
 .../sql/catalyst/analysis/AnsiTypeCoercion.scala   |   1 +
 .../sql/catalyst/analysis/CheckAnalysis.scala  |   8 +
 .../spark/sql/catalyst/analysis/TypeCoercion.scala |  16 +
 .../plans/logical/basicLogicalOperators.scala  |  39 ++
 .../sql/catalyst/rules/RuleIdCollection.scala  |   1 +
 .../spark/sql/catalyst/trees/TreePatterns.scala|   1 +
 .../spark/sql/errors/QueryCompilationErrors.scala  |  18 +
 .../main/scala/org/apache/spark/sql/Dataset.scala  | 138 +-
 .../spark/sql/RelationalGroupedDataset.scala   |  18 +
 .../org/apache/spark/sql/DatasetUnpivotSuite.scala | 543 +
 .../spark/sql/errors/QueryErrorsSuiteBase.scala|   3 +-
 13 files changed, 837 insertions(+), 2 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index e2a99c1a62e..29ca280719e 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -375,6 +375,18 @@
   "Unable to acquire  bytes of memory, got "
 ]
   },
+  "UNPIVOT_REQUIRES_VALUE_COLUMNS" : {
+"message" : [
+  "At least one value column needs to be specified for UNPIVOT, all 
columns specified as ids"
+],
+"sqlState" : "42000"
+  },
+  "UNPIVOT_VALUE_DATA_TYPE_MISMATCH" : {
+"message" : [
+  "Unpivot value columns must share a least common type, some types do 
not: []"
+],
+"sqlState" : "42000"
+  },
   "UNRECOGNIZED_SQL_TYPE" : {
 "message" : [
   "Unrecognized SQL type "
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index f40c260eb6f..a6108c2a3d3 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -293,6 +293,7 @@ class Analyzer(override val catalogManager: CatalogManager)
   ResolveUpCast ::
   ResolveGroupingAnalytics ::
   ResolvePivot ::
+  ResolveUnpivot ::
   ResolveOrdinalInOrderByAndGroupBy ::
   ResolveAggAliasInGroupBy ::
   ResolveMissingReferences ::
@@ -514,6 +515,10 @@ class Analyzer(override val catalogManager: CatalogManager)
 if child.resolved && groupByOpt.isDefined && 
hasUnresolvedAlias(groupByOpt.get) =>
 Pivot(Some(assignAliases(groupByOpt.get)), pivotColumn, pivotValues, 
aggregates, child)
 
+  case up: Unpivot if up.child.resolved &&
+(hasUnresolvedAlias(up.ids) || hasUnresolvedAlias(up.values)) =>
+up.copy(ids = assignAliases(up.ids), values = assignAliases(up.values))
+
   case Project(p

[spark] branch master updated (72b55ccf832 -> 5c0d5515956)

2022-07-26 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 72b55ccf832 [SPARK-39856][SQL][TESTS][FOLLOW-UP] Increase the number 
of partitions in TPC-DS build to avoid out-of-memory
 add 5c0d5515956 [SPARK-39808][SQL] Support aggregate function MODE

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/analysis/FunctionRegistry.scala   |  1 +
 .../sql/catalyst/expressions/aggregate/Mode.scala  | 94 ++
 .../expressions/aggregate/interfaces.scala | 67 +++
 .../expressions/aggregate/percentiles.scala| 60 +-
 .../sql-functions/sql-expression-schema.md |  1 +
 .../test/resources/sql-tests/inputs/group-by.sql   |  4 +
 .../resources/sql-tests/results/group-by.sql.out   | 19 +
 7 files changed, 188 insertions(+), 58 deletions(-)
 create mode 100644 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org