[spark] branch master updated: [SPARK-39864][SQL] Lazily register ExecutionListenerBus
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 869fc2198a4 [SPARK-39864][SQL] Lazily register ExecutionListenerBus 869fc2198a4 is described below commit 869fc2198a4bb51bc03dce36fb2b61a57fe3006e Author: Josh Rosen AuthorDate: Wed Jul 27 14:25:54 2022 +0900 [SPARK-39864][SQL] Lazily register ExecutionListenerBus ### What changes were proposed in this pull request? This PR modifies `ExecutionListenerManager` so that its `ExecutionListenerBus` SparkListener is lazily registered during the first `.register(QueryExceutionListener)` (instead of eagerly registering it in the constructor). ### Why are the changes needed? This addresses a ListenerBus performance problem in applications with large numbers of short-lived SparkSessions. The `ExecutionListenerBus` SparkListener is unregistered by the ContextCleaner after its associated ExecutionListenerManager/SparkSession is garbage-collected (see #31839). If many sessions are rapidly created and destroyed but the driver GC doesn't run then this can result in large number of unused ExecutionListenerBus listeners being registered on the shared ListenerBus queue. This can cause performance problems in the ListenerBus because each listener invocation has some overhead. In one real-world application with a very large driver heap and high rate of SparkSession creation (hundreds per minute), I saw 5000 idle ExecutionListenerBus listeners, resulting in ~50ms median event processing times on the shared listener queue. This patch avoids this problem by making the listener registration lazy: if a short-lived SparkSession never uses QueryExecutionListeners then we won't register the ExecutionListenerBus and won't incur these overheads. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a new unit test. Closes #37282 from JoshRosen/SPARK-39864. Authored-by: Josh Rosen Signed-off-by: Hyukjin Kwon --- .../spark/sql/util/QueryExecutionListener.scala | 20 ++-- .../sql/util/ExecutionListenerManagerSuite.scala | 15 +++ 2 files changed, 29 insertions(+), 6 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala b/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala index 7ac06a5cd7e..45482f12f3c 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala @@ -81,7 +81,10 @@ class ExecutionListenerManager private[sql]( loadExtensions: Boolean) extends Logging { - private val listenerBus = new ExecutionListenerBus(this, session) + // SPARK-39864: lazily create the listener bus on the first register() call in order to + // avoid listener overheads when QueryExecutionListeners aren't used: + private val listenerBusInitializationLock = new Object() + @volatile private var listenerBus: Option[ExecutionListenerBus] = None if (loadExtensions) { val conf = session.sparkContext.conf @@ -97,7 +100,12 @@ class ExecutionListenerManager private[sql]( */ @DeveloperApi def register(listener: QueryExecutionListener): Unit = { -listenerBus.addListener(listener) +listenerBusInitializationLock.synchronized { + if (listenerBus.isEmpty) { +listenerBus = Some(new ExecutionListenerBus(this, session)) + } +} +listenerBus.get.addListener(listener) } /** @@ -105,7 +113,7 @@ class ExecutionListenerManager private[sql]( */ @DeveloperApi def unregister(listener: QueryExecutionListener): Unit = { -listenerBus.removeListener(listener) +listenerBus.foreach(_.removeListener(listener)) } /** @@ -113,12 +121,12 @@ class ExecutionListenerManager private[sql]( */ @DeveloperApi def clear(): Unit = { -listenerBus.removeAllListeners() +listenerBus.foreach(_.removeAllListeners()) } /** Only exposed for testing. */ private[sql] def listListeners(): Array[QueryExecutionListener] = { -listenerBus.listeners.asScala.toArray + listenerBus.map(_.listeners.asScala.toArray).getOrElse(Array.empty[QueryExecutionListener]) } /** @@ -127,7 +135,7 @@ class ExecutionListenerManager private[sql]( private[sql] def clone(session: SparkSession, sqlConf: SQLConf): ExecutionListenerManager = { val newListenerManager = new ExecutionListenerManager(session, sqlConf, loadExtensions = false) -listenerBus.listeners.asScala.foreach(newListenerManager.register) + listenerBus.foreach(_.listeners.asScala.foreach(newListenerManager.register)) newListenerManager }
[spark] branch master updated: [SPARK-39868][CORE][TESTS] StageFailed event should attach with the root cause
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 56f0233658e [SPARK-39868][CORE][TESTS] StageFailed event should attach with the root cause 56f0233658e is described below commit 56f0233658e53425ff915e803284139defb4af42 Author: panbingkun AuthorDate: Wed Jul 27 14:21:34 2022 +0900 [SPARK-39868][CORE][TESTS] StageFailed event should attach with the root cause ### What changes were proposed in this pull request? The pr follow https://github.com/apache/spark/pull/37245 StageFailed event should attach with the root cause ### Why are the changes needed? **It may be a good way for users to know the reason of failure.** By carefully investigating the issue: https://issues.apache.org/jira/browse/SPARK-39622, I found the root cause of test failure: StageFailed don't attach the failed reason from executor. when OutputCommitCoordinator execute 'taskCompleted', the 'reason' is ignored. Scenario 1: receive TaskSetFailed (Success) > InsertIntoHadoopFsRelationCommand > FileFormatWriter.write > _**handleTaskSetFailed**_ (**attach root cause**) > abortStage > failJobAndIndependentStages > SparkListenerJobEnd Scenario 1: receive StageFailed (Fail) > InsertIntoHadoopFsRelationCommand > FileFormatWriter.write > _**handleStageFailed**_ (**don't attach root cause**) > abortStage > failJobAndIndependentStages > SparkListenerJobEnd ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual run UT & Pass GitHub Actions Closes #37292 from panbingkun/SPARK-39868. Authored-by: panbingkun Signed-off-by: Hyukjin Kwon --- .../apache/spark/scheduler/OutputCommitCoordinator.scala | 3 ++- .../OutputCommitCoordinatorIntegrationSuite.scala | 3 ++- .../sql/execution/datasources/parquet/ParquetIOSuite.scala | 14 ++ 3 files changed, 6 insertions(+), 14 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala b/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala index a33c2bb93bc..cd5d6b8f9c9 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala @@ -160,7 +160,8 @@ private[spark] class OutputCommitCoordinator( if (stageState.authorizedCommitters(partition) == taskId) { sc.foreach(_.dagScheduler.stageFailed(stage, s"Authorized committer " + s"(attemptNumber=$attemptNumber, stage=$stage, partition=$partition) failed; " + -s"but task commit success, data duplication may happen.")) +s"but task commit success, data duplication may happen. " + +s"reason=$reason")) } } } diff --git a/core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorIntegrationSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorIntegrationSuite.scala index 66b13be4f7a..45da750768f 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorIntegrationSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorIntegrationSuite.scala @@ -51,7 +51,8 @@ class OutputCommitCoordinatorIntegrationSuite sc.parallelize(1 to 4, 2).map(_.toString).saveAsTextFile(tempDir.getAbsolutePath + "/out") } }.getCause.getMessage -assert(e.endsWith("failed; but task commit success, data duplication may happen.")) +assert(e.contains("failed; but task commit success, data duplication may happen.") && + e.contains("Intentional exception")) } } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala index a06ddc1b9e9..5a8f4563756 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala @@ -1202,15 +1202,7 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSparkSession val m1 = intercept[SparkException] { spark.range(1).coalesce(1).write.options(extraOptions).parquet(dir.getCanonicalPath) }.getCause.getMessage -// SPARK-39622: The current case must handle the `TaskSetFailed` event before SPARK-39195 -// due to `maxTaskFailures` is 1 when local mode. After SPARK-39195, it may handle to one -// of the `TaskSetFailed` event and `StageFailed` event, and the execution order of t
[spark] branch master updated: [SPARK-39882][BUILD] Upgrade rocksdbjni to 7.4.3
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e79b9f05360 [SPARK-39882][BUILD] Upgrade rocksdbjni to 7.4.3 e79b9f05360 is described below commit e79b9f053605369a60f15c606f1a4bc0b4f31329 Author: yangjie01 AuthorDate: Tue Jul 26 21:52:23 2022 -0700 [SPARK-39882][BUILD] Upgrade rocksdbjni to 7.4.3 ### What changes were proposed in this pull request? This PR aims to upgrade RocksDB JNI library from 7.3.1 to 7.4.3. ### Why are the changes needed? This version brings some bug fix, the release note as follows: - https://github.com/facebook/rocksdb/releases/tag/v7.4.3 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GA - The benchmark result : **Before 7.3.1** ``` [INFO] Running org.apache.spark.util.kvstore.RocksDBBenchmark count meanmin max 95th dbClose 4 0.346 0.266 0.539 0.539 dbCreation 4 78.389 3.583 301.763 301.763 naturalIndexCreateIterator 10240.005 0.002 1.469 0.007 naturalIndexDescendingCreateIterator10240.007 0.006 0.064 0.008 naturalIndexDescendingIteration 10240.005 0.004 0.024 0.007 naturalIndexIteration 10240.006 0.004 0.053 0.008 randomDeleteIndexed 10240.026 0.019 0.293 0.035 randomDeletesNoIndex10240.015 0.012 0.053 0.017 randomUpdatesIndexed10240.081 0.033 30.752 0.083 randomUpdatesNoIndex10240.037 0.034 0.502 0.041 randomWritesIndexed 10240.116 0.035 52.128 0.121 randomWritesNoIndex 10240.042 0.036 1.869 0.046 refIndexCreateIterator 10240.004 0.004 0.019 0.006 refIndexDescendingCreateIterator10240.003 0.002 0.026 0.004 refIndexDescendingIteration 10240.006 0.005 0.042 0.008 refIndexIteration 10240.007 0.005 0.314 0.009 sequentialDeleteIndexed 10240.021 0.018 0.104 0.026 sequentialDeleteNoIndex 10240.015 0.012 0.053 0.019 sequentialUpdatesIndexed10240.044 0.038 0.802 0.053 sequentialUpdatesNoIndex10240.045 0.032 1.552 0.054 sequentialWritesIndexed 10240.048 0.040 1.862 0.055 sequentialWritesNoIndex 10240.044 0.032 2.929 0.048 ``` **After 7.4.3** ``` [INFO] Running org.apache.spark.util.kvstore.RocksDBBenchmark count meanmin max 95th dbClose 4 0.364 0.305 0.525 0.525 dbCreation 4 80.553 3.635 310.735 310.735 naturalIndexCreateIterator 10240.006 0.002 1.569 0.007 naturalIndexDescendingCreateIterator10240.006 0.005 0.073 0.008 naturalIndexDescendingIteration 10240.006 0.004 0.269 0.008 naturalIndexIteration 10240.006 0.004 0.059 0.009 randomDeleteIndexed 10240.027 0.020 0.339 0.036 randomDeletesNoIndex10240.015 0.012 0.037 0.018 randomUpdatesIndexed10240.084 0.032 32.286 0.090 randomUpdatesNoIndex10240.023 0.019 0.584 0.026 randomWritesIndexed 10240.122 0.035 53.989 0.129 randomWritesNoIndex 10240.027 0.022 1.547 0.031 refIndexCreateIterator 10240.005 0.005 0.027 0.007 refIndexDescendingCreateIterator10240.003 0.003 0.035 0.005 refIndexDescendingIteration 10240.006 0.005 0.051 0.008 refIndexIteration 10240.007 0.005 0.096 0.009 sequentialDeleteIndexed 10240.023 0.018 1.396 0.027 sequentialDeleteNoIndex 10240.015 0.01
[spark] branch master updated: [SPARK-37888][SQL][TESTS][FOLLOWUP] Don't check the `Created By` field in `DESCRIBE TABLE` tests
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8c164c075d8 [SPARK-37888][SQL][TESTS][FOLLOWUP] Don't check the `Created By` field in `DESCRIBE TABLE` tests 8c164c075d8 is described below commit 8c164c075d8ae41216e7ddf93a5f29b7900b4a95 Author: Max Gekk AuthorDate: Wed Jul 27 08:27:05 2022 +0500 [SPARK-37888][SQL][TESTS][FOLLOWUP] Don't check the `Created By` field in `DESCRIBE TABLE` tests ### What changes were proposed in this pull request? In the PR, I propose to do not check the field `Created By` in tests that check output of the `DESCRIBE TABLE` command. ### Why are the changes needed? The field `Created By` depends on the current Spark version, for instance `Spark 3.4.0-SNAPSHOT`. Apparently, the tests that check the field depend on Spark version. The changes are needed to avoid dependency from Spark version, and to don't change the tests when bumping Spark version. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the modified tests: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DescribeTableSuite" ``` Closes #37299 from MaxGekk/unify-describe-table-tests-followup. Authored-by: Max Gekk Signed-off-by: Max Gekk --- .../org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala | 3 +-- .../apache/spark/sql/hive/execution/command/DescribeTableSuite.scala | 3 +-- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala index f8e53fee723..da4eab13afb 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala @@ -189,7 +189,7 @@ class DescribeTableSuite extends DescribeTableSuiteBase with CommandSuiteBase { ("data_type", StringType), ("comment", StringType))) QueryTest.checkAnswer( -descriptionDf.filter("col_name != 'Created Time'"), +descriptionDf.filter("!(col_name in ('Created Time', 'Created By'))"), Seq( Row("data", "string", null), Row("id", "bigint", null), @@ -202,7 +202,6 @@ class DescribeTableSuite extends DescribeTableSuiteBase with CommandSuiteBase { Row("Database", "ns", ""), Row("Table", "table", ""), Row("Last Access", "UNKNOWN", ""), - Row("Created By", "Spark 3.4.0-SNAPSHOT", ""), Row("Type", "EXTERNAL", ""), Row("Provider", getProvider(), ""), Row("Comment", "this is a test table", ""), diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/DescribeTableSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/DescribeTableSuite.scala index 00adb377f04..c12d236f4b6 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/DescribeTableSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/DescribeTableSuite.scala @@ -58,7 +58,7 @@ class DescribeTableSuite extends v1.DescribeTableSuiteBase with CommandSuiteBase ("comment", StringType))) QueryTest.checkAnswer( // Filter out 'Table Properties' to don't check `transient_lastDdlTime` -descriptionDf.filter("col_name != 'Created Time' and col_name != 'Table Properties'"), +descriptionDf.filter("!(col_name in ('Created Time', 'Table Properties', 'Created By'))"), Seq( Row("data", "string", null), Row("id", "bigint", null), @@ -72,7 +72,6 @@ class DescribeTableSuite extends v1.DescribeTableSuiteBase with CommandSuiteBase Row("Table", "table", ""), Row(TableCatalog.PROP_OWNER.capitalize, Utils.getCurrentUserName(), ""), Row("Last Access", "UNKNOWN", ""), - Row("Created By", "Spark 3.4.0-SNAPSHOT", ""), Row("Type", "EXTERNAL", ""), Row("Provider", getProvider(), ""), Row("Comment", "this is a test table", ""), - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-39865][SQL] Show proper error messages on the overflow errors of table insert
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d5dbe7d4e9e [SPARK-39865][SQL] Show proper error messages on the overflow errors of table insert d5dbe7d4e9e is described below commit d5dbe7d4e9e0e46a514c363efaac15f37d07857c Author: Gengliang Wang AuthorDate: Tue Jul 26 20:26:53 2022 -0700 [SPARK-39865][SQL] Show proper error messages on the overflow errors of table insert ### What changes were proposed in this pull request? In Spark 3.3, the error message of ANSI CAST is improved. However, the table insertion is using the same CAST expression: ``` > create table tiny(i tinyint); > insert into tiny values (1000); org.apache.spark.SparkArithmeticException[CAST_OVERFLOW]: The value 1000 of the type "INT" cannot be cast to "TINYINT" due to an overflow. Use `try_cast` to tolerate overflow and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. ``` Showing the hint of `If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error` doesn't help at all. This PR is to fix the error message. After changes, the error message of this example will become: ``` org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW_IN_TABLE_INSERT] Fail to insert a value of "INT" type into the "TINYINT" type column `i` due to an overflow. Use `try_cast` on the input value to tolerate overflow and return NULL instead. ``` ### Why are the changes needed? Show proper error messages on the overflow errors of table insert. The current message is super confusing. ### Does this PR introduce _any_ user-facing change? Yes, after changes it show proper error messages on the overflow errors of table insert. ### How was this patch tested? Unit test Closes #37283 from gengliangwang/insertionOverflow. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- core/src/main/resources/error/error-classes.json | 6 +++ .../catalyst/analysis/TableOutputResolver.scala| 23 ++- .../spark/sql/catalyst/expressions/Cast.scala | 44 ++ .../spark/sql/errors/QueryExecutionErrors.scala| 15 .../sql/errors/QueryExecutionAnsiErrorsSuite.scala | 21 ++- .../org/apache/spark/sql/sources/InsertSuite.scala | 20 +- 6 files changed, 117 insertions(+), 12 deletions(-) diff --git a/core/src/main/resources/error/error-classes.json b/core/src/main/resources/error/error-classes.json index 29ca280719e..9d35b1a1a69 100644 --- a/core/src/main/resources/error/error-classes.json +++ b/core/src/main/resources/error/error-classes.json @@ -59,6 +59,12 @@ ], "sqlState" : "22005" }, + "CAST_OVERFLOW_IN_TABLE_INSERT" : { +"message" : [ + "Fail to insert a value of type into the type column due to an overflow. Use `try_cast` on the input value to tolerate overflow and return NULL instead." +], +"sqlState" : "22005" + }, "CONCURRENT_QUERY" : { "message" : [ "Another instance of this query was just started by a concurrent session." diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala index aca99b001d2..b9e3c380216 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala @@ -26,7 +26,7 @@ import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ import org.apache.spark.sql.errors.QueryCompilationErrors import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.internal.SQLConf.StoreAssignmentPolicy -import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructType} +import org.apache.spark.sql.types.{ArrayType, DataType, DecimalType, IntegralType, MapType, StructType} object TableOutputResolver { def resolveOutputColumns( @@ -220,6 +220,21 @@ object TableOutputResolver { } } + private def containsIntegralOrDecimalType(dt: DataType): Boolean = dt match { +case _: IntegralType | _: DecimalType => true +case a: ArrayType => containsIntegralOrDecimalType(a.elementType) +case m: MapType => + containsIntegralOrDecimalType(m.keyType) || containsIntegralOrDecimalType(m.valueType) +case s: StructType => + s.fields.exists(sf => containsIntegralOrDecimalType(sf.dataType)) +case _ => false + } + + private def canCauseCastOverflow(cast: Cast): Boolean = { +containsIntegralOrDecimalType(cast.dataType) && + !Cast.canUpCast(cast.child.dataType, cast.dataType) + } + private
[spark] branch master updated: [SPARK-39319][FOLLOW-UP][SQL] Make TreeNode.context as lazy val
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 90d83479a16 [SPARK-39319][FOLLOW-UP][SQL] Make TreeNode.context as lazy val 90d83479a16 is described below commit 90d83479a16cb594aa1ee6c6a8219dbb7d859752 Author: Gengliang Wang AuthorDate: Wed Jul 27 10:57:31 2022 +0900 [SPARK-39319][FOLLOW-UP][SQL] Make TreeNode.context as lazy val ### What changes were proposed in this pull request? - Make TreeNode.context as lazy val - Code clean up in SQLQueryContext ### Why are the changes needed? Making TreeNode.context as lazy val can save the memory usage, which is only called on certain expressions under ANSI SQL mode. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #37307 from gengliangwang/lazyVal. Authored-by: Gengliang Wang Signed-off-by: Hyukjin Kwon --- .../apache/spark/sql/catalyst/trees/SQLQueryContext.scala | 15 +-- .../org/apache/spark/sql/catalyst/trees/TreeNode.scala| 2 +- 2 files changed, 10 insertions(+), 7 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/SQLQueryContext.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/SQLQueryContext.scala index 8f75079fcf9..a8806dbad4d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/SQLQueryContext.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/SQLQueryContext.scala @@ -42,9 +42,7 @@ case class SQLQueryContext( */ lazy val summary: String = { // If the query context is missing or incorrect, simply return an empty string. -if (sqlText.isEmpty || originStartIndex.isEmpty || originStopIndex.isEmpty || - originStartIndex.get < 0 || originStopIndex.get >= sqlText.get.length || - originStartIndex.get > originStopIndex.get) { +if (!isValid) { "" } else { val positionContext = if (line.isDefined && startPosition.isDefined) { @@ -119,12 +117,17 @@ case class SQLQueryContext( /** Gets the textual fragment of a SQL query. */ override lazy val fragment: String = { -if (sqlText.isEmpty || originStartIndex.isEmpty || originStopIndex.isEmpty || - originStartIndex.get < 0 || originStopIndex.get >= sqlText.get.length || - originStartIndex.get > originStopIndex.get) { +if (!isValid) { "" } else { sqlText.get.substring(originStartIndex.get, originStopIndex.get) } } + + private def isValid: Boolean = { +sqlText.isDefined && originStartIndex.isDefined && originStopIndex.isDefined && + originStartIndex.get >= 0 && originStopIndex.get < sqlText.get.length && + originStartIndex.get <= originStopIndex.get + + } } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala index b8cfdcdbe7f..8f5858d2f4d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala @@ -66,7 +66,7 @@ case class Origin( objectType: Option[String] = None, objectName: Option[String] = None) { - val context: SQLQueryContext = SQLQueryContext( + lazy val context: SQLQueryContext = SQLQueryContext( line, startPosition, startIndex, stopIndex, sqlText, objectType, objectName) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d31505eb305 -> 5e43ae29ef2)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from d31505eb305 [SPARK-39870][PYTHON][TESTS] Add flag to 'run-tests.py' to support retaining the output add 5e43ae29ef2 [SPARK-39875][SQL] Change `protected` method in final class to `private` or `package-visible` No new revisions were added by this update. Summary of changes: .../src/main/java/org/apache/spark/unsafe/types/UTF8String.java | 2 +- .../sql/catalyst/expressions/FixedLengthRowBasedKeyValueBatch.java| 4 ++-- .../sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (01d41e7de41 -> d31505eb305)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 01d41e7de41 [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` add d31505eb305 [SPARK-39870][PYTHON][TESTS] Add flag to 'run-tests.py' to support retaining the output No new revisions were added by this update. Summary of changes: python/run-tests.py | 45 - 1 file changed, 36 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite`
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new f7a94093a67 [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` f7a94093a67 is described below commit f7a94093a67956b94eaf99bdbb29c4351736d110 Author: yangjie01 AuthorDate: Wed Jul 27 09:26:47 2022 +0900 [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` ### What changes were proposed in this pull request? This pr change `local-cluster[2, 1, 1024]` in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` to `local-cluster[2, 1, 512]` to reduce test maximum memory usage. ### Why are the changes needed? Reduce the maximum memory usage of test cases. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Should monitor CI Closes #37298 from LuciferYang/reduce-local-cluster-memory. Authored-by: yangjie01 Signed-off-by: Hyukjin Kwon (cherry picked from commit 01d41e7de418d0a40db7b16ddd0d8546f0794d17) Signed-off-by: Hyukjin Kwon --- .../sql/execution/joins/BroadcastJoinSuite.scala | 4 +- .../spark/sql/hive/HiveSparkSubmitSuite.scala | 43 +++--- 2 files changed, 32 insertions(+), 15 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala index ef0a596f211..a7fe4d1c792 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala @@ -20,6 +20,7 @@ package org.apache.spark.sql.execution.joins import scala.reflect.ClassTag import org.apache.spark.AccumulatorSuite +import org.apache.spark.internal.config.EXECUTOR_MEMORY import org.apache.spark.sql.{Dataset, QueryTest, Row, SparkSession} import org.apache.spark.sql.catalyst.expressions.{BitwiseAnd, BitwiseOr, Cast, Literal, ShiftLeft} import org.apache.spark.sql.catalyst.plans.logical.BROADCAST @@ -51,7 +52,8 @@ abstract class BroadcastJoinSuiteBase extends QueryTest with SQLTestUtils override def beforeAll(): Unit = { super.beforeAll() spark = SparkSession.builder() - .master("local-cluster[2,1,1024]") + .master("local-cluster[2,1,512]") + .config(EXECUTOR_MEMORY.key, "512m") .appName("testing") .getOrCreate() } diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala index f9ea4e31487..b2ee82db390 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala @@ -28,6 +28,7 @@ import org.scalatest.Assertions._ import org.apache.spark._ import org.apache.spark.internal.Logging +import org.apache.spark.internal.config.EXECUTOR_MEMORY import org.apache.spark.internal.config.UI.UI_ENABLED import org.apache.spark.sql.{QueryTest, Row, SparkSession} import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier} @@ -66,7 +67,8 @@ class HiveSparkSubmitSuite val args = Seq( "--class", TemporaryHiveUDFTest.getClass.getName.stripSuffix("$"), "--name", "TemporaryHiveUDFTest", - "--master", "local-cluster[2,1,1024]", + "--master", "local-cluster[2,1,512]", + "--conf", s"${EXECUTOR_MEMORY.key}=512m", "--conf", "spark.ui.enabled=false", "--conf", "spark.master.rest.enabled=false", "--driver-java-options", "-Dderby.system.durability=test", @@ -83,7 +85,8 @@ class HiveSparkSubmitSuite val args = Seq( "--class", PermanentHiveUDFTest1.getClass.getName.stripSuffix("$"), "--name", "PermanentHiveUDFTest1", - "--master", "local-cluster[2,1,1024]", + "--master", "local-cluster[2,1,512]", + "--conf", s"${EXECUTOR_MEMORY.key}=512m", "--conf", "spark.ui.enabled=false", "--conf", "spark.master.rest.enabled=false", "--driver-java-options", "-Dderby.system.durability=test", @@ -100,7 +103,8 @@ class HiveSparkSubmitSuite val args = Seq( "--class", PermanentHiveUDFTest2.getClass.getName.stripSuffix("$"), "--name", "PermanentHiveUDFTest2", - "--master", "local-cluster[2,1,1024]", + "--master", "local-cluster[2,1,512]", + "--conf", s"${EXECUTOR_MEMORY.key}=512m", "--conf", "spark.ui.enabled=false", "--conf", "spark.master.rest.enabled=false", "--driver-java-options", "-Dderby.system.durability=test", @@ -119,7 +123,8 @@ clas
[spark] branch branch-3.1 updated: [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite`
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 3768ee1e775 [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` 3768ee1e775 is described below commit 3768ee1e775d42920bb2583f5fcb5f15927688ad Author: yangjie01 AuthorDate: Wed Jul 27 09:26:47 2022 +0900 [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` ### What changes were proposed in this pull request? This pr change `local-cluster[2, 1, 1024]` in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` to `local-cluster[2, 1, 512]` to reduce test maximum memory usage. ### Why are the changes needed? Reduce the maximum memory usage of test cases. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Should monitor CI Closes #37298 from LuciferYang/reduce-local-cluster-memory. Authored-by: yangjie01 Signed-off-by: Hyukjin Kwon (cherry picked from commit 01d41e7de418d0a40db7b16ddd0d8546f0794d17) Signed-off-by: Hyukjin Kwon --- .../sql/execution/joins/BroadcastJoinSuite.scala | 4 +- .../spark/sql/hive/HiveSparkSubmitSuite.scala | 43 +++--- 2 files changed, 32 insertions(+), 15 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala index 98a1089709b..6883c8d1411 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala @@ -20,6 +20,7 @@ package org.apache.spark.sql.execution.joins import scala.reflect.ClassTag import org.apache.spark.AccumulatorSuite +import org.apache.spark.internal.config.EXECUTOR_MEMORY import org.apache.spark.sql.{Dataset, QueryTest, Row, SparkSession} import org.apache.spark.sql.catalyst.expressions.{AttributeReference, BitwiseAnd, BitwiseOr, Cast, Expression, Literal, ShiftLeft} import org.apache.spark.sql.catalyst.optimizer.{BuildLeft, BuildRight, BuildSide} @@ -54,7 +55,8 @@ abstract class BroadcastJoinSuiteBase extends QueryTest with SQLTestUtils override def beforeAll(): Unit = { super.beforeAll() spark = SparkSession.builder() - .master("local-cluster[2,1,1024]") + .master("local-cluster[2,1,512]") + .config(EXECUTOR_MEMORY.key, "512m") .appName("testing") .getOrCreate() } diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala index 426d93b3506..862d4a71ca1 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala @@ -29,6 +29,7 @@ import org.scalatest.matchers.must.Matchers import org.apache.spark._ import org.apache.spark.internal.Logging +import org.apache.spark.internal.config.EXECUTOR_MEMORY import org.apache.spark.internal.config.UI.UI_ENABLED import org.apache.spark.sql.{QueryTest, Row, SparkSession} import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier} @@ -67,7 +68,8 @@ class HiveSparkSubmitSuite val args = Seq( "--class", TemporaryHiveUDFTest.getClass.getName.stripSuffix("$"), "--name", "TemporaryHiveUDFTest", - "--master", "local-cluster[2,1,1024]", + "--master", "local-cluster[2,1,512]", + "--conf", s"${EXECUTOR_MEMORY.key}=512m", "--conf", "spark.ui.enabled=false", "--conf", "spark.master.rest.enabled=false", "--driver-java-options", "-Dderby.system.durability=test", @@ -84,7 +86,8 @@ class HiveSparkSubmitSuite val args = Seq( "--class", PermanentHiveUDFTest1.getClass.getName.stripSuffix("$"), "--name", "PermanentHiveUDFTest1", - "--master", "local-cluster[2,1,1024]", + "--master", "local-cluster[2,1,512]", + "--conf", s"${EXECUTOR_MEMORY.key}=512m", "--conf", "spark.ui.enabled=false", "--conf", "spark.master.rest.enabled=false", "--driver-java-options", "-Dderby.system.durability=test", @@ -101,7 +104,8 @@ class HiveSparkSubmitSuite val args = Seq( "--class", PermanentHiveUDFTest2.getClass.getName.stripSuffix("$"), "--name", "PermanentHiveUDFTest2", - "--master", "local-cluster[2,1,1024]", + "--master", "local-cluster[2,1,512]", + "--conf", s"${EXECUTOR_MEMORY.key}=512m", "--conf", "spark.ui.enabled=false", "--conf", "spark.master.rest.enabled=false", "--driver-java-opti
[spark] branch branch-3.2 updated: [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite`
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 421918dd95f [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` 421918dd95f is described below commit 421918dd95fde01baebe33415b7bb4ac2c8e0ec6 Author: yangjie01 AuthorDate: Wed Jul 27 09:26:47 2022 +0900 [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` ### What changes were proposed in this pull request? This pr change `local-cluster[2, 1, 1024]` in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` to `local-cluster[2, 1, 512]` to reduce test maximum memory usage. ### Why are the changes needed? Reduce the maximum memory usage of test cases. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Should monitor CI Closes #37298 from LuciferYang/reduce-local-cluster-memory. Authored-by: yangjie01 Signed-off-by: Hyukjin Kwon (cherry picked from commit 01d41e7de418d0a40db7b16ddd0d8546f0794d17) Signed-off-by: Hyukjin Kwon --- .../sql/execution/joins/BroadcastJoinSuite.scala | 4 +- .../spark/sql/hive/HiveSparkSubmitSuite.scala | 43 +++--- 2 files changed, 32 insertions(+), 15 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala index a8b4856261d..d67e2d6ef4f 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala @@ -20,6 +20,7 @@ package org.apache.spark.sql.execution.joins import scala.reflect.ClassTag import org.apache.spark.AccumulatorSuite +import org.apache.spark.internal.config.EXECUTOR_MEMORY import org.apache.spark.sql.{Dataset, QueryTest, Row, SparkSession} import org.apache.spark.sql.catalyst.expressions.{AttributeReference, BitwiseAnd, BitwiseOr, Cast, Expression, Literal, ShiftLeft} import org.apache.spark.sql.catalyst.optimizer.{BuildLeft, BuildRight, BuildSide} @@ -56,7 +57,8 @@ abstract class BroadcastJoinSuiteBase extends QueryTest with SQLTestUtils override def beforeAll(): Unit = { super.beforeAll() spark = SparkSession.builder() - .master("local-cluster[2,1,1024]") + .master("local-cluster[2,1,512]") + .config(EXECUTOR_MEMORY.key, "512m") .appName("testing") .getOrCreate() } diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala index 90752e70e1b..66a915b4792 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala @@ -33,6 +33,7 @@ import org.scalatest.time.SpanSugar._ import org.apache.spark._ import org.apache.spark.deploy.SparkSubmitTestUtils import org.apache.spark.internal.Logging +import org.apache.spark.internal.config.EXECUTOR_MEMORY import org.apache.spark.internal.config.UI.UI_ENABLED import org.apache.spark.sql.{QueryTest, Row, SparkSession} import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier} @@ -73,7 +74,8 @@ class HiveSparkSubmitSuite val args = Seq( "--class", TemporaryHiveUDFTest.getClass.getName.stripSuffix("$"), "--name", "TemporaryHiveUDFTest", - "--master", "local-cluster[2,1,1024]", + "--master", "local-cluster[2,1,512]", + "--conf", s"${EXECUTOR_MEMORY.key}=512m", "--conf", "spark.ui.enabled=false", "--conf", "spark.master.rest.enabled=false", "--driver-java-options", "-Dderby.system.durability=test", @@ -90,7 +92,8 @@ class HiveSparkSubmitSuite val args = Seq( "--class", PermanentHiveUDFTest1.getClass.getName.stripSuffix("$"), "--name", "PermanentHiveUDFTest1", - "--master", "local-cluster[2,1,1024]", + "--master", "local-cluster[2,1,512]", + "--conf", s"${EXECUTOR_MEMORY.key}=512m", "--conf", "spark.ui.enabled=false", "--conf", "spark.master.rest.enabled=false", "--driver-java-options", "-Dderby.system.durability=test", @@ -107,7 +110,8 @@ class HiveSparkSubmitSuite val args = Seq( "--class", PermanentHiveUDFTest2.getClass.getName.stripSuffix("$"), "--name", "PermanentHiveUDFTest2", - "--master", "local-cluster[2,1,1024]", + "--master", "local-cluster[2,1,512]", + "--conf", s"${EXECUTOR_MEMORY.key}=512m", "--conf", "spark.ui.enabled=false", "--conf", "spark.master.re
[spark] branch branch-3.3 updated: [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite`
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 9fdd097aa6c [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` 9fdd097aa6c is described below commit 9fdd097aa6c05e7ecfd33dccad876a00d96b6ddf Author: yangjie01 AuthorDate: Wed Jul 27 09:26:47 2022 +0900 [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` ### What changes were proposed in this pull request? This pr change `local-cluster[2, 1, 1024]` in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` to `local-cluster[2, 1, 512]` to reduce test maximum memory usage. ### Why are the changes needed? Reduce the maximum memory usage of test cases. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Should monitor CI Closes #37298 from LuciferYang/reduce-local-cluster-memory. Authored-by: yangjie01 Signed-off-by: Hyukjin Kwon (cherry picked from commit 01d41e7de418d0a40db7b16ddd0d8546f0794d17) Signed-off-by: Hyukjin Kwon --- .../sql/execution/joins/BroadcastJoinSuite.scala | 4 +- .../spark/sql/hive/HiveSparkSubmitSuite.scala | 43 +++--- 2 files changed, 32 insertions(+), 15 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala index 256e9426202..2d553d2b84f 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala @@ -20,6 +20,7 @@ package org.apache.spark.sql.execution.joins import scala.reflect.ClassTag import org.apache.spark.AccumulatorSuite +import org.apache.spark.internal.config.EXECUTOR_MEMORY import org.apache.spark.sql.{Dataset, QueryTest, Row, SparkSession} import org.apache.spark.sql.catalyst.expressions.{AttributeReference, BitwiseAnd, BitwiseOr, Cast, Expression, Literal, ShiftLeft} import org.apache.spark.sql.catalyst.optimizer.{BuildLeft, BuildRight, BuildSide} @@ -56,7 +57,8 @@ abstract class BroadcastJoinSuiteBase extends QueryTest with SQLTestUtils override def beforeAll(): Unit = { super.beforeAll() spark = SparkSession.builder() - .master("local-cluster[2,1,1024]") + .master("local-cluster[2,1,512]") + .config(EXECUTOR_MEMORY.key, "512m") .appName("testing") .getOrCreate() } diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala index 170cf4898f3..fc8d6e61a0d 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala @@ -33,6 +33,7 @@ import org.scalatest.time.SpanSugar._ import org.apache.spark._ import org.apache.spark.deploy.SparkSubmitTestUtils import org.apache.spark.internal.Logging +import org.apache.spark.internal.config.EXECUTOR_MEMORY import org.apache.spark.internal.config.UI.UI_ENABLED import org.apache.spark.sql.{QueryTest, Row, SparkSession} import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier} @@ -73,7 +74,8 @@ class HiveSparkSubmitSuite val args = Seq( "--class", TemporaryHiveUDFTest.getClass.getName.stripSuffix("$"), "--name", "TemporaryHiveUDFTest", - "--master", "local-cluster[2,1,1024]", + "--master", "local-cluster[2,1,512]", + "--conf", s"${EXECUTOR_MEMORY.key}=512m", "--conf", "spark.ui.enabled=false", "--conf", "spark.master.rest.enabled=false", "--driver-java-options", "-Dderby.system.durability=test", @@ -90,7 +92,8 @@ class HiveSparkSubmitSuite val args = Seq( "--class", PermanentHiveUDFTest1.getClass.getName.stripSuffix("$"), "--name", "PermanentHiveUDFTest1", - "--master", "local-cluster[2,1,1024]", + "--master", "local-cluster[2,1,512]", + "--conf", s"${EXECUTOR_MEMORY.key}=512m", "--conf", "spark.ui.enabled=false", "--conf", "spark.master.rest.enabled=false", "--driver-java-options", "-Dderby.system.durability=test", @@ -107,7 +110,8 @@ class HiveSparkSubmitSuite val args = Seq( "--class", PermanentHiveUDFTest2.getClass.getName.stripSuffix("$"), "--name", "PermanentHiveUDFTest2", - "--master", "local-cluster[2,1,1024]", + "--master", "local-cluster[2,1,512]", + "--conf", s"${EXECUTOR_MEMORY.key}=512m", "--conf", "spark.ui.enabled=false", "--conf", "spark.master.re
[spark] branch master updated: [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite`
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 01d41e7de41 [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` 01d41e7de41 is described below commit 01d41e7de418d0a40db7b16ddd0d8546f0794d17 Author: yangjie01 AuthorDate: Wed Jul 27 09:26:47 2022 +0900 [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` ### What changes were proposed in this pull request? This pr change `local-cluster[2, 1, 1024]` in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` to `local-cluster[2, 1, 512]` to reduce test maximum memory usage. ### Why are the changes needed? Reduce the maximum memory usage of test cases. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Should monitor CI Closes #37298 from LuciferYang/reduce-local-cluster-memory. Authored-by: yangjie01 Signed-off-by: Hyukjin Kwon --- .../sql/execution/joins/BroadcastJoinSuite.scala | 4 +- .../spark/sql/hive/HiveSparkSubmitSuite.scala | 43 +++--- 2 files changed, 32 insertions(+), 15 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala index 3cc43e2dd41..6333808b420 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala @@ -20,6 +20,7 @@ package org.apache.spark.sql.execution.joins import scala.reflect.ClassTag import org.apache.spark.AccumulatorSuite +import org.apache.spark.internal.config.EXECUTOR_MEMORY import org.apache.spark.sql.{Dataset, QueryTest, Row, SparkSession} import org.apache.spark.sql.catalyst.expressions.{AttributeReference, BitwiseAnd, BitwiseOr, Cast, Expression, Literal, ShiftLeft} import org.apache.spark.sql.catalyst.optimizer.{BuildLeft, BuildRight, BuildSide} @@ -56,7 +57,8 @@ abstract class BroadcastJoinSuiteBase extends QueryTest with SQLTestUtils override def beforeAll(): Unit = { super.beforeAll() spark = SparkSession.builder() - .master("local-cluster[2,1,1024]") + .master("local-cluster[2,1,512]") + .config(EXECUTOR_MEMORY.key, "512m") .appName("testing") .getOrCreate() } diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala index 18202307fc4..fa4d1b78d1c 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala @@ -33,6 +33,7 @@ import org.scalatest.time.SpanSugar._ import org.apache.spark._ import org.apache.spark.deploy.SparkSubmitTestUtils import org.apache.spark.internal.Logging +import org.apache.spark.internal.config.EXECUTOR_MEMORY import org.apache.spark.internal.config.UI.UI_ENABLED import org.apache.spark.sql.{QueryTest, Row, SparkSession} import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier} @@ -74,7 +75,8 @@ class HiveSparkSubmitSuite val args = Seq( "--class", TemporaryHiveUDFTest.getClass.getName.stripSuffix("$"), "--name", "TemporaryHiveUDFTest", - "--master", "local-cluster[2,1,1024]", + "--master", "local-cluster[2,1,512]", + "--conf", s"${EXECUTOR_MEMORY.key}=512m", "--conf", "spark.ui.enabled=false", "--conf", "spark.master.rest.enabled=false", "--driver-java-options", "-Dderby.system.durability=test", @@ -91,7 +93,8 @@ class HiveSparkSubmitSuite val args = Seq( "--class", PermanentHiveUDFTest1.getClass.getName.stripSuffix("$"), "--name", "PermanentHiveUDFTest1", - "--master", "local-cluster[2,1,1024]", + "--master", "local-cluster[2,1,512]", + "--conf", s"${EXECUTOR_MEMORY.key}=512m", "--conf", "spark.ui.enabled=false", "--conf", "spark.master.rest.enabled=false", "--driver-java-options", "-Dderby.system.durability=test", @@ -108,7 +111,8 @@ class HiveSparkSubmitSuite val args = Seq( "--class", PermanentHiveUDFTest2.getClass.getName.stripSuffix("$"), "--name", "PermanentHiveUDFTest2", - "--master", "local-cluster[2,1,1024]", + "--master", "local-cluster[2,1,512]", + "--conf", s"${EXECUTOR_MEMORY.key}=512m", "--conf", "spark.ui.enabled=false", "--conf", "spark.master.rest.enabled=false", "--driver-java-options", "-Dderby.system.durability=test", @@ -127,7 +131,8 @@ class Hiv
[spark] branch master updated: [SPARK-39823][SQL][PYTHON] Rename Dataset.as as Dataset.to and add DataFrame.to in PySpark
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 91b95056806 [SPARK-39823][SQL][PYTHON] Rename Dataset.as as Dataset.to and add DataFrame.to in PySpark 91b95056806 is described below commit 91b950568066830ecd7a4581ab5bf4dbdbbeb474 Author: Ruifeng Zheng AuthorDate: Wed Jul 27 08:11:18 2022 +0800 [SPARK-39823][SQL][PYTHON] Rename Dataset.as as Dataset.to and add DataFrame.to in PySpark ### What changes were proposed in this pull request? 1, rename `Dataset.as(StructType)` to `Dataset.to(StructType)`, since `as` is a keyword in python, we dont want to use a different name; 2, Add `DataFrame.to(StructType)` in Python ### Why are the changes needed? for function parity ### Does this PR introduce _any_ user-facing change? yes, new api ### How was this patch tested? added UT Closes #37233 from zhengruifeng/py_ds_as. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- .../source/reference/pyspark.sql/dataframe.rst | 1 + python/pyspark/sql/dataframe.py| 50 + python/pyspark/sql/tests/test_dataframe.py | 36 ++- .../main/scala/org/apache/spark/sql/Dataset.scala | 4 +- ...emaSuite.scala => DataFrameToSchemaSuite.scala} | 52 +++--- 5 files changed, 114 insertions(+), 29 deletions(-) diff --git a/python/docs/source/reference/pyspark.sql/dataframe.rst b/python/docs/source/reference/pyspark.sql/dataframe.rst index 5b6e704ba48..8cf083e5dd4 100644 --- a/python/docs/source/reference/pyspark.sql/dataframe.rst +++ b/python/docs/source/reference/pyspark.sql/dataframe.rst @@ -102,6 +102,7 @@ DataFrame DataFrame.summary DataFrame.tail DataFrame.take +DataFrame.to DataFrame.toDF DataFrame.toJSON DataFrame.toLocalIterator diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py index efebd05c08d..481dafa310d 100644 --- a/python/pyspark/sql/dataframe.py +++ b/python/pyspark/sql/dataframe.py @@ -1422,6 +1422,56 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin): jc = self._jdf.colRegex(colName) return Column(jc) +def to(self, schema: StructType) -> "DataFrame": +""" +Returns a new :class:`DataFrame` where each row is reconciled to match the specified +schema. + +Notes +- +1, Reorder columns and/or inner fields by name to match the specified schema. + +2, Project away columns and/or inner fields that are not needed by the specified schema. +Missing columns and/or inner fields (present in the specified schema but not input +DataFrame) lead to failures. + +3, Cast the columns and/or inner fields to match the data types in the specified schema, +if the types are compatible, e.g., numeric to numeric (error if overflows), but not string +to int. + +4, Carry over the metadata from the specified schema, while the columns and/or inner fields +still keep their own metadata if not overwritten by the specified schema. + +5, Fail if the nullability is not compatible. For example, the column and/or inner field +is nullable but the specified schema requires them to be not nullable. + +.. versionadded:: 3.4.0 + +Parameters +-- +schema : :class:`StructType` +Specified schema. + +Examples + +>>> df = spark.createDataFrame([("a", 1)], ["i", "j"]) +>>> df.schema +StructType([StructField('i', StringType(), True), StructField('j', LongType(), True)]) +>>> schema = StructType([StructField("j", StringType()), StructField("i", StringType())]) +>>> df2 = df.to(schema) +>>> df2.schema +StructType([StructField('j', StringType(), True), StructField('i', StringType(), True)]) +>>> df2.show() ++---+---+ +| j| i| ++---+---+ +| 1| a| ++---+---+ +""" +assert schema is not None +jschema = self._jdf.sparkSession().parseDataType(schema.json()) +return DataFrame(self._jdf.to(jschema), self.sparkSession) + def alias(self, alias: str) -> "DataFrame": """Returns a new :class:`DataFrame` with an alias set. diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py index ac6b6f68aed..7c7d3d1e51c 100644 --- a/python/pyspark/sql/tests/test_dataframe.py +++ b/python/pyspark/sql/tests/test_dataframe.py @@ -25,11 +25,12 @@ import unittest from typing import cast from pyspark.sql import SparkSession, Row -from pyspark.sql.functions import col, lit, count, sum, mean +from pyspark.sql.functi
[spark] branch master updated: [SPARK-39884][K8S] `KubernetesExecutorBackend` should handle `IPv6` hostname
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 07ccaa6f4c7 [SPARK-39884][K8S] `KubernetesExecutorBackend` should handle `IPv6` hostname 07ccaa6f4c7 is described below commit 07ccaa6f4c7c37339ee10c3d7b337059fc7468a2 Author: Dongjoon Hyun AuthorDate: Tue Jul 26 17:00:33 2022 -0700 [SPARK-39884][K8S] `KubernetesExecutorBackend` should handle `IPv6` hostname ### What changes were proposed in this pull request? This PR aims to fix `KubernetesExecutorBackend` to handle IPv6 hostname. ### Why are the changes needed? `entrypoint.sh` uses `SPARK_EXECUTOR_POD_IP` for `--hostname`. https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L97 However, IPv6 `status.podIP` has no `[]`. We need to handle both IPv6 cases (`[]` or no `[]`). ```yaml - name: SPARK_EXECUTOR_POD_IP fieldPath: status.podIP value: -Djava.net.preferIPv6Addresses=true podIP: 2600:1f14:552:fb00:4c14::3 ``` ### Does this PR introduce _any_ user-facing change? No, previously, it fails immediately. ``` 22/07/26 19:20:46 INFO Executor: Starting executor ID 5 on host 2600:1f14:552:fb02:2dc2::1 22/07/26 19:20:46 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to create executor due to assertion failed: Expected hostname or IPv6 IP enclosed in [] but got 2600:1f14:552:fb02:2dc2::1 ``` ### How was this patch tested? Pass the CIs. Closes #37306 from dongjoon-hyun/SPARK-39884. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../apache/spark/scheduler/cluster/k8s/KubernetesExecutorBackend.scala | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesExecutorBackend.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesExecutorBackend.scala index fbf485cfa2f..bec54a11366 100644 --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesExecutorBackend.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesExecutorBackend.scala @@ -158,7 +158,8 @@ private[spark] object KubernetesExecutorBackend extends Logging { bindAddress = value argv = tail case ("--hostname") :: value :: tail => - hostname = value + // entrypoint.sh sets SPARK_EXECUTOR_POD_IP without '[]' + hostname = Utils.addBracketsIfNeeded(value) argv = tail case ("--cores") :: value :: tail => cores = value.toInt - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-39319][CORE][SQL] Make query contexts as a part of `SparkThrowable`
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e9eb28e27d1 [SPARK-39319][CORE][SQL] Make query contexts as a part of `SparkThrowable` e9eb28e27d1 is described below commit e9eb28e27d10497c8b36774609823f4bbd2c8500 Author: Max Gekk AuthorDate: Tue Jul 26 20:17:09 2022 +0500 [SPARK-39319][CORE][SQL] Make query contexts as a part of `SparkThrowable` ### What changes were proposed in this pull request? In the PR, I propose to add new interface `QueryContext` Spark core, and allow to get an instance of `QueryContext` from Spark's exceptions of the type `SparkThrowable`. For instance, `QueryContext` should help users to figure out where an error occur while executing queries in Spark SQL. Also this PR adds `SqlQueryContext` as one of implementation of `QueryContext` to Spark SQL `Origin` which contains a context of TreeNodes + textual summary of the error. The `context` value in `Origin` will have all necessary structural info about the fragment of SQL query to which an error can be linked. All Spark's exceptions are modified to accept the optional `QueryContext` and pre-built text summary. Apparently, SQL expressions init and pass new context to exceptions. Closes #36702 ### Why are the changes needed? In the future, this enriches the information of the error message. With the change, it is possible to have a new pretty printing format error message like ```sql > SELECT * FROM v1; { “errorClass” : [ “DIVIDE_BY_ZERO” ], “parameters” : [ { “name” = “config”, “value” = “spark.sql.ansi.enabled” } ], “sqlState” : “42000”, “context” : { “objectType” : “VIEW”, “objectName” : “default.v1” “indexStart” : 36, “indexEnd” : 41, “fragment” : “1 / 0” } } } ``` ### Does this PR introduce _any_ user-facing change? Yes. The PR changes Spark's exception by replacing the type of `queryContext` from `String` to `Option[QueryContext]`. User's code can fail if it uses `queryContext`. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "test:testOnly *DecimalExpressionSuite" $ build/sbt "test:testOnly *TreeNodeSuite" ``` and affected test suites: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Authored-by: Max Gekk Co-authored-by: Gengliang Wang Closes #37209 from MaxGekk/query-context-in-sparkthrowable. Authored-by: Max Gekk Signed-off-by: Max Gekk --- .../main/java/org/apache/spark/QueryContext.java | 48 .../main/java/org/apache/spark/SparkThrowable.java | 2 + .../apache/spark/memory/SparkOutOfMemoryError.java | 3 +- .../main/scala/org/apache/spark/ErrorInfo.scala| 16 ++- .../scala/org/apache/spark/SparkException.scala| 58 + .../spark/sql/catalyst/expressions/Cast.scala | 20 ++-- .../sql/catalyst/expressions/Expression.scala | 6 +- .../catalyst/expressions/aggregate/Average.scala | 14 +-- .../sql/catalyst/expressions/aggregate/Sum.scala | 18 +-- .../sql/catalyst/expressions/arithmetic.scala | 27 ++--- .../expressions/collectionOperations.scala | 8 +- .../expressions/complexTypeExtractors.scala| 13 ++- .../catalyst/expressions/decimalExpressions.scala | 25 ++-- .../catalyst/expressions/intervalExpressions.scala | 22 ++-- .../sql/catalyst/expressions/mathExpressions.scala | 2 +- .../catalyst/expressions/stringExpressions.scala | 10 +- .../spark/sql/catalyst/trees/SQLQueryContext.scala | 130 + .../apache/spark/sql/catalyst/trees/TreeNode.scala | 99 ++-- .../spark/sql/catalyst/util/DateTimeUtils.scala| 22 ++-- .../spark/sql/catalyst/util/IntervalUtils.scala| 2 +- .../apache/spark/sql/catalyst/util/MathUtils.scala | 38 +++--- .../spark/sql/catalyst/util/UTF8StringUtils.scala | 25 ++-- .../apache/spark/sql/errors/QueryErrorsBase.scala | 5 + .../spark/sql/errors/QueryExecutionErrors.scala| 74 +++- .../scala/org/apache/spark/sql/types/Decimal.scala | 7 +- .../expressions/DecimalExpressionSuite.scala | 4 +- .../spark/sql/catalyst/trees/TreeNodeSuite.scala | 15 ++- 27 files changed, 445 insertions(+), 268 deletions(-) diff --git a/core/src/main/java/org/apache/spark/QueryContext.java b/core/src/main/java/org/apache/spark/QueryContext.java new file mode 100644 index 000..de5b29d0295 --- /dev/null +++ b/core/src/main/java/org/apache/spark/QueryContext.java @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under
[spark] branch master updated: [SPARK-39874][SQL][TESTS] Add System.gc at beforeEach in BroadcastJoinSuite*
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 329bb1430b1 [SPARK-39874][SQL][TESTS] Add System.gc at beforeEach in BroadcastJoinSuite* 329bb1430b1 is described below commit 329bb1430b175cc6a6c4769cfc99ec07cc306a6f Author: Hyukjin Kwon AuthorDate: Tue Jul 26 22:20:14 2022 +0900 [SPARK-39874][SQL][TESTS] Add System.gc at beforeEach in BroadcastJoinSuite* ### What changes were proposed in this pull request? This PR is similar with https://github.com/apache/spark/pull/37291. Call `System.gc()`. ### Why are the changes needed? To deflake it. See https://github.com/MaxGekk/spark/runs/7516270590?check_suite_focus=true ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Should monitor CI Closes #37297 from HyukjinKwon/SPARK-39874. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- .../org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala| 5 + 1 file changed, 5 insertions(+) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala index 256e9426202..3cc43e2dd41 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala @@ -70,6 +70,11 @@ abstract class BroadcastJoinSuiteBase extends QueryTest with SQLTestUtils } } + override def beforeEach(): Unit = { +super.beforeEach() +System.gc() + } + /** * Test whether the specified broadcast join updates the peak execution memory accumulator. */ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-39869][SQL][TESTS] Fix flaky hive - slow tests because of out-of-memory
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c4c623a3a89 [SPARK-39869][SQL][TESTS] Fix flaky hive - slow tests because of out-of-memory c4c623a3a89 is described below commit c4c623a3a890267b2f9f8d472c8be30fc5db53e1 Author: Hyukjin Kwon AuthorDate: Tue Jul 26 18:37:20 2022 +0900 [SPARK-39869][SQL][TESTS] Fix flaky hive - slow tests because of out-of-memory ### What changes were proposed in this pull request? This PR adds some manual `System.gc`. I know enough that this doesn't guarantee the garbage collection and sounds somewhat funny but it works in my experience so far, and I did such hack in some places before. ### Why are the changes needed? To deflake the tests. ### Does this PR introduce _any_ user-facing change? No, dev and test-only. ### How was this patch tested? CI in this PR should test it out. Closes #37291 from HyukjinKwon/SPARK-39869. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- .../apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala | 5 + .../test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala | 1 + 2 files changed, 6 insertions(+) diff --git a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala index ecbe87b163d..debae3ad520 100644 --- a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala +++ b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala @@ -1388,6 +1388,11 @@ abstract class HiveThriftServer2TestBase extends SparkFunSuite with BeforeAndAft """.stripMargin) } + override def beforeEach(): Unit = { +super.beforeEach() +System.gc() + } + override protected def beforeAll(): Unit = { super.beforeAll() diagnosisBuffer.clear() diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala index 170cf4898f3..18202307fc4 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala @@ -63,6 +63,7 @@ class HiveSparkSubmitSuite override def beforeEach(): Unit = { super.beforeEach() +System.gc() } test("temporary Hive UDF: define a UDF and use it") { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (55c3347c48f -> de9a4b0747a)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 55c3347c48f [SPARK-38864][SQL] Add unpivot / melt to Dataset add de9a4b0747a [SPARK-39856][TESTS][INFRA] Skip q72 at TPC-DS build at GitHub Actions No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/TPCDSQuerySuite.scala | 6 -- .../org/apache/spark/sql/TPCDSQueryTestSuite.scala | 23 -- 2 files changed, 17 insertions(+), 12 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.3 updated: [SPARK-39856][TESTS][INFRA] Skip q72 at TPC-DS build at GitHub Actions
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new c9d56758a8c [SPARK-39856][TESTS][INFRA] Skip q72 at TPC-DS build at GitHub Actions c9d56758a8c is described below commit c9d56758a8c28a44161f63eb5c8763ab92616a56 Author: Hyukjin Kwon AuthorDate: Tue Jul 26 18:25:50 2022 +0900 [SPARK-39856][TESTS][INFRA] Skip q72 at TPC-DS build at GitHub Actions ### What changes were proposed in this pull request? This PR reverts https://github.com/apache/spark/commit/7358253755762f9bfe6cedc1a50ec14616cfeace, https://github.com/apache/spark/commit/ae1f6a26ed39b297ace8d6c9420b72a3c01a3291 and https://github.com/apache/spark/commit/72b55ccf8327c00e173ab6130fdb428ad0d5aacc because they do not help fixing the TPC-DS build. In addition, this PR skips the problematic query in GitHub Actions to avoid OOM. ### Why are the changes needed? To make the build pass. ### Does this PR introduce _any_ user-facing change? No, dev and test-only. ### How was this patch tested? CI in this PR should test it out. Closes #37289 from HyukjinKwon/SPARK-39856-followup. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit de9a4b0747a4127e320f80f5e1bf431429da70a9) Signed-off-by: Hyukjin Kwon --- .../org/apache/spark/sql/TPCDSQuerySuite.scala | 6 -- .../org/apache/spark/sql/TPCDSQueryTestSuite.scala | 23 -- 2 files changed, 17 insertions(+), 12 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala index 22e1b838f3f..8c4d25a7eb9 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala @@ -29,7 +29,8 @@ import org.apache.spark.tags.ExtendedSQLTest @ExtendedSQLTest class TPCDSQuerySuite extends BenchmarkQueryTest with TPCDSBase { - tpcdsQueries.foreach { name => + // q72 is skipped due to GitHub Actions' memory limit. + tpcdsQueries.filterNot(sys.env.contains("GITHUB_ACTIONS") && _ == "q72").foreach { name => val queryString = resourceToString(s"tpcds/$name.sql", classLoader = Thread.currentThread().getContextClassLoader) test(name) { @@ -39,7 +40,8 @@ class TPCDSQuerySuite extends BenchmarkQueryTest with TPCDSBase { } } - tpcdsQueriesV2_7_0.foreach { name => + // q72 is skipped due to GitHub Actions' memory limit. + tpcdsQueriesV2_7_0.filterNot(sys.env.contains("GITHUB_ACTIONS") && _ == "q72").foreach { name => val queryString = resourceToString(s"tpcds-v2.7.0/$name.sql", classLoader = Thread.currentThread().getContextClassLoader) test(s"$name-v2.7") { diff --git a/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala index 9affe827bc1..8019fc98a52 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryTestSuite.scala @@ -62,7 +62,7 @@ class TPCDSQueryTestSuite extends QueryTest with TPCDSBase with SQLQueryTestHelp // To make output results deterministic override protected def sparkConf: SparkConf = super.sparkConf -.set(SQLConf.SHUFFLE_PARTITIONS.key, 32.toString) +.set(SQLConf.SHUFFLE_PARTITIONS.key, "1") protected override def createSparkSession: TestSparkSession = { new TestSparkSession(new SparkContext("local[1]", this.getClass.getSimpleName, sparkConf)) @@ -105,6 +105,7 @@ class TPCDSQueryTestSuite extends QueryTest with TPCDSBase with SQLQueryTestHelp query: String, goldenFile: File, conf: Map[String, String]): Unit = { +val shouldSortResults = sortMergeJoinConf != conf // Sort for other joins withSQLConf(conf.toSeq: _*) { try { val (schema, output) = handleExceptions(getNormalizedResult(spark, query)) @@ -142,15 +143,17 @@ class TPCDSQueryTestSuite extends QueryTest with TPCDSBase with SQLQueryTestHelp assertResult(expectedSchema, s"Schema did not match\n$queryString") { schema } -// Truncate precisions because they can be vary per how the shuffle is performed. -val expectSorted = expectedOutput.split("\n").sorted.map(_.trim) - .mkString("\n").replaceAll("\\s+$", "") - .replaceAll("""([0-9]+.[0-9]{10})([0-9]*)""", "$1") -val outputSorted = output.sorted.map(_.trim).mkString("\n") - .replaceAll("\\s+$", "") - .replaceAll("""([0-9]+.[0-9]{10})([0-9]*)""", "$1") -assertResult(expectSorted, s"Result did not match\n$queryString") { - outputS
[spark] branch master updated: [SPARK-38864][SQL] Add unpivot / melt to Dataset
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 55c3347c48f [SPARK-38864][SQL] Add unpivot / melt to Dataset 55c3347c48f is described below commit 55c3347c48f93a9c5c5c2fb00b30f838eb081b7f Author: Enrico Minack AuthorDate: Tue Jul 26 15:50:03 2022 +0800 [SPARK-38864][SQL] Add unpivot / melt to Dataset ### What changes were proposed in this pull request? This proposes a Scala implementation of the `melt` (aka. `unpivot`) operation. ### Why are the changes needed? The Scala Dataset API provides the `pivot` operation, but not its reverse operation `unpivot` or `melt`. The `melt` operation is part of the [Pandas API](https://pandas.pydata.org/docs/reference/api/pandas.melt.html), which is why this method is provided by PySpark Pandas API, implemented purely in Python. [It should be implemented in Scala](https://github.com/apache/spark/pull/26912#pullrequestreview-332975715) to make this operation available to Scala / Java, SQL, PySpark, and to reuse the implementation in PySpark Pandas APIs. The `melt` / `unpivot` operation exists in other systems like [BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#unpivot_operator), [T-SQL](https://docs.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-ver15#unpivot-example), [Oracle](https://www.oracletutorial.com/oracle-basics/oracle-unpivot/). It supports expressions for ids and value columns including `*` expansion and structs. So this also fixes / includes SPARK-39292. ### Does this PR introduce _any_ user-facing change? It adds `melt` to the `Dataset` API (Scala and Java). ### How was this patch tested? It is tested in the `DatasetMeltSuite` and `JavaDatasetSuite`. Closes #36150 from EnricoMi/branch-melt. Authored-by: Enrico Minack Signed-off-by: Wenchen Fan --- core/src/main/resources/error/error-classes.json | 12 + .../spark/sql/catalyst/analysis/Analyzer.scala | 41 ++ .../sql/catalyst/analysis/AnsiTypeCoercion.scala | 1 + .../sql/catalyst/analysis/CheckAnalysis.scala | 8 + .../spark/sql/catalyst/analysis/TypeCoercion.scala | 16 + .../plans/logical/basicLogicalOperators.scala | 39 ++ .../sql/catalyst/rules/RuleIdCollection.scala | 1 + .../spark/sql/catalyst/trees/TreePatterns.scala| 1 + .../spark/sql/errors/QueryCompilationErrors.scala | 18 + .../main/scala/org/apache/spark/sql/Dataset.scala | 138 +- .../spark/sql/RelationalGroupedDataset.scala | 18 + .../org/apache/spark/sql/DatasetUnpivotSuite.scala | 543 + .../spark/sql/errors/QueryErrorsSuiteBase.scala| 3 +- 13 files changed, 837 insertions(+), 2 deletions(-) diff --git a/core/src/main/resources/error/error-classes.json b/core/src/main/resources/error/error-classes.json index e2a99c1a62e..29ca280719e 100644 --- a/core/src/main/resources/error/error-classes.json +++ b/core/src/main/resources/error/error-classes.json @@ -375,6 +375,18 @@ "Unable to acquire bytes of memory, got " ] }, + "UNPIVOT_REQUIRES_VALUE_COLUMNS" : { +"message" : [ + "At least one value column needs to be specified for UNPIVOT, all columns specified as ids" +], +"sqlState" : "42000" + }, + "UNPIVOT_VALUE_DATA_TYPE_MISMATCH" : { +"message" : [ + "Unpivot value columns must share a least common type, some types do not: []" +], +"sqlState" : "42000" + }, "UNRECOGNIZED_SQL_TYPE" : { "message" : [ "Unrecognized SQL type " diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index f40c260eb6f..a6108c2a3d3 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -293,6 +293,7 @@ class Analyzer(override val catalogManager: CatalogManager) ResolveUpCast :: ResolveGroupingAnalytics :: ResolvePivot :: + ResolveUnpivot :: ResolveOrdinalInOrderByAndGroupBy :: ResolveAggAliasInGroupBy :: ResolveMissingReferences :: @@ -514,6 +515,10 @@ class Analyzer(override val catalogManager: CatalogManager) if child.resolved && groupByOpt.isDefined && hasUnresolvedAlias(groupByOpt.get) => Pivot(Some(assignAliases(groupByOpt.get)), pivotColumn, pivotValues, aggregates, child) + case up: Unpivot if up.child.resolved && +(hasUnresolvedAlias(up.ids) || hasUnresolvedAlias(up.values)) => +up.copy(ids = assignAliases(up.ids), values = assignAliases(up.values)) + case Project(p
[spark] branch master updated (72b55ccf832 -> 5c0d5515956)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 72b55ccf832 [SPARK-39856][SQL][TESTS][FOLLOW-UP] Increase the number of partitions in TPC-DS build to avoid out-of-memory add 5c0d5515956 [SPARK-39808][SQL] Support aggregate function MODE No new revisions were added by this update. Summary of changes: .../sql/catalyst/analysis/FunctionRegistry.scala | 1 + .../sql/catalyst/expressions/aggregate/Mode.scala | 94 ++ .../expressions/aggregate/interfaces.scala | 67 +++ .../expressions/aggregate/percentiles.scala| 60 +- .../sql-functions/sql-expression-schema.md | 1 + .../test/resources/sql-tests/inputs/group-by.sql | 4 + .../resources/sql-tests/results/group-by.sql.out | 19 + 7 files changed, 188 insertions(+), 58 deletions(-) create mode 100644 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org