[spark] branch master updated: [SPARK-37702][SQL] Support ANSI Aggregate Function: regr_syy
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6604929fe6d [SPARK-37702][SQL] Support ANSI Aggregate Function: regr_syy 6604929fe6d is described below commit 6604929fe6dee7216a24b4d0c96d478cae13e2b9 Author: Jiaan Geng AuthorDate: Thu May 12 13:30:26 2022 +0800 [SPARK-37702][SQL] Support ANSI Aggregate Function: regr_syy ### What changes were proposed in this pull request? This PR used to support ANSI aggregate Function: `regr_syy` The mainstream database supports `regr_syy` show below: **Teradata** https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/ZsXbiMrja5EYTft42VoiTQ **Snowflake** https://docs.snowflake.com/en/sql-reference/functions/regr_syy.html **Oracle** https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGR_-Linear-Regression-Functions.html#GUID-A675B68F-2A88-4843-BE2C-FCDE9C65F9A9 **DB2** https://www.ibm.com/docs/en/db2/11.5?topic=af-regression-functions-regr-avgx-regr-avgy-regr-count **H2** http://www.h2database.com/html/functions-aggregate.html#regr_syy **Postgresql** https://www.postgresql.org/docs/8.4/functions-aggregate.html **Sybase** https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.0/dbreference/regr-syy-function.html **Exasol** https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/regr_function.htm ### Why are the changes needed? `regr_syy` is very useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. New feature. ### How was this patch tested? New tests. Closes #36292 from beliefer/SPARK-37702. Authored-by: Jiaan Geng Signed-off-by: Wenchen Fan --- .../sql/catalyst/analysis/FunctionRegistry.scala | 1 + .../expressions/aggregate/CentralMomentAgg.scala | 4 +- .../expressions/aggregate/linearRegression.scala | 33 +++- .../sql-functions/sql-expression-schema.md | 1 + .../test/resources/sql-tests/inputs/group-by.sql | 6 +++ .../inputs/postgreSQL/aggregates_part1.sql | 23 .../inputs/udf/postgreSQL/udf-aggregates_part1.sql | 23 .../resources/sql-tests/results/group-by.sql.out | 35 +++- .../results/postgreSQL/aggregates_part1.sql.out| 62 +- .../udf/postgreSQL/udf-aggregates_part1.sql.out| 62 +- 10 files changed, 220 insertions(+), 30 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala index f7af5b35a3b..a56ef175b5e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala @@ -503,6 +503,7 @@ object FunctionRegistry { expression[RegrR2]("regr_r2"), expression[RegrSXX]("regr_sxx"), expression[RegrSXY]("regr_sxy"), +expression[RegrSYY]("regr_syy"), // string functions expression[Ascii]("ascii"), diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala index a40c5e4815f..f9bfe77ce5a 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala @@ -264,7 +264,7 @@ case class VarianceSamp( copy(child = newChild) } -case class RegrSXXReplacement(child: Expression) +case class RegrReplacement(child: Expression) extends CentralMomentAgg(child, !SQLConf.get.legacyStatisticalAggregate) { override protected def momentOrder = 2 @@ -273,7 +273,7 @@ case class RegrSXXReplacement(child: Expression) If(n === 0.0, Literal.create(null, DoubleType), m2) } - override protected def withNewChildInternal(newChild: Expression): RegrSXXReplacement = + override protected def withNewChildInternal(newChild: Expression): RegrReplacement = copy(child = newChild) } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala index 568c186f06d..0372f94031e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala @@ -174,7 +174,7 @@ case class
[spark] branch branch-3.2 updated: [SPARK-39060][SQL][3.2] Typo in error messages of decimal overflow
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 6f9e3034ada [SPARK-39060][SQL][3.2] Typo in error messages of decimal overflow 6f9e3034ada is described below commit 6f9e3034ada72f372dafe93152e01ad5cb323989 Author: Vitalii Li AuthorDate: Thu May 12 08:13:51 2022 +0300 [SPARK-39060][SQL][3.2] Typo in error messages of decimal overflow ### What changes were proposed in this pull request? This PR removes extra curly bracket from debug string for Decimal type in SQL. This is a backport from master branch. Commit: https://github.com/apache/spark/commit/165ce4eb7d6d75201beb1bff879efa99fde24f94 ### Why are the changes needed? Typo in error messages of decimal overflow. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running tests: ``` $ build/sbt "sql/testOnly" ``` Closes #36458 from vli-databricks/SPARK-39060-3.2. Authored-by: Vitalii Li Signed-off-by: Max Gekk --- .../src/main/scala/org/apache/spark/sql/types/Decimal.scala | 4 ++-- .../sql-tests/results/ansi/decimalArithmeticOperations.sql.out| 8 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala index 46814297231..bc5fba8d0d8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala @@ -227,9 +227,9 @@ final class Decimal extends Ordered[Decimal] with Serializable { def toDebugString: String = { if (decimalVal.ne(null)) { - s"Decimal(expanded,$decimalVal,$precision,$scale})" + s"Decimal(expanded, $decimalVal, $precision, $scale)" } else { - s"Decimal(compact,$longVal,$precision,$scale})" + s"Decimal(compact, $longVal, $precision, $scale)" } } diff --git a/sql/core/src/test/resources/sql-tests/results/ansi/decimalArithmeticOperations.sql.out b/sql/core/src/test/resources/sql-tests/results/ansi/decimalArithmeticOperations.sql.out index 2f3513e734f..c65742e4d8b 100644 --- a/sql/core/src/test/resources/sql-tests/results/ansi/decimalArithmeticOperations.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/ansi/decimalArithmeticOperations.sql.out @@ -76,7 +76,7 @@ select (5e36BD + 0.1) + 5e36BD struct<> -- !query output java.lang.ArithmeticException -Decimal(expanded,10.1,39,1}) cannot be represented as Decimal(38, 1). +Decimal(expanded, 10.1, 39, 1) cannot be represented as Decimal(38, 1). -- !query @@ -85,7 +85,7 @@ select (-4e36BD - 0.1) - 7e36BD struct<> -- !query output java.lang.ArithmeticException -Decimal(expanded,-11.1,39,1}) cannot be represented as Decimal(38, 1). +Decimal(expanded, -11.1, 39, 1) cannot be represented as Decimal(38, 1). -- !query @@ -94,7 +94,7 @@ select 12345678901234567890.0 * 12345678901234567890.0 struct<> -- !query output java.lang.ArithmeticException -Decimal(expanded,152415787532388367501905199875019052100,39,0}) cannot be represented as Decimal(38, 2). +Decimal(expanded, 152415787532388367501905199875019052100, 39, 0) cannot be represented as Decimal(38, 2). -- !query @@ -103,7 +103,7 @@ select 1e35BD / 0.1 struct<> -- !query output java.lang.ArithmeticException -Decimal(expanded,1,37,0}) cannot be represented as Decimal(38, 6). +Decimal(expanded, 1, 37, 0) cannot be represented as Decimal(38, 6). -- !query - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.3 updated: [SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check in CollapseProject
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 0baa5c7d2b7 [SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check in CollapseProject 0baa5c7d2b7 is described below commit 0baa5c7d2b71f379fad6a8a0b72f427acf70f4e4 Author: Wenchen Fan AuthorDate: Wed May 11 21:58:14 2022 -0700 [SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check in CollapseProject This PR fixes a perf regression in Spark 3.3 caused by https://github.com/apache/spark/pull/33958 In `CollapseProject`, we want to treat `CreateStruct` and its friends as cheap expressions if they are only referenced by `ExtractValue`, but the check is too conservative, which causes a perf regression. This PR fixes this check. Now "extract-only" means: the attribute only appears as a child of `ExtractValue`, but the consumer expression can be in any shape. Fixes perf regression No new tests Closes #36510 from cloud-fan/bug. Lead-authored-by: Wenchen Fan Co-authored-by: Wenchen Fan Signed-off-by: Dongjoon Hyun (cherry picked from commit 547f032d04bd2cf06c54b5a4a2f984f5166beb7d) Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/optimizer/Optimizer.scala | 14 -- .../sql/catalyst/optimizer/CollapseProjectSuite.scala | 18 +++--- 2 files changed, 23 insertions(+), 9 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index 753d81e4003..759a7044f15 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -997,12 +997,14 @@ object CollapseProject extends Rule[LogicalPlan] with AliasHelper { } } - @scala.annotation.tailrec - private def isExtractOnly(expr: Expression, ref: Attribute): Boolean = expr match { -case a: Alias => isExtractOnly(a.child, ref) -case e: ExtractValue => isExtractOnly(e.children.head, ref) -case a: Attribute => a.semanticEquals(ref) -case _ => false + private def isExtractOnly(expr: Expression, ref: Attribute): Boolean = { +def hasRefInNonExtractValue(e: Expression): Boolean = e match { + case a: Attribute => a.semanticEquals(ref) + // The first child of `ExtractValue` is the complex type to be extracted. + case e: ExtractValue if e.children.head.semanticEquals(ref) => false + case _ => e.children.exists(hasRefInNonExtractValue) +} +!hasRefInNonExtractValue(expr) } /** diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala index c1d13d14b05..f6c3209726b 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala @@ -30,7 +30,8 @@ class CollapseProjectSuite extends PlanTest { object Optimize extends RuleExecutor[LogicalPlan] { val batches = Batch("Subqueries", FixedPoint(10), EliminateSubqueryAliases) :: - Batch("CollapseProject", Once, CollapseProject) :: Nil + Batch("CollapseProject", Once, CollapseProject) :: + Batch("SimplifyExtractValueOps", Once, SimplifyExtractValueOps) :: Nil } val testRelation = LocalRelation('a.int, 'b.int) @@ -123,12 +124,23 @@ class CollapseProjectSuite extends PlanTest { test("SPARK-36718: do not collapse project if non-cheap expressions will be repeated") { val query = testRelation - .select(('a + 1).as('a_plus_1)) - .select(('a_plus_1 + 'a_plus_1).as('a_2_plus_2)) + .select(($"a" + 1).as("a_plus_1")) + .select(($"a_plus_1" + $"a_plus_1").as("a_2_plus_2")) .analyze val optimized = Optimize.execute(query) comparePlans(optimized, query) + +// CreateStruct is an exception if it's only referenced by ExtractValue. +val query2 = testRelation + .select(namedStruct("a", $"a", "a_plus_1", $"a" + 1).as("struct")) + .select(($"struct".getField("a") + $"struct".getField("a_plus_1")).as("add")) + .analyze +val optimized2 = Optimize.execute(query2) +val expected2 = testRelation + .select(($"a" + ($"a" + 1)).as("add")) + .analyze +comparePlans(optimized2, expected2) } test("preserve top-level alias metadata while collapsing projects") { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands,
[spark] branch master updated: [SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check in CollapseProject
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 547f032d04b [SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check in CollapseProject 547f032d04b is described below commit 547f032d04bd2cf06c54b5a4a2f984f5166beb7d Author: Wenchen Fan AuthorDate: Wed May 11 21:58:14 2022 -0700 [SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check in CollapseProject ### What changes were proposed in this pull request? This PR fixes a perf regression in Spark 3.3 caused by https://github.com/apache/spark/pull/33958 In `CollapseProject`, we want to treat `CreateStruct` and its friends as cheap expressions if they are only referenced by `ExtractValue`, but the check is too conservative, which causes a perf regression. This PR fixes this check. Now "extract-only" means: the attribute only appears as a child of `ExtractValue`, but the consumer expression can be in any shape. ### Why are the changes needed? Fixes perf regression ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new tests Closes #36510 from cloud-fan/bug. Lead-authored-by: Wenchen Fan Co-authored-by: Wenchen Fan Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/optimizer/Optimizer.scala | 14 -- .../sql/catalyst/optimizer/CollapseProjectSuite.scala | 18 +++--- 2 files changed, 23 insertions(+), 9 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index 3a36e506f4e..9215609f154 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -1014,12 +1014,14 @@ object CollapseProject extends Rule[LogicalPlan] with AliasHelper { } } - @scala.annotation.tailrec - private def isExtractOnly(expr: Expression, ref: Attribute): Boolean = expr match { -case a: Alias => isExtractOnly(a.child, ref) -case e: ExtractValue => isExtractOnly(e.children.head, ref) -case a: Attribute => a.semanticEquals(ref) -case _ => false + private def isExtractOnly(expr: Expression, ref: Attribute): Boolean = { +def hasRefInNonExtractValue(e: Expression): Boolean = e match { + case a: Attribute => a.semanticEquals(ref) + // The first child of `ExtractValue` is the complex type to be extracted. + case e: ExtractValue if e.children.head.semanticEquals(ref) => false + case _ => e.children.exists(hasRefInNonExtractValue) +} +!hasRefInNonExtractValue(expr) } /** diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala index 93646b6f1bc..dd075837d51 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala @@ -30,7 +30,8 @@ class CollapseProjectSuite extends PlanTest { object Optimize extends RuleExecutor[LogicalPlan] { val batches = Batch("Subqueries", FixedPoint(10), EliminateSubqueryAliases) :: - Batch("CollapseProject", Once, CollapseProject) :: Nil + Batch("CollapseProject", Once, CollapseProject) :: + Batch("SimplifyExtractValueOps", Once, SimplifyExtractValueOps) :: Nil } val testRelation = LocalRelation($"a".int, $"b".int) @@ -125,12 +126,23 @@ class CollapseProjectSuite extends PlanTest { test("SPARK-36718: do not collapse project if non-cheap expressions will be repeated") { val query = testRelation - .select(($"a" + 1).as(Symbol("a_plus_1"))) - .select(($"a_plus_1" + $"a_plus_1").as(Symbol("a_2_plus_2"))) + .select(($"a" + 1).as("a_plus_1")) + .select(($"a_plus_1" + $"a_plus_1").as("a_2_plus_2")) .analyze val optimized = Optimize.execute(query) comparePlans(optimized, query) + +// CreateStruct is an exception if it's only referenced by ExtractValue. +val query2 = testRelation + .select(namedStruct("a", $"a", "a_plus_1", $"a" + 1).as("struct")) + .select(($"struct".getField("a") + $"struct".getField("a_plus_1")).as("add")) + .analyze +val optimized2 = Optimize.execute(query2) +val expected2 = testRelation + .select(($"a" + ($"a" + 1)).as("add")) + .analyze +comparePlans(optimized2, expected2) } test("preserve top-level alias metadata while collapsing projects") {
[spark] branch branch-3.3 updated: [SPARK-39155][PYTHON] Access to JVM through passed-in GatewayClient during type conversion
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 70becf29070 [SPARK-39155][PYTHON] Access to JVM through passed-in GatewayClient during type conversion 70becf29070 is described below commit 70becf290700e88c7be248e4277421dd17f3af4b Author: Xinrong Meng AuthorDate: Thu May 12 12:22:11 2022 +0900 [SPARK-39155][PYTHON] Access to JVM through passed-in GatewayClient during type conversion ### What changes were proposed in this pull request? Access to JVM through passed-in GatewayClient during type conversion. ### Why are the changes needed? In customized type converters, we may utilize the passed-in GatewayClient to access JVM, rather than rely on the `SparkContext._jvm`. That's [how](https://github.com/py4j/py4j/blob/master/py4j-python/src/py4j/java_collections.py#L508) Py4J explicit converters access JVM. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #36504 from xinrong-databricks/gateway_client_jvm. Authored-by: Xinrong Meng Signed-off-by: Hyukjin Kwon (cherry picked from commit 92fcf214c107358c1a70566b644cec2d35c096c0) Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/types.py | 30 ++ 1 file changed, 14 insertions(+), 16 deletions(-) diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py index 2a41508d634..123fd628980 100644 --- a/python/pyspark/sql/types.py +++ b/python/pyspark/sql/types.py @@ -44,7 +44,7 @@ from typing import ( ) from py4j.protocol import register_input_converter -from py4j.java_gateway import JavaClass, JavaGateway, JavaObject +from py4j.java_gateway import GatewayClient, JavaClass, JavaObject from pyspark.serializers import CloudPickleSerializer @@ -1929,7 +1929,7 @@ class DateConverter: def can_convert(self, obj: Any) -> bool: return isinstance(obj, datetime.date) -def convert(self, obj: datetime.date, gateway_client: JavaGateway) -> JavaObject: +def convert(self, obj: datetime.date, gateway_client: GatewayClient) -> JavaObject: Date = JavaClass("java.sql.Date", gateway_client) return Date.valueOf(obj.strftime("%Y-%m-%d")) @@ -1938,7 +1938,7 @@ class DatetimeConverter: def can_convert(self, obj: Any) -> bool: return isinstance(obj, datetime.datetime) -def convert(self, obj: datetime.datetime, gateway_client: JavaGateway) -> JavaObject: +def convert(self, obj: datetime.datetime, gateway_client: GatewayClient) -> JavaObject: Timestamp = JavaClass("java.sql.Timestamp", gateway_client) seconds = ( calendar.timegm(obj.utctimetuple()) if obj.tzinfo else time.mktime(obj.timetuple()) @@ -1958,27 +1958,25 @@ class DatetimeNTZConverter: and is_timestamp_ntz_preferred() ) -def convert(self, obj: datetime.datetime, gateway_client: JavaGateway) -> JavaObject: -from pyspark import SparkContext - +def convert(self, obj: datetime.datetime, gateway_client: GatewayClient) -> JavaObject: seconds = calendar.timegm(obj.utctimetuple()) -jvm = SparkContext._jvm -assert jvm is not None -return jvm.org.apache.spark.sql.catalyst.util.DateTimeUtils.microsToLocalDateTime( -int(seconds) * 100 + obj.microsecond +DateTimeUtils = JavaClass( +"org.apache.spark.sql.catalyst.util.DateTimeUtils", +gateway_client, ) +return DateTimeUtils.microsToLocalDateTime(int(seconds) * 100 + obj.microsecond) class DayTimeIntervalTypeConverter: def can_convert(self, obj: Any) -> bool: return isinstance(obj, datetime.timedelta) -def convert(self, obj: datetime.timedelta, gateway_client: JavaGateway) -> JavaObject: -from pyspark import SparkContext - -jvm = SparkContext._jvm -assert jvm is not None -return jvm.org.apache.spark.sql.catalyst.util.IntervalUtils.microsToDuration( +def convert(self, obj: datetime.timedelta, gateway_client: GatewayClient) -> JavaObject: +IntervalUtils = JavaClass( +"org.apache.spark.sql.catalyst.util.IntervalUtils", +gateway_client, +) +return IntervalUtils.microsToDuration( (math.floor(obj.total_seconds()) * 100) + obj.microseconds ) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-39155][PYTHON] Access to JVM through passed-in GatewayClient during type conversion
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 92fcf214c10 [SPARK-39155][PYTHON] Access to JVM through passed-in GatewayClient during type conversion 92fcf214c10 is described below commit 92fcf214c107358c1a70566b644cec2d35c096c0 Author: Xinrong Meng AuthorDate: Thu May 12 12:22:11 2022 +0900 [SPARK-39155][PYTHON] Access to JVM through passed-in GatewayClient during type conversion ### What changes were proposed in this pull request? Access to JVM through passed-in GatewayClient during type conversion. ### Why are the changes needed? In customized type converters, we may utilize the passed-in GatewayClient to access JVM, rather than rely on the `SparkContext._jvm`. That's [how](https://github.com/py4j/py4j/blob/master/py4j-python/src/py4j/java_collections.py#L508) Py4J explicit converters access JVM. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #36504 from xinrong-databricks/gateway_client_jvm. Authored-by: Xinrong Meng Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/types.py | 30 ++ 1 file changed, 14 insertions(+), 16 deletions(-) diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py index 2a41508d634..123fd628980 100644 --- a/python/pyspark/sql/types.py +++ b/python/pyspark/sql/types.py @@ -44,7 +44,7 @@ from typing import ( ) from py4j.protocol import register_input_converter -from py4j.java_gateway import JavaClass, JavaGateway, JavaObject +from py4j.java_gateway import GatewayClient, JavaClass, JavaObject from pyspark.serializers import CloudPickleSerializer @@ -1929,7 +1929,7 @@ class DateConverter: def can_convert(self, obj: Any) -> bool: return isinstance(obj, datetime.date) -def convert(self, obj: datetime.date, gateway_client: JavaGateway) -> JavaObject: +def convert(self, obj: datetime.date, gateway_client: GatewayClient) -> JavaObject: Date = JavaClass("java.sql.Date", gateway_client) return Date.valueOf(obj.strftime("%Y-%m-%d")) @@ -1938,7 +1938,7 @@ class DatetimeConverter: def can_convert(self, obj: Any) -> bool: return isinstance(obj, datetime.datetime) -def convert(self, obj: datetime.datetime, gateway_client: JavaGateway) -> JavaObject: +def convert(self, obj: datetime.datetime, gateway_client: GatewayClient) -> JavaObject: Timestamp = JavaClass("java.sql.Timestamp", gateway_client) seconds = ( calendar.timegm(obj.utctimetuple()) if obj.tzinfo else time.mktime(obj.timetuple()) @@ -1958,27 +1958,25 @@ class DatetimeNTZConverter: and is_timestamp_ntz_preferred() ) -def convert(self, obj: datetime.datetime, gateway_client: JavaGateway) -> JavaObject: -from pyspark import SparkContext - +def convert(self, obj: datetime.datetime, gateway_client: GatewayClient) -> JavaObject: seconds = calendar.timegm(obj.utctimetuple()) -jvm = SparkContext._jvm -assert jvm is not None -return jvm.org.apache.spark.sql.catalyst.util.DateTimeUtils.microsToLocalDateTime( -int(seconds) * 100 + obj.microsecond +DateTimeUtils = JavaClass( +"org.apache.spark.sql.catalyst.util.DateTimeUtils", +gateway_client, ) +return DateTimeUtils.microsToLocalDateTime(int(seconds) * 100 + obj.microsecond) class DayTimeIntervalTypeConverter: def can_convert(self, obj: Any) -> bool: return isinstance(obj, datetime.timedelta) -def convert(self, obj: datetime.timedelta, gateway_client: JavaGateway) -> JavaObject: -from pyspark import SparkContext - -jvm = SparkContext._jvm -assert jvm is not None -return jvm.org.apache.spark.sql.catalyst.util.IntervalUtils.microsToDuration( +def convert(self, obj: datetime.timedelta, gateway_client: GatewayClient) -> JavaObject: +IntervalUtils = JavaClass( +"org.apache.spark.sql.catalyst.util.IntervalUtils", +gateway_client, +) +return IntervalUtils.microsToDuration( (math.floor(obj.total_seconds()) * 100) + obj.microseconds ) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.3 updated: [SPARK-39149][SQL] SHOW DATABASES command should not quote database names under legacy mode
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new ebe4252e415 [SPARK-39149][SQL] SHOW DATABASES command should not quote database names under legacy mode ebe4252e415 is described below commit ebe4252e415e6afdf888e21d0b89ab744fd2dac7 Author: Wenchen Fan AuthorDate: Thu May 12 11:18:18 2022 +0800 [SPARK-39149][SQL] SHOW DATABASES command should not quote database names under legacy mode ### What changes were proposed in this pull request? This is a bug of the command legacy mode as it does not fully restore to the legacy behavior. The legacy v1 SHOW DATABASES command does not quote the database names. This PR fixes it. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no change by default, unless people turn on legacy mode, in which case SHOW DATABASES common won't quote the database names. ### How was this patch tested? new tests Closes #36508 from cloud-fan/regression. Authored-by: Wenchen Fan Signed-off-by: Wenchen Fan (cherry picked from commit 3094e495095635f6c9e83f4646d3321c2a9311f4) Signed-off-by: Wenchen Fan --- .../execution/datasources/v2/ShowNamespacesExec.scala | 11 ++- .../sql/execution/command/ShowNamespacesSuiteBase.scala | 17 + .../sql/execution/command/v1/ShowNamespacesSuite.scala | 13 + 3 files changed, 40 insertions(+), 1 deletion(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala index 9dafbd79a52..c55c7b9f985 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala @@ -42,8 +42,17 @@ case class ShowNamespacesExec( catalog.listNamespaces() } +// Please refer to the rule `KeepLegacyOutputs` for details about legacy command. +// The legacy SHOW DATABASES command does not quote the database names. +val isLegacy = output.head.name == "databaseName" +val namespaceNames = if (isLegacy && namespaces.forall(_.length == 1)) { + namespaces.map(_.head) +} else { + namespaces.map(_.quoted) +} + val rows = new ArrayBuffer[InternalRow]() -namespaces.map(_.quoted).map { ns => +namespaceNames.map { ns => if (pattern.map(StringUtils.filterPattern(Seq(ns), _).nonEmpty).getOrElse(true)) { rows += toCatalystRow(ns) } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowNamespacesSuiteBase.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowNamespacesSuiteBase.scala index b3693845c3b..80e545f6e3c 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowNamespacesSuiteBase.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowNamespacesSuiteBase.scala @@ -42,6 +42,9 @@ trait ShowNamespacesSuiteBase extends QueryTest with DDLCommandTestUtils { protected def builtinTopNamespaces: Seq[String] = Seq.empty protected def isCasePreserving: Boolean = true + protected def createNamespaceWithSpecialName(ns: String): Unit = { +sql(s"CREATE NAMESPACE $catalog.`$ns`") + } test("default namespace") { withSQLConf(SQLConf.DEFAULT_CATALOG.key -> catalog) { @@ -124,6 +127,20 @@ trait ShowNamespacesSuiteBase extends QueryTest with DDLCommandTestUtils { } } + test("SPARK-39149: keep the legacy no-quote behavior") { +Seq(true, false).foreach { legacy => + withSQLConf(SQLConf.LEGACY_KEEP_COMMAND_OUTPUT_SCHEMA.key -> legacy.toString) { +withNamespace(s"$catalog.`123`") { + createNamespaceWithSpecialName("123") + val res = if (legacy) "123" else "`123`" + checkAnswer( +sql(s"SHOW NAMESPACES IN $catalog"), +(res +: builtinTopNamespaces).map(Row(_))) +} + } +} + } + test("case sensitivity of the pattern string") { Seq(true, false).foreach { caseSensitive => withSQLConf(SQLConf.CASE_SENSITIVE.key -> caseSensitive.toString) { diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowNamespacesSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowNamespacesSuite.scala index a1b32e42ae2..b65a9acb656 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowNamespacesSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowNamespacesSuite.scala @@ -18,7 +18,9 @@ package
[spark] branch master updated: [SPARK-39149][SQL] SHOW DATABASES command should not quote database names under legacy mode
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3094e495095 [SPARK-39149][SQL] SHOW DATABASES command should not quote database names under legacy mode 3094e495095 is described below commit 3094e495095635f6c9e83f4646d3321c2a9311f4 Author: Wenchen Fan AuthorDate: Thu May 12 11:18:18 2022 +0800 [SPARK-39149][SQL] SHOW DATABASES command should not quote database names under legacy mode ### What changes were proposed in this pull request? This is a bug of the command legacy mode as it does not fully restore to the legacy behavior. The legacy v1 SHOW DATABASES command does not quote the database names. This PR fixes it. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no change by default, unless people turn on legacy mode, in which case SHOW DATABASES common won't quote the database names. ### How was this patch tested? new tests Closes #36508 from cloud-fan/regression. Authored-by: Wenchen Fan Signed-off-by: Wenchen Fan --- .../execution/datasources/v2/ShowNamespacesExec.scala | 11 ++- .../sql/execution/command/ShowNamespacesSuiteBase.scala | 17 + .../sql/execution/command/v1/ShowNamespacesSuite.scala | 13 + 3 files changed, 40 insertions(+), 1 deletion(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala index 9dafbd79a52..c55c7b9f985 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala @@ -42,8 +42,17 @@ case class ShowNamespacesExec( catalog.listNamespaces() } +// Please refer to the rule `KeepLegacyOutputs` for details about legacy command. +// The legacy SHOW DATABASES command does not quote the database names. +val isLegacy = output.head.name == "databaseName" +val namespaceNames = if (isLegacy && namespaces.forall(_.length == 1)) { + namespaces.map(_.head) +} else { + namespaces.map(_.quoted) +} + val rows = new ArrayBuffer[InternalRow]() -namespaces.map(_.quoted).map { ns => +namespaceNames.map { ns => if (pattern.map(StringUtils.filterPattern(Seq(ns), _).nonEmpty).getOrElse(true)) { rows += toCatalystRow(ns) } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowNamespacesSuiteBase.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowNamespacesSuiteBase.scala index b3693845c3b..80e545f6e3c 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowNamespacesSuiteBase.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowNamespacesSuiteBase.scala @@ -42,6 +42,9 @@ trait ShowNamespacesSuiteBase extends QueryTest with DDLCommandTestUtils { protected def builtinTopNamespaces: Seq[String] = Seq.empty protected def isCasePreserving: Boolean = true + protected def createNamespaceWithSpecialName(ns: String): Unit = { +sql(s"CREATE NAMESPACE $catalog.`$ns`") + } test("default namespace") { withSQLConf(SQLConf.DEFAULT_CATALOG.key -> catalog) { @@ -124,6 +127,20 @@ trait ShowNamespacesSuiteBase extends QueryTest with DDLCommandTestUtils { } } + test("SPARK-39149: keep the legacy no-quote behavior") { +Seq(true, false).foreach { legacy => + withSQLConf(SQLConf.LEGACY_KEEP_COMMAND_OUTPUT_SCHEMA.key -> legacy.toString) { +withNamespace(s"$catalog.`123`") { + createNamespaceWithSpecialName("123") + val res = if (legacy) "123" else "`123`" + checkAnswer( +sql(s"SHOW NAMESPACES IN $catalog"), +(res +: builtinTopNamespaces).map(Row(_))) +} + } +} + } + test("case sensitivity of the pattern string") { Seq(true, false).foreach { caseSensitive => withSQLConf(SQLConf.CASE_SENSITIVE.key -> caseSensitive.toString) { diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowNamespacesSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowNamespacesSuite.scala index a1b32e42ae2..b65a9acb656 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowNamespacesSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowNamespacesSuite.scala @@ -18,7 +18,9 @@ package org.apache.spark.sql.execution.command.v1 import org.apache.spark.sql.AnalysisException +import
[spark] branch master updated: [SPARK-39081][PYTHON][SQL] Implement DataFrame.resample and Series.resample
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6ddb5ae5766 [SPARK-39081][PYTHON][SQL] Implement DataFrame.resample and Series.resample 6ddb5ae5766 is described below commit 6ddb5ae57665e10596182a0ed1d7c683be36078e Author: Ruifeng Zheng AuthorDate: Thu May 12 09:16:43 2022 +0900 [SPARK-39081][PYTHON][SQL] Implement DataFrame.resample and Series.resample ### What changes were proposed in this pull request? Implement DataFrame.resample and Series.resample ### Why are the changes needed? To Increase pandas API coverage in PySpark ### Does this PR introduce _any_ user-facing change? yes, new methods added for example: ``` In [3]: ...: dates = [ ...: datetime.datetime(2011, 12, 31), ...: datetime.datetime(2012, 1, 2), ...: pd.NaT, ...: datetime.datetime(2013, 5, 3), ...: datetime.datetime(2022, 5, 3), ...: ] ...: pdf = pd.DataFrame(np.ones(len(dates)), index=pd.DatetimeIndex(dates), columns=['A']) ...: psdf = ps.from_pandas(pdf) ...: psdf.resample('3Y').sum().sort_index() ...: Out[3]: A 2011-12-31 1.0 2014-12-31 2.0 2017-12-31 0.0 2020-12-31 0.0 2023-12-31 1.0 ``` ### How was this patch tested? added UT Closes #36420 from zhengruifeng/impl_resample. Authored-by: Ruifeng Zheng Signed-off-by: Hyukjin Kwon --- dev/sparktestsupport/modules.py| 1 + .../docs/source/reference/pyspark.pandas/frame.rst | 1 + .../source/reference/pyspark.pandas/series.rst | 1 + .../pandas_on_spark/supported_pandas_api.rst | 8 +- python/pyspark/pandas/frame.py | 70 +++ python/pyspark/pandas/missing/frame.py | 1 - python/pyspark/pandas/missing/resample.py | 105 + python/pyspark/pandas/missing/series.py| 1 - python/pyspark/pandas/resample.py | 500 + python/pyspark/pandas/series.py| 135 ++ python/pyspark/pandas/tests/test_resample.py | 281 .../spark/sql/api/python/PythonSQLUtils.scala | 21 + 12 files changed, 1121 insertions(+), 4 deletions(-) diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index ed1eeb9b807..fc9e2ced9a9 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -643,6 +643,7 @@ pyspark_pandas = Module( "pyspark.pandas.tests.test_ops_on_diff_frames_groupby_expanding", "pyspark.pandas.tests.test_ops_on_diff_frames_groupby_rolling", "pyspark.pandas.tests.test_repr", +"pyspark.pandas.tests.test_resample", "pyspark.pandas.tests.test_reshape", "pyspark.pandas.tests.test_rolling", "pyspark.pandas.tests.test_series_conversion", diff --git a/python/docs/source/reference/pyspark.pandas/frame.rst b/python/docs/source/reference/pyspark.pandas/frame.rst index 75a8941ad78..bcf9694a7ae 100644 --- a/python/docs/source/reference/pyspark.pandas/frame.rst +++ b/python/docs/source/reference/pyspark.pandas/frame.rst @@ -260,6 +260,7 @@ Time series-related .. autosummary:: :toctree: api/ + DataFrame.resample DataFrame.shift DataFrame.first_valid_index DataFrame.last_valid_index diff --git a/python/docs/source/reference/pyspark.pandas/series.rst b/python/docs/source/reference/pyspark.pandas/series.rst index 48f67192349..1cf63c1a8ae 100644 --- a/python/docs/source/reference/pyspark.pandas/series.rst +++ b/python/docs/source/reference/pyspark.pandas/series.rst @@ -257,6 +257,7 @@ Time series-related :toctree: api/ Series.asof + Series.resample Series.shift Series.first_valid_index Series.last_valid_index diff --git a/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst b/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst index f63b4a2f05d..8b456445a1e 100644 --- a/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst +++ b/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst @@ -362,7 +362,9 @@ Supported DataFrame APIs ++-+--+ | :func:`replace`| P | ``regex``, ``method``| ++-+--+ -| resample | N | | +| :func:`resample` | P |``axis``, ``convention``, ``kind``| +|
[spark] branch branch-3.2 updated: [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new d1cae9c5ac5 [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index d1cae9c5ac5 is described below commit d1cae9c5ac5393243d2f9661dc7957d0ebccb1d6 Author: Xinrong Meng AuthorDate: Thu May 12 09:13:27 2022 +0900 [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index ### What changes were proposed in this pull request? Remove outdated statements on distributed-sequence default index. ### Why are the changes needed? Since distributed-sequence default index is updated to be enforced only while execution, there are stale statements in documents to be removed. ### Does this PR introduce _any_ user-facing change? No. Doc change only. ### How was this patch tested? Manual tests. Closes #36513 from xinrong-databricks/defaultIndexDoc. Authored-by: Xinrong Meng Signed-off-by: Hyukjin Kwon (cherry picked from commit f37150a5549a8f3cb4c1877bcfd2d1459fc73cac) Signed-off-by: Hyukjin Kwon --- python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst b/python/docs/source/user_guide/pandas_on_spark/options.rst index fd2e975aa9d..a82962a4373 100644 --- a/python/docs/source/user_guide/pandas_on_spark/options.rst +++ b/python/docs/source/user_guide/pandas_on_spark/options.rst @@ -186,9 +186,7 @@ This is conceptually equivalent to the PySpark example as below: **distributed-sequence**: It implements a sequence that increases one by one, by group-by and group-map approach in a distributed manner. It still generates the sequential index globally. If the default index must be the sequence in a large dataset, this -index has to be used. -Note that if more data are added to the data source after creating this index, -then it does not guarantee the sequential index. See the example below: +index has to be used. See the example below: .. code-block:: python - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.3 updated: [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 65dd72743e4 [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index 65dd72743e4 is described below commit 65dd72743e40a6276b69938f04fc69655bd8270e Author: Xinrong Meng AuthorDate: Thu May 12 09:13:27 2022 +0900 [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index ### What changes were proposed in this pull request? Remove outdated statements on distributed-sequence default index. ### Why are the changes needed? Since distributed-sequence default index is updated to be enforced only while execution, there are stale statements in documents to be removed. ### Does this PR introduce _any_ user-facing change? No. Doc change only. ### How was this patch tested? Manual tests. Closes #36513 from xinrong-databricks/defaultIndexDoc. Authored-by: Xinrong Meng Signed-off-by: Hyukjin Kwon (cherry picked from commit f37150a5549a8f3cb4c1877bcfd2d1459fc73cac) Signed-off-by: Hyukjin Kwon --- python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst b/python/docs/source/user_guide/pandas_on_spark/options.rst index c0d9b18c085..67b8f6841f5 100644 --- a/python/docs/source/user_guide/pandas_on_spark/options.rst +++ b/python/docs/source/user_guide/pandas_on_spark/options.rst @@ -186,9 +186,7 @@ This is conceptually equivalent to the PySpark example as below: **distributed-sequence** (default): It implements a sequence that increases one by one, by group-by and group-map approach in a distributed manner. It still generates the sequential index globally. If the default index must be the sequence in a large dataset, this -index has to be used. -Note that if more data are added to the data source after creating this index, -then it does not guarantee the sequential index. See the example below: +index has to be used. See the example below: .. code-block:: python - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f37150a5549 [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index f37150a5549 is described below commit f37150a5549a8f3cb4c1877bcfd2d1459fc73cac Author: Xinrong Meng AuthorDate: Thu May 12 09:13:27 2022 +0900 [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index ### What changes were proposed in this pull request? Remove outdated statements on distributed-sequence default index. ### Why are the changes needed? Since distributed-sequence default index is updated to be enforced only while execution, there are stale statements in documents to be removed. ### Does this PR introduce _any_ user-facing change? No. Doc change only. ### How was this patch tested? Manual tests. Closes #36513 from xinrong-databricks/defaultIndexDoc. Authored-by: Xinrong Meng Signed-off-by: Hyukjin Kwon --- python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst b/python/docs/source/user_guide/pandas_on_spark/options.rst index c0d9b18c085..67b8f6841f5 100644 --- a/python/docs/source/user_guide/pandas_on_spark/options.rst +++ b/python/docs/source/user_guide/pandas_on_spark/options.rst @@ -186,9 +186,7 @@ This is conceptually equivalent to the PySpark example as below: **distributed-sequence** (default): It implements a sequence that increases one by one, by group-by and group-map approach in a distributed manner. It still generates the sequential index globally. If the default index must be the sequence in a large dataset, this -index has to be used. -Note that if more data are added to the data source after creating this index, -then it does not guarantee the sequential index. See the example below: +index has to be used. See the example below: .. code-block:: python - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index"
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 345bf8c80ab Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index" 345bf8c80ab is described below commit 345bf8c80abaf194d3cc4c25491faa23ac8de16c Author: Hyukjin Kwon AuthorDate: Thu May 12 09:12:39 2022 +0900 Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index" This reverts commit 0a9a21df7dea62c5f129d91c049d0d3408d0366e. --- python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst b/python/docs/source/user_guide/pandas_on_spark/options.rst index a82962a4373..fd2e975aa9d 100644 --- a/python/docs/source/user_guide/pandas_on_spark/options.rst +++ b/python/docs/source/user_guide/pandas_on_spark/options.rst @@ -186,7 +186,9 @@ This is conceptually equivalent to the PySpark example as below: **distributed-sequence**: It implements a sequence that increases one by one, by group-by and group-map approach in a distributed manner. It still generates the sequential index globally. If the default index must be the sequence in a large dataset, this -index has to be used. See the example below: +index has to be used. +Note that if more data are added to the data source after creating this index, +then it does not guarantee the sequential index. See the example below: .. code-block:: python - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.3 updated: Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index"
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new e4bb341d376 Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index" e4bb341d376 is described below commit e4bb341d37661e93097e56e0087699bca60825fb Author: Hyukjin Kwon AuthorDate: Thu May 12 09:12:13 2022 +0900 Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index" This reverts commit f75c00da3cf01e63d93cedbe480198413af41455. --- python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst b/python/docs/source/user_guide/pandas_on_spark/options.rst index 67b8f6841f5..c0d9b18c085 100644 --- a/python/docs/source/user_guide/pandas_on_spark/options.rst +++ b/python/docs/source/user_guide/pandas_on_spark/options.rst @@ -186,7 +186,9 @@ This is conceptually equivalent to the PySpark example as below: **distributed-sequence** (default): It implements a sequence that increases one by one, by group-by and group-map approach in a distributed manner. It still generates the sequential index globally. If the default index must be the sequence in a large dataset, this -index has to be used. See the example below: +index has to be used. +Note that if more data are added to the data source after creating this index, +then it does not guarantee the sequential index. See the example below: .. code-block:: python - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index"
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 94cdbba699f Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index" 94cdbba699f is described below commit 94cdbba699f2f92dc93e8229d8e87565737fd20a Author: Hyukjin Kwon AuthorDate: Thu May 12 09:11:55 2022 +0900 Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index" This reverts commit cec1e7b4e68deac321f409d424a3acdcd4cb91be. --- python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst b/python/docs/source/user_guide/pandas_on_spark/options.rst index 67b8f6841f5..c0d9b18c085 100644 --- a/python/docs/source/user_guide/pandas_on_spark/options.rst +++ b/python/docs/source/user_guide/pandas_on_spark/options.rst @@ -186,7 +186,9 @@ This is conceptually equivalent to the PySpark example as below: **distributed-sequence** (default): It implements a sequence that increases one by one, by group-by and group-map approach in a distributed manner. It still generates the sequential index globally. If the default index must be the sequence in a large dataset, this -index has to be used. See the example below: +index has to be used. +Note that if more data are added to the data source after creating this index, +then it does not guarantee the sequential index. See the example below: .. code-block:: python - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 0a9a21df7de [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index 0a9a21df7de is described below commit 0a9a21df7dea62c5f129d91c049d0d3408d0366e Author: Xinrong Meng AuthorDate: Thu May 12 09:07:11 2022 +0900 [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index ### What changes were proposed in this pull request? Remove outdated statements on distributed-sequence default index. ### Why are the changes needed? Since distributed-sequence default index is updated to be enforced only while execution, there are stale statements in documents to be removed. ### Does this PR introduce _any_ user-facing change? No. Doc change only. ### How was this patch tested? Manual tests. Closes #36513 from xinrong-databricks/defaultIndexDoc. Authored-by: Xinrong Meng Signed-off-by: Hyukjin Kwon (cherry picked from commit cec1e7b4e68deac321f409d424a3acdcd4cb91be) Signed-off-by: Hyukjin Kwon --- python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst b/python/docs/source/user_guide/pandas_on_spark/options.rst index fd2e975aa9d..a82962a4373 100644 --- a/python/docs/source/user_guide/pandas_on_spark/options.rst +++ b/python/docs/source/user_guide/pandas_on_spark/options.rst @@ -186,9 +186,7 @@ This is conceptually equivalent to the PySpark example as below: **distributed-sequence**: It implements a sequence that increases one by one, by group-by and group-map approach in a distributed manner. It still generates the sequential index globally. If the default index must be the sequence in a large dataset, this -index has to be used. -Note that if more data are added to the data source after creating this index, -then it does not guarantee the sequential index. See the example below: +index has to be used. See the example below: .. code-block:: python - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.3 updated: [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new f75c00da3cf [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index f75c00da3cf is described below commit f75c00da3cf01e63d93cedbe480198413af41455 Author: Xinrong Meng AuthorDate: Thu May 12 09:07:11 2022 +0900 [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index Remove outdated statements on distributed-sequence default index. Since distributed-sequence default index is updated to be enforced only while execution, there are stale statements in documents to be removed. No. Doc change only. Manual tests. Closes #36513 from xinrong-databricks/defaultIndexDoc. Authored-by: Xinrong Meng Signed-off-by: Hyukjin Kwon (cherry picked from commit cec1e7b4e68deac321f409d424a3acdcd4cb91be) Signed-off-by: Hyukjin Kwon --- python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst b/python/docs/source/user_guide/pandas_on_spark/options.rst index c0d9b18c085..67b8f6841f5 100644 --- a/python/docs/source/user_guide/pandas_on_spark/options.rst +++ b/python/docs/source/user_guide/pandas_on_spark/options.rst @@ -186,9 +186,7 @@ This is conceptually equivalent to the PySpark example as below: **distributed-sequence** (default): It implements a sequence that increases one by one, by group-by and group-map approach in a distributed manner. It still generates the sequential index globally. If the default index must be the sequence in a large dataset, this -index has to be used. -Note that if more data are added to the data source after creating this index, -then it does not guarantee the sequential index. See the example below: +index has to be used. See the example below: .. code-block:: python - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new cec1e7b4e68 [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index cec1e7b4e68 is described below commit cec1e7b4e68deac321f409d424a3acdcd4cb91be Author: Xinrong Meng AuthorDate: Thu May 12 09:07:11 2022 +0900 [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index ### What changes were proposed in this pull request? Remove outdated statements on distributed-sequence default index. ### Why are the changes needed? Since distributed-sequence default index is updated to be enforced only while execution, there are stale statements in documents to be removed. ### Does this PR introduce _any_ user-facing change? No. Doc change only. ### How was this patch tested? Manual tests. Closes #36513 from xinrong-databricks/defaultIndexDoc. Authored-by: Xinrong Meng Signed-off-by: Hyukjin Kwon --- python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst b/python/docs/source/user_guide/pandas_on_spark/options.rst index c0d9b18c085..67b8f6841f5 100644 --- a/python/docs/source/user_guide/pandas_on_spark/options.rst +++ b/python/docs/source/user_guide/pandas_on_spark/options.rst @@ -186,9 +186,7 @@ This is conceptually equivalent to the PySpark example as below: **distributed-sequence** (default): It implements a sequence that increases one by one, by group-by and group-map approach in a distributed manner. It still generates the sequential index globally. If the default index must be the sequence in a large dataset, this -index has to be used. -Note that if more data are added to the data source after creating this index, -then it does not guarantee the sequential index. See the example below: +index has to be used. See the example below: .. code-block:: python - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-39147][SQL] Code simplification, use count() instead of filter().size, etc
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 09564df8485 [SPARK-39147][SQL] Code simplification, use count() instead of filter().size, etc 09564df8485 is described below commit 09564df8485d4ba27ba6d77b18a4635038ab2a1e Author: morvenhuang AuthorDate: Wed May 11 18:27:29 2022 -0500 [SPARK-39147][SQL] Code simplification, use count() instead of filter().size, etc ### What changes were proposed in this pull request? Use count() instead of filter().size, use df.count() instead of df.collect().size. ### Why are the changes needed? Code simplification. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #36507 from morvenhuang/SPARK-39147. Authored-by: morvenhuang Signed-off-by: Sean Owen --- core/src/test/scala/org/apache/spark/scheduler/MapStatusSuite.scala | 2 +- .../org/apache/spark/sql/catalyst/analysis/StreamingJoinHelper.scala | 4 ++-- .../scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/core/src/test/scala/org/apache/spark/scheduler/MapStatusSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/MapStatusSuite.scala index fe76b1bc322..cf2240a0511 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/MapStatusSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/MapStatusSuite.scala @@ -263,7 +263,7 @@ class MapStatusSuite extends SparkFunSuite { val allBlocks = emptyBlocks ++: nonEmptyBlocks val skewThreshold = Utils.median(allBlocks, false) * accurateBlockSkewedFactor -assert(nonEmptyBlocks.filter(_ > skewThreshold).size == +assert(nonEmptyBlocks.count(_ > skewThreshold) == untrackedSkewedBlocksLength + trackedSkewedBlocksLength, "number of skewed block sizes") diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/StreamingJoinHelper.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/StreamingJoinHelper.scala index 3c5ab55a8a7..737d30a41d3 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/StreamingJoinHelper.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/StreamingJoinHelper.scala @@ -132,8 +132,8 @@ object StreamingJoinHelper extends PredicateHelper with Logging { leftExpr.collect { case a: AttributeReference => a } ++ rightExpr.collect { case a: AttributeReference => a } ) -if (attributesInCondition.filter { attributesToFindStateWatermarkFor.contains(_) }.size > 1 || -attributesInCondition.filter { attributesWithEventWatermark.contains(_) }.size > 1) { +if (attributesInCondition.count(attributesToFindStateWatermarkFor.contains) > 1 || +attributesInCondition.count(attributesWithEventWatermark.contains) > 1) { // If more than attributes present in condition from one side, then it cannot be solved return None } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala index 8971f0c70af..d8081f4525a 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala @@ -622,7 +622,7 @@ object PushFoldableIntoBranches extends Rule[LogicalPlan] with PredicateHelper { // To be conservative here: it's only a guaranteed win if all but at most only one branch // end up being not foldable. private def atMostOneUnfoldable(exprs: Seq[Expression]): Boolean = { -exprs.filterNot(_.foldable).size < 2 +exprs.count(!_.foldable) < 2 } // Not all UnaryExpression can be pushed into (if / case) branches, e.g. Alias. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-39079][SQL] Forbid dot in V2 catalog name
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c770daa9d46 [SPARK-39079][SQL] Forbid dot in V2 catalog name c770daa9d46 is described below commit c770daa9d468cf2c3cda117904478996cd88ebb9 Author: Cheng Pan AuthorDate: Wed May 11 23:42:39 2022 +0800 [SPARK-39079][SQL] Forbid dot in V2 catalog name ### What changes were proposed in this pull request? Forbid dot in V2 catalog name. ### Why are the changes needed? In the following configuration, we define 2 catalogs `test`, `test.cat`. ``` spark.sql.catalog.test=CatalogTestImpl spark.sql.catalog.test.k1=v1 spark.sql.catalog.test.cat.k1=CatalogTestImpl spark.sql.catalog.test.cat.k1=v1 ``` In the current implementation, three keys will be treated to `test`'s configuration, it's a little bit weird. ``` k1=v1 cat.k1=CatalogTestImpl cat.k1=v1 ``` Besides, the current implementation of `SHOW CATALOGS` use lazy load strategy, it's not friendly for GUI client like DBeaver to display available catalog list, w/o this restriction, it is hard to find the catalog key in configuration to load catalogs eagerly. ### Does this PR introduce _any_ user-facing change? Yes, previously Spark allows the catalog name contain `.` but not after this change. ### How was this patch tested? New UT. Closes #36418 from pan3793/catalog. Authored-by: Cheng Pan Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/connector/catalog/Catalogs.scala | 11 --- .../org/apache/spark/sql/errors/QueryExecutionErrors.scala| 4 .../spark/sql/connector/catalog/CatalogLoadingSuite.java | 11 +++ 3 files changed, 23 insertions(+), 3 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/Catalogs.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/Catalogs.scala index 9949f45d483..16983d239d1 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/Catalogs.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/Catalogs.scala @@ -19,7 +19,6 @@ package org.apache.spark.sql.connector.catalog import java.lang.reflect.InvocationTargetException import java.util -import java.util.NoSuchElementException import java.util.regex.Pattern import org.apache.spark.SparkException @@ -39,13 +38,19 @@ private[sql] object Catalogs { * @param conf a SQLConf * @return an initialized CatalogPlugin * @throws CatalogNotFoundException if the plugin class cannot be found - * @throws org.apache.spark.SparkException if the plugin class cannot be instantiated + * @throws org.apache.spark.SparkException if the plugin class cannot be instantiated */ @throws[CatalogNotFoundException] @throws[SparkException] def load(name: String, conf: SQLConf): CatalogPlugin = { val pluginClassName = try { - conf.getConfString("spark.sql.catalog." + name) + val _pluginClassName = conf.getConfString(s"spark.sql.catalog.$name") + // SPARK-39079 do configuration check first, otherwise some path-based table like + // `org.apache.spark.sql.json`.`/path/json_file` may fail on analyze phase + if (name.contains(".")) { +throw QueryExecutionErrors.invalidCatalogNameError(name) + } + _pluginClassName } catch { case _: NoSuchElementException => throw QueryExecutionErrors.catalogPluginClassNotFoundError(name) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala index 53d32927cee..f9d3854fc5e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala @@ -1532,6 +1532,10 @@ object QueryExecutionErrors extends QueryErrorsBase { new UnsupportedOperationException(s"Invalid output mode: $outputMode") } + def invalidCatalogNameError(name: String): Throwable = { +new SparkException(s"Invalid catalog name: $name") + } + def catalogPluginClassNotFoundError(name: String): Throwable = { new CatalogNotFoundException( s"Catalog '$name' plugin class not found: spark.sql.catalog.$name is not defined") diff --git a/sql/catalyst/src/test/java/org/apache/spark/sql/connector/catalog/CatalogLoadingSuite.java b/sql/catalyst/src/test/java/org/apache/spark/sql/connector/catalog/CatalogLoadingSuite.java index 369e2fcaf1a..88ec8b6fc07 100644 --- a/sql/catalyst/src/test/java/org/apache/spark/sql/connector/catalog/CatalogLoadingSuite.java +++
[spark] branch branch-3.3 updated: [SPARK-37878][SQL][FOLLOWUP] V1Table should always carry the "location" property
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 8608baad7ab [SPARK-37878][SQL][FOLLOWUP] V1Table should always carry the "location" property 8608baad7ab is described below commit 8608baad7ab31eef0903b9229789e8112c9c1234 Author: Wenchen Fan AuthorDate: Wed May 11 14:49:54 2022 +0800 [SPARK-37878][SQL][FOLLOWUP] V1Table should always carry the "location" property ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/35204 . https://github.com/apache/spark/pull/35204 introduced a potential regression: it removes the "location" table property from `V1Table` if the table is not external. The intention was to avoid putting the LOCATION clause for managed tables in `ShowCreateTableExec`. However, if we use the v2 DESCRIBE TABLE command by default in the future, this will bring a behavior change and v2 DESCRIBE TABLE command won't print the tab [...] This PR fixes this regression by using a different idea to fix the SHOW CREATE TABLE issue: 1. introduce a new reserved table property `is_managed_location`, to indicate that the location is managed by the catalog, not user given. 2. `ShowCreateTableExec` only generates the LOCATION clause if the "location" property is present and is not managed. ### Why are the changes needed? avoid a potential regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests. We can add a test when we use v2 DESCRIBE TABLE command by default. Closes #36498 from cloud-fan/regression. Authored-by: Wenchen Fan Signed-off-by: Wenchen Fan (cherry picked from commit fa2bda5c4eabb23d5f5b3e14ccd055a2453f579f) Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/connector/catalog/TableCatalog.java | 6 ++ .../org/apache/spark/sql/catalyst/parser/AstBuilder.scala| 12 ++-- .../apache/spark/sql/connector/catalog/CatalogV2Util.scala | 3 ++- .../org/apache/spark/sql/connector/catalog/V1Table.scala | 6 +++--- .../sql/execution/datasources/v2/ShowCreateTableExec.scala | 10 +++--- 5 files changed, 28 insertions(+), 9 deletions(-) diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java index 9336c2a1cae..ec773ab90ad 100644 --- a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java @@ -47,6 +47,12 @@ public interface TableCatalog extends CatalogPlugin { */ String PROP_LOCATION = "location"; + /** + * A reserved property to indicate that the table location is managed, not user-specified. + * If this property is "true", SHOW CREATE TABLE will not generate the LOCATION clause. + */ + String PROP_IS_MANAGED_LOCATION = "is_managed_location"; + /** * A reserved property to specify a table was created with EXTERNAL. */ diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala index 19544346447..ecc5360a4f7 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala @@ -41,7 +41,7 @@ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.trees.CurrentOrigin import org.apache.spark.sql.catalyst.util.{CharVarcharUtils, DateTimeUtils, IntervalUtils} import org.apache.spark.sql.catalyst.util.DateTimeUtils.{convertSpecialDate, convertSpecialTimestamp, convertSpecialTimestampNTZ, getZoneId, stringToDate, stringToTimestamp, stringToTimestampWithoutTimeZone} -import org.apache.spark.sql.connector.catalog.{SupportsNamespaces, TableCatalog} +import org.apache.spark.sql.connector.catalog.{CatalogV2Util, SupportsNamespaces, TableCatalog} import org.apache.spark.sql.connector.catalog.TableChange.ColumnPosition import org.apache.spark.sql.connector.expressions.{ApplyTransform, BucketTransform, DaysTransform, Expression => V2Expression, FieldReference, HoursTransform, IdentityTransform, LiteralValue, MonthsTransform, Transform, YearsTransform} import org.apache.spark.sql.errors.QueryParsingErrors @@ -3215,7 +3215,15 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit throw QueryParsingErrors.cannotCleanReservedTablePropertyError( PROP_EXTERNAL, ctx, "please use CREATE EXTERNAL TABLE") case (PROP_EXTERNAL,
[spark] branch master updated: [SPARK-37878][SQL][FOLLOWUP] V1Table should always carry the "location" property
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new fa2bda5c4ea [SPARK-37878][SQL][FOLLOWUP] V1Table should always carry the "location" property fa2bda5c4ea is described below commit fa2bda5c4eabb23d5f5b3e14ccd055a2453f579f Author: Wenchen Fan AuthorDate: Wed May 11 14:49:54 2022 +0800 [SPARK-37878][SQL][FOLLOWUP] V1Table should always carry the "location" property ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/35204 . https://github.com/apache/spark/pull/35204 introduced a potential regression: it removes the "location" table property from `V1Table` if the table is not external. The intention was to avoid putting the LOCATION clause for managed tables in `ShowCreateTableExec`. However, if we use the v2 DESCRIBE TABLE command by default in the future, this will bring a behavior change and v2 DESCRIBE TABLE command won't print the tab [...] This PR fixes this regression by using a different idea to fix the SHOW CREATE TABLE issue: 1. introduce a new reserved table property `is_managed_location`, to indicate that the location is managed by the catalog, not user given. 2. `ShowCreateTableExec` only generates the LOCATION clause if the "location" property is present and is not managed. ### Why are the changes needed? avoid a potential regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests. We can add a test when we use v2 DESCRIBE TABLE command by default. Closes #36498 from cloud-fan/regression. Authored-by: Wenchen Fan Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/connector/catalog/TableCatalog.java | 6 ++ .../org/apache/spark/sql/catalyst/parser/AstBuilder.scala| 12 ++-- .../apache/spark/sql/connector/catalog/CatalogV2Util.scala | 3 ++- .../org/apache/spark/sql/connector/catalog/V1Table.scala | 6 +++--- .../sql/execution/datasources/v2/ShowCreateTableExec.scala | 10 +++--- 5 files changed, 28 insertions(+), 9 deletions(-) diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java index 9336c2a1cae..ec773ab90ad 100644 --- a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java @@ -47,6 +47,12 @@ public interface TableCatalog extends CatalogPlugin { */ String PROP_LOCATION = "location"; + /** + * A reserved property to indicate that the table location is managed, not user-specified. + * If this property is "true", SHOW CREATE TABLE will not generate the LOCATION clause. + */ + String PROP_IS_MANAGED_LOCATION = "is_managed_location"; + /** * A reserved property to specify a table was created with EXTERNAL. */ diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala index 448ebcb35b3..e6f7dba863b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala @@ -42,7 +42,7 @@ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.trees.CurrentOrigin import org.apache.spark.sql.catalyst.util.{CharVarcharUtils, DateTimeUtils, IntervalUtils, ResolveDefaultColumns} import org.apache.spark.sql.catalyst.util.DateTimeUtils.{convertSpecialDate, convertSpecialTimestamp, convertSpecialTimestampNTZ, getZoneId, stringToDate, stringToTimestamp, stringToTimestampWithoutTimeZone} -import org.apache.spark.sql.connector.catalog.{SupportsNamespaces, TableCatalog} +import org.apache.spark.sql.connector.catalog.{CatalogV2Util, SupportsNamespaces, TableCatalog} import org.apache.spark.sql.connector.catalog.TableChange.ColumnPosition import org.apache.spark.sql.connector.expressions.{ApplyTransform, BucketTransform, DaysTransform, Expression => V2Expression, FieldReference, HoursTransform, IdentityTransform, LiteralValue, MonthsTransform, Transform, YearsTransform} import org.apache.spark.sql.errors.QueryParsingErrors @@ -3288,7 +3288,15 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit throw QueryParsingErrors.cannotCleanReservedTablePropertyError( PROP_EXTERNAL, ctx, "please use CREATE EXTERNAL TABLE") case (PROP_EXTERNAL, _) => false - case _ => true + // It's safe to set whatever table comment, so we
[spark] branch branch-3.3 updated: [SPARK-39112][SQL] UnsupportedOperationException if spark.sql.ui.explainMode is set to cost
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 7d57577037f [SPARK-39112][SQL] UnsupportedOperationException if spark.sql.ui.explainMode is set to cost 7d57577037f is described below commit 7d57577037fe082b6b1ded093943669dd1f8dd05 Author: ulysses-you AuthorDate: Wed May 11 14:32:05 2022 +0800 [SPARK-39112][SQL] UnsupportedOperationException if spark.sql.ui.explainMode is set to cost ### What changes were proposed in this pull request? Add a new leaf like node `LeafNodeWithoutStats` and apply to the list: - ResolvedDBObjectName - ResolvedNamespace - ResolvedTable - ResolvedView - ResolvedNonPersistentFunc - ResolvedPersistentFunc ### Why are the changes needed? We enable v2 command at 3.3.0 branch by default `spark.sql.legacy.useV1Command`. However this is a behavior change between v1 and c2 command. - v1 command: We resolve logical plan to command at analyzer phase by `ResolveSessionCatalog` - v2 commnd: We resolve logical plan to v2 command at physical phase by `DataSourceV2Strategy` Foe cost explain mode, we will call `LogicalPlanStats.stats` using optimized plan so there is a gap between v1 and v2 command. Unfortunately, the logical plan of v2 command contains the `LeafNode` which does not override the `computeStats`. As a result, there is a error running such sql: ```sql set spark.sql.ui.explainMode=cost; show tables; ``` ``` java.lang.UnsupportedOperationException: at org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats(LogicalPlan.scala:171) at org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats$(LogicalPlan.scala:171) at org.apache.spark.sql.catalyst.analysis.ResolvedNamespace.computeStats(v2ResolutionPlans.scala:155) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:55) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:27) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:49) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:27) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30) ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test Closes #36488 from ulysses-you/SPARK-39112. Authored-by: ulysses-you Signed-off-by: Wenchen Fan (cherry picked from commit 06fd340daefd67a3e96393539401c9bf4b3cbde9) Signed-off-by: Wenchen Fan --- .../sql/catalyst/analysis/v2ResolutionPlans.scala | 28 -- .../scala/org/apache/spark/sql/ExplainSuite.scala | 15 2 files changed, 36 insertions(+), 7 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala index 4cffead93b2..a87f9e0082d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala @@ -20,7 +20,7 @@ package org.apache.spark.sql.catalyst.analysis import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec import org.apache.spark.sql.catalyst.expressions.{Attribute, LeafExpression, Unevaluable} -import org.apache.spark.sql.catalyst.plans.logical.LeafNode +import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} import org.apache.spark.sql.catalyst.trees.TreePattern.{TreePattern, UNRESOLVED_FUNC} import org.apache.spark.sql.catalyst.util.CharVarcharUtils import org.apache.spark.sql.connector.catalog.{CatalogPlugin, FunctionCatalog,
[spark] branch master updated: [SPARK-39112][SQL] UnsupportedOperationException if spark.sql.ui.explainMode is set to cost
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 06fd340daef [SPARK-39112][SQL] UnsupportedOperationException if spark.sql.ui.explainMode is set to cost 06fd340daef is described below commit 06fd340daefd67a3e96393539401c9bf4b3cbde9 Author: ulysses-you AuthorDate: Wed May 11 14:32:05 2022 +0800 [SPARK-39112][SQL] UnsupportedOperationException if spark.sql.ui.explainMode is set to cost ### What changes were proposed in this pull request? Add a new leaf like node `LeafNodeWithoutStats` and apply to the list: - ResolvedDBObjectName - ResolvedNamespace - ResolvedTable - ResolvedView - ResolvedNonPersistentFunc - ResolvedPersistentFunc ### Why are the changes needed? We enable v2 command at 3.3.0 branch by default `spark.sql.legacy.useV1Command`. However this is a behavior change between v1 and c2 command. - v1 command: We resolve logical plan to command at analyzer phase by `ResolveSessionCatalog` - v2 commnd: We resolve logical plan to v2 command at physical phase by `DataSourceV2Strategy` Foe cost explain mode, we will call `LogicalPlanStats.stats` using optimized plan so there is a gap between v1 and v2 command. Unfortunately, the logical plan of v2 command contains the `LeafNode` which does not override the `computeStats`. As a result, there is a error running such sql: ```sql set spark.sql.ui.explainMode=cost; show tables; ``` ``` java.lang.UnsupportedOperationException: at org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats(LogicalPlan.scala:171) at org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats$(LogicalPlan.scala:171) at org.apache.spark.sql.catalyst.analysis.ResolvedNamespace.computeStats(v2ResolutionPlans.scala:155) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:55) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:27) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:49) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:27) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30) ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test Closes #36488 from ulysses-you/SPARK-39112. Authored-by: ulysses-you Signed-off-by: Wenchen Fan --- .../sql/catalyst/analysis/v2ResolutionPlans.scala | 28 -- .../scala/org/apache/spark/sql/ExplainSuite.scala | 15 2 files changed, 36 insertions(+), 7 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala index 4cffead93b2..a87f9e0082d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala @@ -20,7 +20,7 @@ package org.apache.spark.sql.catalyst.analysis import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec import org.apache.spark.sql.catalyst.expressions.{Attribute, LeafExpression, Unevaluable} -import org.apache.spark.sql.catalyst.plans.logical.LeafNode +import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} import org.apache.spark.sql.catalyst.trees.TreePattern.{TreePattern, UNRESOLVED_FUNC} import org.apache.spark.sql.catalyst.util.CharVarcharUtils import org.apache.spark.sql.connector.catalog.{CatalogPlugin, FunctionCatalog, Identifier, Table, TableCatalog} @@ -140,11 +140,19 @@ case class UnresolvedDBObjectName(nameParts: Seq[String],