date:20220511

[spark] branch master updated: [SPARK-37702][SQL] Support ANSI Aggregate Function: regr_syy

2022-05-11 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6604929fe6d [SPARK-37702][SQL] Support ANSI Aggregate Function: 
regr_syy
6604929fe6d is described below

commit 6604929fe6dee7216a24b4d0c96d478cae13e2b9
Author: Jiaan Geng 
AuthorDate: Thu May 12 13:30:26 2022 +0800

[SPARK-37702][SQL] Support ANSI Aggregate Function: regr_syy

### What changes were proposed in this pull request?
This PR used to support ANSI aggregate Function: `regr_syy`

The mainstream database supports `regr_syy` show below:
**Teradata**
https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/ZsXbiMrja5EYTft42VoiTQ
**Snowflake**
https://docs.snowflake.com/en/sql-reference/functions/regr_syy.html
**Oracle**

https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGR_-Linear-Regression-Functions.html#GUID-A675B68F-2A88-4843-BE2C-FCDE9C65F9A9
**DB2**

https://www.ibm.com/docs/en/db2/11.5?topic=af-regression-functions-regr-avgx-regr-avgy-regr-count
**H2**
http://www.h2database.com/html/functions-aggregate.html#regr_syy
**Postgresql**
https://www.postgresql.org/docs/8.4/functions-aggregate.html
**Sybase**

https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.0/dbreference/regr-syy-function.html
**Exasol**

https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/regr_function.htm

### Why are the changes needed?
`regr_syy` is very useful.

### Does this PR introduce _any_ user-facing change?
'Yes'. New feature.

### How was this patch tested?
New tests.

Closes #36292 from beliefer/SPARK-37702.

Authored-by: Jiaan Geng 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/FunctionRegistry.scala   |  1 +
 .../expressions/aggregate/CentralMomentAgg.scala   |  4 +-
 .../expressions/aggregate/linearRegression.scala   | 33 +++-
 .../sql-functions/sql-expression-schema.md |  1 +
 .../test/resources/sql-tests/inputs/group-by.sql   |  6 +++
 .../inputs/postgreSQL/aggregates_part1.sql | 23 
 .../inputs/udf/postgreSQL/udf-aggregates_part1.sql | 23 
 .../resources/sql-tests/results/group-by.sql.out   | 35 +++-
 .../results/postgreSQL/aggregates_part1.sql.out| 62 +-
 .../udf/postgreSQL/udf-aggregates_part1.sql.out| 62 +-
 10 files changed, 220 insertions(+), 30 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
index f7af5b35a3b..a56ef175b5e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
@@ -503,6 +503,7 @@ object FunctionRegistry {
 expression[RegrR2]("regr_r2"),
 expression[RegrSXX]("regr_sxx"),
 expression[RegrSXY]("regr_sxy"),
+expression[RegrSYY]("regr_syy"),
 
 // string functions
 expression[Ascii]("ascii"),
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala
index a40c5e4815f..f9bfe77ce5a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala
@@ -264,7 +264,7 @@ case class VarianceSamp(
 copy(child = newChild)
 }
 
-case class RegrSXXReplacement(child: Expression)
+case class RegrReplacement(child: Expression)
   extends CentralMomentAgg(child, !SQLConf.get.legacyStatisticalAggregate) {
 
   override protected def momentOrder = 2
@@ -273,7 +273,7 @@ case class RegrSXXReplacement(child: Expression)
 If(n === 0.0, Literal.create(null, DoubleType), m2)
   }
 
-  override protected def withNewChildInternal(newChild: Expression): 
RegrSXXReplacement =
+  override protected def withNewChildInternal(newChild: Expression): 
RegrReplacement =
 copy(child = newChild)
 }
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
index 568c186f06d..0372f94031e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
@@ -174,7 +174,7 @@ case class

[spark] branch branch-3.2 updated: [SPARK-39060][SQL][3.2] Typo in error messages of decimal overflow

2022-05-11 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 6f9e3034ada [SPARK-39060][SQL][3.2] Typo in error messages of decimal 
overflow
6f9e3034ada is described below

commit 6f9e3034ada72f372dafe93152e01ad5cb323989
Author: Vitalii Li 
AuthorDate: Thu May 12 08:13:51 2022 +0300

[SPARK-39060][SQL][3.2] Typo in error messages of decimal overflow

### What changes were proposed in this pull request?

This PR removes extra curly bracket from debug string for Decimal type in 
SQL.

This is a backport from master branch. Commit: 
https://github.com/apache/spark/commit/165ce4eb7d6d75201beb1bff879efa99fde24f94

### Why are the changes needed?

Typo in error messages of decimal overflow.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

By running tests:
```
$ build/sbt "sql/testOnly"
```

Closes #36458 from vli-databricks/SPARK-39060-3.2.

Authored-by: Vitalii Li 
Signed-off-by: Max Gekk 
---
 .../src/main/scala/org/apache/spark/sql/types/Decimal.scala   | 4 ++--
 .../sql-tests/results/ansi/decimalArithmeticOperations.sql.out| 8 
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
index 46814297231..bc5fba8d0d8 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
@@ -227,9 +227,9 @@ final class Decimal extends Ordered[Decimal] with 
Serializable {
 
   def toDebugString: String = {
 if (decimalVal.ne(null)) {
-  s"Decimal(expanded,$decimalVal,$precision,$scale})"
+  s"Decimal(expanded, $decimalVal, $precision, $scale)"
 } else {
-  s"Decimal(compact,$longVal,$precision,$scale})"
+  s"Decimal(compact, $longVal, $precision, $scale)"
 }
   }
 
diff --git 
a/sql/core/src/test/resources/sql-tests/results/ansi/decimalArithmeticOperations.sql.out
 
b/sql/core/src/test/resources/sql-tests/results/ansi/decimalArithmeticOperations.sql.out
index 2f3513e734f..c65742e4d8b 100644
--- 
a/sql/core/src/test/resources/sql-tests/results/ansi/decimalArithmeticOperations.sql.out
+++ 
b/sql/core/src/test/resources/sql-tests/results/ansi/decimalArithmeticOperations.sql.out
@@ -76,7 +76,7 @@ select (5e36BD + 0.1) + 5e36BD
 struct<>
 -- !query output
 java.lang.ArithmeticException
-Decimal(expanded,10.1,39,1}) cannot be 
represented as Decimal(38, 1).
+Decimal(expanded, 10.1, 39, 1) cannot be 
represented as Decimal(38, 1).
 
 
 -- !query
@@ -85,7 +85,7 @@ select (-4e36BD - 0.1) - 7e36BD
 struct<>
 -- !query output
 java.lang.ArithmeticException
-Decimal(expanded,-11.1,39,1}) cannot be 
represented as Decimal(38, 1).
+Decimal(expanded, -11.1, 39, 1) cannot be 
represented as Decimal(38, 1).
 
 
 -- !query
@@ -94,7 +94,7 @@ select 12345678901234567890.0 * 12345678901234567890.0
 struct<>
 -- !query output
 java.lang.ArithmeticException
-Decimal(expanded,152415787532388367501905199875019052100,39,0}) cannot be 
represented as Decimal(38, 2).
+Decimal(expanded, 152415787532388367501905199875019052100, 39, 0) cannot be 
represented as Decimal(38, 2).
 
 
 -- !query
@@ -103,7 +103,7 @@ select 1e35BD / 0.1
 struct<>
 -- !query output
 java.lang.ArithmeticException
-Decimal(expanded,1,37,0}) cannot be 
represented as Decimal(38, 6).
+Decimal(expanded, 1, 37, 0) cannot be 
represented as Decimal(38, 6).
 
 
 -- !query


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check in CollapseProject

2022-05-11 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 0baa5c7d2b7 [SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check 
in CollapseProject
0baa5c7d2b7 is described below

commit 0baa5c7d2b71f379fad6a8a0b72f427acf70f4e4
Author: Wenchen Fan 
AuthorDate: Wed May 11 21:58:14 2022 -0700

[SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check in 
CollapseProject

This PR fixes a perf regression in Spark 3.3 caused by 
https://github.com/apache/spark/pull/33958

In `CollapseProject`, we want to treat `CreateStruct` and its friends as 
cheap expressions if they are only referenced by `ExtractValue`, but the check 
is too conservative, which causes a perf regression. This PR fixes this check. 
Now "extract-only" means: the attribute only appears as a child of 
`ExtractValue`, but the consumer expression can be in any shape.

Fixes perf regression

No

new tests

Closes #36510 from cloud-fan/bug.

Lead-authored-by: Wenchen Fan 
Co-authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 547f032d04bd2cf06c54b5a4a2f984f5166beb7d)
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/optimizer/Optimizer.scala   | 14 --
 .../sql/catalyst/optimizer/CollapseProjectSuite.scala  | 18 +++---
 2 files changed, 23 insertions(+), 9 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
index 753d81e4003..759a7044f15 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@@ -997,12 +997,14 @@ object CollapseProject extends Rule[LogicalPlan] with 
AliasHelper {
   }
   }
 
-  @scala.annotation.tailrec
-  private def isExtractOnly(expr: Expression, ref: Attribute): Boolean = expr 
match {
-case a: Alias => isExtractOnly(a.child, ref)
-case e: ExtractValue => isExtractOnly(e.children.head, ref)
-case a: Attribute => a.semanticEquals(ref)
-case _ => false
+  private def isExtractOnly(expr: Expression, ref: Attribute): Boolean = {
+def hasRefInNonExtractValue(e: Expression): Boolean = e match {
+  case a: Attribute => a.semanticEquals(ref)
+  // The first child of `ExtractValue` is the complex type to be extracted.
+  case e: ExtractValue if e.children.head.semanticEquals(ref) => false
+  case _ => e.children.exists(hasRefInNonExtractValue)
+}
+!hasRefInNonExtractValue(expr)
   }
 
   /**
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
index c1d13d14b05..f6c3209726b 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
@@ -30,7 +30,8 @@ class CollapseProjectSuite extends PlanTest {
   object Optimize extends RuleExecutor[LogicalPlan] {
 val batches =
   Batch("Subqueries", FixedPoint(10), EliminateSubqueryAliases) ::
-  Batch("CollapseProject", Once, CollapseProject) :: Nil
+  Batch("CollapseProject", Once, CollapseProject) ::
+  Batch("SimplifyExtractValueOps", Once, SimplifyExtractValueOps) :: Nil
   }
 
   val testRelation = LocalRelation('a.int, 'b.int)
@@ -123,12 +124,23 @@ class CollapseProjectSuite extends PlanTest {
 
   test("SPARK-36718: do not collapse project if non-cheap expressions will be 
repeated") {
 val query = testRelation
-  .select(('a + 1).as('a_plus_1))
-  .select(('a_plus_1 + 'a_plus_1).as('a_2_plus_2))
+  .select(($"a" + 1).as("a_plus_1"))
+  .select(($"a_plus_1" + $"a_plus_1").as("a_2_plus_2"))
   .analyze
 
 val optimized = Optimize.execute(query)
 comparePlans(optimized, query)
+
+// CreateStruct is an exception if it's only referenced by ExtractValue.
+val query2 = testRelation
+  .select(namedStruct("a", $"a", "a_plus_1", $"a" + 1).as("struct"))
+  .select(($"struct".getField("a") + 
$"struct".getField("a_plus_1")).as("add"))
+  .analyze
+val optimized2 = Optimize.execute(query2)
+val expected2 = testRelation
+  .select(($"a" + ($"a" + 1)).as("add"))
+  .analyze
+comparePlans(optimized2, expected2)
   }
 
   test("preserve top-level alias metadata while collapsing projects") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands,

[spark] branch master updated: [SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check in CollapseProject

2022-05-11 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 547f032d04b [SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check 
in CollapseProject
547f032d04b is described below

commit 547f032d04bd2cf06c54b5a4a2f984f5166beb7d
Author: Wenchen Fan 
AuthorDate: Wed May 11 21:58:14 2022 -0700

[SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check in 
CollapseProject

### What changes were proposed in this pull request?

This PR fixes a perf regression in Spark 3.3 caused by 
https://github.com/apache/spark/pull/33958

In `CollapseProject`, we want to treat `CreateStruct` and its friends as 
cheap expressions if they are only referenced by `ExtractValue`, but the check 
is too conservative, which causes a perf regression. This PR fixes this check. 
Now "extract-only" means: the attribute only appears as a child of 
`ExtractValue`, but the consumer expression can be in any shape.

### Why are the changes needed?

Fixes perf regression

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new tests

Closes #36510 from cloud-fan/bug.

Lead-authored-by: Wenchen Fan 
Co-authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/optimizer/Optimizer.scala   | 14 --
 .../sql/catalyst/optimizer/CollapseProjectSuite.scala  | 18 +++---
 2 files changed, 23 insertions(+), 9 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
index 3a36e506f4e..9215609f154 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@@ -1014,12 +1014,14 @@ object CollapseProject extends Rule[LogicalPlan] with 
AliasHelper {
   }
   }
 
-  @scala.annotation.tailrec
-  private def isExtractOnly(expr: Expression, ref: Attribute): Boolean = expr 
match {
-case a: Alias => isExtractOnly(a.child, ref)
-case e: ExtractValue => isExtractOnly(e.children.head, ref)
-case a: Attribute => a.semanticEquals(ref)
-case _ => false
+  private def isExtractOnly(expr: Expression, ref: Attribute): Boolean = {
+def hasRefInNonExtractValue(e: Expression): Boolean = e match {
+  case a: Attribute => a.semanticEquals(ref)
+  // The first child of `ExtractValue` is the complex type to be extracted.
+  case e: ExtractValue if e.children.head.semanticEquals(ref) => false
+  case _ => e.children.exists(hasRefInNonExtractValue)
+}
+!hasRefInNonExtractValue(expr)
   }
 
   /**
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
index 93646b6f1bc..dd075837d51 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
@@ -30,7 +30,8 @@ class CollapseProjectSuite extends PlanTest {
   object Optimize extends RuleExecutor[LogicalPlan] {
 val batches =
   Batch("Subqueries", FixedPoint(10), EliminateSubqueryAliases) ::
-  Batch("CollapseProject", Once, CollapseProject) :: Nil
+  Batch("CollapseProject", Once, CollapseProject) ::
+  Batch("SimplifyExtractValueOps", Once, SimplifyExtractValueOps) :: Nil
   }
 
   val testRelation = LocalRelation($"a".int, $"b".int)
@@ -125,12 +126,23 @@ class CollapseProjectSuite extends PlanTest {
 
   test("SPARK-36718: do not collapse project if non-cheap expressions will be 
repeated") {
 val query = testRelation
-  .select(($"a" + 1).as(Symbol("a_plus_1")))
-  .select(($"a_plus_1" + $"a_plus_1").as(Symbol("a_2_plus_2")))
+  .select(($"a" + 1).as("a_plus_1"))
+  .select(($"a_plus_1" + $"a_plus_1").as("a_2_plus_2"))
   .analyze
 
 val optimized = Optimize.execute(query)
 comparePlans(optimized, query)
+
+// CreateStruct is an exception if it's only referenced by ExtractValue.
+val query2 = testRelation
+  .select(namedStruct("a", $"a", "a_plus_1", $"a" + 1).as("struct"))
+  .select(($"struct".getField("a") + 
$"struct".getField("a_plus_1")).as("add"))
+  .analyze
+val optimized2 = Optimize.execute(query2)
+val expected2 = testRelation
+  .select(($"a" + ($"a" + 1)).as("add"))
+  .analyze
+comparePlans(optimized2, expected2)
   }
 
   test("preserve top-level alias metadata while collapsing projects") {

[spark] branch branch-3.3 updated: [SPARK-39155][PYTHON] Access to JVM through passed-in GatewayClient during type conversion

2022-05-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 70becf29070 [SPARK-39155][PYTHON] Access to JVM through passed-in 
GatewayClient during type conversion
70becf29070 is described below

commit 70becf290700e88c7be248e4277421dd17f3af4b
Author: Xinrong Meng 
AuthorDate: Thu May 12 12:22:11 2022 +0900

[SPARK-39155][PYTHON] Access to JVM through passed-in GatewayClient during 
type conversion

### What changes were proposed in this pull request?
Access to JVM through passed-in GatewayClient during type conversion.

### Why are the changes needed?
In customized type converters, we may utilize the passed-in GatewayClient 
to access JVM, rather than rely on the `SparkContext._jvm`.

That's 
[how](https://github.com/py4j/py4j/blob/master/py4j-python/src/py4j/java_collections.py#L508)
 Py4J explicit converters access JVM.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #36504 from xinrong-databricks/gateway_client_jvm.

Authored-by: Xinrong Meng 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 92fcf214c107358c1a70566b644cec2d35c096c0)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/types.py | 30 ++
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 2a41508d634..123fd628980 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -44,7 +44,7 @@ from typing import (
 )
 
 from py4j.protocol import register_input_converter
-from py4j.java_gateway import JavaClass, JavaGateway, JavaObject
+from py4j.java_gateway import GatewayClient, JavaClass, JavaObject
 
 from pyspark.serializers import CloudPickleSerializer
 
@@ -1929,7 +1929,7 @@ class DateConverter:
 def can_convert(self, obj: Any) -> bool:
 return isinstance(obj, datetime.date)
 
-def convert(self, obj: datetime.date, gateway_client: JavaGateway) -> 
JavaObject:
+def convert(self, obj: datetime.date, gateway_client: GatewayClient) -> 
JavaObject:
 Date = JavaClass("java.sql.Date", gateway_client)
 return Date.valueOf(obj.strftime("%Y-%m-%d"))
 
@@ -1938,7 +1938,7 @@ class DatetimeConverter:
 def can_convert(self, obj: Any) -> bool:
 return isinstance(obj, datetime.datetime)
 
-def convert(self, obj: datetime.datetime, gateway_client: JavaGateway) -> 
JavaObject:
+def convert(self, obj: datetime.datetime, gateway_client: GatewayClient) 
-> JavaObject:
 Timestamp = JavaClass("java.sql.Timestamp", gateway_client)
 seconds = (
 calendar.timegm(obj.utctimetuple()) if obj.tzinfo else 
time.mktime(obj.timetuple())
@@ -1958,27 +1958,25 @@ class DatetimeNTZConverter:
 and is_timestamp_ntz_preferred()
 )
 
-def convert(self, obj: datetime.datetime, gateway_client: JavaGateway) -> 
JavaObject:
-from pyspark import SparkContext
-
+def convert(self, obj: datetime.datetime, gateway_client: GatewayClient) 
-> JavaObject:
 seconds = calendar.timegm(obj.utctimetuple())
-jvm = SparkContext._jvm
-assert jvm is not None
-return 
jvm.org.apache.spark.sql.catalyst.util.DateTimeUtils.microsToLocalDateTime(
-int(seconds) * 100 + obj.microsecond
+DateTimeUtils = JavaClass(
+"org.apache.spark.sql.catalyst.util.DateTimeUtils",
+gateway_client,
 )
+return DateTimeUtils.microsToLocalDateTime(int(seconds) * 100 + 
obj.microsecond)
 
 
 class DayTimeIntervalTypeConverter:
 def can_convert(self, obj: Any) -> bool:
 return isinstance(obj, datetime.timedelta)
 
-def convert(self, obj: datetime.timedelta, gateway_client: JavaGateway) -> 
JavaObject:
-from pyspark import SparkContext
-
-jvm = SparkContext._jvm
-assert jvm is not None
-return 
jvm.org.apache.spark.sql.catalyst.util.IntervalUtils.microsToDuration(
+def convert(self, obj: datetime.timedelta, gateway_client: GatewayClient) 
-> JavaObject:
+IntervalUtils = JavaClass(
+"org.apache.spark.sql.catalyst.util.IntervalUtils",
+gateway_client,
+)
+return IntervalUtils.microsToDuration(
 (math.floor(obj.total_seconds()) * 100) + obj.microseconds
 )
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-39155][PYTHON] Access to JVM through passed-in GatewayClient during type conversion

2022-05-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 92fcf214c10 [SPARK-39155][PYTHON] Access to JVM through passed-in 
GatewayClient during type conversion
92fcf214c10 is described below

commit 92fcf214c107358c1a70566b644cec2d35c096c0
Author: Xinrong Meng 
AuthorDate: Thu May 12 12:22:11 2022 +0900

[SPARK-39155][PYTHON] Access to JVM through passed-in GatewayClient during 
type conversion

### What changes were proposed in this pull request?
Access to JVM through passed-in GatewayClient during type conversion.

### Why are the changes needed?
In customized type converters, we may utilize the passed-in GatewayClient 
to access JVM, rather than rely on the `SparkContext._jvm`.

That's 
[how](https://github.com/py4j/py4j/blob/master/py4j-python/src/py4j/java_collections.py#L508)
 Py4J explicit converters access JVM.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #36504 from xinrong-databricks/gateway_client_jvm.

Authored-by: Xinrong Meng 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/types.py | 30 ++
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 2a41508d634..123fd628980 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -44,7 +44,7 @@ from typing import (
 )
 
 from py4j.protocol import register_input_converter
-from py4j.java_gateway import JavaClass, JavaGateway, JavaObject
+from py4j.java_gateway import GatewayClient, JavaClass, JavaObject
 
 from pyspark.serializers import CloudPickleSerializer
 
@@ -1929,7 +1929,7 @@ class DateConverter:
 def can_convert(self, obj: Any) -> bool:
 return isinstance(obj, datetime.date)
 
-def convert(self, obj: datetime.date, gateway_client: JavaGateway) -> 
JavaObject:
+def convert(self, obj: datetime.date, gateway_client: GatewayClient) -> 
JavaObject:
 Date = JavaClass("java.sql.Date", gateway_client)
 return Date.valueOf(obj.strftime("%Y-%m-%d"))
 
@@ -1938,7 +1938,7 @@ class DatetimeConverter:
 def can_convert(self, obj: Any) -> bool:
 return isinstance(obj, datetime.datetime)
 
-def convert(self, obj: datetime.datetime, gateway_client: JavaGateway) -> 
JavaObject:
+def convert(self, obj: datetime.datetime, gateway_client: GatewayClient) 
-> JavaObject:
 Timestamp = JavaClass("java.sql.Timestamp", gateway_client)
 seconds = (
 calendar.timegm(obj.utctimetuple()) if obj.tzinfo else 
time.mktime(obj.timetuple())
@@ -1958,27 +1958,25 @@ class DatetimeNTZConverter:
 and is_timestamp_ntz_preferred()
 )
 
-def convert(self, obj: datetime.datetime, gateway_client: JavaGateway) -> 
JavaObject:
-from pyspark import SparkContext
-
+def convert(self, obj: datetime.datetime, gateway_client: GatewayClient) 
-> JavaObject:
 seconds = calendar.timegm(obj.utctimetuple())
-jvm = SparkContext._jvm
-assert jvm is not None
-return 
jvm.org.apache.spark.sql.catalyst.util.DateTimeUtils.microsToLocalDateTime(
-int(seconds) * 100 + obj.microsecond
+DateTimeUtils = JavaClass(
+"org.apache.spark.sql.catalyst.util.DateTimeUtils",
+gateway_client,
 )
+return DateTimeUtils.microsToLocalDateTime(int(seconds) * 100 + 
obj.microsecond)
 
 
 class DayTimeIntervalTypeConverter:
 def can_convert(self, obj: Any) -> bool:
 return isinstance(obj, datetime.timedelta)
 
-def convert(self, obj: datetime.timedelta, gateway_client: JavaGateway) -> 
JavaObject:
-from pyspark import SparkContext
-
-jvm = SparkContext._jvm
-assert jvm is not None
-return 
jvm.org.apache.spark.sql.catalyst.util.IntervalUtils.microsToDuration(
+def convert(self, obj: datetime.timedelta, gateway_client: GatewayClient) 
-> JavaObject:
+IntervalUtils = JavaClass(
+"org.apache.spark.sql.catalyst.util.IntervalUtils",
+gateway_client,
+)
+return IntervalUtils.microsToDuration(
 (math.floor(obj.total_seconds()) * 100) + obj.microseconds
 )
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [SPARK-39149][SQL] SHOW DATABASES command should not quote database names under legacy mode

2022-05-11 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new ebe4252e415 [SPARK-39149][SQL] SHOW DATABASES command should not quote 
database names under legacy mode
ebe4252e415 is described below

commit ebe4252e415e6afdf888e21d0b89ab744fd2dac7
Author: Wenchen Fan 
AuthorDate: Thu May 12 11:18:18 2022 +0800

[SPARK-39149][SQL] SHOW DATABASES command should not quote database names 
under legacy mode

### What changes were proposed in this pull request?

This is a bug of the command legacy mode as it does not fully restore to 
the legacy behavior. The legacy v1 SHOW DATABASES command does not quote the 
database names. This PR fixes it.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

no change by default, unless people turn on legacy mode, in which case SHOW 
DATABASES common won't quote the database names.

### How was this patch tested?

new tests

Closes #36508 from cloud-fan/regression.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 3094e495095635f6c9e83f4646d3321c2a9311f4)
Signed-off-by: Wenchen Fan 
---
 .../execution/datasources/v2/ShowNamespacesExec.scala   | 11 ++-
 .../sql/execution/command/ShowNamespacesSuiteBase.scala | 17 +
 .../sql/execution/command/v1/ShowNamespacesSuite.scala  | 13 +
 3 files changed, 40 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala
index 9dafbd79a52..c55c7b9f985 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala
@@ -42,8 +42,17 @@ case class ShowNamespacesExec(
   catalog.listNamespaces()
 }
 
+// Please refer to the rule `KeepLegacyOutputs` for details about legacy 
command.
+// The legacy SHOW DATABASES command does not quote the database names.
+val isLegacy = output.head.name == "databaseName"
+val namespaceNames = if (isLegacy && namespaces.forall(_.length == 1)) {
+  namespaces.map(_.head)
+} else {
+  namespaces.map(_.quoted)
+}
+
 val rows = new ArrayBuffer[InternalRow]()
-namespaces.map(_.quoted).map { ns =>
+namespaceNames.map { ns =>
   if (pattern.map(StringUtils.filterPattern(Seq(ns), 
_).nonEmpty).getOrElse(true)) {
 rows += toCatalystRow(ns)
   }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowNamespacesSuiteBase.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowNamespacesSuiteBase.scala
index b3693845c3b..80e545f6e3c 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowNamespacesSuiteBase.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowNamespacesSuiteBase.scala
@@ -42,6 +42,9 @@ trait ShowNamespacesSuiteBase extends QueryTest with 
DDLCommandTestUtils {
 
   protected def builtinTopNamespaces: Seq[String] = Seq.empty
   protected def isCasePreserving: Boolean = true
+  protected def createNamespaceWithSpecialName(ns: String): Unit = {
+sql(s"CREATE NAMESPACE $catalog.`$ns`")
+  }
 
   test("default namespace") {
 withSQLConf(SQLConf.DEFAULT_CATALOG.key -> catalog) {
@@ -124,6 +127,20 @@ trait ShowNamespacesSuiteBase extends QueryTest with 
DDLCommandTestUtils {
 }
   }
 
+  test("SPARK-39149: keep the legacy no-quote behavior") {
+Seq(true, false).foreach { legacy =>
+  withSQLConf(SQLConf.LEGACY_KEEP_COMMAND_OUTPUT_SCHEMA.key -> 
legacy.toString) {
+withNamespace(s"$catalog.`123`") {
+  createNamespaceWithSpecialName("123")
+  val res = if (legacy) "123" else "`123`"
+  checkAnswer(
+sql(s"SHOW NAMESPACES IN $catalog"),
+(res +: builtinTopNamespaces).map(Row(_)))
+}
+  }
+}
+  }
+
   test("case sensitivity of the pattern string") {
 Seq(true, false).foreach { caseSensitive =>
   withSQLConf(SQLConf.CASE_SENSITIVE.key -> caseSensitive.toString) {
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowNamespacesSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowNamespacesSuite.scala
index a1b32e42ae2..b65a9acb656 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowNamespacesSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowNamespacesSuite.scala
@@ -18,7 +18,9 @@
 package

[spark] branch master updated: [SPARK-39149][SQL] SHOW DATABASES command should not quote database names under legacy mode

2022-05-11 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3094e495095 [SPARK-39149][SQL] SHOW DATABASES command should not quote 
database names under legacy mode
3094e495095 is described below

commit 3094e495095635f6c9e83f4646d3321c2a9311f4
Author: Wenchen Fan 
AuthorDate: Thu May 12 11:18:18 2022 +0800

[SPARK-39149][SQL] SHOW DATABASES command should not quote database names 
under legacy mode

### What changes were proposed in this pull request?

This is a bug of the command legacy mode as it does not fully restore to 
the legacy behavior. The legacy v1 SHOW DATABASES command does not quote the 
database names. This PR fixes it.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

no change by default, unless people turn on legacy mode, in which case SHOW 
DATABASES common won't quote the database names.

### How was this patch tested?

new tests

Closes #36508 from cloud-fan/regression.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../execution/datasources/v2/ShowNamespacesExec.scala   | 11 ++-
 .../sql/execution/command/ShowNamespacesSuiteBase.scala | 17 +
 .../sql/execution/command/v1/ShowNamespacesSuite.scala  | 13 +
 3 files changed, 40 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala
index 9dafbd79a52..c55c7b9f985 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala
@@ -42,8 +42,17 @@ case class ShowNamespacesExec(
   catalog.listNamespaces()
 }
 
+// Please refer to the rule `KeepLegacyOutputs` for details about legacy 
command.
+// The legacy SHOW DATABASES command does not quote the database names.
+val isLegacy = output.head.name == "databaseName"
+val namespaceNames = if (isLegacy && namespaces.forall(_.length == 1)) {
+  namespaces.map(_.head)
+} else {
+  namespaces.map(_.quoted)
+}
+
 val rows = new ArrayBuffer[InternalRow]()
-namespaces.map(_.quoted).map { ns =>
+namespaceNames.map { ns =>
   if (pattern.map(StringUtils.filterPattern(Seq(ns), 
_).nonEmpty).getOrElse(true)) {
 rows += toCatalystRow(ns)
   }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowNamespacesSuiteBase.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowNamespacesSuiteBase.scala
index b3693845c3b..80e545f6e3c 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowNamespacesSuiteBase.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowNamespacesSuiteBase.scala
@@ -42,6 +42,9 @@ trait ShowNamespacesSuiteBase extends QueryTest with 
DDLCommandTestUtils {
 
   protected def builtinTopNamespaces: Seq[String] = Seq.empty
   protected def isCasePreserving: Boolean = true
+  protected def createNamespaceWithSpecialName(ns: String): Unit = {
+sql(s"CREATE NAMESPACE $catalog.`$ns`")
+  }
 
   test("default namespace") {
 withSQLConf(SQLConf.DEFAULT_CATALOG.key -> catalog) {
@@ -124,6 +127,20 @@ trait ShowNamespacesSuiteBase extends QueryTest with 
DDLCommandTestUtils {
 }
   }
 
+  test("SPARK-39149: keep the legacy no-quote behavior") {
+Seq(true, false).foreach { legacy =>
+  withSQLConf(SQLConf.LEGACY_KEEP_COMMAND_OUTPUT_SCHEMA.key -> 
legacy.toString) {
+withNamespace(s"$catalog.`123`") {
+  createNamespaceWithSpecialName("123")
+  val res = if (legacy) "123" else "`123`"
+  checkAnswer(
+sql(s"SHOW NAMESPACES IN $catalog"),
+(res +: builtinTopNamespaces).map(Row(_)))
+}
+  }
+}
+  }
+
   test("case sensitivity of the pattern string") {
 Seq(true, false).foreach { caseSensitive =>
   withSQLConf(SQLConf.CASE_SENSITIVE.key -> caseSensitive.toString) {
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowNamespacesSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowNamespacesSuite.scala
index a1b32e42ae2..b65a9acb656 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowNamespacesSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowNamespacesSuite.scala
@@ -18,7 +18,9 @@
 package org.apache.spark.sql.execution.command.v1
 
 import org.apache.spark.sql.AnalysisException
+import

[spark] branch master updated: [SPARK-39081][PYTHON][SQL] Implement DataFrame.resample and Series.resample

2022-05-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6ddb5ae5766 [SPARK-39081][PYTHON][SQL] Implement DataFrame.resample 
and Series.resample
6ddb5ae5766 is described below

commit 6ddb5ae57665e10596182a0ed1d7c683be36078e
Author: Ruifeng Zheng 
AuthorDate: Thu May 12 09:16:43 2022 +0900

[SPARK-39081][PYTHON][SQL] Implement DataFrame.resample and Series.resample

### What changes were proposed in this pull request?
Implement DataFrame.resample and Series.resample

### Why are the changes needed?
To Increase pandas API coverage in PySpark

### Does this PR introduce _any_ user-facing change?
yes, new methods added

for example:
```
In [3]:
   ...: dates = [
   ...: datetime.datetime(2011, 12, 31),
   ...: datetime.datetime(2012, 1, 2),
   ...: pd.NaT,
   ...: datetime.datetime(2013, 5, 3),
   ...: datetime.datetime(2022, 5, 3),
   ...: ]
   ...: pdf = pd.DataFrame(np.ones(len(dates)), 
index=pd.DatetimeIndex(dates), columns=['A'])
   ...: psdf = ps.from_pandas(pdf)
   ...: psdf.resample('3Y').sum().sort_index()
   ...:
Out[3]:
  A
2011-12-31  1.0
2014-12-31  2.0
2017-12-31  0.0
2020-12-31  0.0
2023-12-31  1.0

```

### How was this patch tested?
added UT

Closes #36420 from zhengruifeng/impl_resample.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 dev/sparktestsupport/modules.py|   1 +
 .../docs/source/reference/pyspark.pandas/frame.rst |   1 +
 .../source/reference/pyspark.pandas/series.rst |   1 +
 .../pandas_on_spark/supported_pandas_api.rst   |   8 +-
 python/pyspark/pandas/frame.py |  70 +++
 python/pyspark/pandas/missing/frame.py |   1 -
 python/pyspark/pandas/missing/resample.py  | 105 +
 python/pyspark/pandas/missing/series.py|   1 -
 python/pyspark/pandas/resample.py  | 500 +
 python/pyspark/pandas/series.py| 135 ++
 python/pyspark/pandas/tests/test_resample.py   | 281 
 .../spark/sql/api/python/PythonSQLUtils.scala  |  21 +
 12 files changed, 1121 insertions(+), 4 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index ed1eeb9b807..fc9e2ced9a9 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -643,6 +643,7 @@ pyspark_pandas = Module(
 "pyspark.pandas.tests.test_ops_on_diff_frames_groupby_expanding",
 "pyspark.pandas.tests.test_ops_on_diff_frames_groupby_rolling",
 "pyspark.pandas.tests.test_repr",
+"pyspark.pandas.tests.test_resample",
 "pyspark.pandas.tests.test_reshape",
 "pyspark.pandas.tests.test_rolling",
 "pyspark.pandas.tests.test_series_conversion",
diff --git a/python/docs/source/reference/pyspark.pandas/frame.rst 
b/python/docs/source/reference/pyspark.pandas/frame.rst
index 75a8941ad78..bcf9694a7ae 100644
--- a/python/docs/source/reference/pyspark.pandas/frame.rst
+++ b/python/docs/source/reference/pyspark.pandas/frame.rst
@@ -260,6 +260,7 @@ Time series-related
 .. autosummary::
:toctree: api/
 
+   DataFrame.resample
DataFrame.shift
DataFrame.first_valid_index
DataFrame.last_valid_index
diff --git a/python/docs/source/reference/pyspark.pandas/series.rst 
b/python/docs/source/reference/pyspark.pandas/series.rst
index 48f67192349..1cf63c1a8ae 100644
--- a/python/docs/source/reference/pyspark.pandas/series.rst
+++ b/python/docs/source/reference/pyspark.pandas/series.rst
@@ -257,6 +257,7 @@ Time series-related
:toctree: api/
 
Series.asof
+   Series.resample
Series.shift
Series.first_valid_index
Series.last_valid_index
diff --git 
a/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst 
b/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst
index f63b4a2f05d..8b456445a1e 100644
--- a/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst
@@ -362,7 +362,9 @@ Supported DataFrame APIs
 
++-+--+
 | :func:`replace`| P   | ``regex``, 
``method``|
 
++-+--+
-| resample   | N   |   
   |
+| :func:`resample`   | P   |``axis``, 
``convention``, ``kind``|
+|

[spark] branch branch-3.2 updated: [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index

2022-05-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new d1cae9c5ac5 [SPARK-39154][PYTHON][DOCS] Remove outdated statements on 
distributed-sequence default index
d1cae9c5ac5 is described below

commit d1cae9c5ac5393243d2f9661dc7957d0ebccb1d6
Author: Xinrong Meng 
AuthorDate: Thu May 12 09:13:27 2022 +0900

[SPARK-39154][PYTHON][DOCS] Remove outdated statements on 
distributed-sequence default index

### What changes were proposed in this pull request?
Remove outdated statements on distributed-sequence default index.

### Why are the changes needed?
Since distributed-sequence default index is updated to be enforced only 
while execution, there are stale statements in documents to be removed.

### Does this PR introduce _any_ user-facing change?
No. Doc change only.

### How was this patch tested?
Manual tests.

Closes #36513 from xinrong-databricks/defaultIndexDoc.

Authored-by: Xinrong Meng 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit f37150a5549a8f3cb4c1877bcfd2d1459fc73cac)
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst 
b/python/docs/source/user_guide/pandas_on_spark/options.rst
index fd2e975aa9d..a82962a4373 100644
--- a/python/docs/source/user_guide/pandas_on_spark/options.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/options.rst
@@ -186,9 +186,7 @@ This is conceptually equivalent to the PySpark example as 
below:
 **distributed-sequence**: It implements a sequence that increases one by one, 
by group-by and
 group-map approach in a distributed manner. It still generates the sequential 
index globally.
 If the default index must be the sequence in a large dataset, this
-index has to be used.
-Note that if more data are added to the data source after creating this index,
-then it does not guarantee the sequential index. See the example below:
+index has to be used. See the example below:
 
 .. code-block:: python
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index

2022-05-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 65dd72743e4 [SPARK-39154][PYTHON][DOCS] Remove outdated statements on 
distributed-sequence default index
65dd72743e4 is described below

commit 65dd72743e40a6276b69938f04fc69655bd8270e
Author: Xinrong Meng 
AuthorDate: Thu May 12 09:13:27 2022 +0900

[SPARK-39154][PYTHON][DOCS] Remove outdated statements on 
distributed-sequence default index

### What changes were proposed in this pull request?
Remove outdated statements on distributed-sequence default index.

### Why are the changes needed?
Since distributed-sequence default index is updated to be enforced only 
while execution, there are stale statements in documents to be removed.

### Does this PR introduce _any_ user-facing change?
No. Doc change only.

### How was this patch tested?
Manual tests.

Closes #36513 from xinrong-databricks/defaultIndexDoc.

Authored-by: Xinrong Meng 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit f37150a5549a8f3cb4c1877bcfd2d1459fc73cac)
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst 
b/python/docs/source/user_guide/pandas_on_spark/options.rst
index c0d9b18c085..67b8f6841f5 100644
--- a/python/docs/source/user_guide/pandas_on_spark/options.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/options.rst
@@ -186,9 +186,7 @@ This is conceptually equivalent to the PySpark example as 
below:
 **distributed-sequence** (default): It implements a sequence that increases 
one by one, by group-by and
 group-map approach in a distributed manner. It still generates the sequential 
index globally.
 If the default index must be the sequence in a large dataset, this
-index has to be used.
-Note that if more data are added to the data source after creating this index,
-then it does not guarantee the sequential index. See the example below:
+index has to be used. See the example below:
 
 .. code-block:: python
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index

2022-05-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f37150a5549 [SPARK-39154][PYTHON][DOCS] Remove outdated statements on 
distributed-sequence default index
f37150a5549 is described below

commit f37150a5549a8f3cb4c1877bcfd2d1459fc73cac
Author: Xinrong Meng 
AuthorDate: Thu May 12 09:13:27 2022 +0900

[SPARK-39154][PYTHON][DOCS] Remove outdated statements on 
distributed-sequence default index

### What changes were proposed in this pull request?
Remove outdated statements on distributed-sequence default index.

### Why are the changes needed?
Since distributed-sequence default index is updated to be enforced only 
while execution, there are stale statements in documents to be removed.

### Does this PR introduce _any_ user-facing change?
No. Doc change only.

### How was this patch tested?
Manual tests.

Closes #36513 from xinrong-databricks/defaultIndexDoc.

Authored-by: Xinrong Meng 
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst 
b/python/docs/source/user_guide/pandas_on_spark/options.rst
index c0d9b18c085..67b8f6841f5 100644
--- a/python/docs/source/user_guide/pandas_on_spark/options.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/options.rst
@@ -186,9 +186,7 @@ This is conceptually equivalent to the PySpark example as 
below:
 **distributed-sequence** (default): It implements a sequence that increases 
one by one, by group-by and
 group-map approach in a distributed manner. It still generates the sequential 
index globally.
 If the default index must be the sequence in a large dataset, this
-index has to be used.
-Note that if more data are added to the data source after creating this index,
-then it does not guarantee the sequential index. See the example below:
+index has to be used. See the example below:
 
 .. code-block:: python
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index"

2022-05-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 345bf8c80ab Revert "[SPARK-34827][PYTHON][DOC] Remove outdated 
statements on distributed-sequence default index"
345bf8c80ab is described below

commit 345bf8c80abaf194d3cc4c25491faa23ac8de16c
Author: Hyukjin Kwon 
AuthorDate: Thu May 12 09:12:39 2022 +0900

Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on 
distributed-sequence default index"

This reverts commit 0a9a21df7dea62c5f129d91c049d0d3408d0366e.
---
 python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst 
b/python/docs/source/user_guide/pandas_on_spark/options.rst
index a82962a4373..fd2e975aa9d 100644
--- a/python/docs/source/user_guide/pandas_on_spark/options.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/options.rst
@@ -186,7 +186,9 @@ This is conceptually equivalent to the PySpark example as 
below:
 **distributed-sequence**: It implements a sequence that increases one by one, 
by group-by and
 group-map approach in a distributed manner. It still generates the sequential 
index globally.
 If the default index must be the sequence in a large dataset, this
-index has to be used. See the example below:
+index has to be used.
+Note that if more data are added to the data source after creating this index,
+then it does not guarantee the sequential index. See the example below:
 
 .. code-block:: python
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index"

2022-05-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new e4bb341d376 Revert "[SPARK-34827][PYTHON][DOC] Remove outdated 
statements on distributed-sequence default index"
e4bb341d376 is described below

commit e4bb341d37661e93097e56e0087699bca60825fb
Author: Hyukjin Kwon 
AuthorDate: Thu May 12 09:12:13 2022 +0900

Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on 
distributed-sequence default index"

This reverts commit f75c00da3cf01e63d93cedbe480198413af41455.
---
 python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst 
b/python/docs/source/user_guide/pandas_on_spark/options.rst
index 67b8f6841f5..c0d9b18c085 100644
--- a/python/docs/source/user_guide/pandas_on_spark/options.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/options.rst
@@ -186,7 +186,9 @@ This is conceptually equivalent to the PySpark example as 
below:
 **distributed-sequence** (default): It implements a sequence that increases 
one by one, by group-by and
 group-map approach in a distributed manner. It still generates the sequential 
index globally.
 If the default index must be the sequence in a large dataset, this
-index has to be used. See the example below:
+index has to be used.
+Note that if more data are added to the data source after creating this index,
+then it does not guarantee the sequential index. See the example below:
 
 .. code-block:: python
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index"

2022-05-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 94cdbba699f Revert "[SPARK-34827][PYTHON][DOC] Remove outdated 
statements on distributed-sequence default index"
94cdbba699f is described below

commit 94cdbba699f2f92dc93e8229d8e87565737fd20a
Author: Hyukjin Kwon 
AuthorDate: Thu May 12 09:11:55 2022 +0900

Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on 
distributed-sequence default index"

This reverts commit cec1e7b4e68deac321f409d424a3acdcd4cb91be.
---
 python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst 
b/python/docs/source/user_guide/pandas_on_spark/options.rst
index 67b8f6841f5..c0d9b18c085 100644
--- a/python/docs/source/user_guide/pandas_on_spark/options.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/options.rst
@@ -186,7 +186,9 @@ This is conceptually equivalent to the PySpark example as 
below:
 **distributed-sequence** (default): It implements a sequence that increases 
one by one, by group-by and
 group-map approach in a distributed manner. It still generates the sequential 
index globally.
 If the default index must be the sequence in a large dataset, this
-index has to be used. See the example below:
+index has to be used.
+Note that if more data are added to the data source after creating this index,
+then it does not guarantee the sequential index. See the example below:
 
 .. code-block:: python
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index

2022-05-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 0a9a21df7de [SPARK-34827][PYTHON][DOC] Remove outdated statements on 
distributed-sequence default index
0a9a21df7de is described below

commit 0a9a21df7dea62c5f129d91c049d0d3408d0366e
Author: Xinrong Meng 
AuthorDate: Thu May 12 09:07:11 2022 +0900

[SPARK-34827][PYTHON][DOC] Remove outdated statements on 
distributed-sequence default index

### What changes were proposed in this pull request?
Remove outdated statements on distributed-sequence default index.

### Why are the changes needed?
Since distributed-sequence default index is updated to be enforced only 
while execution, there are stale statements in documents to be removed.

### Does this PR introduce _any_ user-facing change?
No. Doc change only.

### How was this patch tested?
Manual tests.

Closes #36513 from xinrong-databricks/defaultIndexDoc.

Authored-by: Xinrong Meng 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit cec1e7b4e68deac321f409d424a3acdcd4cb91be)
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst 
b/python/docs/source/user_guide/pandas_on_spark/options.rst
index fd2e975aa9d..a82962a4373 100644
--- a/python/docs/source/user_guide/pandas_on_spark/options.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/options.rst
@@ -186,9 +186,7 @@ This is conceptually equivalent to the PySpark example as 
below:
 **distributed-sequence**: It implements a sequence that increases one by one, 
by group-by and
 group-map approach in a distributed manner. It still generates the sequential 
index globally.
 If the default index must be the sequence in a large dataset, this
-index has to be used.
-Note that if more data are added to the data source after creating this index,
-then it does not guarantee the sequential index. See the example below:
+index has to be used. See the example below:
 
 .. code-block:: python
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index

2022-05-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new f75c00da3cf [SPARK-34827][PYTHON][DOC] Remove outdated statements on 
distributed-sequence default index
f75c00da3cf is described below

commit f75c00da3cf01e63d93cedbe480198413af41455
Author: Xinrong Meng 
AuthorDate: Thu May 12 09:07:11 2022 +0900

[SPARK-34827][PYTHON][DOC] Remove outdated statements on 
distributed-sequence default index

Remove outdated statements on distributed-sequence default index.

Since distributed-sequence default index is updated to be enforced only 
while execution, there are stale statements in documents to be removed.

No. Doc change only.

Manual tests.

Closes #36513 from xinrong-databricks/defaultIndexDoc.

Authored-by: Xinrong Meng 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit cec1e7b4e68deac321f409d424a3acdcd4cb91be)
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst 
b/python/docs/source/user_guide/pandas_on_spark/options.rst
index c0d9b18c085..67b8f6841f5 100644
--- a/python/docs/source/user_guide/pandas_on_spark/options.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/options.rst
@@ -186,9 +186,7 @@ This is conceptually equivalent to the PySpark example as 
below:
 **distributed-sequence** (default): It implements a sequence that increases 
one by one, by group-by and
 group-map approach in a distributed manner. It still generates the sequential 
index globally.
 If the default index must be the sequence in a large dataset, this
-index has to be used.
-Note that if more data are added to the data source after creating this index,
-then it does not guarantee the sequential index. See the example below:
+index has to be used. See the example below:
 
 .. code-block:: python
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index

2022-05-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cec1e7b4e68 [SPARK-34827][PYTHON][DOC] Remove outdated statements on 
distributed-sequence default index
cec1e7b4e68 is described below

commit cec1e7b4e68deac321f409d424a3acdcd4cb91be
Author: Xinrong Meng 
AuthorDate: Thu May 12 09:07:11 2022 +0900

[SPARK-34827][PYTHON][DOC] Remove outdated statements on 
distributed-sequence default index

### What changes were proposed in this pull request?
Remove outdated statements on distributed-sequence default index.

### Why are the changes needed?
Since distributed-sequence default index is updated to be enforced only 
while execution, there are stale statements in documents to be removed.

### Does this PR introduce _any_ user-facing change?
No. Doc change only.

### How was this patch tested?
Manual tests.

Closes #36513 from xinrong-databricks/defaultIndexDoc.

Authored-by: Xinrong Meng 
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/user_guide/pandas_on_spark/options.rst | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst 
b/python/docs/source/user_guide/pandas_on_spark/options.rst
index c0d9b18c085..67b8f6841f5 100644
--- a/python/docs/source/user_guide/pandas_on_spark/options.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/options.rst
@@ -186,9 +186,7 @@ This is conceptually equivalent to the PySpark example as 
below:
 **distributed-sequence** (default): It implements a sequence that increases 
one by one, by group-by and
 group-map approach in a distributed manner. It still generates the sequential 
index globally.
 If the default index must be the sequence in a large dataset, this
-index has to be used.
-Note that if more data are added to the data source after creating this index,
-then it does not guarantee the sequential index. See the example below:
+index has to be used. See the example below:
 
 .. code-block:: python
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-39147][SQL] Code simplification, use count() instead of filter().size, etc

2022-05-11 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 09564df8485 [SPARK-39147][SQL] Code simplification, use count() 
instead of filter().size, etc
09564df8485 is described below

commit 09564df8485d4ba27ba6d77b18a4635038ab2a1e
Author: morvenhuang 
AuthorDate: Wed May 11 18:27:29 2022 -0500

[SPARK-39147][SQL] Code simplification, use count() instead of 
filter().size, etc

### What changes were proposed in this pull request?
Use count() instead of filter().size, use df.count() instead of 
df.collect().size.

### Why are the changes needed?
Code simplification.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #36507 from morvenhuang/SPARK-39147.

Authored-by: morvenhuang 
Signed-off-by: Sean Owen 
---
 core/src/test/scala/org/apache/spark/scheduler/MapStatusSuite.scala   | 2 +-
 .../org/apache/spark/sql/catalyst/analysis/StreamingJoinHelper.scala  | 4 ++--
 .../scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala   | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/MapStatusSuite.scala 
b/core/src/test/scala/org/apache/spark/scheduler/MapStatusSuite.scala
index fe76b1bc322..cf2240a0511 100644
--- a/core/src/test/scala/org/apache/spark/scheduler/MapStatusSuite.scala
+++ b/core/src/test/scala/org/apache/spark/scheduler/MapStatusSuite.scala
@@ -263,7 +263,7 @@ class MapStatusSuite extends SparkFunSuite {
 val allBlocks = emptyBlocks ++: nonEmptyBlocks
 
 val skewThreshold = Utils.median(allBlocks, false) * 
accurateBlockSkewedFactor
-assert(nonEmptyBlocks.filter(_ > skewThreshold).size ==
+assert(nonEmptyBlocks.count(_ > skewThreshold) ==
   untrackedSkewedBlocksLength + trackedSkewedBlocksLength,
   "number of skewed block sizes")
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/StreamingJoinHelper.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/StreamingJoinHelper.scala
index 3c5ab55a8a7..737d30a41d3 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/StreamingJoinHelper.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/StreamingJoinHelper.scala
@@ -132,8 +132,8 @@ object StreamingJoinHelper extends PredicateHelper with 
Logging {
   leftExpr.collect { case a: AttributeReference => a } ++
   rightExpr.collect { case a: AttributeReference => a }
 )
-if (attributesInCondition.filter { 
attributesToFindStateWatermarkFor.contains(_) }.size > 1 ||
-attributesInCondition.filter { 
attributesWithEventWatermark.contains(_) }.size > 1) {
+if 
(attributesInCondition.count(attributesToFindStateWatermarkFor.contains) > 1 ||
+attributesInCondition.count(attributesWithEventWatermark.contains) > 
1) {
   // If more than attributes present in condition from one side, then it 
cannot be solved
   return None
 }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
index 8971f0c70af..d8081f4525a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
@@ -622,7 +622,7 @@ object PushFoldableIntoBranches extends Rule[LogicalPlan] 
with PredicateHelper {
   // To be conservative here: it's only a guaranteed win if all but at most 
only one branch
   // end up being not foldable.
   private def atMostOneUnfoldable(exprs: Seq[Expression]): Boolean = {
-exprs.filterNot(_.foldable).size < 2
+exprs.count(!_.foldable) < 2
   }
 
   // Not all UnaryExpression can be pushed into (if / case) branches, e.g. 
Alias.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-39079][SQL] Forbid dot in V2 catalog name

2022-05-11 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c770daa9d46 [SPARK-39079][SQL] Forbid dot in V2 catalog name
c770daa9d46 is described below

commit c770daa9d468cf2c3cda117904478996cd88ebb9
Author: Cheng Pan 
AuthorDate: Wed May 11 23:42:39 2022 +0800

[SPARK-39079][SQL] Forbid dot in V2 catalog name

### What changes were proposed in this pull request?

Forbid dot in V2 catalog name.

### Why are the changes needed?

In the following configuration, we define 2 catalogs `test`, `test.cat`.
```
spark.sql.catalog.test=CatalogTestImpl
spark.sql.catalog.test.k1=v1
spark.sql.catalog.test.cat.k1=CatalogTestImpl
spark.sql.catalog.test.cat.k1=v1
```

In the current implementation, three keys will be treated to `test`'s 
configuration, it's a little bit weird.
```
k1=v1
cat.k1=CatalogTestImpl
cat.k1=v1
```

Besides, the current implementation of `SHOW CATALOGS` use lazy load 
strategy, it's not friendly for GUI client like DBeaver to display available 
catalog list, w/o this restriction, it is hard to find the catalog key in 
configuration to load catalogs eagerly.

### Does this PR introduce _any_ user-facing change?

Yes, previously Spark allows the catalog name contain `.` but not after 
this change.

### How was this patch tested?

New UT.

Closes #36418 from pan3793/catalog.

Authored-by: Cheng Pan 
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/connector/catalog/Catalogs.scala | 11 ---
 .../org/apache/spark/sql/errors/QueryExecutionErrors.scala|  4 
 .../spark/sql/connector/catalog/CatalogLoadingSuite.java  | 11 +++
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/Catalogs.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/Catalogs.scala
index 9949f45d483..16983d239d1 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/Catalogs.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/Catalogs.scala
@@ -19,7 +19,6 @@ package org.apache.spark.sql.connector.catalog
 
 import java.lang.reflect.InvocationTargetException
 import java.util
-import java.util.NoSuchElementException
 import java.util.regex.Pattern
 
 import org.apache.spark.SparkException
@@ -39,13 +38,19 @@ private[sql] object Catalogs {
* @param conf a SQLConf
* @return an initialized CatalogPlugin
* @throws CatalogNotFoundException if the plugin class cannot be found
-   * @throws org.apache.spark.SparkException   if the plugin class 
cannot be instantiated
+   * @throws org.apache.spark.SparkException if the plugin class cannot be 
instantiated
*/
   @throws[CatalogNotFoundException]
   @throws[SparkException]
   def load(name: String, conf: SQLConf): CatalogPlugin = {
 val pluginClassName = try {
-  conf.getConfString("spark.sql.catalog." + name)
+  val _pluginClassName = conf.getConfString(s"spark.sql.catalog.$name")
+  // SPARK-39079 do configuration check first, otherwise some path-based 
table like
+  // `org.apache.spark.sql.json`.`/path/json_file` may fail on analyze 
phase
+  if (name.contains(".")) {
+throw QueryExecutionErrors.invalidCatalogNameError(name)
+  }
+  _pluginClassName
 } catch {
   case _: NoSuchElementException =>
 throw QueryExecutionErrors.catalogPluginClassNotFoundError(name)
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
index 53d32927cee..f9d3854fc5e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
@@ -1532,6 +1532,10 @@ object QueryExecutionErrors extends QueryErrorsBase {
 new UnsupportedOperationException(s"Invalid output mode: $outputMode")
   }
 
+  def invalidCatalogNameError(name: String): Throwable = {
+new SparkException(s"Invalid catalog name: $name")
+  }
+
   def catalogPluginClassNotFoundError(name: String): Throwable = {
 new CatalogNotFoundException(
   s"Catalog '$name' plugin class not found: spark.sql.catalog.$name is not 
defined")
diff --git 
a/sql/catalyst/src/test/java/org/apache/spark/sql/connector/catalog/CatalogLoadingSuite.java
 
b/sql/catalyst/src/test/java/org/apache/spark/sql/connector/catalog/CatalogLoadingSuite.java
index 369e2fcaf1a..88ec8b6fc07 100644
--- 
a/sql/catalyst/src/test/java/org/apache/spark/sql/connector/catalog/CatalogLoadingSuite.java
+++

[spark] branch branch-3.3 updated: [SPARK-37878][SQL][FOLLOWUP] V1Table should always carry the "location" property

2022-05-11 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 8608baad7ab [SPARK-37878][SQL][FOLLOWUP] V1Table should always carry 
the "location" property
8608baad7ab is described below

commit 8608baad7ab31eef0903b9229789e8112c9c1234
Author: Wenchen Fan 
AuthorDate: Wed May 11 14:49:54 2022 +0800

[SPARK-37878][SQL][FOLLOWUP] V1Table should always carry the "location" 
property

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/35204 . 
https://github.com/apache/spark/pull/35204 introduced a potential regression: 
it removes the "location" table property from `V1Table` if the table is not 
external. The intention was to avoid putting the LOCATION clause for managed 
tables in `ShowCreateTableExec`. However, if we use the v2 DESCRIBE TABLE 
command by default in the future, this will bring a behavior change and v2 
DESCRIBE TABLE command won't print the tab [...]

This PR fixes this regression by using a different idea to fix the SHOW 
CREATE TABLE issue:
1. introduce a new reserved table property `is_managed_location`, to 
indicate that the location is managed by the catalog, not user given.
2. `ShowCreateTableExec` only generates the LOCATION clause if the 
"location" property is present and is not managed.

### Why are the changes needed?

avoid a potential regression

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests. We can add a test when we use v2 DESCRIBE TABLE command by 
default.

Closes #36498 from cloud-fan/regression.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
(cherry picked from commit fa2bda5c4eabb23d5f5b3e14ccd055a2453f579f)
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/connector/catalog/TableCatalog.java |  6 ++
 .../org/apache/spark/sql/catalyst/parser/AstBuilder.scala| 12 ++--
 .../apache/spark/sql/connector/catalog/CatalogV2Util.scala   |  3 ++-
 .../org/apache/spark/sql/connector/catalog/V1Table.scala |  6 +++---
 .../sql/execution/datasources/v2/ShowCreateTableExec.scala   | 10 +++---
 5 files changed, 28 insertions(+), 9 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
index 9336c2a1cae..ec773ab90ad 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
@@ -47,6 +47,12 @@ public interface TableCatalog extends CatalogPlugin {
*/
   String PROP_LOCATION = "location";
 
+  /**
+   * A reserved property to indicate that the table location is managed, not 
user-specified.
+   * If this property is "true", SHOW CREATE TABLE will not generate the 
LOCATION clause.
+   */
+  String PROP_IS_MANAGED_LOCATION = "is_managed_location";
+
   /**
* A reserved property to specify a table was created with EXTERNAL.
*/
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
index 19544346447..ecc5360a4f7 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
@@ -41,7 +41,7 @@ import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.trees.CurrentOrigin
 import org.apache.spark.sql.catalyst.util.{CharVarcharUtils, DateTimeUtils, 
IntervalUtils}
 import org.apache.spark.sql.catalyst.util.DateTimeUtils.{convertSpecialDate, 
convertSpecialTimestamp, convertSpecialTimestampNTZ, getZoneId, stringToDate, 
stringToTimestamp, stringToTimestampWithoutTimeZone}
-import org.apache.spark.sql.connector.catalog.{SupportsNamespaces, 
TableCatalog}
+import org.apache.spark.sql.connector.catalog.{CatalogV2Util, 
SupportsNamespaces, TableCatalog}
 import org.apache.spark.sql.connector.catalog.TableChange.ColumnPosition
 import org.apache.spark.sql.connector.expressions.{ApplyTransform, 
BucketTransform, DaysTransform, Expression => V2Expression, FieldReference, 
HoursTransform, IdentityTransform, LiteralValue, MonthsTransform, Transform, 
YearsTransform}
 import org.apache.spark.sql.errors.QueryParsingErrors
@@ -3215,7 +3215,15 @@ class AstBuilder extends 
SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit
 throw QueryParsingErrors.cannotCleanReservedTablePropertyError(
   PROP_EXTERNAL, ctx, "please use CREATE EXTERNAL TABLE")
   case (PROP_EXTERNAL,

[spark] branch master updated: [SPARK-37878][SQL][FOLLOWUP] V1Table should always carry the "location" property

2022-05-11 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new fa2bda5c4ea [SPARK-37878][SQL][FOLLOWUP] V1Table should always carry 
the "location" property
fa2bda5c4ea is described below

commit fa2bda5c4eabb23d5f5b3e14ccd055a2453f579f
Author: Wenchen Fan 
AuthorDate: Wed May 11 14:49:54 2022 +0800

[SPARK-37878][SQL][FOLLOWUP] V1Table should always carry the "location" 
property

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/35204 . 
https://github.com/apache/spark/pull/35204 introduced a potential regression: 
it removes the "location" table property from `V1Table` if the table is not 
external. The intention was to avoid putting the LOCATION clause for managed 
tables in `ShowCreateTableExec`. However, if we use the v2 DESCRIBE TABLE 
command by default in the future, this will bring a behavior change and v2 
DESCRIBE TABLE command won't print the tab [...]

This PR fixes this regression by using a different idea to fix the SHOW 
CREATE TABLE issue:
1. introduce a new reserved table property `is_managed_location`, to 
indicate that the location is managed by the catalog, not user given.
2. `ShowCreateTableExec` only generates the LOCATION clause if the 
"location" property is present and is not managed.

### Why are the changes needed?

avoid a potential regression

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests. We can add a test when we use v2 DESCRIBE TABLE command by 
default.

Closes #36498 from cloud-fan/regression.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/connector/catalog/TableCatalog.java |  6 ++
 .../org/apache/spark/sql/catalyst/parser/AstBuilder.scala| 12 ++--
 .../apache/spark/sql/connector/catalog/CatalogV2Util.scala   |  3 ++-
 .../org/apache/spark/sql/connector/catalog/V1Table.scala |  6 +++---
 .../sql/execution/datasources/v2/ShowCreateTableExec.scala   | 10 +++---
 5 files changed, 28 insertions(+), 9 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
index 9336c2a1cae..ec773ab90ad 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
@@ -47,6 +47,12 @@ public interface TableCatalog extends CatalogPlugin {
*/
   String PROP_LOCATION = "location";
 
+  /**
+   * A reserved property to indicate that the table location is managed, not 
user-specified.
+   * If this property is "true", SHOW CREATE TABLE will not generate the 
LOCATION clause.
+   */
+  String PROP_IS_MANAGED_LOCATION = "is_managed_location";
+
   /**
* A reserved property to specify a table was created with EXTERNAL.
*/
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
index 448ebcb35b3..e6f7dba863b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
@@ -42,7 +42,7 @@ import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.trees.CurrentOrigin
 import org.apache.spark.sql.catalyst.util.{CharVarcharUtils, DateTimeUtils, 
IntervalUtils, ResolveDefaultColumns}
 import org.apache.spark.sql.catalyst.util.DateTimeUtils.{convertSpecialDate, 
convertSpecialTimestamp, convertSpecialTimestampNTZ, getZoneId, stringToDate, 
stringToTimestamp, stringToTimestampWithoutTimeZone}
-import org.apache.spark.sql.connector.catalog.{SupportsNamespaces, 
TableCatalog}
+import org.apache.spark.sql.connector.catalog.{CatalogV2Util, 
SupportsNamespaces, TableCatalog}
 import org.apache.spark.sql.connector.catalog.TableChange.ColumnPosition
 import org.apache.spark.sql.connector.expressions.{ApplyTransform, 
BucketTransform, DaysTransform, Expression => V2Expression, FieldReference, 
HoursTransform, IdentityTransform, LiteralValue, MonthsTransform, Transform, 
YearsTransform}
 import org.apache.spark.sql.errors.QueryParsingErrors
@@ -3288,7 +3288,15 @@ class AstBuilder extends 
SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit
 throw QueryParsingErrors.cannotCleanReservedTablePropertyError(
   PROP_EXTERNAL, ctx, "please use CREATE EXTERNAL TABLE")
   case (PROP_EXTERNAL, _) => false
-  case _ => true
+  // It's safe to set whatever table comment, so we

[spark] branch branch-3.3 updated: [SPARK-39112][SQL] UnsupportedOperationException if spark.sql.ui.explainMode is set to cost

2022-05-11 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 7d57577037f [SPARK-39112][SQL] UnsupportedOperationException if 
spark.sql.ui.explainMode is set to cost
7d57577037f is described below

commit 7d57577037fe082b6b1ded093943669dd1f8dd05
Author: ulysses-you 
AuthorDate: Wed May 11 14:32:05 2022 +0800

[SPARK-39112][SQL] UnsupportedOperationException if 
spark.sql.ui.explainMode is set to cost

### What changes were proposed in this pull request?

Add a new leaf like node `LeafNodeWithoutStats` and apply to the list:
- ResolvedDBObjectName
- ResolvedNamespace
- ResolvedTable
- ResolvedView
- ResolvedNonPersistentFunc
- ResolvedPersistentFunc

### Why are the changes needed?

We enable v2 command at 3.3.0 branch by default 
`spark.sql.legacy.useV1Command`. However this is a behavior change between v1 
and c2 command.

- v1 command:
  We resolve logical plan to command at analyzer phase by 
`ResolveSessionCatalog`

- v2 commnd:
  We resolve logical plan to v2 command at physical phase by 
`DataSourceV2Strategy`

Foe cost explain mode, we will call `LogicalPlanStats.stats` using 
optimized plan so there is a gap between v1 and v2 command.
Unfortunately, the logical plan of v2 command contains the `LeafNode` which 
does not override the `computeStats`. As a result, there is a error running 
such sql:
```sql
set spark.sql.ui.explainMode=cost;
show tables;
```

```
java.lang.UnsupportedOperationException:
at 
org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats(LogicalPlan.scala:171)
at 
org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats$(LogicalPlan.scala:171)
at 
org.apache.spark.sql.catalyst.analysis.ResolvedNamespace.computeStats(v2ResolutionPlans.scala:155)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:55)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:27)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:49)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:27)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)
```

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

add test

Closes #36488 from ulysses-you/SPARK-39112.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 06fd340daefd67a3e96393539401c9bf4b3cbde9)
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/v2ResolutionPlans.scala  | 28 --
 .../scala/org/apache/spark/sql/ExplainSuite.scala  | 15 
 2 files changed, 36 insertions(+), 7 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala
index 4cffead93b2..a87f9e0082d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala
@@ -20,7 +20,7 @@ package org.apache.spark.sql.catalyst.analysis
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec
 import org.apache.spark.sql.catalyst.expressions.{Attribute, LeafExpression, 
Unevaluable}
-import org.apache.spark.sql.catalyst.plans.logical.LeafNode
+import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
 import org.apache.spark.sql.catalyst.trees.TreePattern.{TreePattern, 
UNRESOLVED_FUNC}
 import org.apache.spark.sql.catalyst.util.CharVarcharUtils
 import org.apache.spark.sql.connector.catalog.{CatalogPlugin, FunctionCatalog,

[spark] branch master updated: [SPARK-39112][SQL] UnsupportedOperationException if spark.sql.ui.explainMode is set to cost

2022-05-11 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 06fd340daef [SPARK-39112][SQL] UnsupportedOperationException if 
spark.sql.ui.explainMode is set to cost
06fd340daef is described below

commit 06fd340daefd67a3e96393539401c9bf4b3cbde9
Author: ulysses-you 
AuthorDate: Wed May 11 14:32:05 2022 +0800

[SPARK-39112][SQL] UnsupportedOperationException if 
spark.sql.ui.explainMode is set to cost

### What changes were proposed in this pull request?

Add a new leaf like node `LeafNodeWithoutStats` and apply to the list:
- ResolvedDBObjectName
- ResolvedNamespace
- ResolvedTable
- ResolvedView
- ResolvedNonPersistentFunc
- ResolvedPersistentFunc

### Why are the changes needed?

We enable v2 command at 3.3.0 branch by default 
`spark.sql.legacy.useV1Command`. However this is a behavior change between v1 
and c2 command.

- v1 command:
  We resolve logical plan to command at analyzer phase by 
`ResolveSessionCatalog`

- v2 commnd:
  We resolve logical plan to v2 command at physical phase by 
`DataSourceV2Strategy`

Foe cost explain mode, we will call `LogicalPlanStats.stats` using 
optimized plan so there is a gap between v1 and v2 command.
Unfortunately, the logical plan of v2 command contains the `LeafNode` which 
does not override the `computeStats`. As a result, there is a error running 
such sql:
```sql
set spark.sql.ui.explainMode=cost;
show tables;
```

```
java.lang.UnsupportedOperationException:
at 
org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats(LogicalPlan.scala:171)
at 
org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats$(LogicalPlan.scala:171)
at 
org.apache.spark.sql.catalyst.analysis.ResolvedNamespace.computeStats(v2ResolutionPlans.scala:155)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:55)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:27)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:49)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:27)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33)
at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)
```

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

add test

Closes #36488 from ulysses-you/SPARK-39112.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/v2ResolutionPlans.scala  | 28 --
 .../scala/org/apache/spark/sql/ExplainSuite.scala  | 15 
 2 files changed, 36 insertions(+), 7 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala
index 4cffead93b2..a87f9e0082d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala
@@ -20,7 +20,7 @@ package org.apache.spark.sql.catalyst.analysis
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec
 import org.apache.spark.sql.catalyst.expressions.{Attribute, LeafExpression, 
Unevaluable}
-import org.apache.spark.sql.catalyst.plans.logical.LeafNode
+import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
 import org.apache.spark.sql.catalyst.trees.TreePattern.{TreePattern, 
UNRESOLVED_FUNC}
 import org.apache.spark.sql.catalyst.util.CharVarcharUtils
 import org.apache.spark.sql.connector.catalog.{CatalogPlugin, FunctionCatalog, 
Identifier, Table, TableCatalog}
@@ -140,11 +140,19 @@ case class UnresolvedDBObjectName(nameParts: Seq[String],

[spark] branch master updated: [SPARK-37702][SQL] Support ANSI Aggregate Function: regr_syy

[spark] branch branch-3.2 updated: [SPARK-39060][SQL][3.2] Typo in error messages of decimal overflow

[spark] branch branch-3.3 updated: [SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check in CollapseProject

[spark] branch master updated: [SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check in CollapseProject

[spark] branch branch-3.3 updated: [SPARK-39155][PYTHON] Access to JVM through passed-in GatewayClient during type conversion

[spark] branch master updated: [SPARK-39155][PYTHON] Access to JVM through passed-in GatewayClient during type conversion

[spark] branch branch-3.3 updated: [SPARK-39149][SQL] SHOW DATABASES command should not quote database names under legacy mode

[spark] branch master updated: [SPARK-39149][SQL] SHOW DATABASES command should not quote database names under legacy mode

[spark] branch master updated: [SPARK-39081][PYTHON][SQL] Implement DataFrame.resample and Series.resample

[spark] branch branch-3.2 updated: [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index

[spark] branch branch-3.3 updated: [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index

[spark] branch master updated: [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index

[spark] branch branch-3.2 updated: Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index"

[spark] branch branch-3.3 updated: Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index"

[spark] branch master updated: Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index"

[spark] branch branch-3.2 updated: [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index

[spark] branch branch-3.3 updated: [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index

[spark] branch master updated: [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index

[spark] branch master updated: [SPARK-39147][SQL] Code simplification, use count() instead of filter().size, etc

[spark] branch master updated: [SPARK-39079][SQL] Forbid dot in V2 catalog name

[spark] branch branch-3.3 updated: [SPARK-37878][SQL][FOLLOWUP] V1Table should always carry the "location" property

[spark] branch master updated: [SPARK-37878][SQL][FOLLOWUP] V1Table should always carry the "location" property

[spark] branch branch-3.3 updated: [SPARK-39112][SQL] UnsupportedOperationException if spark.sql.ui.explainMode is set to cost

[spark] branch master updated: [SPARK-39112][SQL] UnsupportedOperationException if spark.sql.ui.explainMode is set to cost

24 matches

Site Navigation

Mail list logo

Footer information