date:20210817

[spark] branch master updated: [SPARK-34309][BUILD][FOLLOWUP] Upgrade Caffeine to 2.9.2

2021-08-17 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 281b00a  [SPARK-34309][BUILD][FOLLOWUP] Upgrade Caffeine to 2.9.2
281b00a is described below

commit 281b00ab5b3dd3f21dd6af020ad5455f35498b79
Author: Kousuke Saruta 
AuthorDate: Wed Aug 18 13:40:52 2021 +0900

[SPARK-34309][BUILD][FOLLOWUP] Upgrade Caffeine to 2.9.2

### What changes were proposed in this pull request?

This PR upgrades Caffeine to `2.9.2`.
Caffeine was introduced in SPARK-34309 (#31517). At the time that PR was 
opened, the latest version of caffeine was `2.9.1` but now `2.9.2` is available.

### Why are the changes needed?

`2.9.2` have the following improvements 
(https://github.com/ben-manes/caffeine/releases/tag/v2.9.2).

* Fixed reading an intermittent null weak/soft value during a concurrent 
write
* Fixed extraneous eviction when concurrently removing a collected entry 
after a writer resurrects it with a new mapping
* Fixed excessive retries of discarding an expired entry when the fixed 
duration period is extended, thereby resurrecting it

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CIs.

Closes #33772 from sarutak/upgrade-caffeine-2.9.2.

Authored-by: Kousuke Saruta 
Signed-off-by: Kousuke Saruta 
---
 dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 2 +-
 dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 2 +-
 pom.xml | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 
b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
index 1dc01b5..31dd02f 100644
--- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
@@ -30,7 +30,7 @@ blas/2.2.0//blas-2.2.0.jar
 bonecp/0.8.0.RELEASE//bonecp-0.8.0.RELEASE.jar
 breeze-macros_2.12/1.2//breeze-macros_2.12-1.2.jar
 breeze_2.12/1.2//breeze_2.12-1.2.jar
-caffeine/2.9.1//caffeine-2.9.1.jar
+caffeine/2.9.2//caffeine-2.9.2.jar
 cats-kernel_2.12/2.1.1//cats-kernel_2.12-2.1.1.jar
 checker-qual/3.10.0//checker-qual-3.10.0.jar
 chill-java/0.10.0//chill-java-0.10.0.jar
diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 
b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
index 698a03c..5b27680 100644
--- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
@@ -25,7 +25,7 @@ blas/2.2.0//blas-2.2.0.jar
 bonecp/0.8.0.RELEASE//bonecp-0.8.0.RELEASE.jar
 breeze-macros_2.12/1.2//breeze-macros_2.12-1.2.jar
 breeze_2.12/1.2//breeze_2.12-1.2.jar
-caffeine/2.9.1//caffeine-2.9.1.jar
+caffeine/2.9.2//caffeine-2.9.2.jar
 cats-kernel_2.12/2.1.1//cats-kernel_2.12-2.1.1.jar
 checker-qual/3.10.0//checker-qual-3.10.0.jar
 chill-java/0.10.0//chill-java-0.10.0.jar
diff --git a/pom.xml b/pom.xml
index bd1722f..1452b0b 100644
--- a/pom.xml
+++ b/pom.xml
@@ -182,7 +182,7 @@
 2.6.2
 4.1.17
 14.0.1
-2.9.1
+2.9.2
 3.0.16
 2.34
 2.10.10

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-36400][SPARK-36398][SQL][WEBUI] Make ThriftServer recognize spark.sql.redaction.string.regex

2021-08-17 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 31d771d  [SPARK-36400][SPARK-36398][SQL][WEBUI] Make ThriftServer 
recognize spark.sql.redaction.string.regex
31d771d is described below

commit 31d771dcf242cfa477b04f28950526bf87b7e90a
Author: Kousuke Saruta 
AuthorDate: Wed Aug 18 13:31:22 2021 +0900

[SPARK-36400][SPARK-36398][SQL][WEBUI] Make ThriftServer recognize 
spark.sql.redaction.string.regex

### What changes were proposed in this pull request?

This PR fixes an issue that ThriftServer doesn't recognize 
`spark.sql.redaction.string.regex`.
The problem is that sensitive information included in queries can be 
exposed.

![thrift-password1](https://user-images.githubusercontent.com/4736016/129440772-46379cc5-987b-41ac-adce-aaf2139f6955.png)

![thrift-password2](https://user-images.githubusercontent.com/4736016/129440775-fd328c0f-d128-4a20-82b0-46c331b9fd64.png)

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Ran ThriftServer, connect to it and execute `CREATE TABLE mytbl2(a int) 
OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", 
dbtable="test_tbl", user="test_usr", password="abcde");` with 
`spark.sql.redaction.string.regex=((?i)(?<=password=))(".*")|('.*')`
Then, confirmed UI.


![thrift-hide-password1](https://user-images.githubusercontent.com/4736016/129440863-cabea247-d51f-41a4-80ac-6c64141e1fb7.png)

![thrift-hide-password2](https://user-images.githubusercontent.com/4736016/129440874-96cd0f0c-720b-4010-968a-cffbc85d2be5.png)

Closes #33743 from sarutak/thrift-redact.

Authored-by: Kousuke Saruta 
Signed-off-by: Kousuke Saruta 
(cherry picked from commit b914ff7d54bd7c07e7313bb06a1fa22c36b628d2)
Signed-off-by: Kousuke Saruta 
---
 .../spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala   | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
index f7a4be9..acb00e4 100644
--- 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
+++ 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
@@ -220,10 +220,11 @@ private[hive] class SparkExecuteStatementOperation(
   override def runInternal(): Unit = {
 setState(OperationState.PENDING)
 logInfo(s"Submitting query '$statement' with $statementId")
+val redactedStatement = 
SparkUtils.redact(sqlContext.conf.stringRedactionPattern, statement)
 HiveThriftServer2.eventManager.onStatementStart(
   statementId,
   parentSession.getSessionHandle.getSessionId.toString,
-  statement,
+  redactedStatement,
   statementId,
   parentSession.getUsername)
 setHasResultSet(true) // avoid no resultset for async run

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36400][SPARK-36398][SQL][WEBUI] Make ThriftServer recognize spark.sql.redaction.string.regex

2021-08-17 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new b749b49  [SPARK-36400][SPARK-36398][SQL][WEBUI] Make ThriftServer 
recognize spark.sql.redaction.string.regex
b749b49 is described below

commit b749b49a283800d3e12455a00a23da24bf6cd333
Author: Kousuke Saruta 
AuthorDate: Wed Aug 18 13:31:22 2021 +0900

[SPARK-36400][SPARK-36398][SQL][WEBUI] Make ThriftServer recognize 
spark.sql.redaction.string.regex

### What changes were proposed in this pull request?

This PR fixes an issue that ThriftServer doesn't recognize 
`spark.sql.redaction.string.regex`.
The problem is that sensitive information included in queries can be 
exposed.

![thrift-password1](https://user-images.githubusercontent.com/4736016/129440772-46379cc5-987b-41ac-adce-aaf2139f6955.png)

![thrift-password2](https://user-images.githubusercontent.com/4736016/129440775-fd328c0f-d128-4a20-82b0-46c331b9fd64.png)

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Ran ThriftServer, connect to it and execute `CREATE TABLE mytbl2(a int) 
OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", 
dbtable="test_tbl", user="test_usr", password="abcde");` with 
`spark.sql.redaction.string.regex=((?i)(?<=password=))(".*")|('.*')`
Then, confirmed UI.


![thrift-hide-password1](https://user-images.githubusercontent.com/4736016/129440863-cabea247-d51f-41a4-80ac-6c64141e1fb7.png)

![thrift-hide-password2](https://user-images.githubusercontent.com/4736016/129440874-96cd0f0c-720b-4010-968a-cffbc85d2be5.png)

Closes #33743 from sarutak/thrift-redact.

Authored-by: Kousuke Saruta 
Signed-off-by: Kousuke Saruta 
(cherry picked from commit b914ff7d54bd7c07e7313bb06a1fa22c36b628d2)
Signed-off-by: Kousuke Saruta 
---
 .../spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala   | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
index f43f8e7..0df5885 100644
--- 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
+++ 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
@@ -186,10 +186,11 @@ private[hive] class SparkExecuteStatementOperation(
   override def runInternal(): Unit = {
 setState(OperationState.PENDING)
 logInfo(s"Submitting query '$statement' with $statementId")
+val redactedStatement = 
SparkUtils.redact(sqlContext.conf.stringRedactionPattern, statement)
 HiveThriftServer2.eventManager.onStatementStart(
   statementId,
   parentSession.getSessionHandle.getSessionId.toString,
-  statement,
+  redactedStatement,
   statementId,
   parentSession.getUsername)
 setHasResultSet(true) // avoid no resultset for async run

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-36400][SPARK-36398][SQL][WEBUI] Make ThriftServer recognize spark.sql.redaction.string.regex

2021-08-17 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b914ff7  [SPARK-36400][SPARK-36398][SQL][WEBUI] Make ThriftServer 
recognize spark.sql.redaction.string.regex
b914ff7 is described below

commit b914ff7d54bd7c07e7313bb06a1fa22c36b628d2
Author: Kousuke Saruta 
AuthorDate: Wed Aug 18 13:31:22 2021 +0900

[SPARK-36400][SPARK-36398][SQL][WEBUI] Make ThriftServer recognize 
spark.sql.redaction.string.regex

### What changes were proposed in this pull request?

This PR fixes an issue that ThriftServer doesn't recognize 
`spark.sql.redaction.string.regex`.
The problem is that sensitive information included in queries can be 
exposed.

![thrift-password1](https://user-images.githubusercontent.com/4736016/129440772-46379cc5-987b-41ac-adce-aaf2139f6955.png)

![thrift-password2](https://user-images.githubusercontent.com/4736016/129440775-fd328c0f-d128-4a20-82b0-46c331b9fd64.png)

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Ran ThriftServer, connect to it and execute `CREATE TABLE mytbl2(a int) 
OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", 
dbtable="test_tbl", user="test_usr", password="abcde");` with 
`spark.sql.redaction.string.regex=((?i)(?<=password=))(".*")|('.*')`
Then, confirmed UI.


![thrift-hide-password1](https://user-images.githubusercontent.com/4736016/129440863-cabea247-d51f-41a4-80ac-6c64141e1fb7.png)

![thrift-hide-password2](https://user-images.githubusercontent.com/4736016/129440874-96cd0f0c-720b-4010-968a-cffbc85d2be5.png)

Closes #33743 from sarutak/thrift-redact.

Authored-by: Kousuke Saruta 
Signed-off-by: Kousuke Saruta 
---
 .../spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala   | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
index f43f8e7..0df5885 100644
--- 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
+++ 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
@@ -186,10 +186,11 @@ private[hive] class SparkExecuteStatementOperation(
   override def runInternal(): Unit = {
 setState(OperationState.PENDING)
 logInfo(s"Submitting query '$statement' with $statementId")
+val redactedStatement = 
SparkUtils.redact(sqlContext.conf.stringRedactionPattern, statement)
 HiveThriftServer2.eventManager.onStatementStart(
   statementId,
   parentSession.getSessionHandle.getSessionId.toString,
-  statement,
+  redactedStatement,
   statementId,
   parentSession.getUsername)
 setHasResultSet(true) // avoid no resultset for async run

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36370][PYTHON][FOLLOWUP] Use LooseVersion instead of pkg_resources.parse_version

2021-08-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 528fca8  [SPARK-36370][PYTHON][FOLLOWUP] Use LooseVersion instead of 
pkg_resources.parse_version
528fca8 is described below

commit 528fca8944036ebd7ded3be8fbb799de080f663a
Author: Takuya UESHIN 
AuthorDate: Wed Aug 18 10:36:09 2021 +0900

[SPARK-36370][PYTHON][FOLLOWUP] Use LooseVersion instead of 
pkg_resources.parse_version

### What changes were proposed in this pull request?

This is a follow-up of #33687.

Use `LooseVersion` instead of `pkg_resources.parse_version`.

### Why are the changes needed?

In the previous PR, `pkg_resources.parse_version` was used, but we should 
use `LooseVersion` instead to be consistent in the code base.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33768 from ueshin/issues/SPARK-36370/LooseVersion.

Authored-by: Takuya UESHIN 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 7fb8ea319e4931f7721ac6f9c12100c95d252cd2)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/groupby.py | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/python/pyspark/pandas/groupby.py b/python/pyspark/pandas/groupby.py
index 2daf80f..70ece9c 100644
--- a/python/pyspark/pandas/groupby.py
+++ b/python/pyspark/pandas/groupby.py
@@ -26,7 +26,6 @@ from collections import OrderedDict, namedtuple
 from distutils.version import LooseVersion
 from functools import partial
 from itertools import product
-from pkg_resources import parse_version  # type: ignore
 from typing import (
 Any,
 Callable,
@@ -47,7 +46,7 @@ from typing import (
 import pandas as pd
 from pandas.api.types import is_hashable, is_list_like
 
-if parse_version(pd.__version__) >= parse_version("1.3.0"):
+if LooseVersion(pd.__version__) >= LooseVersion("1.3.0"):
 from pandas.core.common import _builtin_table
 else:
 from pandas.core.base import SelectionMixin

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-36370][PYTHON][FOLLOWUP] Use LooseVersion instead of pkg_resources.parse_version

2021-08-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7fb8ea3  [SPARK-36370][PYTHON][FOLLOWUP] Use LooseVersion instead of 
pkg_resources.parse_version
7fb8ea3 is described below

commit 7fb8ea319e4931f7721ac6f9c12100c95d252cd2
Author: Takuya UESHIN 
AuthorDate: Wed Aug 18 10:36:09 2021 +0900

[SPARK-36370][PYTHON][FOLLOWUP] Use LooseVersion instead of 
pkg_resources.parse_version

### What changes were proposed in this pull request?

This is a follow-up of #33687.

Use `LooseVersion` instead of `pkg_resources.parse_version`.

### Why are the changes needed?

In the previous PR, `pkg_resources.parse_version` was used, but we should 
use `LooseVersion` instead to be consistent in the code base.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33768 from ueshin/issues/SPARK-36370/LooseVersion.

Authored-by: Takuya UESHIN 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/groupby.py | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/python/pyspark/pandas/groupby.py b/python/pyspark/pandas/groupby.py
index 1ced2ce..beb36e6 100644
--- a/python/pyspark/pandas/groupby.py
+++ b/python/pyspark/pandas/groupby.py
@@ -26,7 +26,6 @@ from collections import OrderedDict, namedtuple
 from distutils.version import LooseVersion
 from functools import partial
 from itertools import product
-from pkg_resources import parse_version  # type: ignore
 from typing import (
 Any,
 Callable,
@@ -47,7 +46,7 @@ from typing import (
 import pandas as pd
 from pandas.api.types import is_hashable, is_list_like
 
-if parse_version(pd.__version__) >= parse_version("1.3.0"):
+if LooseVersion(pd.__version__) >= LooseVersion("1.3.0"):
 from pandas.core.common import _builtin_table
 else:
 from pandas.core.base import SelectionMixin

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36535][SQL] Refine the sql reference doc

2021-08-17 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 5107ad3  [SPARK-36535][SQL] Refine the sql reference doc
5107ad3 is described below

commit 5107ad3157c07c91fec2e30fc97e72684b84cf14
Author: Wenchen Fan 
AuthorDate: Tue Aug 17 12:46:38 2021 -0700

[SPARK-36535][SQL] Refine the sql reference doc

### What changes were proposed in this pull request?

Refine the SQL reference doc:
- remove useless subitems in the sidebar
- remove useless sub-menu-pages (e.g. `sql-ref-syntax-aux.md`)
- avoid using `#` in `sql-ref-literals.md`

### Why are the changes needed?

The subitems in the sidebar are quite useless, as the menu page serves the 
same functionalities:
https://user-images.githubusercontent.com/3182036/129765924-d7e69bc1-e351-4581-a6de-f2468022f372.png";>
It's also extra work to keep the manu page and sidebar subitems in sync 
(The ANSI compliance page is already out of sync).

The sub-menu-pages are only referenced by the sidebar, and duplicates the 
content of the menu page. As a result, the `sql-ref-syntax-aux.md` is already 
outdated compared to the menu page. It's easier to just look at the menu page.

The `#` is not rendered properly:
https://user-images.githubusercontent.com/3182036/129766760-6f385443-e597-44aa-888d-14d128d45f84.png";>
It's better to avoid using it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #33767 from cloud-fan/doc.

Authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 4b015e8d7d6f5972341104f2a359bb9d09c4385b)
Signed-off-by: Dongjoon Hyun 
---
 docs/_data/menu-sql.yaml  | 187 +-
 docs/sql-ref-literals.md  |  42 -
 docs/sql-ref-syntax-aux.md|  29 --
 docs/sql-ref-syntax-ddl.md|  37 
 docs/sql-ref-syntax-dml-insert.md |  27 --
 docs/sql-ref-syntax-dml.md|  25 -
 docs/sql-ref-syntax-qry.md|  53 ---
 docs/sql-ref-syntax.md|  12 +++
 8 files changed, 34 insertions(+), 378 deletions(-)

diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml
index e7b22c4..22e01df 100644
--- a/docs/_data/menu-sql.yaml
+++ b/docs/_data/menu-sql.yaml
@@ -75,28 +75,12 @@
   subitems:
 - text: ANSI Compliance
   url: sql-ref-ansi-compliance.html
-  subitems:
-- text: Arithmetic Operations
-  url: sql-ref-ansi-compliance.html#arithmetic-operations
-- text: Type Conversion
-  url: sql-ref-ansi-compliance.html#type-conversion
-- text: SQL Keywords
-  url: sql-ref-ansi-compliance.html#sql-keywords
 - text: Data Types
   url: sql-ref-datatypes.html
 - text: Datetime Pattern
   url: sql-ref-datetime-pattern.html
 - text: Functions
   url: sql-ref-functions.html
-  subitems:
-  - text: Built-in Functions
-url: sql-ref-functions-builtin.html
-  - text: Scalar UDFs (User-Defined Functions)
-url: sql-ref-functions-udf-scalar.html
-  - text: UDAFs (User-Defined Aggregate Functions)
-url: sql-ref-functions-udf-aggregate.html
-  - text: Integration with Hive UDFs/UDAFs/UDTFs
-url: sql-ref-functions-udf-hive.html
 - text: Identifiers
   url: sql-ref-identifier.html
 - text: Literals
@@ -107,173 +91,10 @@
   url: sql-ref-syntax.html
   subitems:
 - text: Data Definition Statements
-  url: sql-ref-syntax-ddl.html
-  subitems:
-- text: ALTER DATABASE
-  url: sql-ref-syntax-ddl-alter-database.html
-- text: ALTER TABLE
-  url: sql-ref-syntax-ddl-alter-table.html
-- text: ALTER VIEW
-  url: sql-ref-syntax-ddl-alter-view.html
-- text: CREATE DATABASE
-  url: sql-ref-syntax-ddl-create-database.html
-- text: CREATE FUNCTION
-  url: sql-ref-syntax-ddl-create-function.html
-- text: CREATE TABLE
-  url: sql-ref-syntax-ddl-create-table.html
-- text: CREATE VIEW
-  url: sql-ref-syntax-ddl-create-view.html
-- text: DROP DATABASE
-  url: sql-ref-syntax-ddl-drop-database.html
-- text: DROP FUNCTION
-  url: sql-ref-syntax-ddl-drop-function.html
-- text: DROP TABLE
-  url: sql-ref-syntax-ddl-drop-table.html
-- text: DROP VIEW
-  url: sql-ref-syntax-ddl-drop-view.html
-- text: TRUNCATE TABLE
-  url: sql-ref-syntax-ddl-truncate-table.html
-- text: REPAIR TABLE
-  url: sql-ref-syntax-

[spark] branch master updated: [SPARK-36535][SQL] Refine the sql reference doc

2021-08-17 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4b015e8  [SPARK-36535][SQL] Refine the sql reference doc
4b015e8 is described below

commit 4b015e8d7d6f5972341104f2a359bb9d09c4385b
Author: Wenchen Fan 
AuthorDate: Tue Aug 17 12:46:38 2021 -0700

[SPARK-36535][SQL] Refine the sql reference doc

### What changes were proposed in this pull request?

Refine the SQL reference doc:
- remove useless subitems in the sidebar
- remove useless sub-menu-pages (e.g. `sql-ref-syntax-aux.md`)
- avoid using `#` in `sql-ref-literals.md`

### Why are the changes needed?

The subitems in the sidebar are quite useless, as the menu page serves the 
same functionalities:
https://user-images.githubusercontent.com/3182036/129765924-d7e69bc1-e351-4581-a6de-f2468022f372.png";>
It's also extra work to keep the manu page and sidebar subitems in sync 
(The ANSI compliance page is already out of sync).

The sub-menu-pages are only referenced by the sidebar, and duplicates the 
content of the menu page. As a result, the `sql-ref-syntax-aux.md` is already 
outdated compared to the menu page. It's easier to just look at the menu page.

The `#` is not rendered properly:
https://user-images.githubusercontent.com/3182036/129766760-6f385443-e597-44aa-888d-14d128d45f84.png";>
It's better to avoid using it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #33767 from cloud-fan/doc.

Authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
---
 docs/_data/menu-sql.yaml  | 187 +-
 docs/sql-ref-literals.md  |  42 -
 docs/sql-ref-syntax-aux.md|  29 --
 docs/sql-ref-syntax-ddl.md|  37 
 docs/sql-ref-syntax-dml-insert.md |  27 --
 docs/sql-ref-syntax-dml.md|  25 -
 docs/sql-ref-syntax-qry.md|  53 ---
 docs/sql-ref-syntax.md|  12 +++
 8 files changed, 34 insertions(+), 378 deletions(-)

diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml
index e7b22c4..22e01df 100644
--- a/docs/_data/menu-sql.yaml
+++ b/docs/_data/menu-sql.yaml
@@ -75,28 +75,12 @@
   subitems:
 - text: ANSI Compliance
   url: sql-ref-ansi-compliance.html
-  subitems:
-- text: Arithmetic Operations
-  url: sql-ref-ansi-compliance.html#arithmetic-operations
-- text: Type Conversion
-  url: sql-ref-ansi-compliance.html#type-conversion
-- text: SQL Keywords
-  url: sql-ref-ansi-compliance.html#sql-keywords
 - text: Data Types
   url: sql-ref-datatypes.html
 - text: Datetime Pattern
   url: sql-ref-datetime-pattern.html
 - text: Functions
   url: sql-ref-functions.html
-  subitems:
-  - text: Built-in Functions
-url: sql-ref-functions-builtin.html
-  - text: Scalar UDFs (User-Defined Functions)
-url: sql-ref-functions-udf-scalar.html
-  - text: UDAFs (User-Defined Aggregate Functions)
-url: sql-ref-functions-udf-aggregate.html
-  - text: Integration with Hive UDFs/UDAFs/UDTFs
-url: sql-ref-functions-udf-hive.html
 - text: Identifiers
   url: sql-ref-identifier.html
 - text: Literals
@@ -107,173 +91,10 @@
   url: sql-ref-syntax.html
   subitems:
 - text: Data Definition Statements
-  url: sql-ref-syntax-ddl.html
-  subitems:
-- text: ALTER DATABASE
-  url: sql-ref-syntax-ddl-alter-database.html
-- text: ALTER TABLE
-  url: sql-ref-syntax-ddl-alter-table.html
-- text: ALTER VIEW
-  url: sql-ref-syntax-ddl-alter-view.html
-- text: CREATE DATABASE
-  url: sql-ref-syntax-ddl-create-database.html
-- text: CREATE FUNCTION
-  url: sql-ref-syntax-ddl-create-function.html
-- text: CREATE TABLE
-  url: sql-ref-syntax-ddl-create-table.html
-- text: CREATE VIEW
-  url: sql-ref-syntax-ddl-create-view.html
-- text: DROP DATABASE
-  url: sql-ref-syntax-ddl-drop-database.html
-- text: DROP FUNCTION
-  url: sql-ref-syntax-ddl-drop-function.html
-- text: DROP TABLE
-  url: sql-ref-syntax-ddl-drop-table.html
-- text: DROP VIEW
-  url: sql-ref-syntax-ddl-drop-view.html
-- text: TRUNCATE TABLE
-  url: sql-ref-syntax-ddl-truncate-table.html
-- text: REPAIR TABLE
-  url: sql-ref-syntax-ddl-repair-table.html
-- text: USE DATABASE
-  url: sql-ref-syntax-ddl-usedb.html
+

[spark] branch branch-3.2 updated: [SPARK-36370][PYTHON] _builtin_table directly imported from pandas instead of being redefined

2021-08-17 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new e15daa3  [SPARK-36370][PYTHON] _builtin_table directly imported from 
pandas instead of being redefined
e15daa3 is described below

commit e15daa31b36669a7e29367e385f28b6ba25acf09
Author: Cedric-Magnan 
AuthorDate: Tue Aug 17 10:46:49 2021 -0700

[SPARK-36370][PYTHON] _builtin_table directly imported from pandas instead 
of being redefined

### What changes were proposed in this pull request?
Suggesting to refactor the way the _builtin_table is defined in the 
`python/pyspark/pandas/groupby.py` module.
Pandas has recently refactored the way we import the _builtin_table and is 
now part of the pandas.core.common module instead of being an attribute of the 
pandas.core.base.SelectionMixin class.

### Why are the changes needed?
This change is not fully needed but the current implementation redefines 
this table within pyspark, so any changes of this table from the pandas library 
would need to be updated in the pyspark repository as well.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Ran the following command successfully :
```sh
python/run-tests --testnames 'pyspark.pandas.tests.test_groupby'
```
Tests passed in 327 seconds

Closes #33687 from Cedric-Magnan/_builtin_table_from_pandas.

Authored-by: Cedric-Magnan 
Signed-off-by: Takuya UESHIN 
(cherry picked from commit 964dfe254ff8ebf9d7f5c7115ff8f79da3f28261)
Signed-off-by: Takuya UESHIN 
---
 python/pyspark/pandas/groupby.py | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/python/pyspark/pandas/groupby.py b/python/pyspark/pandas/groupby.py
index 376592d..2daf80f 100644
--- a/python/pyspark/pandas/groupby.py
+++ b/python/pyspark/pandas/groupby.py
@@ -20,13 +20,13 @@ A wrapper for GroupedData to behave similar to pandas 
GroupBy.
 """
 
 from abc import ABCMeta, abstractmethod
-import builtins
 import sys
 import inspect
 from collections import OrderedDict, namedtuple
 from distutils.version import LooseVersion
 from functools import partial
 from itertools import product
+from pkg_resources import parse_version  # type: ignore
 from typing import (
 Any,
 Callable,
@@ -44,10 +44,16 @@ from typing import (
 TYPE_CHECKING,
 )
 
-import numpy as np
 import pandas as pd
 from pandas.api.types import is_hashable, is_list_like
 
+if parse_version(pd.__version__) >= parse_version("1.3.0"):
+from pandas.core.common import _builtin_table
+else:
+from pandas.core.base import SelectionMixin
+
+_builtin_table = SelectionMixin._builtin_table
+
 from pyspark.sql import Column, DataFrame as SparkDataFrame, Window, functions 
as F
 from pyspark.sql.types import (  # noqa: F401
 DataType,
@@ -97,12 +103,6 @@ if TYPE_CHECKING:
 # to keep it the same as pandas
 NamedAgg = namedtuple("NamedAgg", ["column", "aggfunc"])
 
-_builtin_table = {
-builtins.sum: np.sum,
-builtins.max: np.max,
-builtins.min: np.min,
-}  # type: Dict[Callable, Callable]
-
 
 class GroupBy(Generic[FrameLike], metaclass=ABCMeta):
 """

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c0441bb -> 964dfe2)

2021-08-17 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c0441bb  [SPARK-36387][PYTHON] Fix Series.astype from datetime to 
nullable string
 add 964dfe2  [SPARK-36370][PYTHON] _builtin_table directly imported from 
pandas instead of being redefined

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/groupby.py | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-36387][PYTHON] Fix Series.astype from datetime to nullable string

2021-08-17 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c0441bb  [SPARK-36387][PYTHON] Fix Series.astype from datetime to 
nullable string
c0441bb is described below

commit c0441bb7e83e83e3240bf7e2991de34b01a182f5
Author: itholic 
AuthorDate: Tue Aug 17 10:29:16 2021 -0700

[SPARK-36387][PYTHON] Fix Series.astype from datetime to nullable string

### What changes were proposed in this pull request?

This PR proposes to fix `Series.astype` when converting datetime type to 
StringDtype, to match the behavior of pandas 1.3.

In pandas < 1.3,
```python
>>> pd.Series(["2020-10-27 00:00:01", None], 
name="datetime").astype("string")
02020-10-27 00:00:01
1NaT
Name: datetime, dtype: string
```

This is changed to

```python
>>> pd.Series(["2020-10-27 00:00:01", None], 
name="datetime").astype("string")
02020-10-27 00:00:01
1   
Name: datetime, dtype: string
```

in pandas >= 1.3, so we follow the behavior of latest pandas.

### Why are the changes needed?

Because pandas-on-Spark always follow the behavior of latest pandas.

### Does this PR introduce _any_ user-facing change?

Yes, the behavior is changed to latest pandas when converting datetime to 
nullable string (StringDtype)

### How was this patch tested?

Unittest passed

Closes #33735 from itholic/SPARK-36387.

Authored-by: itholic 
Signed-off-by: Takuya UESHIN 
---
 python/pyspark/pandas/data_type_ops/base.py |  2 +-
 python/pyspark/pandas/data_type_ops/datetime_ops.py | 19 ---
 python/pyspark/pandas/tests/test_series.py  |  8 +---
 3 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/python/pyspark/pandas/data_type_ops/base.py 
b/python/pyspark/pandas/data_type_ops/base.py
index c69715f..b4c8c3e 100644
--- a/python/pyspark/pandas/data_type_ops/base.py
+++ b/python/pyspark/pandas/data_type_ops/base.py
@@ -155,7 +155,7 @@ def _as_string_type(
 index_ops: IndexOpsLike, dtype: Union[str, type, Dtype], *, null_str: str 
= str(None)
 ) -> IndexOpsLike:
 """Cast `index_ops` to StringType Spark type, given `dtype` and `null_str`,
-representing null Spark column.
+representing null Spark column. Note that `null_str` is for non-extension 
dtypes only.
 """
 spark_type = StringType()
 if isinstance(dtype, extension_dtypes):
diff --git a/python/pyspark/pandas/data_type_ops/datetime_ops.py 
b/python/pyspark/pandas/data_type_ops/datetime_ops.py
index 071c22e..63d817b 100644
--- a/python/pyspark/pandas/data_type_ops/datetime_ops.py
+++ b/python/pyspark/pandas/data_type_ops/datetime_ops.py
@@ -23,7 +23,7 @@ import numpy as np
 import pandas as pd
 from pandas.api.types import CategoricalDtype
 
-from pyspark.sql import functions as F, Column
+from pyspark.sql import Column
 from pyspark.sql.types import BooleanType, LongType, StringType, TimestampType
 
 from pyspark.pandas._typing import Dtype, IndexOpsLike, SeriesOrIndex
@@ -33,10 +33,11 @@ from pyspark.pandas.data_type_ops.base import (
 _as_bool_type,
 _as_categorical_type,
 _as_other_type,
+_as_string_type,
 _sanitize_list_like,
 )
 from pyspark.pandas.spark import functions as SF
-from pyspark.pandas.typedef import extension_dtypes, pandas_on_spark_type
+from pyspark.pandas.typedef import pandas_on_spark_type
 
 
 class DatetimeOps(DataTypeOps):
@@ -133,18 +134,6 @@ class DatetimeOps(DataTypeOps):
 elif isinstance(spark_type, BooleanType):
 return _as_bool_type(index_ops, dtype)
 elif isinstance(spark_type, StringType):
-if isinstance(dtype, extension_dtypes):
-# seems like a pandas' bug?
-scol = F.when(index_ops.spark.column.isNull(), 
str(pd.NaT)).otherwise(
-index_ops.spark.column.cast(spark_type)
-)
-else:
-null_str = str(pd.NaT)
-casted = index_ops.spark.column.cast(spark_type)
-scol = F.when(index_ops.spark.column.isNull(), 
null_str).otherwise(casted)
-return index_ops._with_new_scol(
-scol.alias(index_ops._internal.data_spark_column_names[0]),
-field=index_ops._internal.data_fields[0].copy(dtype=dtype, 
spark_type=spark_type),
-)
+return _as_string_type(index_ops, dtype, null_str=str(pd.NaT))
 else:
 return _as_other_type(index_ops, dtype, spark_type)
diff --git a/python/pyspark/pandas/tests/test_series.py 
b/python/pyspark/pandas/tests/test_series.py
index d9ba3c76..58c87ed 100644
--- a/python/pyspark/pandas/tests/test_series.py
+++ b/python/pyspark/pandas/tests/t

[spark] branch branch-3.2 updated: Revert "[SPARK-35028][SQL] ANSI mode: disallow group by aliases"

2021-08-17 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 70635b4  Revert "[SPARK-35028][SQL] ANSI mode: disallow group by 
aliases"
70635b4 is described below

commit 70635b4b2633be544563c1cb00e6333fdb1f3782
Author: Gengliang Wang 
AuthorDate: Tue Aug 17 20:23:49 2021 +0800

Revert "[SPARK-35028][SQL] ANSI mode: disallow group by aliases"

### What changes were proposed in this pull request?

Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases 
](https://github.com/apache/spark/pull/32129)

### Why are the changes needed?

It turns out that many users are using the group by alias feature.  Spark 
has its precedence rule when alias names conflict with column names in Group by 
clause: always use the table column. This should be reasonable and acceptable.
Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, 
too.

As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing 
the group by alias in ANSI mode.

### Does this PR introduce _any_ user-facing change?

No, the feature is not released yet.

### How was this patch tested?

Unit tests

Closes #33758 from gengliangwang/revertGroupByAlias.

Authored-by: Gengliang Wang 
Signed-off-by: Gengliang Wang 
(cherry picked from commit 8bfb4f1e72f33205b94957f7dacf298b0c8bde17)
Signed-off-by: Gengliang Wang 
---
 docs/sql-ref-ansi-compliance.md|1 -
 .../spark/sql/catalyst/analysis/Analyzer.scala |2 +-
 .../org/apache/spark/sql/internal/SQLConf.scala|   27 +-
 .../sql-tests/inputs/ansi/group-analytics.sql  |1 -
 .../sql-tests/results/ansi/group-analytics.sql.out | 1293 
 5 files changed, 14 insertions(+), 1310 deletions(-)

diff --git a/docs/sql-ref-ansi-compliance.md b/docs/sql-ref-ansi-compliance.md
index a647abc..f0e1066 100644
--- a/docs/sql-ref-ansi-compliance.md
+++ b/docs/sql-ref-ansi-compliance.md
@@ -255,7 +255,6 @@ The behavior of some SQL functions can be different under 
ANSI mode (`spark.sql.
 The behavior of some SQL operators can be different under ANSI mode 
(`spark.sql.ansi.enabled=true`).
   - `array_col[index]`: This operator throws `ArrayIndexOutOfBoundsException` 
if using invalid indices.
   - `map_col[key]`: This operator throws `NoSuchElementException` if key does 
not exist in map.
-  - `GROUP BY`: aliases in a select list can not be used in GROUP BY clauses. 
Each column referenced in a GROUP BY clause shall unambiguously reference a 
column of the table resulting from the FROM clause.
 
 ### Useful Functions for ANSI Mode
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 2f0a709..92018eb 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -1951,7 +1951,7 @@ class Analyzer(override val catalogManager: 
CatalogManager)
   // mayResolveAttrByAggregateExprs requires the TreePattern 
UNRESOLVED_ATTRIBUTE.
   _.containsAllPatterns(AGGREGATE, UNRESOLVED_ATTRIBUTE), ruleId) {
   case agg @ Aggregate(groups, aggs, child)
-  if allowGroupByAlias && child.resolved && aggs.forall(_.resolved) &&
+  if conf.groupByAliases && child.resolved && aggs.forall(_.resolved) 
&&
 groups.exists(!_.resolved) =>
 agg.copy(groupingExpressions = mayResolveAttrByAggregateExprs(groups, 
aggs, child))
 }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 555242f..6869977 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -240,17 +240,6 @@ object SQLConf {
 .intConf
 .createWithDefault(100)
 
-  val ANSI_ENABLED = buildConf("spark.sql.ansi.enabled")
-.doc("When true, Spark SQL uses an ANSI compliant dialect instead of being 
Hive compliant. " +
-  "For example, Spark will throw an exception at runtime instead of 
returning null results " +
-  "when the inputs to a SQL operator/function are invalid." +
-  "For full details of this dialect, you can find them in the section 
\"ANSI Compliance\" of " +
-  "Spark's documentation. Some ANSI dialect features may be not from the 
ANSI SQL " +
-  "standard directly, but their behaviors align with ANSI SQL's style")
-.version("3.0.0")
-.booleanConf
-.createWithDefault(false)
-
   val OPTIMIZER_EXCLUDED_RULES = buildConf("spar

[spark] branch master updated (82a3150 -> 8bfb4f1)

2021-08-17 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 82a3150  [SPARK-36524][SQL] Common class for ANSI interval types
 add 8bfb4f1  Revert "[SPARK-35028][SQL] ANSI mode: disallow group by 
aliases"

No new revisions were added by this update.

Summary of changes:
 docs/sql-ref-ansi-compliance.md|1 -
 .../spark/sql/catalyst/analysis/Analyzer.scala |2 +-
 .../org/apache/spark/sql/internal/SQLConf.scala|   27 +-
 .../sql-tests/inputs/ansi/group-analytics.sql  |1 -
 .../sql-tests/results/ansi/group-analytics.sql.out | 1293 
 5 files changed, 14 insertions(+), 1310 deletions(-)
 delete mode 100644 
sql/core/src/test/resources/sql-tests/inputs/ansi/group-analytics.sql
 delete mode 100644 
sql/core/src/test/resources/sql-tests/results/ansi/group-analytics.sql.out

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-36379][SQL][3.1] Null at root level of a JSON array should not fail w/ permissive mode

2021-08-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 32d127d  [SPARK-36379][SQL][3.1] Null at root level of a JSON array 
should not fail w/ permissive mode
32d127d is described below

commit 32d127de4a4a628276e659bd6a5d572c625ed565
Author: Hyukjin Kwon 
AuthorDate: Tue Aug 17 21:10:44 2021 +0900

[SPARK-36379][SQL][3.1] Null at root level of a JSON array should not fail 
w/ permissive mode

This PR backports https://github.com/apache/spark/pull/33608 to branch-3.1


-

### What changes were proposed in this pull request?

This PR proposes to fail properly so JSON parser can proceed and parse the 
input with the permissive mode.
Previously, we passed `null`s as are, the root `InternalRow`s became 
`null`s, and it causes the query fails even with permissive mode on.
Now, we fail explicitly if `null` is passed when the input array contains 
`null`.

Note that this is consistent with non-array JSON input:

**Permissive mode:**

```scala
spark.read.json(Seq("""{"a": "str"}""", """null""").toDS).collect()
```
```
res0: Array[org.apache.spark.sql.Row] = Array([str], [null])
```

**Failfast mode**:

```scala
spark.read.option("mode", "failfast").json(Seq("""{"a": "str"}""", 
"""null""").toDS).collect()
```
```
org.apache.spark.SparkException: Malformed records are detected in record 
parsing. Parse Mode: FAILFAST. To process malformed records as null result, try 
setting the option 'mode' as 'PERMISSIVE'.
at 
org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70)
at 
org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
```

### Why are the changes needed?

To make the permissive mode to proceed and parse without throwing an 
exception.

### Does this PR introduce _any_ user-facing change?

**Permissive mode:**

```scala
spark.read.json(Seq("""[{"a": "str"}, null]""").toDS).collect()
```

Before:

```
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
```

After:

```
res0: Array[org.apache.spark.sql.Row] = Array([null])
```

NOTE that this behaviour is consistent when JSON object is malformed:

```scala
spark.read.schema("a int").json(Seq("""[{"a": 123}, {123123}, {"a": 
123}]""").toDS).collect()
```

```
res0: Array[org.apache.spark.sql.Row] = Array([null])
```

Since we're parsing _one_ JSON array, related records all fail together.

**Failfast mode:**

```scala
spark.read.option("mode", "failfast").json(Seq("""[{"a": "str"}, 
null]""").toDS).collect()
```

Before:

```
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
```

After:

```
org.apache.spark.SparkException: Malformed records are detected in record 
parsing. Parse Mode: FAILFAST. To process malformed records as null result, try 
setting the option 'mode' as 'PERMISSIVE'.
at 
org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70)
at 
org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
```

### How was this patch tested?

Manually tested, and unit test was added.

Closes #33762 from HyukjinKwon/cherry-pick-SPARK-36379.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .../org/apache/spark/sql/catalyst/json/JacksonParser.scala |  9 ++---
 .../spark/sql/execution/datasources/json/JsonSuite.scala   | 14 ++
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
index bbcf

[spark] branch branch-3.2 updated: [SPARK-36524][SQL] Common class for ANSI interval types

2021-08-17 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 07c6976  [SPARK-36524][SQL] Common class for ANSI interval types
07c6976 is described below

commit 07c6976f79e418be8aed9bed8e7b396231a27c25
Author: Max Gekk 
AuthorDate: Tue Aug 17 12:27:56 2021 +0300

[SPARK-36524][SQL] Common class for ANSI interval types

### What changes were proposed in this pull request?
Add new type `AnsiIntervalType` to `AbstractDataType.scala`, and extend it 
by `YearMonthIntervalType` and by `DayTimeIntervalType`

### Why are the changes needed?
To improve code maintenance. The change will allow to replace checking of 
both `YearMonthIntervalType` and `DayTimeIntervalType` by a check of 
`AnsiIntervalType`, for instance:
```scala
case _: YearMonthIntervalType | _: DayTimeIntervalType => false
```
by
```scala
case _: AnsiIntervalType => false
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By existing test suites.

Closes #33753 from MaxGekk/ansi-interval-type-trait.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
(cherry picked from commit 82a31508afffd089048e28276c75b5deb1ada47f)
Signed-off-by: Max Gekk 
---
 .../avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala | 2 +-
 .../scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala   | 8 
 .../org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala | 2 +-
 .../org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala| 2 +-
 .../org/apache/spark/sql/catalyst/expressions/arithmetic.scala| 4 ++--
 .../spark/sql/catalyst/expressions/collectionOperations.scala | 2 +-
 .../spark/sql/catalyst/expressions/datetimeExpressions.scala  | 2 +-
 .../main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala | 4 ++--
 .../main/scala/org/apache/spark/sql/types/AbstractDataType.scala  | 5 +
 .../scala/org/apache/spark/sql/types/DayTimeIntervalType.scala| 2 +-
 .../scala/org/apache/spark/sql/types/YearMonthIntervalType.scala  | 2 +-
 .../spark/sql/execution/datasources/csv/CSVFileFormat.scala   | 2 +-
 .../spark/sql/execution/datasources/json/JsonFileFormat.scala | 2 +-
 .../spark/sql/execution/datasources/orc/OrcFileFormat.scala   | 2 +-
 .../sql/execution/datasources/parquet/ParquetFileFormat.scala | 2 +-
 .../apache/spark/sql/execution/datasources/v2/csv/CSVTable.scala  | 4 ++--
 .../spark/sql/execution/datasources/v2/json/JsonTable.scala   | 2 +-
 .../apache/spark/sql/execution/datasources/v2/orc/OrcTable.scala  | 2 +-
 .../spark/sql/execution/datasources/v2/parquet/ParquetTable.scala | 2 +-
 .../sql/hive/thriftserver/SparkExecuteStatementOperation.scala| 2 +-
 .../spark/sql/hive/thriftserver/SparkGetColumnsOperation.scala| 5 ++---
 .../main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala  | 2 +-
 22 files changed, 33 insertions(+), 29 deletions(-)

diff --git 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala
index 68b393e..5b8afe8 100644
--- a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala
+++ b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala
@@ -71,7 +71,7 @@ private[sql] object AvroUtils extends Logging {
   }
 
   def supportsDataType(dataType: DataType): Boolean = dataType match {
-case _: DayTimeIntervalType | _: YearMonthIntervalType => false
+case _: AnsiIntervalType => false
 
 case _: AtomicType => true
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 468986d..2f0a709 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -377,9 +377,9 @@ class Analyzer(override val catalogManager: CatalogManager)
 TimestampAddYMInterval(r, l)
   case (CalendarIntervalType, CalendarIntervalType) |
(_: DayTimeIntervalType, _: DayTimeIntervalType) => a
-  case (_: NullType, _: DayTimeIntervalType | _: 
YearMonthIntervalType) =>
+  case (_: NullType, _: AnsiIntervalType) =>
 a.copy(left = Cast(a.left, a.right.dataType))
-  case (_: DayTimeIntervalType | _: YearMonthIntervalType, _: 
NullType) =>
+  case (_: AnsiIntervalType, _: NullType) =>
 a.copy(right = Cast(a.right, a.left.dataType))
   case (DateType, CalendarIntervalType) => DateAddInterval(l, r, 
ansiEnabled = f)
   case (_, CalendarIntervalType | _: DayTimeInter

[spark] branch master updated (ea13c5a -> 82a3150)

2021-08-17 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ea13c5a  [SPARK-36052][K8S][FOLLOWUP] Update config version to 3.2.0
 add 82a3150  [SPARK-36524][SQL] Common class for ANSI interval types

No new revisions were added by this update.

Summary of changes:
 .../avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala | 2 +-
 .../scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala   | 8 
 .../org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala | 2 +-
 .../org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala| 2 +-
 .../org/apache/spark/sql/catalyst/expressions/arithmetic.scala| 4 ++--
 .../spark/sql/catalyst/expressions/collectionOperations.scala | 2 +-
 .../spark/sql/catalyst/expressions/datetimeExpressions.scala  | 2 +-
 .../main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala | 4 ++--
 .../main/scala/org/apache/spark/sql/types/AbstractDataType.scala  | 5 +
 .../scala/org/apache/spark/sql/types/DayTimeIntervalType.scala| 2 +-
 .../scala/org/apache/spark/sql/types/YearMonthIntervalType.scala  | 2 +-
 .../spark/sql/execution/datasources/csv/CSVFileFormat.scala   | 2 +-
 .../spark/sql/execution/datasources/json/JsonFileFormat.scala | 2 +-
 .../spark/sql/execution/datasources/orc/OrcFileFormat.scala   | 2 +-
 .../sql/execution/datasources/parquet/ParquetFileFormat.scala | 2 +-
 .../apache/spark/sql/execution/datasources/v2/csv/CSVTable.scala  | 4 ++--
 .../spark/sql/execution/datasources/v2/json/JsonTable.scala   | 2 +-
 .../apache/spark/sql/execution/datasources/v2/orc/OrcTable.scala  | 2 +-
 .../spark/sql/execution/datasources/v2/parquet/ParquetTable.scala | 2 +-
 .../sql/hive/thriftserver/SparkExecuteStatementOperation.scala| 2 +-
 .../spark/sql/hive/thriftserver/SparkGetColumnsOperation.scala| 5 ++---
 .../main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala  | 2 +-
 22 files changed, 33 insertions(+), 29 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-34309][BUILD][FOLLOWUP] Upgrade Caffeine to 2.9.2

[spark] branch branch-3.1 updated: [SPARK-36400][SPARK-36398][SQL][WEBUI] Make ThriftServer recognize spark.sql.redaction.string.regex

[spark] branch branch-3.2 updated: [SPARK-36400][SPARK-36398][SQL][WEBUI] Make ThriftServer recognize spark.sql.redaction.string.regex

[spark] branch master updated: [SPARK-36400][SPARK-36398][SQL][WEBUI] Make ThriftServer recognize spark.sql.redaction.string.regex

[spark] branch branch-3.2 updated: [SPARK-36370][PYTHON][FOLLOWUP] Use LooseVersion instead of pkg_resources.parse_version

[spark] branch master updated: [SPARK-36370][PYTHON][FOLLOWUP] Use LooseVersion instead of pkg_resources.parse_version

[spark] branch branch-3.2 updated: [SPARK-36535][SQL] Refine the sql reference doc

[spark] branch master updated: [SPARK-36535][SQL] Refine the sql reference doc

[spark] branch branch-3.2 updated: [SPARK-36370][PYTHON] _builtin_table directly imported from pandas instead of being redefined

[spark] branch master updated (c0441bb -> 964dfe2)

[spark] branch master updated: [SPARK-36387][PYTHON] Fix Series.astype from datetime to nullable string

[spark] branch branch-3.2 updated: Revert "[SPARK-35028][SQL] ANSI mode: disallow group by aliases"

[spark] branch master updated (82a3150 -> 8bfb4f1)

[spark] branch branch-3.1 updated: [SPARK-36379][SQL][3.1] Null at root level of a JSON array should not fail w/ permissive mode

[spark] branch branch-3.2 updated: [SPARK-36524][SQL] Common class for ANSI interval types

[spark] branch master updated (ea13c5a -> 82a3150)

16 matches

Site Navigation

Mail list logo

Footer information