[spark] branch master updated: [SPARK-37672][SQL] Support ANSI Aggregate Function: regr_sxx

2022-04-19 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e1e110f53fc [SPARK-37672][SQL] Support ANSI Aggregate Function: 
regr_sxx
e1e110f53fc is described below

commit e1e110f53fc980eb30b2684544eeb97b7acd3f45
Author: Jiaan Geng 
AuthorDate: Tue Apr 19 15:35:24 2022 +0800

[SPARK-37672][SQL] Support ANSI Aggregate Function: regr_sxx

### What changes were proposed in this pull request?
This PR used to support ANSI aggregate Function: `regr_sxx`

The mainstream database supports `regr_sxx` show below:
**Teradata**
https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/PBEW1OPIaxqkIf3CJfIr6A
**Snowflake**
https://docs.snowflake.com/en/sql-reference/functions/regr_sxx.html
**Oracle**

https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGR_-Linear-Regression-Functions.html#GUID-A675B68F-2A88-4843-BE2C-FCDE9C65F9A9
**DB2**

https://www.ibm.com/docs/en/db2/11.5?topic=af-regression-functions-regr-avgx-regr-avgy-regr-count
**H2**
http://www.h2database.com/html/functions-aggregate.html#regr_sxx
**Postgresql**
https://www.postgresql.org/docs/8.4/functions-aggregate.html
**Sybase**

https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.0/dbreference/regr-sxx-function.html
**Exasol**

https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/regr_function.htm

### Why are the changes needed?
`regr_sxx` is very useful.

### Does this PR introduce _any_ user-facing change?
'Yes'. New feature.

### How was this patch tested?
New tests.

Closes #34943 from beliefer/SPARK-37672.

Authored-by: Jiaan Geng 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/FunctionRegistry.scala   |  1 +
 .../expressions/aggregate/CentralMomentAgg.scala   | 13 
 .../expressions/aggregate/linearRegression.scala   | 33 +++-
 .../sql-functions/sql-expression-schema.md |  3 +-
 .../test/resources/sql-tests/inputs/group-by.sql   |  6 
 .../inputs/postgreSQL/aggregates_part1.sql |  2 +-
 .../inputs/udf/postgreSQL/udf-aggregates_part1.sql |  2 +-
 .../resources/sql-tests/results/group-by.sql.out   | 35 +-
 .../results/postgreSQL/aggregates_part1.sql.out| 10 ++-
 .../udf/postgreSQL/udf-aggregates_part1.sql.out| 10 ++-
 10 files changed, 108 insertions(+), 7 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
index 80374f769a2..47fdca8ebe4 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
@@ -501,6 +501,7 @@ object FunctionRegistry {
 expression[RegrAvgX]("regr_avgx"),
 expression[RegrAvgY]("regr_avgy"),
 expression[RegrR2]("regr_r2"),
+expression[RegrSXX]("regr_sxx"),
 
 // string functions
 expression[Ascii]("ascii"),
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala
index c5c78e5062f..a40c5e4815f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala
@@ -264,6 +264,19 @@ case class VarianceSamp(
 copy(child = newChild)
 }
 
+case class RegrSXXReplacement(child: Expression)
+  extends CentralMomentAgg(child, !SQLConf.get.legacyStatisticalAggregate) {
+
+  override protected def momentOrder = 2
+
+  override val evaluateExpression: Expression = {
+If(n === 0.0, Literal.create(null, DoubleType), m2)
+  }
+
+  override protected def withNewChildInternal(newChild: Expression): 
RegrSXXReplacement =
+copy(child = newChild)
+}
+
 @ExpressionDescription(
   usage = "_FUNC_(expr) - Returns the skewness value calculated from values of 
a group.",
   examples = """
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
index 7463ef59c78..4c1749fa00e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
@@ -18,7 +18,7 @@
 package org.apache.spark.sql.catalyst.expre

[spark] branch master updated: [SPARK-38720][SQL][TESTS] Test the error class: CANNOT_CHANGE_DECIMAL_PRECISION

2022-04-19 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 29eea67987d [SPARK-38720][SQL][TESTS] Test the error class: 
CANNOT_CHANGE_DECIMAL_PRECISION
29eea67987d is described below

commit 29eea67987d4715acd040426c127a77b69ced76b
Author: panbingkun 
AuthorDate: Tue Apr 19 12:49:10 2022 +0300

[SPARK-38720][SQL][TESTS] Test the error class: 
CANNOT_CHANGE_DECIMAL_PRECISION

## What changes were proposed in this pull request?
This PR aims to add a test for the error class 
CANNOT_CHANGE_DECIMAL_PRECISION to `QueryExecutionErrorsSuite`.

### Why are the changes needed?
The changes improve test coverage, and document expected error messages in 
tests.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running new test:
```
$ build/sbt "sql/testOnly *QueryExecutionErrorsSuite*"
```

Closes #36239 from panbingkun/SPARK-38720.

Lead-authored-by: panbingkun 
Co-authored-by: Maxim Gekk 
Signed-off-by: Max Gekk 
---
 .../spark/sql/errors/QueryExecutionErrorsSuite.scala   | 18 ++
 1 file changed, 18 insertions(+)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala
index 85956bd8876..24fdaedabbe 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala
@@ -400,4 +400,22 @@ class QueryExecutionErrorsSuite extends QueryTest
 "If necessary set spark.sql.ansi.enabled to false to bypass this 
error. ")
 }
   }
+
+  test("CANNOT_CHANGE_DECIMAL_PRECISION: cast string to decimal") {
+withSQLConf(SQLConf.ANSI_ENABLED.key -> "true") {
+  val e = intercept[SparkArithmeticException] {
+sql("select CAST('66.666' AS DECIMAL(8, 1))").collect()
+  }
+  assert(e.getErrorClass === "CANNOT_CHANGE_DECIMAL_PRECISION")
+  assert(e.getSqlState === "22005")
+  assert(e.getMessage ===
+"Decimal(expanded,66.666,17,3}) cannot be represented as 
Decimal(8, 1). " +
+"If necessary set spark.sql.ansi.enabled to false to bypass this 
error." +
+"""
+  |== SQL(line 1, position 7) ==
+  |select CAST('66.666' AS DECIMAL(8, 1))
+  |   ^^^
+  |""".stripMargin)
+}
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (29eea67987d -> abb1df9d190)

2022-04-19 Thread kabhwan
This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 29eea67987d [SPARK-38720][SQL][TESTS] Test the error class: 
CANNOT_CHANGE_DECIMAL_PRECISION
 add abb1df9d190 [SPARK-38931][SS] Create root dfs directory for 
RocksDBFileManager with unknown number of keys on 1st checkpoint

No new revisions were added by this update.

Summary of changes:
 .../streaming/state/RocksDBFileManager.scala  |  4 +++-
 .../sql/execution/streaming/state/RocksDBSuite.scala  | 19 +++
 2 files changed, 22 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated: [SPARK-38931][SS] Create root dfs directory for RocksDBFileManager with unknown number of keys on 1st checkpoint

2022-04-19 Thread kabhwan
This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 8811e8caaa8 [SPARK-38931][SS] Create root dfs directory for 
RocksDBFileManager with unknown number of keys on 1st checkpoint
8811e8caaa8 is described below

commit 8811e8caaa8540d1fa05fb77152043addc607b82
Author: Yun Tang 
AuthorDate: Tue Apr 19 20:31:04 2022 +0900

[SPARK-38931][SS] Create root dfs directory for RocksDBFileManager with 
unknown number of keys on 1st checkpoint

### What changes were proposed in this pull request?
Create root dfs directory for RocksDBFileManager with unknown number of 
keys on 1st checkpoint.

### Why are the changes needed?
If this fix is not introduced, we might meet exception below:
~~~java
File 
/private/var/folders/rk/wyr101_562ngn8lp7tbqt7_0gp/T/spark-ce4a0607-b1d8-43b8-becd-638c6b030019/state/1/1
 does not exist
java.io.FileNotFoundException: File 
/private/var/folders/rk/wyr101_562ngn8lp7tbqt7_0gp/T/spark-ce4a0607-b1d8-43b8-becd-638c6b030019/state/1/1
 does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
at 
org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128)
at 
org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:93)
at 
org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:353)
at 
org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:400)
at 
org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:626)
at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:701)
at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:697)
at 
org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.create(FileContext.java:703)
at 
org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createTempFile(CheckpointFileManager.scala:327)
at 
org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.(CheckpointFileManager.scala:140)
at 
org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.(CheckpointFileManager.scala:143)
at 
org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createAtomic(CheckpointFileManager.scala:333)
at 
org.apache.spark.sql.execution.streaming.state.RocksDBFileManager.zipToDfsFile(RocksDBFileManager.scala:438)
at 
org.apache.spark.sql.execution.streaming.state.RocksDBFileManager.saveCheckpointToDfs(RocksDBFileManager.scala:174)
at 
org.apache.spark.sql.execution.streaming.state.RocksDBSuite.saveCheckpointFiles(RocksDBSuite.scala:566)
at 
org.apache.spark.sql.execution.streaming.state.RocksDBSuite.$anonfun$new$35(RocksDBSuite.scala:179)

~~~

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Tested via RocksDBSuite.

Closes #36242 from Myasuka/SPARK-38931.

Authored-by: Yun Tang 
Signed-off-by: Jungtaek Lim 
(cherry picked from commit abb1df9d190e35a17b693f2b013b092af4f2528a)
Signed-off-by: Jungtaek Lim 
---
 .../streaming/state/RocksDBFileManager.scala  |  4 +++-
 .../sql/execution/streaming/state/RocksDBSuite.scala  | 19 +++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala
index 4f2ce9b1237..26084747c32 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala
@@ -161,11 +161,13 @@ class RocksDBFileManager(
 metadata.writeToFile(metadataFile)
 logInfo(s"Written metadata for version $version:\n${metadata.prettyJson}")
 
-if (version <= 1 && numKeys == 0) {
+if (version <= 1 && numKeys <= 0) {
   // If we're writing the initial version and there's no data, we have to 
explicitly initialize
   // the root directory. Normally saveImmutableFilesToDfs will do this 
initialization, but
   // when there's no data that method won't write any files, and 
zipToDfsFile uses the
   // 

[spark] branch branch-3.2 updated: [SPARK-38931][SS] Create root dfs directory for RocksDBFileManager with unknown number of keys on 1st checkpoint

2022-04-19 Thread kabhwan
This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new cc097a4d990 [SPARK-38931][SS] Create root dfs directory for 
RocksDBFileManager with unknown number of keys on 1st checkpoint
cc097a4d990 is described below

commit cc097a4d990e1f9c6ef7bf515ae966eaf35fc44c
Author: Yun Tang 
AuthorDate: Tue Apr 19 20:31:04 2022 +0900

[SPARK-38931][SS] Create root dfs directory for RocksDBFileManager with 
unknown number of keys on 1st checkpoint

### What changes were proposed in this pull request?
Create root dfs directory for RocksDBFileManager with unknown number of 
keys on 1st checkpoint.

### Why are the changes needed?
If this fix is not introduced, we might meet exception below:
~~~java
File 
/private/var/folders/rk/wyr101_562ngn8lp7tbqt7_0gp/T/spark-ce4a0607-b1d8-43b8-becd-638c6b030019/state/1/1
 does not exist
java.io.FileNotFoundException: File 
/private/var/folders/rk/wyr101_562ngn8lp7tbqt7_0gp/T/spark-ce4a0607-b1d8-43b8-becd-638c6b030019/state/1/1
 does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
at 
org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128)
at 
org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:93)
at 
org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:353)
at 
org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:400)
at 
org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:626)
at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:701)
at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:697)
at 
org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.create(FileContext.java:703)
at 
org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createTempFile(CheckpointFileManager.scala:327)
at 
org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.(CheckpointFileManager.scala:140)
at 
org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.(CheckpointFileManager.scala:143)
at 
org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createAtomic(CheckpointFileManager.scala:333)
at 
org.apache.spark.sql.execution.streaming.state.RocksDBFileManager.zipToDfsFile(RocksDBFileManager.scala:438)
at 
org.apache.spark.sql.execution.streaming.state.RocksDBFileManager.saveCheckpointToDfs(RocksDBFileManager.scala:174)
at 
org.apache.spark.sql.execution.streaming.state.RocksDBSuite.saveCheckpointFiles(RocksDBSuite.scala:566)
at 
org.apache.spark.sql.execution.streaming.state.RocksDBSuite.$anonfun$new$35(RocksDBSuite.scala:179)

~~~

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Tested via RocksDBSuite.

Closes #36242 from Myasuka/SPARK-38931.

Authored-by: Yun Tang 
Signed-off-by: Jungtaek Lim 
(cherry picked from commit abb1df9d190e35a17b693f2b013b092af4f2528a)
Signed-off-by: Jungtaek Lim 
---
 .../streaming/state/RocksDBFileManager.scala  |  4 +++-
 .../sql/execution/streaming/state/RocksDBSuite.scala  | 19 +++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala
index 23cdbd01bc1..367062b90bc 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala
@@ -161,11 +161,13 @@ class RocksDBFileManager(
 metadata.writeToFile(metadataFile)
 logInfo(s"Written metadata for version $version:\n${metadata.prettyJson}")
 
-if (version <= 1 && numKeys == 0) {
+if (version <= 1 && numKeys <= 0) {
   // If we're writing the initial version and there's no data, we have to 
explicitly initialize
   // the root directory. Normally saveImmutableFilesToDfs will do this 
initialization, but
   // when there's no data that method won't write any files, and 
zipToDfsFile uses the
   // 

[spark] branch master updated: [SPARK-37691][SQL] Support ANSI Aggregation Function: `percentile_disc`

2022-04-19 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 639301a933f [SPARK-37691][SQL] Support ANSI Aggregation Function: 
`percentile_disc`
639301a933f is described below

commit 639301a933f3f7b0a4cc2c1defb6c843afae180e
Author: Jiaan Geng 
AuthorDate: Tue Apr 19 20:56:56 2022 +0800

[SPARK-37691][SQL] Support ANSI Aggregation Function: `percentile_disc`

### What changes were proposed in this pull request?
`PERCENTILE_DISC` is an ANSI aggregate functions.

The mainstream database supports `percentile_disc` show below:
**Postgresql**
https://www.postgresql.org/docs/9.4/functions-aggregate.html
**Teradata**
https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/cPkFySIBORL~M938Zv07Cg
**Snowflake**
https://docs.snowflake.com/en/sql-reference/functions/percentile_disc.html
**Oracle**

https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/PERCENTILE_DISC.html#GUID-7C34FDDA-C241-474F-8C5C-50CC0182E005
**DB2**
https://www.ibm.com/docs/en/db2/11.5?topic=functions-percentile-disc
**H2**
http://www.h2database.com/html/functions-aggregate.html#percentile_disc
**Sybase**

https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc01776.1601/doc/html/san1278453110413.html
**Exasol**

https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/percentile_disc.htm
**RedShift**

https://docs.aws.amazon.com/redshift/latest/dg/r_APPROXIMATE_PERCENTILE_DISC.html
**Yellowbrick**
https://www.yellowbrick.com/docs/2.2/ybd_sqlref/percentile_disc.html
**Mariadb**
https://mariadb.com/kb/en/percentile_disc/
**Phoenix**
http://phoenix.incubator.apache.org/language/functions.html#percentile_disc
**Singlestore**

https://docs.singlestore.com/db/v7.6/en/reference/sql-reference/window-functions/percentile_disc.html

This PR references the implementation of H2. Please refer: 
https://github.com/h2database/h2database/blob/master/h2/src/main/org/h2/expression/aggregate/Percentile.java

### Why are the changes needed?
`PERCENTILE_DISC` is very useful. Exposing the expression can make the 
migration from other systems to Spark SQL easier.

### Does this PR introduce _any_ user-facing change?
'Yes'. New feature.

### How was this patch tested?
New tests.

Closes #35041 from beliefer/SPARK-37691.

Authored-by: Jiaan Geng 
Signed-off-by: Wenchen Fan 
---
 docs/sql-ref-ansi-compliance.md|   1 +
 .../spark/sql/catalyst/parser/SqlBaseLexer.g4  |   1 +
 .../spark/sql/catalyst/parser/SqlBaseParser.g4 |   3 +-
 .../sql/catalyst/analysis/CheckAnalysis.scala  |   7 +-
 .../expressions/aggregate/PercentileCont.scala |  41 
 .../{Percentile.scala => percentiles.scala}| 260 ++---
 .../spark/sql/catalyst/parser/AstBuilder.scala |  18 +-
 .../expressions/aggregate/PercentileSuite.scala|  13 +-
 .../sql/catalyst/parser/PlanParserSuite.scala  |  24 +-
 .../test/resources/sql-tests/inputs/group-by.sql   |  17 +-
 .../inputs/postgreSQL/aggregates_part4.sql |   8 +-
 .../inputs/udf/postgreSQL/udf-aggregates_part4.sql |   8 +-
 .../src/test/resources/sql-tests/inputs/window.sql |  59 -
 .../resources/sql-tests/results/group-by.sql.out   |  39 +++-
 .../results/postgreSQL/aggregates_part4.sql.out|  31 ++-
 .../udf/postgreSQL/udf-aggregates_part4.sql.out|  31 ++-
 .../resources/sql-tests/results/window.sql.out | 184 +++
 17 files changed, 540 insertions(+), 205 deletions(-)

diff --git a/docs/sql-ref-ansi-compliance.md b/docs/sql-ref-ansi-compliance.md
index 66161a112b1..89ba2d17608 100644
--- a/docs/sql-ref-ansi-compliance.md
+++ b/docs/sql-ref-ansi-compliance.md
@@ -512,6 +512,7 @@ Below is a list of all the keywords in Spark SQL.
 |PARTITIONS|non-reserved|non-reserved|non-reserved|
 |PERCENT|non-reserved|non-reserved|non-reserved|
 |PERCENTILE_CONT|reserved|non-reserved|non-reserved|
+|PERCENTILE_DISC|reserved|non-reserved|non-reserved|
 |PIVOT|non-reserved|non-reserved|non-reserved|
 |PLACING|non-reserved|non-reserved|non-reserved|
 |POSITION|non-reserved|non-reserved|reserved|
diff --git 
a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
 
b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
index e84d4fa45eb..c5199a601ce 100644
--- 
a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
+++ 
b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
@@ -259,6 +259,7 @@ PARTITION: 'PARTITION';
 PARTITIONED: 'PARTITIONED';
 PARTITIONS: 'PARTITIONS';
 PERCENTILE_CONT: 'PERCENTILE_CONT';
+PERCENTILE_DISC: 'PERCENTIL

[spark] branch master updated: [SPARK-38747][SQL][TESTS] Move the tests for `PARSE_SYNTAX_ERROR` from ErrorParserSuite to QueryParsingErrorsSuite

2022-04-19 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bcb235f29b8 [SPARK-38747][SQL][TESTS] Move the tests for 
`PARSE_SYNTAX_ERROR` from ErrorParserSuite to QueryParsingErrorsSuite
bcb235f29b8 is described below

commit bcb235f29b862fe646f9e1683244e705ddb66641
Author: panbingkun 
AuthorDate: Tue Apr 19 19:28:22 2022 +0300

[SPARK-38747][SQL][TESTS] Move the tests for `PARSE_SYNTAX_ERROR` from 
ErrorParserSuite to QueryParsingErrorsSuite

### What changes were proposed in this pull request?
This PR aims to move tests for the error class PARSE_SYNTAX_ERROR from 
ErrorParserSuite to QueryParsingErrorsSuite.

### Why are the changes needed?
To improve code maintenance.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
By running the moved tests:
```
$ build/sbt "sql/testOnly *QueryParsingErrorsSuite*"
```

Closes #36224 from panbingkun/SPARK-38747.

Lead-authored-by: panbingkun 
Co-authored-by: Maxim Gekk 
Signed-off-by: Max Gekk 
---
 .../sql/catalyst/parser/ErrorParserSuite.scala |  40 --
 .../spark/sql/errors/QueryParsingErrorsSuite.scala | 147 +
 2 files changed, 147 insertions(+), 40 deletions(-)

diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ErrorParserSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ErrorParserSuite.scala
index aa9f096cfe2..52d0c6c7018 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ErrorParserSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ErrorParserSuite.scala
@@ -16,7 +16,6 @@
  */
 package org.apache.spark.sql.catalyst.parser
 
-import org.apache.spark.SparkThrowableHelper
 import org.apache.spark.sql.catalyst.analysis.AnalysisTest
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 
@@ -77,51 +76,12 @@ class ErrorParserSuite extends AnalysisTest {
   Some(line), Some(startPosition), Some(stopPosition), Some(errorClass))
   }
 
-  test("no viable input") {
-intercept("select ((r + 1) ", 1, 16, 16,
-  "Syntax error at or near", "^^^")
-  }
-
-  test("extraneous input") {
-intercept("select 1 1", 1, 9, 10,
-  "Syntax error at or near '1': extra input '1'", "-^^^")
-intercept("select *\nfrom r as q t", 2, 12, 13, "Syntax error at or near", 
"^^^")
-  }
-
-  test("mismatched input") {
-intercept("select * from r order by q from t", "PARSE_SYNTAX_ERROR",
-  1, 27, 31,
-  "Syntax error at or near",
-  "---^^^"
-)
-intercept("select *\nfrom r\norder by q\nfrom t", "PARSE_SYNTAX_ERROR",
-  4, 0, 4,
-  "Syntax error at or near", "^^^")
-  }
-
-  test("jargon token substitute to user-facing language") {
-// '' -> end of input
-intercept("select count(*", "PARSE_SYNTAX_ERROR",
-  1, 14, 14, "Syntax error at or near end of input")
-intercept("select 1 as a from", "PARSE_SYNTAX_ERROR",
-  1, 18, 18, "Syntax error at or near end of input")
-  }
-
   test("semantic errors") {
 intercept("select *\nfrom r\norder by q\ncluster by q", 3, 0, 11,
   "Combination of ORDER BY/SORT BY/DISTRIBUTE BY/CLUSTER BY is not 
supported",
   "^^^")
   }
 
-  test("SPARK-21136: misleading error message due to problematic antlr 
grammar") {
-intercept("select * from a left join_ b on a.id = b.id", None,
-  "Syntax error at or near 'join_': missing 'JOIN'")
-intercept("select * from test where test.t is like 'test'", 
Some("PARSE_SYNTAX_ERROR"),
-  SparkThrowableHelper.getMessage("PARSE_SYNTAX_ERROR", Array("'is'", "")))
-intercept("SELECT * FROM test WHERE x NOT NULL", 
Some("PARSE_SYNTAX_ERROR"),
-  SparkThrowableHelper.getMessage("PARSE_SYNTAX_ERROR", Array("'NOT'", 
"")))
-  }
-
   test("hyphen in identifier - DDL tests") {
 val msg = "unquoted identifier"
 intercept("USE test-test", 1, 8, 9, msg + " test-test")
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala
index 225d4f33b41..032e2359b47 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala
@@ -472,4 +472,151 @@ class QueryParsingErrorsSuite extends QueryTest with 
SharedSparkSession {
|^^^
|""".stripMargin)
   }
+
+  test("PARSE_SYNTAX_ERROR: no viable input") {
+val sqlText = "select ((r + 1) "
+validateParsingError(
+  sqlText = sqlText,
+  errorClass = "PARSE_SYNTAX_ERROR",
+  sqlState = "4

[spark] branch master updated (bcb235f29b8 -> 134ab484233)

2022-04-19 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from bcb235f29b8 [SPARK-38747][SQL][TESTS] Move the tests for 
`PARSE_SYNTAX_ERROR` from ErrorParserSuite to QueryParsingErrorsSuite
 add 134ab484233 [SPARK-38727][SQL][TESTS] Test the error class: 
FAILED_EXECUTE_UDF

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/errors/QueryExecutionErrorsSuite.scala   | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-38929][SQL] Improve error messages for cast failures in ANSI

2022-04-19 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f76b3e766f7 [SPARK-38929][SQL] Improve error messages for cast 
failures in ANSI
f76b3e766f7 is described below

commit f76b3e766f79b4c2d4f1ecffaad25aeb962336b7
Author: Xinyi Yu 
AuthorDate: Tue Apr 19 22:25:09 2022 +0300

[SPARK-38929][SQL] Improve error messages for cast failures in ANSI

### What changes were proposed in this pull request?
Improve the error messages for cast failures in ANSI.
As mentioned in https://issues.apache.org/jira/browse/SPARK-38929, this PR 
targets two cast-to types: numeric types and date types.
* For numeric(`int`, `smallint`, `double`, `float`, `decimal` ..) types, it 
embeds the cast-to types in the error message. For example,
  ```
  Invalid input value for type INT: '1.0'. To return NULL instead, use 
'try_cast'. If necessary set %s to false to bypass this error.
  ```
  It uses the `toSQLType` and `toSQLValue` to wrap the corresponding types 
and literals.
* For date types, it does similarly as above. For example,
  ```
  Invalid input value for type TIMESTAMP: 'a'. To return NULL instead, use 
'try_cast'. If necessary set spark.sql.ansi.enabled to false to bypass this 
error.
  ```

### Why are the changes needed?
To improve the error message in general.

### Does this PR introduce _any_ user-facing change?
It changes the error messages.

### How was this patch tested?
The related unit tests are updated.

Closes #36241 from anchovYu/ansi-error-improve.

Authored-by: Xinyi Yu 
Signed-off-by: Max Gekk 
---
 core/src/main/resources/error/error-classes.json   |   8 +-
 .../spark/sql/catalyst/expressions/Cast.scala  |  17 ++-
 .../spark/sql/catalyst/util/UTF8StringUtils.scala  |  13 +-
 .../spark/sql/errors/QueryExecutionErrors.scala|  16 +-
 .../scala/org/apache/spark/sql/types/Decimal.scala |   7 +-
 .../catalyst/expressions/AnsiCastSuiteBase.scala   |  58 +++
 .../sql/catalyst/expressions/TryCastSuite.scala|   3 +-
 .../sql/catalyst/util/DateFormatterSuite.scala |   2 +-
 .../catalyst/util/TimestampFormatterSuite.scala|   2 +-
 .../org/apache/spark/sql/types/DecimalSuite.scala  |   3 +-
 .../src/test/resources/sql-tests/inputs/cast.sql   |  10 +-
 .../resources/sql-tests/results/ansi/cast.sql.out  | 170 +++--
 .../resources/sql-tests/results/ansi/date.sql.out  |  10 +-
 .../results/ansi/datetime-parsing-invalid.sql.out  |   4 +-
 .../sql-tests/results/ansi/interval.sql.out|  20 +--
 .../results/ansi/string-functions.sql.out  |  16 +-
 .../test/resources/sql-tests/results/cast.sql.out  |  50 +-
 .../sql-tests/results/postgreSQL/float4.sql.out|   8 +-
 .../sql-tests/results/postgreSQL/float8.sql.out|   8 +-
 .../sql-tests/results/postgreSQL/text.sql.out  |   8 +-
 .../results/postgreSQL/window_part2.sql.out|   4 +-
 .../results/postgreSQL/window_part3.sql.out|   2 +-
 .../results/postgreSQL/window_part4.sql.out|   2 +-
 .../sql-tests/results/string-functions.sql.out |   2 +-
 .../results/timestampNTZ/timestamp-ansi.sql.out|   2 +-
 .../org/apache/spark/sql/SQLInsertTestSuite.scala  |   2 +-
 26 files changed, 297 insertions(+), 150 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index 26d75fa675e..23c1cee1c72 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -105,10 +105,6 @@
 "message" : [ "The fraction of sec must be zero. Valid range is [0, 60]. 
If necessary set %s to false to bypass this error. " ],
 "sqlState" : "22023"
   },
-  "INVALID_INPUT_SYNTAX_FOR_NUMERIC_TYPE" : {
-"message" : [ "invalid input syntax for type numeric: %s. To return NULL 
instead, use 'try_cast'. If necessary set %s to false to bypass this error.%s" 
],
-"sqlState" : "42000"
-  },
   "INVALID_JSON_SCHEMA_MAPTYPE" : {
 "message" : [ "Input schema %s can only contain StringType as a key type 
for a MapType." ]
   },
@@ -123,6 +119,10 @@
 "message" : [ "Invalid SQL syntax: %s" ],
 "sqlState" : "42000"
   },
+  "INVALID_SYNTAX_FOR_CAST" : {
+"message" : [ "Invalid input syntax for type %s: %s. To return NULL 
instead, use 'try_cast'. If necessary set %s to false to bypass this error.%s" 
],
+"sqlState" : "42000"
+  },
   "MAP_KEY_DOES_NOT_EXIST" : {
 "message" : [ "Key %s does not exist. If necessary set %s to false to 
bypass this error.%s" ]
   },
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
index e522c211cb2..865202ca

[GitHub] [spark-website] dongjoon-hyun opened a new pull request, #383: Add Apple Silicon CI link

2022-04-19 Thread GitBox


dongjoon-hyun opened a new pull request, #383:
URL: https://github.com/apache/spark-website/pull/383

   This PR aims to add `Apple Silicon CI` status link.
   
   ![Screen Shot 2022-04-19 at 3 41 54 
PM](https://user-images.githubusercontent.com/9700541/164113213-32b10528-fd4c-4c5e-b836-b71c9b16e994.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] dongjoon-hyun commented on pull request #383: Add Apple Silicon CI link

2022-04-19 Thread GitBox


dongjoon-hyun commented on PR #383:
URL: https://github.com/apache/spark-website/pull/383#issuecomment-1103235869

   cc @srowen , @HyukjinKwon , @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] dongjoon-hyun commented on pull request #383: Add Apple Silicon CI link

2022-04-19 Thread GitBox


dongjoon-hyun commented on PR #383:
URL: https://github.com/apache/spark-website/pull/383#issuecomment-1103236270

   Thank you, @srowen . 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] dongjoon-hyun merged pull request #383: Add Apple Silicon CI link

2022-04-19 Thread GitBox


dongjoon-hyun merged PR #383:
URL: https://github.com/apache/spark-website/pull/383


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark-website] branch asf-site updated: Add Apple Silicon CI link (#383)

2022-04-19 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 0134b4017 Add Apple Silicon CI link (#383)
0134b4017 is described below

commit 0134b4017bb27df463fd55facfd28895b7278d54
Author: Dongjoon Hyun 
AuthorDate: Tue Apr 19 15:49:20 2022 -0700

Add Apple Silicon CI link (#383)
---
 developer-tools.md| 2 +-
 site/developer-tools.html | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/developer-tools.md b/developer-tools.md
index e64d56c43..096b06fc4 100644
--- a/developer-tools.md
+++ b/developer-tools.md
@@ -29,7 +29,7 @@ Apache Spark community uses various resources to maintain the 
community test cov
 Scaleway
 
 [Scaleway](https://www.scaleway.com) provides the following on MacOS and Apple 
Silicon.
-- Java/Scala/Python/R unit tests with Java 17/Scala 2.12/Maven
+- [Java/Scala/Python/R unit tests with Java 17/Scala 
2.12/SBT](https://apache-spark.s3.fr-par.scw.cloud/index.html)
 - K8s integration tests (TBD)
 
 Useful developer tools
diff --git a/site/developer-tools.html b/site/developer-tools.html
index adb88579e..7517e03ca 100644
--- a/site/developer-tools.html
+++ b/site/developer-tools.html
@@ -171,7 +171,7 @@
 
 https://www.scaleway.com";>Scaleway provides the following on 
MacOS and Apple Silicon.
 
-  Java/Scala/Python/R unit tests with Java 17/Scala 2.12/Maven
+  https://apache-spark.s3.fr-par.scw.cloud/index.html";>Java/Scala/Python/R 
unit tests with Java 17/Scala 2.12/SBT
   K8s integration tests (TBD)
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] viirya commented on pull request #383: Add Apple Silicon CI link

2022-04-19 Thread GitBox


viirya commented on PR #383:
URL: https://github.com/apache/spark-website/pull/383#issuecomment-1103250987

   lgtm


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-38943][PYTHON] EWM support ignore_na

2022-04-19 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1dfb4399334 [SPARK-38943][PYTHON] EWM support ignore_na
1dfb4399334 is described below

commit 1dfb4399334a02bf2e54faeb214c4a387753ddce
Author: Ruifeng Zheng 
AuthorDate: Wed Apr 20 10:46:58 2022 +0900

[SPARK-38943][PYTHON] EWM support ignore_na

### What changes were proposed in this pull request?
EWM support ignore_na

### Why are the changes needed?
`ignore_na` is supported in pandas.
after adding this param, EWM can deal with dataset containing NaN/Null.

### Does this PR introduce _any_ user-facing change?
Yes, a new param added

### How was this patch tested?
added testsuites

Closes #36257 from zhengruifeng/ewm_support_ingnore_na.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/generic.py   | 26 +-
 python/pyspark/pandas/missing/window.py|  1 -
 python/pyspark/pandas/tests/test_ewm.py| 92 ++
 python/pyspark/pandas/window.py| 27 +--
 .../catalyst/expressions/windowExpressions.scala   | 34 +---
 .../spark/sql/api/python/PythonSQLUtils.scala  |  3 +-
 6 files changed, 159 insertions(+), 24 deletions(-)

diff --git a/python/pyspark/pandas/generic.py b/python/pyspark/pandas/generic.py
index bb27c633a2b..21c880373ad 100644
--- a/python/pyspark/pandas/generic.py
+++ b/python/pyspark/pandas/generic.py
@@ -2619,7 +2619,7 @@ class Frame(object, metaclass=ABCMeta):
 
 return Expanding(self, min_periods=min_periods)
 
-# TODO: 'adjust', 'ignore_na', 'axis', 'method' parameter should be 
implemented.
+# TODO: 'adjust', 'axis', 'method' parameter should be implemented.
 def ewm(
 self: FrameLike,
 com: Optional[float] = None,
@@ -2627,6 +2627,7 @@ class Frame(object, metaclass=ABCMeta):
 halflife: Optional[float] = None,
 alpha: Optional[float] = None,
 min_periods: Optional[int] = None,
+ignore_na: bool_type = False,
 ) -> "ExponentialMoving[FrameLike]":
 """
 Provide exponentially weighted window transformations.
@@ -2659,6 +2660,21 @@ class Frame(object, metaclass=ABCMeta):
 Minimum number of observations in window required to have a value
 (otherwise result is NA).
 
+ignore_na : bool, default False
+Ignore missing values when calculating weights.
+
+- When ``ignore_na=False`` (default), weights are based on 
absolute positions.
+  For example, the weights of :math:`x_0` and :math:`x_2` used in 
calculating
+  the final weighted average of [:math:`x_0`, None, :math:`x_2`] 
are
+  :math:`(1-\alpha)^2` and :math:`1` if ``adjust=True``, and
+  :math:`(1-\alpha)^2` and :math:`\alpha` if ``adjust=False``.
+
+- When ``ignore_na=True``, weights are based
+  on relative positions. For example, the weights of :math:`x_0` 
and :math:`x_2`
+  used in calculating the final weighted average of
+  [:math:`x_0`, None, :math:`x_2`] are :math:`1-\alpha` and 
:math:`1` if
+  ``adjust=True``, and :math:`1-\alpha` and :math:`\alpha` if 
``adjust=False``.
+
 Returns
 ---
 a Window sub-classed for the particular operation
@@ -2666,7 +2682,13 @@ class Frame(object, metaclass=ABCMeta):
 from pyspark.pandas.window import ExponentialMoving
 
 return ExponentialMoving(
-self, com=com, span=span, halflife=halflife, alpha=alpha, 
min_periods=min_periods
+self,
+com=com,
+span=span,
+halflife=halflife,
+alpha=alpha,
+min_periods=min_periods,
+ignore_na=ignore_na,
 )
 
 def get(self, key: Any, default: Optional[Any] = None) -> Any:
diff --git a/python/pyspark/pandas/missing/window.py 
b/python/pyspark/pandas/missing/window.py
index e6ac39901ff..237dc85c82c 100644
--- a/python/pyspark/pandas/missing/window.py
+++ b/python/pyspark/pandas/missing/window.py
@@ -152,6 +152,5 @@ class MissingPandasLikeExponentialMoving:
 corr = _unsupported_function_exponential_moving("corr")
 
 adjust = _unsupported_property_exponential_moving("adjust")
-ignore_na = _unsupported_property_exponential_moving("ignore_na")
 axis = _unsupported_property_exponential_moving("axis")
 method = _unsupported_property_exponential_moving("method")
diff --git a/python/pyspark/pandas/tests/test_ewm.py 
b/python/pyspark/pandas/tests/test_ewm.py
index 7306aad44ff..d4c1e1ba06a 100644
--- a/python/pyspark/pandas/tests/test_ewm.py
+++ b/python/pyspark/pandas/tests/test_ewm.py
@@ -109,6 +109,98 @@ class EWMTest(

[spark] branch master updated: [SPARK-38828][PYTHON] Remove TimestampNTZ type Python support in Spark 3.3

2022-04-19 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 581000de243 [SPARK-38828][PYTHON] Remove TimestampNTZ type Python 
support in Spark 3.3
581000de243 is described below

commit 581000de24377ca373df7fa94b214baa7e9b0462
Author: itholic 
AuthorDate: Wed Apr 20 10:49:07 2022 +0900

[SPARK-38828][PYTHON] Remove TimestampNTZ type Python support in Spark 3.3

### What changes were proposed in this pull request?

This PR proposes to remove `TimestampNTZ` type Python support in Spark 3.3 
from documentation and `pyspark.sql.types` module.

The purpose of this PR is just hide `TimestampNTZ` type from end-users.

### Why are the changes needed?

Because the `TimestampNTZ` project is not finished yet:

- Lack Hive metastore support
- Lack JDBC support
- Need to spend time scanning the codebase to find out any missing support. 
The current code usages of TimestampType are larger than TimestampNTZType

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The existing tests should cover.

Closes #36255 from itholic/SPARK-38828.

Authored-by: itholic 
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/reference/pyspark.sql.rst | 1 -
 python/pyspark/sql/types.py  | 1 -
 2 files changed, 2 deletions(-)

diff --git a/python/docs/source/reference/pyspark.sql.rst 
b/python/docs/source/reference/pyspark.sql.rst
index 1d34961a91a..adc1958822e 100644
--- a/python/docs/source/reference/pyspark.sql.rst
+++ b/python/docs/source/reference/pyspark.sql.rst
@@ -302,7 +302,6 @@ Data Types
 StringType
 StructField
 StructType
-TimestampNTZType
 TimestampType
 DayTimeIntervalType
 
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 23e54eb8889..2a41508d634 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -59,7 +59,6 @@ __all__ = [
 "BooleanType",
 "DateType",
 "TimestampType",
-"TimestampNTZType",
 "DecimalType",
 "DoubleType",
 "FloatType",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated: [SPARK-38828][PYTHON] Remove TimestampNTZ type Python support in Spark 3.3

2022-04-19 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new fb58c3e5071 [SPARK-38828][PYTHON] Remove TimestampNTZ type Python 
support in Spark 3.3
fb58c3e5071 is described below

commit fb58c3e507113e2e9e398cb77703e54603bfa29a
Author: itholic 
AuthorDate: Wed Apr 20 10:49:07 2022 +0900

[SPARK-38828][PYTHON] Remove TimestampNTZ type Python support in Spark 3.3

This PR proposes to remove `TimestampNTZ` type Python support in Spark 3.3 
from documentation and `pyspark.sql.types` module.

The purpose of this PR is just hide `TimestampNTZ` type from end-users.

Because the `TimestampNTZ` project is not finished yet:

- Lack Hive metastore support
- Lack JDBC support
- Need to spend time scanning the codebase to find out any missing support. 
The current code usages of TimestampType are larger than TimestampNTZType

No.

The existing tests should cover.

Closes #36255 from itholic/SPARK-38828.

Authored-by: itholic 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 581000de24377ca373df7fa94b214baa7e9b0462)
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/reference/pyspark.sql.rst | 1 -
 python/pyspark/sql/types.py  | 1 -
 2 files changed, 2 deletions(-)

diff --git a/python/docs/source/reference/pyspark.sql.rst 
b/python/docs/source/reference/pyspark.sql.rst
index 1d34961a91a..adc1958822e 100644
--- a/python/docs/source/reference/pyspark.sql.rst
+++ b/python/docs/source/reference/pyspark.sql.rst
@@ -302,7 +302,6 @@ Data Types
 StringType
 StructField
 StructType
-TimestampNTZType
 TimestampType
 DayTimeIntervalType
 
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 23e54eb8889..2a41508d634 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -59,7 +59,6 @@ __all__ = [
 "BooleanType",
 "DateType",
 "TimestampType",
-"TimestampNTZType",
 "DecimalType",
 "DoubleType",
 "FloatType",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-37613][SQL][FOLLOWUP] Supplement docs for regr_count

2022-04-19 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1b106ea32d5 [SPARK-37613][SQL][FOLLOWUP] Supplement docs for regr_count
1b106ea32d5 is described below

commit 1b106ea32d567dd32ac697ed0d6cfd40ea7e6e08
Author: Jiaan Geng 
AuthorDate: Wed Apr 20 11:02:58 2022 +0900

[SPARK-37613][SQL][FOLLOWUP] Supplement docs for regr_count

### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/34880 supported ANSI Aggregate 
Function: regr_count.
But the docs of regr_count is not good enough.

### Why are the changes needed?
Make the docs of regr_count more detailed.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
N/A

Closes #36258 from beliefer/SPARK-37613_followup.

Authored-by: Jiaan Geng 
Signed-off-by: Hyukjin Kwon 
---
 .../spark/sql/catalyst/expressions/aggregate/linearRegression.scala | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
index 4c1749fa00e..098fc17b98a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
@@ -22,10 +22,9 @@ import org.apache.spark.sql.catalyst.expressions.{And, 
Expression, ExpressionDes
 import org.apache.spark.sql.catalyst.trees.BinaryLike
 import org.apache.spark.sql.types.{AbstractDataType, DoubleType, NumericType}
 
+// scalastyle:off line.size.limit
 @ExpressionDescription(
-  usage = """
-_FUNC_(expr) - Returns the number of non-null number pairs in a group.
-  """,
+  usage = "_FUNC_(y, x) - Returns the number of non-null number pairs in a 
group, where `y` is the dependent variable and `x` is the independent 
variable.",
   examples = """
 Examples:
   > SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS 
tab(y, x);
@@ -37,6 +36,7 @@ import org.apache.spark.sql.types.{AbstractDataType, 
DoubleType, NumericType}
   """,
   group = "agg_funcs",
   since = "3.3.0")
+// scalastyle:on line.size.limit
 case class RegrCount(left: Expression, right: Expression)
   extends AggregateFunction
   with RuntimeReplaceableAggregate


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated: [SPARK-37613][SQL][FOLLOWUP] Supplement docs for regr_count

2022-04-19 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 2b3df38b430 [SPARK-37613][SQL][FOLLOWUP] Supplement docs for regr_count
2b3df38b430 is described below

commit 2b3df38b430b92e4a8392854988f071b795d543c
Author: Jiaan Geng 
AuthorDate: Wed Apr 20 11:02:58 2022 +0900

[SPARK-37613][SQL][FOLLOWUP] Supplement docs for regr_count

### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/34880 supported ANSI Aggregate 
Function: regr_count.
But the docs of regr_count is not good enough.

### Why are the changes needed?
Make the docs of regr_count more detailed.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
N/A

Closes #36258 from beliefer/SPARK-37613_followup.

Authored-by: Jiaan Geng 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 1b106ea32d567dd32ac697ed0d6cfd40ea7e6e08)
Signed-off-by: Hyukjin Kwon 
---
 .../spark/sql/catalyst/expressions/aggregate/linearRegression.scala | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
index 7463ef59c78..ce37e69d9fd 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/linearRegression.scala
@@ -22,10 +22,9 @@ import org.apache.spark.sql.catalyst.expressions.{And, 
Expression, ExpressionDes
 import org.apache.spark.sql.catalyst.trees.BinaryLike
 import org.apache.spark.sql.types.{AbstractDataType, DoubleType, NumericType}
 
+// scalastyle:off line.size.limit
 @ExpressionDescription(
-  usage = """
-_FUNC_(expr) - Returns the number of non-null number pairs in a group.
-  """,
+  usage = "_FUNC_(y, x) - Returns the number of non-null number pairs in a 
group, where `y` is the dependent variable and `x` is the independent 
variable.",
   examples = """
 Examples:
   > SELECT _FUNC_(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS 
tab(y, x);
@@ -37,6 +36,7 @@ import org.apache.spark.sql.types.{AbstractDataType, 
DoubleType, NumericType}
   """,
   group = "agg_funcs",
   since = "3.3.0")
+// scalastyle:on line.size.limit
 case class RegrCount(left: Expression, right: Expression)
   extends AggregateFunction
   with RuntimeReplaceableAggregate


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-38956][TESTS] Fix FAILED_EXECUTE_UDF test case on Java 17

2022-04-19 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ef00aa1434a [SPARK-38956][TESTS] Fix FAILED_EXECUTE_UDF test case on 
Java 17
ef00aa1434a is described below

commit ef00aa1434ae5c5ecfec5e9b4ffaa2ed0f0e45d4
Author: William Hyun 
AuthorDate: Wed Apr 20 11:03:55 2022 +0900

[SPARK-38956][TESTS] Fix FAILED_EXECUTE_UDF test case on Java 17

### What changes were proposed in this pull request?
This PR aims to fix FAILED_EXECUTE_UDF test case on Java 17.

### Why are the changes needed?
**BEFORE (Java17)**
```
[info] QueryExecutionErrorsSuite:
16:04:22.234 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
16:04:24.377 ERROR org.apache.spark.executor.Executor: Exception in task 
0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Failed to execute user defined function 
(QueryExecutionErrorsSuite$$Lambda$1582/0x000801653be8: (string, int) => 
string)
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test on Java 17.

Closes #36270 from williamhyun/w.

Authored-by: William Hyun 
Signed-off-by: Hyukjin Kwon 
---
 .../scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala
index 6b8f255c7e6..1d5ffc516e7 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala
@@ -432,6 +432,6 @@ class QueryExecutionErrorsSuite extends QueryTest
 val e2 = e1.getCause.asInstanceOf[SparkException]
 assert(e2.getErrorClass === "FAILED_EXECUTE_UDF")
 assert(e2.getMessage.matches("Failed to execute user defined function " +
-  "\\(QueryExecutionErrorsSuite\\$\\$Lambda\\$\\d+/\\d+: \\(string, int\\) 
=> string\\)"))
+  "\\(QueryExecutionErrorsSuite\\$\\$Lambda\\$\\d+/\\w+: \\(string, int\\) 
=> string\\)"))
   }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-38844][PYTHON][TESTS][FOLLOW-UP] Test pyspark.pandas.tests.test_generic_functions

2022-04-19 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7144e1ae1c2 [SPARK-38844][PYTHON][TESTS][FOLLOW-UP] Test 
pyspark.pandas.tests.test_generic_functions
7144e1ae1c2 is described below

commit 7144e1ae1c26615d10a193ddd62b6097aa480cb5
Author: Hyukjin Kwon 
AuthorDate: Wed Apr 20 11:06:00 2022 +0900

[SPARK-38844][PYTHON][TESTS][FOLLOW-UP] Test 
pyspark.pandas.tests.test_generic_functions

### What changes were proposed in this pull request?

This is a minor followup of https://github.com/apache/spark/pull/36127 that 
actually activates the tests added.

### Why are the changes needed?

To make sure the regression tests running in CI.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

CI in this PR should test it out.

Closes #36271 from HyukjinKwon/SPARK-38844.

Lead-authored-by: Hyukjin Kwon 
Co-authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 dev/sparktestsupport/modules.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index b0e09884035..5514df11f9a 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -634,6 +634,7 @@ pyspark_pandas = Module(
 "pyspark.pandas.tests.test_expanding",
 "pyspark.pandas.tests.test_extension",
 "pyspark.pandas.tests.test_frame_spark",
+"pyspark.pandas.tests.test_generic_functions",
 "pyspark.pandas.tests.test_indexops_spark",
 "pyspark.pandas.tests.test_internal",
 "pyspark.pandas.tests.test_namespace",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-37575][SQL][FOLLOWUP] Update the migration guide for added legacy flag for the breaking change of write null value in csv to unquoted empty string

2022-04-19 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a67acbaa29d [SPARK-37575][SQL][FOLLOWUP] Update the migration guide 
for added legacy flag for the breaking change of write null value in csv to 
unquoted empty string
a67acbaa29d is described below

commit a67acbaa29d1ab9071910cac09323c2544d65303
Author: Xinyi Yu 
AuthorDate: Wed Apr 20 10:48:00 2022 +0800

[SPARK-37575][SQL][FOLLOWUP] Update the migration guide for added legacy 
flag for the breaking change of write null value in csv to unquoted empty string

### What changes were proposed in this pull request?
This is a follow-up of updating the migration guide for 
https://github.com/apache/spark/pull/36110 which adds a legacy flag to restore 
the pre-change behavior.
It also fixes a typo in the previous flag description.

### Why are the changes needed?
The flag needs to be documented.

### Does this PR introduce _any_ user-facing change?
It changes the migration doc for users.

### How was this patch tested?
No tests

Closes #36268 from anchovYu/flags-null-to-csv-migration-guide.

Authored-by: Xinyi Yu 
Signed-off-by: Wenchen Fan 
---
 docs/sql-migration-guide.md | 2 +-
 sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index c4f1bd188bf..32b90da1917 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -54,7 +54,7 @@ license: |
 
   - Since Spark 3.3, the `strfmt` in `format_string(strfmt, obj, ...)` and 
`printf(strfmt, obj, ...)` will no longer support to use "0$" to specify the 
first argument, the first argument should always reference by "1$" when use 
argument index to indicating the position of the argument in the argument list.
 
-  - Since Spark 3.3, nulls are written as empty strings in CSV data source by 
default. In Spark 3.2 or earlier, nulls were written as empty strings as quoted 
empty strings, `""`. To restore the previous behavior, set `nullValue` to `""`.
+  - Since Spark 3.3, nulls are written as empty strings in CSV data source by 
default. In Spark 3.2 or earlier, nulls were written as empty strings as quoted 
empty strings, `""`. To restore the previous behavior, set `nullValue` to `""`, 
or set the configuration 
`spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv` to `true`.
 
   - Since Spark 3.3, DESCRIBE FUNCTION fails if the function does not exist. 
In Spark 3.2 or earlier, DESCRIBE FUNCTION can still run and print "Function: 
func_name not found".
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 36b666fd59c..301d792bb3e 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -3758,7 +3758,7 @@ object SQLConf {
 buildConf("spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv")
   .internal()
   .doc("When set to false, nulls are written as unquoted empty strings in 
CSV data source. " +
-"If set to false, it restores the legacy behavior that nulls were 
written as quoted " +
+"If set to true, it restores the legacy behavior that nulls were 
written as quoted " +
 "empty strings, `\"\"`.")
   .version("3.3.0")
   .booleanConf


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated: [SPARK-37575][SQL][FOLLOWUP] Update the migration guide for added legacy flag for the breaking change of write null value in csv to unquoted empty string

2022-04-19 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 27c75eae923 [SPARK-37575][SQL][FOLLOWUP] Update the migration guide 
for added legacy flag for the breaking change of write null value in csv to 
unquoted empty string
27c75eae923 is described below

commit 27c75eae92333add3ba6854b6c46410ec8e6743f
Author: Xinyi Yu 
AuthorDate: Wed Apr 20 10:48:00 2022 +0800

[SPARK-37575][SQL][FOLLOWUP] Update the migration guide for added legacy 
flag for the breaking change of write null value in csv to unquoted empty string

### What changes were proposed in this pull request?
This is a follow-up of updating the migration guide for 
https://github.com/apache/spark/pull/36110 which adds a legacy flag to restore 
the pre-change behavior.
It also fixes a typo in the previous flag description.

### Why are the changes needed?
The flag needs to be documented.

### Does this PR introduce _any_ user-facing change?
It changes the migration doc for users.

### How was this patch tested?
No tests

Closes #36268 from anchovYu/flags-null-to-csv-migration-guide.

Authored-by: Xinyi Yu 
Signed-off-by: Wenchen Fan 
(cherry picked from commit a67acbaa29d1ab9071910cac09323c2544d65303)
Signed-off-by: Wenchen Fan 
---
 docs/sql-migration-guide.md | 2 +-
 sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 607100b0850..b6bfb0ed2be 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -54,7 +54,7 @@ license: |
 
   - Since Spark 3.3, the `strfmt` in `format_string(strfmt, obj, ...)` and 
`printf(strfmt, obj, ...)` will no longer support to use "0$" to specify the 
first argument, the first argument should always reference by "1$" when use 
argument index to indicating the position of the argument in the argument list.
 
-  - Since Spark 3.3, nulls are written as empty strings in CSV data source by 
default. In Spark 3.2 or earlier, nulls were written as empty strings as quoted 
empty strings, `""`. To restore the previous behavior, set `nullValue` to `""`.
+  - Since Spark 3.3, nulls are written as empty strings in CSV data source by 
default. In Spark 3.2 or earlier, nulls were written as empty strings as quoted 
empty strings, `""`. To restore the previous behavior, set `nullValue` to `""`, 
or set the configuration 
`spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv` to `true`.
 
   - Since Spark 3.3, DESCRIBE FUNCTION fails if the function does not exist. 
In Spark 3.2 or earlier, DESCRIBE FUNCTION can still run and print "Function: 
func_name not found".
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 5f803ed963b..e8d99a2d44d 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -3728,7 +3728,7 @@ object SQLConf {
 buildConf("spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv")
   .internal()
   .doc("When set to false, nulls are written as unquoted empty strings in 
CSV data source. " +
-"If set to false, it restores the legacy behavior that nulls were 
written as quoted " +
+"If set to true, it restores the legacy behavior that nulls were 
written as quoted " +
 "empty strings, `\"\"`.")
   .version("3.3.0")
   .booleanConf


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-38962][SQL] Fix wrong computeStats at DataSourceV2Relation

2022-04-19 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f0fa8b119b4 [SPARK-38962][SQL] Fix wrong computeStats at 
DataSourceV2Relation
f0fa8b119b4 is described below

commit f0fa8b119b42eb1e0623b8918e33a14f6ae80a51
Author: ulysses-you 
AuthorDate: Wed Apr 20 14:09:10 2022 +0800

[SPARK-38962][SQL] Fix wrong computeStats at DataSourceV2Relation

### What changes were proposed in this pull request?

Use `Scan` to match `SupportsReportStatistics`.

### Why are the changes needed?

The interface `SupportsReportStatistics` should be mixed in `Scan` rather 
than `ScanBuilder`

### Does this PR introduce _any_ user-facing change?

almost no, it is just dead code in normal useage

### How was this patch tested?

an obvious bug and it's a dead code in test.

Closes #36276 from ulysses-you/statistics.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/execution/datasources/v2/DataSourceV2Relation.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
index 6b0760ca163..61fe3602bb6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
@@ -80,7 +80,7 @@ case class DataSourceV2Relation(
 s"BUG: computeStats called before pushdown on DSv2 relation: $name")
 } else {
   // when not testing, return stats because bad stats are better than 
failing a query
-  table.asReadable.newScanBuilder(options) match {
+  table.asReadable.newScanBuilder(options).build() match {
 case r: SupportsReportStatistics =>
   val statistics = r.estimateStatistics()
   DataSourceV2Relation.transformV2Stats(statistics, None, 
conf.defaultSizeInBytes)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated: [SPARK-38922][CORE] TaskLocation.apply throw NullPointerException

2022-04-19 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 83a365edf16 [SPARK-38922][CORE] TaskLocation.apply throw 
NullPointerException
83a365edf16 is described below

commit 83a365edf163bdd30974756c6c58fdca2e16f7f3
Author: Kent Yao 
AuthorDate: Wed Apr 20 14:38:26 2022 +0800

[SPARK-38922][CORE] TaskLocation.apply throw NullPointerException

### What changes were proposed in this pull request?

TaskLocation.apply w/o NULL check may throw NPE and fail job scheduling

```

Caused by: java.lang.NullPointerException
at 
scala.collection.immutable.StringLike$class.stripPrefix(StringLike.scala:155)
at scala.collection.immutable.StringOps.stripPrefix(StringOps.scala:29)
at org.apache.spark.scheduler.TaskLocation$.apply(TaskLocation.scala:71)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal
```

For instance, `org.apache.spark.rdd.HadoopRDD#convertSplitLocationInfo` 
might generate unexpected `Some(null)` elements where should be replace by 
`Option.apply`

### Why are the changes needed?

fix NPE

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes #36222 from yaooqinn/SPARK-38922.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
(cherry picked from commit 33e07f3cd926105c6d28986eb6218f237505549e)
Signed-off-by: Kent Yao 
---
 .../scala/org/apache/spark/rdd/HadoopRDD.scala |  2 +-
 .../org/apache/spark/scheduler/DAGScheduler.scala  |  2 +-
 .../org/apache/spark/rdd/HadoopRDDSuite.scala  | 30 ++
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
index fcc2275585e..0d905b46953 100644
--- a/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
@@ -454,7 +454,7 @@ private[spark] object HadoopRDD extends Logging {
infos: Array[SplitLocationInfo]): Option[Seq[String]] = {
 Option(infos).map(_.flatMap { loc =>
   val locationStr = loc.getLocation
-  if (locationStr != "localhost") {
+  if (locationStr != null && locationStr != "localhost") {
 if (loc.isInMemory) {
   logDebug(s"Partition $locationStr is cached by Hadoop.")
   Some(HDFSCacheTaskLocation(locationStr).toString)
diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
index ffaabba71e8..ea3a333b19e 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -2736,7 +2736,7 @@ private[spark] class DAGScheduler(
 // If the RDD has some placement preferences (as is the case for input 
RDDs), get those
 val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
 if (rddPrefs.nonEmpty) {
-  return rddPrefs.map(TaskLocation(_))
+  return rddPrefs.filter(_ != null).map(TaskLocation(_))
 }
 
 // If the RDD has narrow dependencies, pick the first partition of the 
first narrow dependency
diff --git a/core/src/test/scala/org/apache/spark/rdd/HadoopRDDSuite.scala 
b/core/src/test/scala/org/apache/spark/rdd/HadoopRDDSuite.scala
new file mode 100644
index 000..b43d76c114c
--- /dev/null
+++ b/core/src/test/scala/org/apache/spark/rdd/HadoopRDDSuite.scala
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.rdd
+
+import org.apache.hadoop.mapred.SplitLocationInfo
+
+import org.apache.spark.SparkFunSuite
+
+class HadoopRDDSuite extends SparkFunSuite {
+
+  test("SPARK-38922: HadoopRDD convertSplitLocationInfo contains Some(null) 
cause NPE") {
+val locs = Array(new SplitLocationInfo(null, false))
+assert

[spark] branch master updated (f0fa8b119b4 -> 33e07f3cd92)

2022-04-19 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from f0fa8b119b4 [SPARK-38962][SQL] Fix wrong computeStats at 
DataSourceV2Relation
 add 33e07f3cd92 [SPARK-38922][CORE] TaskLocation.apply throw 
NullPointerException

No new revisions were added by this update.

Summary of changes:
 core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala  |  2 +-
 .../scala/org/apache/spark/scheduler/DAGScheduler.scala   |  2 +-
 .../test/scala/org/apache/spark/rdd/HadoopRDDSuite.scala  | 15 ---
 3 files changed, 10 insertions(+), 9 deletions(-)
 copy 
sql/core/src/test/scala/org/apache/spark/sql/test/TestSparkSessionSuite.scala 
=> core/src/test/scala/org/apache/spark/rdd/HadoopRDDSuite.scala (71%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-38922][CORE] TaskLocation.apply throw NullPointerException

2022-04-19 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 662c5df7534 [SPARK-38922][CORE] TaskLocation.apply throw 
NullPointerException
662c5df7534 is described below

commit 662c5df75341473e5ea4057d5f8300516ca025fa
Author: Kent Yao 
AuthorDate: Wed Apr 20 14:38:26 2022 +0800

[SPARK-38922][CORE] TaskLocation.apply throw NullPointerException

### What changes were proposed in this pull request?

TaskLocation.apply w/o NULL check may throw NPE and fail job scheduling

```

Caused by: java.lang.NullPointerException
at 
scala.collection.immutable.StringLike$class.stripPrefix(StringLike.scala:155)
at scala.collection.immutable.StringOps.stripPrefix(StringOps.scala:29)
at org.apache.spark.scheduler.TaskLocation$.apply(TaskLocation.scala:71)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal
```

For instance, `org.apache.spark.rdd.HadoopRDD#convertSplitLocationInfo` 
might generate unexpected `Some(null)` elements where should be replace by 
`Option.apply`

### Why are the changes needed?

fix NPE

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes #36222 from yaooqinn/SPARK-38922.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
(cherry picked from commit 33e07f3cd926105c6d28986eb6218f237505549e)
Signed-off-by: Kent Yao 
---
 .../scala/org/apache/spark/rdd/HadoopRDD.scala |  2 +-
 .../org/apache/spark/scheduler/DAGScheduler.scala  |  2 +-
 .../org/apache/spark/rdd/HadoopRDDSuite.scala  | 30 ++
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
index 5fc0b4f736d..ec9ab9c0663 100644
--- a/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
@@ -452,7 +452,7 @@ private[spark] object HadoopRDD extends Logging {
infos: Array[SplitLocationInfo]): Option[Seq[String]] = {
 Option(infos).map(_.flatMap { loc =>
   val locationStr = loc.getLocation
-  if (locationStr != "localhost") {
+  if (locationStr != null && locationStr != "localhost") {
 if (loc.isInMemory) {
   logDebug(s"Partition $locationStr is cached by Hadoop.")
   Some(HDFSCacheTaskLocation(locationStr).toString)
diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
index 9a27d9cbad2..a82d261d545 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -2528,7 +2528,7 @@ private[spark] class DAGScheduler(
 // If the RDD has some placement preferences (as is the case for input 
RDDs), get those
 val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
 if (rddPrefs.nonEmpty) {
-  return rddPrefs.map(TaskLocation(_))
+  return rddPrefs.filter(_ != null).map(TaskLocation(_))
 }
 
 // If the RDD has narrow dependencies, pick the first partition of the 
first narrow dependency
diff --git a/core/src/test/scala/org/apache/spark/rdd/HadoopRDDSuite.scala 
b/core/src/test/scala/org/apache/spark/rdd/HadoopRDDSuite.scala
new file mode 100644
index 000..b43d76c114c
--- /dev/null
+++ b/core/src/test/scala/org/apache/spark/rdd/HadoopRDDSuite.scala
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.rdd
+
+import org.apache.hadoop.mapred.SplitLocationInfo
+
+import org.apache.spark.SparkFunSuite
+
+class HadoopRDDSuite extends SparkFunSuite {
+
+  test("SPARK-38922: HadoopRDD convertSplitLocationInfo contains Some(null) 
cause NPE") {
+val locs = Array(new SplitLocationInfo(null, false))
+assert

[spark] branch branch-3.1 updated: [SPARK-38922][CORE] TaskLocation.apply throw NullPointerException

2022-04-19 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new fefd54d7fa1 [SPARK-38922][CORE] TaskLocation.apply throw 
NullPointerException
fefd54d7fa1 is described below

commit fefd54d7fa11c22517603d498a6273b237b867ef
Author: Kent Yao 
AuthorDate: Wed Apr 20 14:38:26 2022 +0800

[SPARK-38922][CORE] TaskLocation.apply throw NullPointerException

### What changes were proposed in this pull request?

TaskLocation.apply w/o NULL check may throw NPE and fail job scheduling

```

Caused by: java.lang.NullPointerException
at 
scala.collection.immutable.StringLike$class.stripPrefix(StringLike.scala:155)
at scala.collection.immutable.StringOps.stripPrefix(StringOps.scala:29)
at org.apache.spark.scheduler.TaskLocation$.apply(TaskLocation.scala:71)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal
```

For instance, `org.apache.spark.rdd.HadoopRDD#convertSplitLocationInfo` 
might generate unexpected `Some(null)` elements where should be replace by 
`Option.apply`

### Why are the changes needed?

fix NPE

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes #36222 from yaooqinn/SPARK-38922.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
(cherry picked from commit 33e07f3cd926105c6d28986eb6218f237505549e)
Signed-off-by: Kent Yao 
---
 .../scala/org/apache/spark/rdd/HadoopRDD.scala |  2 +-
 .../org/apache/spark/scheduler/DAGScheduler.scala  |  2 +-
 .../org/apache/spark/rdd/HadoopRDDSuite.scala  | 30 ++
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
index 5fc0b4f736d..ec9ab9c0663 100644
--- a/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
@@ -452,7 +452,7 @@ private[spark] object HadoopRDD extends Logging {
infos: Array[SplitLocationInfo]): Option[Seq[String]] = {
 Option(infos).map(_.flatMap { loc =>
   val locationStr = loc.getLocation
-  if (locationStr != "localhost") {
+  if (locationStr != null && locationStr != "localhost") {
 if (loc.isInMemory) {
   logDebug(s"Partition $locationStr is cached by Hadoop.")
   Some(HDFSCacheTaskLocation(locationStr).toString)
diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
index bc69f4c804e..759fd20ff2e 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -2377,7 +2377,7 @@ private[spark] class DAGScheduler(
 // If the RDD has some placement preferences (as is the case for input 
RDDs), get those
 val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
 if (rddPrefs.nonEmpty) {
-  return rddPrefs.map(TaskLocation(_))
+  return rddPrefs.filter(_ != null).map(TaskLocation(_))
 }
 
 // If the RDD has narrow dependencies, pick the first partition of the 
first narrow dependency
diff --git a/core/src/test/scala/org/apache/spark/rdd/HadoopRDDSuite.scala 
b/core/src/test/scala/org/apache/spark/rdd/HadoopRDDSuite.scala
new file mode 100644
index 000..b43d76c114c
--- /dev/null
+++ b/core/src/test/scala/org/apache/spark/rdd/HadoopRDDSuite.scala
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.rdd
+
+import org.apache.hadoop.mapred.SplitLocationInfo
+
+import org.apache.spark.SparkFunSuite
+
+class HadoopRDDSuite extends SparkFunSuite {
+
+  test("SPARK-38922: HadoopRDD convertSplitLocationInfo contains Some(null) 
cause NPE") {
+val locs = Array(new SplitLocationInfo(null, false))
+assert

[spark] branch branch-3.0 updated: [SPARK-38922][CORE] TaskLocation.apply throw NullPointerException

2022-04-19 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 156d1aa9146 [SPARK-38922][CORE] TaskLocation.apply throw 
NullPointerException
156d1aa9146 is described below

commit 156d1aa9146723c1a4287c7e7ad8d52d52ddc109
Author: Kent Yao 
AuthorDate: Wed Apr 20 14:38:26 2022 +0800

[SPARK-38922][CORE] TaskLocation.apply throw NullPointerException

### What changes were proposed in this pull request?

TaskLocation.apply w/o NULL check may throw NPE and fail job scheduling

```

Caused by: java.lang.NullPointerException
at 
scala.collection.immutable.StringLike$class.stripPrefix(StringLike.scala:155)
at scala.collection.immutable.StringOps.stripPrefix(StringOps.scala:29)
at org.apache.spark.scheduler.TaskLocation$.apply(TaskLocation.scala:71)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal
```

For instance, `org.apache.spark.rdd.HadoopRDD#convertSplitLocationInfo` 
might generate unexpected `Some(null)` elements where should be replace by 
`Option.apply`

### Why are the changes needed?

fix NPE

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes #36222 from yaooqinn/SPARK-38922.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
(cherry picked from commit 33e07f3cd926105c6d28986eb6218f237505549e)
Signed-off-by: Kent Yao 
---
 .../scala/org/apache/spark/rdd/HadoopRDD.scala |  2 +-
 .../org/apache/spark/scheduler/DAGScheduler.scala  |  2 +-
 .../org/apache/spark/rdd/HadoopRDDSuite.scala  | 30 ++
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
index 9742d12cfe0..e15a7d6f10a 100644
--- a/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
@@ -448,7 +448,7 @@ private[spark] object HadoopRDD extends Logging {
infos: Array[SplitLocationInfo]): Option[Seq[String]] = {
 Option(infos).map(_.flatMap { loc =>
   val locationStr = loc.getLocation
-  if (locationStr != "localhost") {
+  if (locationStr != null && locationStr != "localhost") {
 if (loc.isInMemory) {
   logDebug(s"Partition $locationStr is cached by Hadoop.")
   Some(HDFSCacheTaskLocation(locationStr).toString)
diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
index 28f36d76884..2f7b26c7d9d 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -2178,7 +2178,7 @@ private[spark] class DAGScheduler(
 // If the RDD has some placement preferences (as is the case for input 
RDDs), get those
 val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
 if (rddPrefs.nonEmpty) {
-  return rddPrefs.map(TaskLocation(_))
+  return rddPrefs.filter(_ != null).map(TaskLocation(_))
 }
 
 // If the RDD has narrow dependencies, pick the first partition of the 
first narrow dependency
diff --git a/core/src/test/scala/org/apache/spark/rdd/HadoopRDDSuite.scala 
b/core/src/test/scala/org/apache/spark/rdd/HadoopRDDSuite.scala
new file mode 100644
index 000..b43d76c114c
--- /dev/null
+++ b/core/src/test/scala/org/apache/spark/rdd/HadoopRDDSuite.scala
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.rdd
+
+import org.apache.hadoop.mapred.SplitLocationInfo
+
+import org.apache.spark.SparkFunSuite
+
+class HadoopRDDSuite extends SparkFunSuite {
+
+  test("SPARK-38922: HadoopRDD convertSplitLocationInfo contains Some(null) 
cause NPE") {
+val locs = Array(new SplitLocationInfo(null, false))
+assert