date:20220222

[spark] branch master updated (fab4ceb -> b425156)

2022-02-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from fab4ceb  [SPARK-38240][SQL] Improve RuntimeReplaceable and add a 
guideline for adding new functions
 add b425156  [SPARK-38162][SQL] Optimize one row plan in normal and AQE 
Optimizer

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/catalyst/dsl/package.scala|   2 +-
 .../catalyst/optimizer/OptimizeOneRowPlan.scala|  49 ++
 .../spark/sql/catalyst/optimizer/Optimizer.scala   |  13 ++-
 .../sql/catalyst/rules/RuleIdCollection.scala  |   1 +
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala |   2 +-
 .../catalyst/optimizer/EliminateSortsSuite.scala   |  10 --
 .../optimizer/OptimizeOneRowPlanSuite.scala| 104 +
 .../sql/execution/adaptive/AQEOptimizer.scala  |   5 +-
 .../adaptive/AdaptiveQueryExecSuite.scala  |  54 +++
 9 files changed, 219 insertions(+), 21 deletions(-)
 create mode 100644 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeOneRowPlan.scala
 create mode 100644 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeOneRowPlanSuite.scala

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-38240][SQL] Improve RuntimeReplaceable and add a guideline for adding new functions

2022-02-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new fab4ceb  [SPARK-38240][SQL] Improve RuntimeReplaceable and add a 
guideline for adding new functions
fab4ceb is described below

commit fab4ceb157baac870f6d50b942084bb9b2cd4ad2
Author: Wenchen Fan 
AuthorDate: Wed Feb 23 15:32:00 2022 +0800

[SPARK-38240][SQL] Improve RuntimeReplaceable and add a guideline for 
adding new functions

### What changes were proposed in this pull request?

This PR improves `RuntimeReplaceable` so that it can
1. Customize the type coercion behavior instead of always inheriting from 
the replacement expression. This is useful for expressions like `ToBinary`, 
where its replacement expression can be `Cast` that does not have type coercion.
2. Support aggregate functions.

This PR also adds a guideline for adding new SQL functions, with 
`RuntimeReplaceable` and `ExpressionBuilder`. See 
https://github.com/apache/spark/pull/35534/files#diff-6c6ba3e220b9d155160e4e25305fdd3a4835b7ce9eba230a7ae70bdd97047313R330

### Why are the changes needed?

Since we are keep adding new functions, it's better to make 
`RuntimeReplaceable` more useful and set up a standard for adding functions.

### Does this PR introduce _any_ user-facing change?

Improves error messages of some functions.

### How was this patch tested?

existing tests

Closes #35534 from cloud-fan/refactor.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../spark/examples/extensions/AgeExample.scala |  13 +-
 .../sql/catalyst/analysis/CheckAnalysis.scala  |   4 +
 .../sql/catalyst/analysis/FunctionRegistry.scala   |  63 +++-
 .../sql/catalyst/analysis/TimeTravelSpec.scala |   2 +-
 .../sql/catalyst/expressions/Expression.scala  |  81 +++--
 .../spark/sql/catalyst/expressions/TryEval.scala   |  51 ++-
 .../catalyst/expressions/aggregate/CountIf.scala   |  35 +--
 .../catalyst/expressions/aggregate/RegrCount.scala |  19 +-
 ...{UnevaluableAggs.scala => boolAggregates.scala} |  41 +--
 .../expressions/collectionOperations.scala |  53 +++-
 .../catalyst/expressions/datetimeExpressions.scala | 343 ++---
 .../catalyst/expressions/intervalExpressions.scala |  10 +-
 .../sql/catalyst/expressions/mathExpressions.scala |  97 +++---
 .../spark/sql/catalyst/expressions/misc.scala  |  91 +++---
 .../sql/catalyst/expressions/nullExpressions.scala |  54 +---
 .../catalyst/expressions/regexpExpressions.scala   |  19 +-
 .../catalyst/expressions/stringExpressions.scala   | 207 ++---
 .../sql/catalyst/optimizer/finishAnalysis.scala|  21 +-
 .../spark/sql/catalyst/parser/AstBuilder.scala |   2 +-
 .../spark/sql/catalyst/trees/TreePatterns.scala|   3 -
 .../apache/spark/sql/catalyst/util/package.scala   |   4 +-
 .../spark/sql/errors/QueryCompilationErrors.scala  |  24 +-
 .../spark/sql/errors/QueryExecutionErrors.scala|   8 +-
 .../expressions/DateExpressionsSuite.scala |   8 +-
 .../scala/org/apache/spark/sql/functions.scala |   4 +-
 .../sql-functions/sql-expression-schema.md |  20 +-
 .../sql-tests/inputs/string-functions.sql  |   9 +-
 .../resources/sql-tests/results/ansi/map.sql.out   |   4 +-
 .../results/ansi/string-functions.sql.out  |  28 +-
 .../results/ceil-floor-with-scale-param.sql.out|  14 +-
 .../resources/sql-tests/results/extract.sql.out|   4 +-
 .../resources/sql-tests/results/group-by.sql.out   |  12 +-
 .../test/resources/sql-tests/results/map.sql.out   |   4 +-
 .../sql-tests/results/string-functions.sql.out |  28 +-
 .../sql-tests/results/timestamp-ltz.sql.out|   2 +-
 .../sql-tests/results/udf/udf-group-by.sql.out |   8 +-
 .../apache/spark/sql/DataFrameAggregateSuite.scala |   3 +-
 37 files changed, 657 insertions(+), 736 deletions(-)

diff --git 
a/examples/src/main/scala/org/apache/spark/examples/extensions/AgeExample.scala 
b/examples/src/main/scala/org/apache/spark/examples/extensions/AgeExample.scala
index d25f220..e484024 100644
--- 
a/examples/src/main/scala/org/apache/spark/examples/extensions/AgeExample.scala
+++ 
b/examples/src/main/scala/org/apache/spark/examples/extensions/AgeExample.scala
@@ -18,14 +18,15 @@
 package org.apache.spark.examples.extensions
 
 import org.apache.spark.sql.catalyst.expressions.{CurrentDate, Expression, 
RuntimeReplaceable, SubtractDates}
+import org.apache.spark.sql.catalyst.trees.UnaryLike
 
 /**
  * How old are you in days?
  */
-case class AgeExample(birthday: Expression, child: Expression) extends 
RuntimeReplaceable {
-
-  def this(birthday: Expression) = this(birthday, SubtractDates(CurrentDate(), 
birthday))
-  override def exprsReplaced: Seq[Expression] = Seq(birthday)
-
-  override

[GitHub] [spark-website] AngersZhuuuu commented on pull request #380: Fix wrong issue link

2022-02-22 Thread GitBox



AngersZh commented on pull request #380:
URL: https://github.com/apache/spark-website/pull/380#issuecomment-1048486813


   > oh wait, we should also update generated HTMLs too.
   
   updated and double checked.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] AngersZhuuuu commented on pull request #380: Fix wrong issue link

2022-02-22 Thread GitBox



AngersZh commented on pull request #380:
URL: https://github.com/apache/spark-website/pull/380#issuecomment-1048485159


   > oh wait, we should also update generated HTMLs too.
   
   OK,  let me change it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] HyukjinKwon commented on pull request #380: Fix wrong issue link

2022-02-22 Thread GitBox



HyukjinKwon commented on pull request #380:
URL: https://github.com/apache/spark-website/pull/380#issuecomment-1048484840


   oh wait, we should also update generated HTMLs too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] AngersZhuuuu opened a new pull request #380: Fix wrong issue link

2022-02-22 Thread GitBox



AngersZh opened a new pull request #380:
URL: https://github.com/apache/spark-website/pull/380


   Fix wrong issue link


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-38297][PYTHON] Explicitly cast the return value at DataFrame.to_numpy in POS

2022-02-22 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new a0d2be5  [SPARK-38297][PYTHON] Explicitly cast the return value at 
DataFrame.to_numpy in POS
a0d2be5 is described below

commit a0d2be565486367abd6b637c98634c35420994ce
Author: Hyukjin Kwon 
AuthorDate: Wed Feb 23 14:12:39 2022 +0900

[SPARK-38297][PYTHON] Explicitly cast the return value at 
DataFrame.to_numpy in POS

MyPy build currently fails as below:

```
starting mypy annotations test...
annotations failed mypy checks:
python/pyspark/pandas/generic.py:585: error: Incompatible return value type 
(got "Union[ndarray[Any, Any], ExtensionArray]", expected "ndarray[Any, Any]")  
[return-value]
Found 1 error in 1 file (checked 324 source files)
1
```

https://github.com/apache/spark/runs/5298261168?check_suite_focus=true

I tried to reproduce in my local by matching NumPy and MyPy versions but 
failed. So I decided to work around the problem first by explicitly casting to 
make MyPy happy.

To make the build pass.

No, dev-only.

CI in this PR should verify if it's fixed.

Closes #35617 from HyukjinKwon/SPARK-38297.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit b46b74ce0521d1d5e7c09cadad0e9639e31214cb)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/generic.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyspark/pandas/generic.py b/python/pyspark/pandas/generic.py
index cdd8f67..c26b516 100644
--- a/python/pyspark/pandas/generic.py
+++ b/python/pyspark/pandas/generic.py
@@ -573,7 +573,7 @@ class Frame(object, metaclass=ABCMeta):
 >>> ps.Series(['a', 'b', 'a']).to_numpy()
 array(['a', 'b', 'a'], dtype=object)
 """
-return self.to_pandas().values
+return cast(np.ndarray, self._to_pandas().values)
 
 @property
 def values(self) -> np.ndarray:

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (43e93b5 -> b46b74c)

2022-02-22 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 43e93b5  [SPARK-38241][K8S][TESTS] Close KubernetesClient in K8S 
integrations tests
 add b46b74c  [SPARK-38297][PYTHON] Explicitly cast the return value at 
DataFrame.to_numpy in POS

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/generic.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (2534217 -> 43e93b5)

2022-02-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 2534217  [SPARK-38260][BUILD][CORE] Remove `commons-net` dependency in 
`hadoop-3` profile
 add 43e93b5  [SPARK-38241][K8S][TESTS] Close KubernetesClient in K8S 
integrations tests

No new revisions were added by this update.

Summary of changes:
 .../deploy/k8s/integrationtest/backend/cloud/KubeConfigBackend.scala   | 3 +++
 .../k8s/integrationtest/backend/minikube/MinikubeTestBackend.scala | 3 +++
 2 files changed, 6 insertions(+)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-38260][BUILD][CORE] Remove `commons-net` dependency in `hadoop-3` profile

2022-02-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2534217  [SPARK-38260][BUILD][CORE] Remove `commons-net` dependency in 
`hadoop-3` profile
2534217 is described below

commit 25342179447914d76123b8d3ae7bddf34e4bcfba
Author: yangjie01 
AuthorDate: Tue Feb 22 18:47:11 2022 -0800

[SPARK-38260][BUILD][CORE] Remove `commons-net` dependency in `hadoop-3` 
profile

### What changes were proposed in this pull request?
[SPARK-1189](https://github.com/apache/spark/pull/33/files)  introduces 
maven dependence on `commons-net`, and `org.apache.commons.net.util.Base64` is 
used in `SparkSaslServer`,  but `SparkSaslServer` has changed to use 
`io.netty.handler.codec.base64.Base64` and there is no explicit dependency on 
`commons-net` in Spark code, so this pr removed this dependency.

After this pr Spark with `hadoop-3` profile no longer need `commons-net`, 
but Spark with `hadoop-2` still need it due to `hadoop-2.7.4` use `commons-net` 
directly.

### Why are the changes needed?
Remove unnecessary maven dependency.

### Does this PR introduce _any_ user-facing change?
`commons-net` jar no longer exists in Spark-Client with hadoop-3.x

### How was this patch tested?
Pass GA

Closes #35582 from LuciferYang/SPARK-38260.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 core/pom.xml  | 4 
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 1 -
 pom.xml   | 5 -
 3 files changed, 10 deletions(-)

diff --git a/core/pom.xml b/core/pom.xml
index ac429fc..3d09591 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -251,10 +251,6 @@
   RoaringBitmap
 
 
-  commons-net
-  commons-net
-
-
   org.scala-lang.modules
   scala-xml_${scala.binary.version}
 
diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 73644ee..2de677e 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -49,7 +49,6 @@ commons-lang/2.6//commons-lang-2.6.jar
 commons-lang3/3.12.0//commons-lang3-3.12.0.jar
 commons-logging/1.1.3//commons-logging-1.1.3.jar
 commons-math3/3.6.1//commons-math3-3.6.1.jar
-commons-net/3.1//commons-net-3.1.jar
 commons-pool/1.5.4//commons-pool-1.5.4.jar
 commons-text/1.9//commons-text-1.9.jar
 compress-lzf/1.0.3//compress-lzf-1.0.3.jar
diff --git a/pom.xml b/pom.xml
index 23e567c..d1e391c 100644
--- a/pom.xml
+++ b/pom.xml
@@ -804,11 +804,6 @@
 0.9.23
   
   
-commons-net
-commons-net
-3.1
-  
-  
 io.netty
 netty-all
 4.1.74.Final

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (4d75d47 -> ceb32c9)

2022-02-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 4d75d47  [SPARK-38062][CORE] Avoid resolving placeholder hostname for 
FallbackStorage in BlockManagerDecommissioner
 add ceb32c9  [SPARK-38272][K8S][TESTS] Use `docker-desktop` instead of 
`docker-for-desktop` for Docker K8S IT deployMode and context name

No new revisions were added by this update.

Summary of changes:
 resource-managers/kubernetes/integration-tests/README.md  | 8 
 .../integration-tests/scripts/setup-integration-test-env.sh   | 2 +-
 .../apache/spark/deploy/k8s/integrationtest/TestConstants.scala   | 1 +
 .../k8s/integrationtest/backend/IntegrationTestBackend.scala  | 2 +-
 .../integrationtest/backend/docker/DockerForDesktopBackend.scala  | 2 +-
 5 files changed, 8 insertions(+), 7 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (a11f799 -> 4d75d47)

2022-02-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from a11f799  [SPARK-38121][PYTHON][SQL][FOLLOW-UP] Make df.sparkSession 
return the session that created DataFrame when SQLContext is used
 add 4d75d47  [SPARK-38062][CORE] Avoid resolving placeholder hostname for 
FallbackStorage in BlockManagerDecommissioner

No new revisions were added by this update.

Summary of changes:
 .../spark/storage/BlockManagerDecommissioner.scala | 31 +-
 .../spark/storage/FallbackStorageSuite.scala   | 14 +++---
 2 files changed, 22 insertions(+), 23 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (27dbf6f -> a11f799)

2022-02-22 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 27dbf6f  [SPARK-38291][BUILD][TESTS] Upgrade `postgresql` from 42.3.0 
to 42.3.3
 add a11f799  [SPARK-38121][PYTHON][SQL][FOLLOW-UP] Make df.sparkSession 
return the session that created DataFrame when SQLContext is used

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/dataframe.py | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-38291][BUILD][TESTS] Upgrade `postgresql` from 42.3.0 to 42.3.3

2022-02-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 27dbf6f  [SPARK-38291][BUILD][TESTS] Upgrade `postgresql` from 42.3.0 
to 42.3.3
27dbf6f is described below

commit 27dbf6fe67c81887ee656a69fc327f3cb5ae56f2
Author: bjornjorgensen 
AuthorDate: Tue Feb 22 13:02:14 2022 -0800

[SPARK-38291][BUILD][TESTS] Upgrade `postgresql` from 42.3.0 to 42.3.3

### What changes were proposed in this pull request?
Upgrade Postgresql 42.3.0 to 42.3.3
[Postgresql changelog 
42.3.3](https://jdbc.postgresql.org/documentation/changelog.html#version_42.3.3)

### Why are the changes needed?
[CVE-2022-21724](https://nvd.nist.gov/vuln/detail/CVE-2022-21724)
and
[Arbitrary File Write 
Vulnerability](https://github.com/advisories/GHSA-673j-qm5f-xpv8)

By upgrading postgresql from 42.3.0 to 42.3.3 we will resolve these issues.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
All test must pass.

Closes #35614 from bjornjorgensen/postgresql-from-42.3.0-to-42.3.3.

Authored-by: bjornjorgensen 
Signed-off-by: Dongjoon Hyun 
---
 pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pom.xml b/pom.xml
index 788cf8c..23e567c 100644
--- a/pom.xml
+++ b/pom.xml
@@ -1181,7 +1181,7 @@
   
 org.postgresql
 postgresql
-42.3.0
+42.3.3
 test
   
   

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (43822cd -> bd44611)

2022-02-22 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 43822cd  [SPARK-38060][SQL] Respect allowNonNumericNumbers when 
parsing quoted NaN and Infinity values in JSON reader
 add bd44611  [SPARK-38290][SQL] Fix JsonSuite and ParquetIOSuite under 
ANSI mode

No new revisions were added by this update.

Summary of changes:
 .../sql/execution/datasources/json/JsonSuite.scala | 39 +-
 .../datasources/parquet/ParquetIOSuite.scala   |  7 +++-
 2 files changed, 29 insertions(+), 17 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-38060][SQL] Respect allowNonNumericNumbers when parsing quoted NaN and Infinity values in JSON reader

2022-02-22 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 43822cd  [SPARK-38060][SQL] Respect allowNonNumericNumbers when 
parsing quoted NaN and Infinity values in JSON reader
43822cd is described below

commit 43822cdd228a3ba49c47637c525d731d00772f64
Author: Andy Grove 
AuthorDate: Tue Feb 22 08:42:47 2022 -0600

[SPARK-38060][SQL] Respect allowNonNumericNumbers when parsing quoted NaN 
and Infinity values in JSON reader

Signed-off-by: Andy Grove 

### What changes were proposed in this pull request?

When parsing JSON unquoted `NaN` and `Infinity`values for floating-point 
columns we get the expected behavior as shown below where valid values are 
returned when the parsing option `allowNonNumericNumbers` is enabled and `null` 
otherwise.

| Value | allowNonNumericNumbers=true | allowNonNumericNumbers=false |
| - | --- |  |
| NaN   | Double.NaN  | null |
| +INF  | Double.PositiveInfinity | null |
| +Infinity | Double.PositiveInfinity | null |
| Infinity  | Double.PositiveInfinity | null |
| -INF  | Double.NegativeInfinity | null |
| -Infinity | Double.NegativeInfinity | null |

However, when these values are quoted we get the following unexpected 
behavior due to a different code path being used that is inconsistent with 
Jackson's parsing and that ignores the `allowNonNumericNumbers` parser option.

| Value   | allowNonNumericNumbers=true | allowNonNumericNumbers=false |
| --- | --- |  |
| "NaN"   | Double.NaN  | Double.NaN   |
| "+INF"  | null| null |
| "+Infinity" | null| null |
| "Infinity"  | Double.PositiveInfinity | Double.PositiveInfinity  |
| "-INF"  | null| null |
| "-Infinity" | Double.NegativeInfinity | Double.NegativeInfinity  |

This PR updates the code path that handles quoted non-numeric numbers to 
make it consistent with the path that handles the unquoted values.

### Why are the changes needed?

The current behavior does not match the documented behavior in 
https://spark.apache.org/docs/latest/sql-data-sources-json.html

### Does this PR introduce _any_ user-facing change?

Yes, parsing of quoted `NaN` and `Infinity` values will now be consistent 
with the unquoted versions.

### How was this patch tested?

Unit tests are updated.

Closes #35573 from andygrove/SPARK-38060.

Authored-by: Andy Grove 
Signed-off-by: Sean Owen 
---
 docs/core-migration-guide.md   |  2 ++
 .../spark/sql/catalyst/json/JacksonParser.scala| 18 ++
 .../datasources/json/JsonParsingOptionsSuite.scala | 39 ++
 .../sql/execution/datasources/json/JsonSuite.scala |  6 
 4 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/docs/core-migration-guide.md b/docs/core-migration-guide.md
index 745b80d..588433c 100644
--- a/docs/core-migration-guide.md
+++ b/docs/core-migration-guide.md
@@ -26,6 +26,8 @@ license: |
 
 - Since Spark 3.3, Spark migrates its log4j dependency from 1.x to 2.x because 
log4j 1.x has reached end of life and is no longer supported by the community. 
Vulnerabilities reported after August 2015 against log4j 1.x were not checked 
and will not be fixed. Users should rewrite original log4j properties files 
using log4j2 syntax (XML, JSON, YAML, or properties format). Spark rewrites the 
`conf/log4j.properties.template` which is included in Spark distribution, to 
`conf/log4j2.properties [...]
 
+- Since Spark 3.3, when reading values from a JSON attribute defined as 
`FloatType` or `DoubleType`, the strings `"+Infinity"`, `"+INF"`, and `"-INF"` 
are now parsed to the appropriate values, in addition to the already supported 
`"Infinity"` and `"-Infinity"` variations. This change was made to improve 
consistency with Jackson's parsing of the unquoted versions of these values. 
Also, the `allowNonNumericNumbers` option is now respected so these strings 
will now be considered invalid if  [...]
+
 ## Upgrading from Core 3.1 to 3.2
 
 - Since Spark 3.2, `spark.scheduler.allocation.file` supports read remote file 
using hadoop filesystem which means if the path has no scheme Spark will 
respect hadoop configuration to read it. To restore the behavior before

[spark] branch branch-3.2 updated: [SPARK-38271] PoissonSampler may output more rows than MaxRows

2022-02-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 7d36329  [SPARK-38271] PoissonSampler may output more rows than MaxRows
7d36329 is described below

commit 7d363294b7af212836e7a444ad82c716f3560278
Author: Ruifeng Zheng 
AuthorDate: Tue Feb 22 21:04:43 2022 +0800

[SPARK-38271] PoissonSampler may output more rows than MaxRows

### What changes were proposed in this pull request?
when `replacement=true`, `Sample.maxRows` returns `None`

### Why are the changes needed?
the underlying impl of `SampleExec` can not guarantee that its number of 
output rows <= `Sample.maxRows`

```
scala> val df = spark.range(0, 1000)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.count
res0: Long = 1000

scala> df.sample(true, 0.99, 10).count
res1: Long = 1004
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #35593 from zhengruifeng/fix_sample_maxRows.

Authored-by: Ruifeng Zheng 
Signed-off-by: Wenchen Fan 
(cherry picked from commit b68327968a7a5f7ac1afa9cc270204c9eaddcb75)
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/plans/logical/basicLogicalOperators.scala  |  6 +-
 .../spark/sql/catalyst/optimizer/CombiningLimitsSuite.scala | 13 +
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
index 7f33f28..6748db5 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
@@ -1344,7 +1344,11 @@ case class Sample(
   s"Sampling fraction ($fraction) must be on interval [0, 1] without 
replacement")
   }
 
-  override def maxRows: Option[Long] = child.maxRows
+  override def maxRows: Option[Long] = {
+// when withReplacement is true, PoissonSampler is applied in SampleExec,
+// which may output more rows than child.maxRows.
+if (withReplacement) None else child.maxRows
+  }
   override def output: Seq[Attribute] = child.output
 
   override protected def withNewChildInternal(newChild: LogicalPlan): Sample =
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CombiningLimitsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CombiningLimitsSuite.scala
index 46e9dea..d3cbaa8 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CombiningLimitsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CombiningLimitsSuite.scala
@@ -159,6 +159,19 @@ class CombiningLimitsSuite extends PlanTest {
 )
   }
 
+  test("SPARK-38271: PoissonSampler may output more rows than child.maxRows") {
+val query = testRelation.select().sample(0, 0.2, true, 1)
+assert(query.maxRows.isEmpty)
+val optimized = Optimize.execute(query.analyze)
+assert(optimized.maxRows.isEmpty)
+// can not eliminate Limit since Sample.maxRows is None
+checkPlanAndMaxRow(
+  query.limit(10),
+  query.limit(10),
+  10
+)
+  }
+
   test("SPARK-33497: Eliminate Limit if Deduplicate max rows not larger than 
Limit") {
 checkPlanAndMaxRow(
   testRelation.deduplicate("a".attr).limit(10),

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c82e0fe -> b683279)

2022-02-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c82e0fe  [SPARK-37422][PYTHON][MLLIB] Inline typehints for 
pyspark.mllib.feature
 add b683279  [SPARK-38271] PoissonSampler may output more rows than MaxRows

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/plans/logical/basicLogicalOperators.scala  |  6 +-
 .../spark/sql/catalyst/optimizer/CombiningLimitsSuite.scala | 13 +
 2 files changed, 18 insertions(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (ef818ed -> c82e0fe)

2022-02-22 Thread zero323

This is an automated email from the ASF dual-hosted git repository.

zero323 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ef818ed  [SPARK-38283][SQL] Test invalid datetime parsing under ANSI 
mode
 add c82e0fe  [SPARK-37422][PYTHON][MLLIB] Inline typehints for 
pyspark.mllib.feature

No new revisions were added by this update.

Summary of changes:
 python/pyspark/mllib/feature.py  | 218 ---
 python/pyspark/mllib/feature.pyi | 169 --
 2 files changed, 155 insertions(+), 232 deletions(-)
 delete mode 100644 python/pyspark/mllib/feature.pyi

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-38283][SQL] Test invalid datetime parsing under ANSI mode

2022-02-22 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ef818ed  [SPARK-38283][SQL] Test invalid datetime parsing under ANSI 
mode
ef818ed is described below

commit ef818ed86ce41be55bd962a5c809974f957f8734
Author: Gengliang Wang 
AuthorDate: Tue Feb 22 19:12:02 2022 +0800

[SPARK-38283][SQL] Test invalid datetime parsing under ANSI mode

### What changes were proposed in this pull request?

Run datetime-parsing-invalid.sql under ANSI mode in SQLQueryTestSuite for 
improving test coverage.

Also, we can simply set ANSI mode as off in DateFunctionsSuite, so that the 
test suite can pass after we set up a new test job with ANSI on.

### Why are the changes needed?

Improve test coverage and fix DateFunctionsSuite under ANSI mode.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

Closes #35606 from gengliangwang/fixDateFuncSuite.

Authored-by: Gengliang Wang 
Signed-off-by: Gengliang Wang 
---
 .../inputs/ansi/datetime-parsing-invalid.sql   |   2 +
 .../results/ansi/datetime-parsing-invalid.sql.out  | 263 +
 .../org/apache/spark/sql/DateFunctionsSuite.scala  |   6 +-
 3 files changed, 270 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/test/resources/sql-tests/inputs/ansi/datetime-parsing-invalid.sql
 
b/sql/core/src/test/resources/sql-tests/inputs/ansi/datetime-parsing-invalid.sql
new file mode 100644
index 000..70022f3
--- /dev/null
+++ 
b/sql/core/src/test/resources/sql-tests/inputs/ansi/datetime-parsing-invalid.sql
@@ -0,0 +1,2 @@
+--IMPORT datetime-parsing-invalid.sql
+
diff --git 
a/sql/core/src/test/resources/sql-tests/results/ansi/datetime-parsing-invalid.sql.out
 
b/sql/core/src/test/resources/sql-tests/results/ansi/datetime-parsing-invalid.sql.out
new file mode 100644
index 000..e6dd07b
--- /dev/null
+++ 
b/sql/core/src/test/resources/sql-tests/results/ansi/datetime-parsing-invalid.sql.out
@@ -0,0 +1,263 @@
+-- Automatically generated by SQLQueryTestSuite
+-- Number of queries: 29
+
+
+-- !query
+select to_timestamp('294248', 'y')
+-- !query schema
+struct<>
+-- !query output
+java.lang.ArithmeticException
+long overflow
+
+
+-- !query
+select to_timestamp('1', 'yy')
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.SparkUpgradeException
+You may get a different result due to the upgrading of Spark 3.0: Fail to 
parse '1' in the new parser. You can set spark.sql.legacy.timeParserPolicy to 
LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat 
it as an invalid datetime string.
+
+
+-- !query
+select to_timestamp('-12', 'yy')
+-- !query schema
+struct<>
+-- !query output
+java.time.format.DateTimeParseException
+Text '-12' could not be parsed at index 0. If necessary set 
spark.sql.ansi.enabled to false to bypass this error.
+
+
+-- !query
+select to_timestamp('123', 'yy')
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.SparkUpgradeException
+You may get a different result due to the upgrading of Spark 3.0: Fail to 
parse '123' in the new parser. You can set spark.sql.legacy.timeParserPolicy to 
LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat 
it as an invalid datetime string.
+
+
+-- !query
+select to_timestamp('1', 'yyy')
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.SparkUpgradeException
+You may get a different result due to the upgrading of Spark 3.0: Fail to 
parse '1' in the new parser. You can set spark.sql.legacy.timeParserPolicy to 
LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat 
it as an invalid datetime string.
+
+
+-- !query
+select to_timestamp('1234567', 'yyy')
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.SparkUpgradeException
+You may get a different result due to the upgrading of Spark 3.0: Fail to 
recognize 'yyy' pattern in the DateTimeFormatter. 1) You can set 
spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before 
Spark 3.0. 2) You can form a valid datetime pattern with the guide from 
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
+
+
+-- !query
+select to_timestamp('366', 'D')
+-- !query schema
+struct<>
+-- !query output
+java.time.DateTimeException
+Invalid date 'DayOfYear 366' as '1970' is not a leap year. If necessary set 
spark.sql.ansi.enabled to false to bypass this error.
+
+
+-- !query
+select to_timestamp('9', 'DD')
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.SparkUpgradeException
+You may get a different result due to the upgrading of Spark 3.0: Fail to 
parse '9' in the new parser. You can set spark.sql.legacy.timeParserPolicy to 
LEGACY to restore the behavior before Spark

[spark] branch master updated (a103a49 -> 48b56c0)

2022-02-22 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from a103a49  [SPARK-38279][TESTS][3.2] Pin MarkupSafe to 2.0.1 fix linter 
failure
 add 48b56c0  [SPARK-38278][PYTHON] Add SparkContext.addArchive in PySpark

No new revisions were added by this update.

Summary of changes:
 python/docs/source/reference/pyspark.rst |  1 +
 python/pyspark/context.py| 44 
 2 files changed, 45 insertions(+)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (fab4ceb -> b425156)

[spark] branch master updated: [SPARK-38240][SQL] Improve RuntimeReplaceable and add a guideline for adding new functions

[GitHub] [spark-website] AngersZhuuuu commented on pull request #380: Fix wrong issue link

[GitHub] [spark-website] AngersZhuuuu commented on pull request #380: Fix wrong issue link

[GitHub] [spark-website] HyukjinKwon commented on pull request #380: Fix wrong issue link

[GitHub] [spark-website] AngersZhuuuu opened a new pull request #380: Fix wrong issue link

[spark] branch branch-3.2 updated: [SPARK-38297][PYTHON] Explicitly cast the return value at DataFrame.to_numpy in POS

[spark] branch master updated (43e93b5 -> b46b74c)

[spark] branch master updated (2534217 -> 43e93b5)

[spark] branch master updated: [SPARK-38260][BUILD][CORE] Remove `commons-net` dependency in `hadoop-3` profile

[spark] branch master updated (4d75d47 -> ceb32c9)

[spark] branch master updated (a11f799 -> 4d75d47)

[spark] branch master updated (27dbf6f -> a11f799)

[spark] branch master updated: [SPARK-38291][BUILD][TESTS] Upgrade `postgresql` from 42.3.0 to 42.3.3

[spark] branch master updated (43822cd -> bd44611)

[spark] branch master updated: [SPARK-38060][SQL] Respect allowNonNumericNumbers when parsing quoted NaN and Infinity values in JSON reader

[spark] branch branch-3.2 updated: [SPARK-38271] PoissonSampler may output more rows than MaxRows

[spark] branch master updated (c82e0fe -> b683279)

[spark] branch master updated (ef818ed -> c82e0fe)

[spark] branch master updated: [SPARK-38283][SQL] Test invalid datetime parsing under ANSI mode

[spark] branch master updated (a103a49 -> 48b56c0)

21 matches

Site Navigation

Mail list logo

Footer information