date:20210920

[spark] branch branch-3.1 updated: [SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image

2021-09-20 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new c93d23b  [SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image
c93d23b is described below

commit c93d23bd14bdd6ea0d25bbe03bef4755f3ad2616
Author: Dongjoon Hyun 
AuthorDate: Mon Sep 20 10:52:45 2021 -0700

[SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image

### What changes were proposed in this pull request?

This PR aims to upgrade R from 3.6.3 to 4.0.4 in K8s R Docker image.

### Why are the changes needed?

`openjdk:11-jre-slim` image is upgraded to `Debian 11`.

```
$ docker run -it openjdk:11-jre-slim cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/;
SUPPORT_URL="https://www.debian.org/support;
BUG_REPORT_URL="https://bugs.debian.org/;
```

It causes `R 3.5` installation failures in our K8s integration test 
environment.
- 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47953/
```
The following packages have unmet dependencies:
 r-base-core : Depends: libicu63 (>= 63.1-1~) but it is not installable
   Depends: libreadline7 (>= 6.0) but it is not installable
E: Unable to correct problems, you have held broken packages.
The command '/bin/sh -c apt-get update &&   apt install -y gnupg &&   echo 
"deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> 
/etc/apt/sources.list &&   apt-key adv --keyserver keyserver.ubuntu.com 
--recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' &&   apt-get update &&
apt install -y -t buster-cran35 r-base r-base-dev &&   rm -rf
```

### Does this PR introduce _any_ user-facing change?

Yes, this will recover the installation.

### How was this patch tested?

Succeed to build SparkR docker image in the K8s integration test in Jenkins 
CI.

- 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47959/
```
Successfully built 32e1a0cd5ff8
Successfully tagged 
kubespark/spark-r:3.3.0-SNAPSHOT_6e4f7e2d-054d-4978-812f-4f32fc546b51
```

Closes #34048 from dongjoon-hyun/SPARK-36806.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit a178752540e2d37a6da847a381de7c8d6b4797d3)
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 5d0e51e943615d65b28e245fbf1fa3e575e20128)
Signed-off-by: Dongjoon Hyun 
---
 .../docker/src/main/dockerfiles/spark/bindings/R/Dockerfile   | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
index aabd04b..03e4210 100644
--- 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
+++ 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
@@ -25,14 +25,10 @@ USER 0
 
 RUN mkdir ${SPARK_HOME}/R
 
-# Install R 3.6.3 (http://cloud.r-project.org/bin/linux/debian/)
+# Install R 4.0.4 (http://cloud.r-project.org/bin/linux/debian/)
 RUN \
   apt-get update && \
-  apt install -y gnupg && \
-  echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> 
/etc/apt/sources.list && \
-  apt-key adv --keyserver keyserver.ubuntu.com --recv-key 
'E19F5F87128899B192B1A2C2AD5F960A256A04AF' && \
-  apt-get update && \
-  apt install -y -t buster-cran35 r-base r-base-dev && \
+  apt install -y r-base r-base-dev && \
   rm -rf /var/cache/apt/*
 
 COPY R ${SPARK_HOME}/R

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36785][PYTHON] Fix DataFrame.isin when DataFrame has NaN value

2021-09-20 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 3d47c69  [SPARK-36785][PYTHON] Fix DataFrame.isin when DataFrame has 
NaN value
3d47c69 is described below

commit 3d47c692d276d2d489664aa2d4e66e23e9bae0f7
Author: dgd-contributor 
AuthorDate: Mon Sep 20 17:52:51 2021 -0700

[SPARK-36785][PYTHON] Fix DataFrame.isin when DataFrame has NaN value

### What changes were proposed in this pull request?
Fix DataFrame.isin when DataFrame has NaN value

### Why are the changes needed?
Fix DataFrame.isin when DataFrame has NaN value

``` python
>>> psdf = ps.DataFrame(
... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 
1, None, 0, 0], "c": [1, 5, 1, 3, 2, 1, 1, 0, 0]},
... )
>>> psdf
 ab  c
0  NaN  NaN  1
1  2.0  5.0  5
2  3.0  NaN  1
3  4.0  3.0  3
4  5.0  2.0  2
5  6.0  1.0  1
6  7.0  NaN  1
7  8.0  0.0  0
8  NaN  0.0  0
>>> other = [1, 2, None]

>>> psdf.isin(other)
  a b c
0  None  None  True
1  True  None  None
2  None  None  True
3  None  None  None
4  None  True  True
5  None  True  True
6  None  None  True
7  None  None  None
8  None  None  None

>>> psdf.to_pandas().isin(other)
   a  b  c
0  False  False   True
1   True  False  False
2  False  False   True
3  False  False  False
4  False   True   True
5  False   True   True
6  False  False   True
7  False  False  False
8  False  False  False
```

### Does this PR introduce _any_ user-facing change?
After this PR

``` python
>>> psdf = ps.DataFrame(
... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 
1, None, 0, 0], "c": [1, 5, 1, 3, 2, 1, 1, 0, 0]},
... )
>>> psdf
 ab  c
0  NaN  NaN  1
1  2.0  5.0  5
2  3.0  NaN  1
3  4.0  3.0  3
4  5.0  2.0  2
5  6.0  1.0  1
6  7.0  NaN  1
7  8.0  0.0  0
8  NaN  0.0  0
>>> other = [1, 2, None]

>>> psdf.isin(other)
   a  b  c
0  False  False   True
1   True  False  False
2  False  False   True
3  False  False  False
4  False   True   True
5  False   True   True
6  False  False   True
7  False  False  False
8  False  False  False
```

### How was this patch tested?
Unit tests

Closes #34040 from dgd-contributor/SPARK-36785_dataframe.isin_fix.

Authored-by: dgd-contributor 
Signed-off-by: Takuya UESHIN 
(cherry picked from commit cc182fe6f61eab494350b81196b3cce356814a25)
Signed-off-by: Takuya UESHIN 
---
 python/pyspark/pandas/frame.py| 34 +++---
 python/pyspark/pandas/tests/test_dataframe.py | 35 +++
 2 files changed, 55 insertions(+), 14 deletions(-)

diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py
index ec6b261..e576789 100644
--- a/python/pyspark/pandas/frame.py
+++ b/python/pyspark/pandas/frame.py
@@ -7394,31 +7394,37 @@ defaultdict(, {'col..., 'col...})]
 if col in values:
 item = values[col]
 item = item.tolist() if isinstance(item, np.ndarray) else 
list(item)
-data_spark_columns.append(
-
self._internal.spark_column_for(self._internal.column_labels[i])
-.isin(item)
-.alias(self._internal.data_spark_column_names[i])
+
+scol = 
self._internal.spark_column_for(self._internal.column_labels[i]).isin(
+[SF.lit(v) for v in item]
 )
+scol = F.coalesce(scol, F.lit(False))
 else:
-data_spark_columns.append(
-
SF.lit(False).alias(self._internal.data_spark_column_names[i])
-)
+scol = SF.lit(False)
+
data_spark_columns.append(scol.alias(self._internal.data_spark_column_names[i]))
 elif is_list_like(values):
 values = (
 cast(np.ndarray, values).tolist()
 if isinstance(values, np.ndarray)
 else list(values)
 )
-data_spark_columns += [
-self._internal.spark_column_for(label)
-.isin(values)
-.alias(self._internal.spark_column_name_for(label))
-for label in self._internal.column_labels
-]
+
+for label in self._internal.column_labels:
+scol = self._internal.spark_column_for(label).isin([SF.lit(v) 
for v in values])
+scol =

[spark] branch master updated (4b61c62 -> cc182fe)

2021-09-20 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 4b61c62  [SPARK-36746][PYTHON] Refactor `_select_rows_by_iterable` in 
`iLocIndexer` to use `Column.isin`
 add cc182fe  [SPARK-36785][PYTHON] Fix DataFrame.isin when DataFrame has 
NaN value

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/frame.py| 34 +++---
 python/pyspark/pandas/tests/test_dataframe.py | 35 +++
 2 files changed, 55 insertions(+), 14 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-36746][PYTHON] Refactor `_select_rows_by_iterable` in `iLocIndexer` to use `Column.isin`

2021-09-20 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4b61c62  [SPARK-36746][PYTHON] Refactor `_select_rows_by_iterable` in 
`iLocIndexer` to use `Column.isin`
4b61c62 is described below

commit 4b61c623b52032c7e4749db641c5e63c1317c3a4
Author: Xinrong Meng 
AuthorDate: Mon Sep 20 15:00:10 2021 -0700

[SPARK-36746][PYTHON] Refactor `_select_rows_by_iterable` in `iLocIndexer` 
to use `Column.isin`

### What changes were proposed in this pull request?
Refactor `_select_rows_by_iterable` in `iLocIndexer` to use `Column.isin`.

### Why are the changes needed?
For better performance.

After a rough benchmark, a long projection performs worse than 
`Column.isin`, even when the length of the filtering conditions exceeding 
`compute.isin_limit`.

So we use `Column.isin` instead.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #33964 from xinrong-databricks/iloc_select.

Authored-by: Xinrong Meng 
Signed-off-by: Takuya UESHIN 
---
 python/pyspark/pandas/indexing.py | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/python/pyspark/pandas/indexing.py 
b/python/pyspark/pandas/indexing.py
index c46e250..3e07975 100644
--- a/python/pyspark/pandas/indexing.py
+++ b/python/pyspark/pandas/indexing.py
@@ -1657,14 +1657,13 @@ class iLocIndexer(LocIndexerLike):
 "however, normalised index was [%s]" % new_rows_sel
 )
 
-sequence_scol = sdf[self._sequence_col]
-cond = []
-for key in new_rows_sel:
-cond.append(sequence_scol == SF.lit(int(key)).cast(LongType()))
-
-if len(cond) == 0:
-cond = [SF.lit(False)]
-return reduce(lambda x, y: x | y, cond), None, None
+if len(new_rows_sel) == 0:
+cond = SF.lit(False)
+else:
+cond = sdf[self._sequence_col].isin(
+[SF.lit(int(key)).cast(LongType()) for key in new_rows_sel]
+)
+return cond, None, None
 
 def _select_rows_else(
 self, rows_sel: Any

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-36618][PYTHON] Support dropping rows of a single-indexed DataFrame

2021-09-20 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4cf86d3  [SPARK-36618][PYTHON] Support dropping rows of a 
single-indexed DataFrame
4cf86d3 is described below

commit 4cf86d33adc483382a2803486db628e21cad44e9
Author: Xinrong Meng 
AuthorDate: Mon Sep 20 14:50:50 2021 -0700

[SPARK-36618][PYTHON] Support dropping rows of a single-indexed DataFrame

### What changes were proposed in this pull request?
Support dropping rows of a single-indexed DataFrame.

Dropping rows and columns at the same time is supported in this PR  as well.

### Why are the changes needed?
To increase pandas API coverage.

### Does this PR introduce _any_ user-facing change?
Yes, dropping rows of a single-indexed DataFrame is supported now.

```py
>>> df = ps.DataFrame(np.arange(12).reshape(3, 4), columns=['A', 'B', 'C', 
'D'])
>>> df
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
```
 From
```py
>>> df.drop([0, 1])
Traceback (most recent call last):
...
KeyError: [(0,), (1,)]

>>> df.drop([0, 1], axis=0)
Traceback (most recent call last):
...
NotImplementedError: Drop currently only works for axis=1

>>> df.drop(1)
Traceback (most recent call last):
...
KeyError: [(1,)]

>>> df.drop(index=1)
Traceback (most recent call last):
...
TypeError: drop() got an unexpected keyword argument 'index'

>>> df.drop(index=[0, 1], columns='A')
Traceback (most recent call last):
...
TypeError: drop() got an unexpected keyword argument 'index'
```
 To
```py
>>> df.drop([0, 1])
   A  B   C   D
2  8  9  10  11

>>> df.drop([0, 1], axis=0)
   A  B   C   D
2  8  9  10  11

>>> df.drop(1)
   A  B   C   D
0  0  1   2   3
2  8  9  10  11

>>> df.drop(index=1)
   A  B   C   D
0  0  1   2   3
2  8  9  10  11

>>> df.drop(index=[0, 1], columns='A')
   B   C   D
2  9  10  11
```

### How was this patch tested?
Unit tests.

Closes #33929 from xinrong-databricks/frame_drop.

Authored-by: Xinrong Meng 
Signed-off-by: Takuya UESHIN 
---
 .../source/migration_guide/pyspark_3.2_to_3.3.rst  |  23 +++
 python/pyspark/pandas/frame.py | 176 ++---
 python/pyspark/pandas/indexing.py  |   4 +-
 python/pyspark/pandas/tests/test_dataframe.py  | 106 +++--
 python/pyspark/pandas/tests/test_groupby.py|   8 +-
 5 files changed, 241 insertions(+), 76 deletions(-)

diff --git a/python/docs/source/migration_guide/pyspark_3.2_to_3.3.rst 
b/python/docs/source/migration_guide/pyspark_3.2_to_3.3.rst
new file mode 100644
index 000..060f24c
--- /dev/null
+++ b/python/docs/source/migration_guide/pyspark_3.2_to_3.3.rst
@@ -0,0 +1,23 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+=
+Upgrading from PySpark 3.2 to 3.3
+=
+
+* In Spark 3.3, the ``drop`` method of pandas API on Spark DataFrame supports 
dropping rows by ``index``, and sets dropping by index instead of column by 
default.
diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py
index 8d37d31..f863890 100644
--- a/python/pyspark/pandas/frame.py
+++ b/python/pyspark/pandas/frame.py
@@ -2817,7 +2817,7 @@ defaultdict(, {'col..., 'col...})]
 3   NaN
 """
 result = self[item]
-self._update_internal_frame(self.drop(item)._internal)
+self._update_internal_frame(self.drop(columns=item)._internal)
 return result
 
 # TODO: add axis parameter can work when '1' or 'columns'
@@ -6586,23 +6586,31 @@ defaultdict(, {'col..., 'col...})]
 def drop(
 self,
 labels: Optional[Union[Name, List[Name]]] = None,
-axis: Axis = 1,
+axis: Optional[Axis] = 0,
+index: Union[Name, List[Name]] =

[spark] branch branch-3.2 updated: [SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image

2021-09-20 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 5d0e51e  [SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image
5d0e51e is described below

commit 5d0e51e943615d65b28e245fbf1fa3e575e20128
Author: Dongjoon Hyun 
AuthorDate: Mon Sep 20 10:52:45 2021 -0700

[SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image

### What changes were proposed in this pull request?

This PR aims to upgrade R from 3.6.3 to 4.0.4 in K8s R Docker image.

### Why are the changes needed?

`openjdk:11-jre-slim` image is upgraded to `Debian 11`.

```
$ docker run -it openjdk:11-jre-slim cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/;
SUPPORT_URL="https://www.debian.org/support;
BUG_REPORT_URL="https://bugs.debian.org/;
```

It causes `R 3.5` installation failures in our K8s integration test 
environment.
- 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47953/
```
The following packages have unmet dependencies:
 r-base-core : Depends: libicu63 (>= 63.1-1~) but it is not installable
   Depends: libreadline7 (>= 6.0) but it is not installable
E: Unable to correct problems, you have held broken packages.
The command '/bin/sh -c apt-get update &&   apt install -y gnupg &&   echo 
"deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> 
/etc/apt/sources.list &&   apt-key adv --keyserver keyserver.ubuntu.com 
--recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' &&   apt-get update &&
apt install -y -t buster-cran35 r-base r-base-dev &&   rm -rf
```

### Does this PR introduce _any_ user-facing change?

Yes, this will recover the installation.

### How was this patch tested?

Succeed to build SparkR docker image in the K8s integration test in Jenkins 
CI.

- 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47959/
```
Successfully built 32e1a0cd5ff8
Successfully tagged 
kubespark/spark-r:3.3.0-SNAPSHOT_6e4f7e2d-054d-4978-812f-4f32fc546b51
```

Closes #34048 from dongjoon-hyun/SPARK-36806.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit a178752540e2d37a6da847a381de7c8d6b4797d3)
Signed-off-by: Dongjoon Hyun 
---
 .../docker/src/main/dockerfiles/spark/bindings/R/Dockerfile   | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
index aabd04b..03e4210 100644
--- 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
+++ 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
@@ -25,14 +25,10 @@ USER 0
 
 RUN mkdir ${SPARK_HOME}/R
 
-# Install R 3.6.3 (http://cloud.r-project.org/bin/linux/debian/)
+# Install R 4.0.4 (http://cloud.r-project.org/bin/linux/debian/)
 RUN \
   apt-get update && \
-  apt install -y gnupg && \
-  echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> 
/etc/apt/sources.list && \
-  apt-key adv --keyserver keyserver.ubuntu.com --recv-key 
'E19F5F87128899B192B1A2C2AD5F960A256A04AF' && \
-  apt-get update && \
-  apt install -y -t buster-cran35 r-base r-base-dev && \
+  apt install -y r-base r-base-dev && \
   rm -rf /var/cache/apt/*
 
 COPY R ${SPARK_HOME}/R

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated (3d0e631 -> b4a370b)

2021-09-20 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3d0e631  [SPARK-36754][SQL] ArrayIntersect handle duplicated 
Double.NaN and Float.NaN
 add b4a370b  [SPARK-36706][SQL][3.1] OverwriteByExpression conversion in 
DataSourceV2Strategy use wrong param in translateFilter

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (75e71ef -> a178752)

2021-09-20 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 75e71ef  [SPARK-36808][BUILD] Upgrade Kafka to 2.8.1
 add a178752  [SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image

No new revisions were added by this update.

Summary of changes:
 .../docker/src/main/dockerfiles/spark/bindings/R/Dockerfile   | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (3b56039 -> 75e71ef)

2021-09-20 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3b56039  [SPARK-36805][BUILD][K8S] Upgrade kubernetes-client to 5.7.3
 add 75e71ef  [SPARK-36808][BUILD] Upgrade Kafka to 2.8.1

No new revisions were added by this update.

Summary of changes:
 pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (30d17b6 -> 3b56039)

2021-09-20 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 30d17b6  [SPARK-36683][SQL] Add new built-in SQL functions: SEC and CSC
 add 3b56039  [SPARK-36805][BUILD][K8S] Upgrade kubernetes-client to 5.7.3

No new revisions were added by this update.

Summary of changes:
 dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 42 -
 dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 42 -
 pom.xml |  2 +-
 3 files changed, 43 insertions(+), 43 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (4cc39cf -> 30d17b6)

2021-09-20 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 4cc39cf  [SPARK-36101][CORE] Grouping exception in core/api
 add 30d17b6  [SPARK-36683][SQL] Add new built-in SQL functions: SEC and CSC

No new revisions were added by this update.

Summary of changes:
 python/docs/source/reference/pyspark.sql.rst   |  2 +
 python/pyspark/sql/functions.py| 58 
 python/pyspark/sql/functions.pyi   |  2 +
 python/pyspark/sql/tests/test_functions.py | 78 --
 python/pyspark/testing/sqlutils.py |  8 +++
 .../sql/catalyst/analysis/FunctionRegistry.scala   |  2 +
 .../sql/catalyst/expressions/mathExpressions.scala | 46 +
 .../expressions/MathExpressionsSuite.scala | 28 
 .../scala/org/apache/spark/sql/functions.scala | 18 +
 .../sql-functions/sql-expression-schema.md |  4 +-
 .../test/resources/sql-tests/inputs/operators.sql  |  8 +++
 .../resources/sql-tests/results/operators.sql.out  | 66 +-
 .../org/apache/spark/sql/MathFunctionsSuite.scala  | 15 +
 13 files changed, 299 insertions(+), 36 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-36101][CORE] Grouping exception in core/api

2021-09-20 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4cc39cf  [SPARK-36101][CORE] Grouping exception in core/api
4cc39cf is described below

commit 4cc39cfe153aca70c97dc8702b3b55cf22f3ce80
Author: dgd-contributor 
AuthorDate: Mon Sep 20 17:19:29 2021 +0800

[SPARK-36101][CORE] Grouping exception in core/api

### What changes were proposed in this pull request?
This PR group exception messages in core/src/main/scala/org/apache/spark/api

### Why are the changes needed?
It will largely help with standardization of error messages and its 
maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any 
existing behavior.

Closes #33536 from dgd-contributor/SPARK-36101.

Authored-by: dgd-contributor 
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/api/python/Py4JServer.scala   |  7 +++---
 .../spark/api/python/PythonWorkerFactory.scala | 10 -
 .../python/WriteInputFormatTestDataGenerator.scala |  6 ++---
 .../org/apache/spark/errors/SparkCoreErrors.scala  | 26 ++
 .../main/scala/org/apache/spark/rdd/PipedRDD.scala |  2 +-
 .../apache/spark/scheduler/TaskSchedulerImpl.scala |  5 +++--
 6 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/api/python/Py4JServer.scala 
b/core/src/main/scala/org/apache/spark/api/python/Py4JServer.scala
index 2edc492..ef85bbd 100644
--- a/core/src/main/scala/org/apache/spark/api/python/Py4JServer.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/Py4JServer.scala
@@ -21,6 +21,7 @@ import java.net.InetAddress
 import java.util.Locale
 
 import org.apache.spark.SparkConf
+import org.apache.spark.errors.SparkCoreErrors
 import org.apache.spark.internal.Logging
 import org.apache.spark.util.Utils
 
@@ -52,18 +53,18 @@ private[spark] class Py4JServer(sparkConf: SparkConf) 
extends Logging {
   def start(): Unit = server match {
 case clientServer: py4j.ClientServer => clientServer.startServer()
 case gatewayServer: py4j.GatewayServer => gatewayServer.start()
-case other => throw new RuntimeException(s"Unexpected Py4J server 
${other.getClass}")
+case other => throw SparkCoreErrors.unexpectedPy4JServerError(other)
   }
 
   def getListeningPort: Int = server match {
 case clientServer: py4j.ClientServer => 
clientServer.getJavaServer.getListeningPort
 case gatewayServer: py4j.GatewayServer => gatewayServer.getListeningPort
-case other => throw new RuntimeException(s"Unexpected Py4J server 
${other.getClass}")
+case other => throw SparkCoreErrors.unexpectedPy4JServerError(other)
   }
 
   def shutdown(): Unit = server match {
 case clientServer: py4j.ClientServer => clientServer.shutdown()
 case gatewayServer: py4j.GatewayServer => gatewayServer.shutdown()
-case other => throw new RuntimeException(s"Unexpected Py4J server 
${other.getClass}")
+case other => throw SparkCoreErrors.unexpectedPy4JServerError(other)
   }
 }
diff --git 
a/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
index 7b2c36b..2beca6f 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
@@ -27,6 +27,7 @@ import scala.collection.JavaConverters._
 import scala.collection.mutable
 
 import org.apache.spark._
+import org.apache.spark.errors.SparkCoreErrors
 import org.apache.spark.internal.Logging
 import org.apache.spark.internal.config.Python._
 import org.apache.spark.security.SocketAuthHelper
@@ -219,12 +220,11 @@ private[spark] class PythonWorkerFactory(pythonExec: 
String, envVars: Map[String
   daemonPort = in.readInt()
 } catch {
   case _: EOFException if daemon.isAlive =>
-throw new SparkException("EOFException occurred while reading the 
port number " +
-  s"from $daemonModule's stdout")
+throw SparkCoreErrors.eofExceptionWhileReadPortNumberError(
+  daemonModule)
   case _: EOFException =>
-throw new SparkException(
-  s"EOFException occurred while reading the port number from 
$daemonModule's" +
-  s" stdout and terminated with code: ${daemon.exitValue}.")
+throw SparkCoreErrors.
+  eofExceptionWhileReadPortNumberError(daemonModule, 
Some(daemon.exitValue))
 }
 
 // test that the returned port number is within a valid range.
diff --git

[spark] branch branch-3.0 updated: [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN

2021-09-20 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 016eeaf  [SPARK-36754][SQL] ArrayIntersect handle duplicated 
Double.NaN and Float.NaN
016eeaf is described below

commit 016eeafd1196fb20449b457599487e1e06b6ffa7
Author: Angerszh 
AuthorDate: Mon Sep 20 16:48:59 2021 +0800

[SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN

### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' 
as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and 
`Float.NaN` too.
In this pr fix this based on https://github.com/apache/spark/pull/33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value

### How was this patch tested?
Added UT

Closes #33995 from AngersZh/SPARK-36754.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 2fc7f2f702c6c08d9c76332f45e2902728ba2ee3)
Signed-off-by: Wenchen Fan 
---
 .../expressions/collectionOperations.scala | 66 ++
 .../expressions/CollectionExpressionsSuite.scala   | 17 ++
 2 files changed, 58 insertions(+), 25 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index 479b1d7..c153181 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -3583,33 +3583,42 @@ case class ArrayIntersect(left: Expression, right: 
Expression) extends ArrayBina
 if (TypeUtils.typeWithProperEquals(elementType)) {
   (array1, array2) =>
 if (array1.numElements() != 0 && array2.numElements() != 0) {
-  val hs = new OpenHashSet[Any]
-  val hsResult = new OpenHashSet[Any]
-  var foundNullElement = false
+  val hs = new SQLOpenHashSet[Any]
+  val hsResult = new SQLOpenHashSet[Any]
+  val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
+  val withArray2NaNCheckFunc = 
SQLOpenHashSet.withNaNCheckFunc(elementType, hs,
+(value: Any) => hs.add(value),
+(valueNaN: Any) => {} )
+  val withArray1NaNCheckFunc = 
SQLOpenHashSet.withNaNCheckFunc(elementType, hsResult,
+(value: Any) =>
+  if (hs.contains(value) && !hsResult.contains(value)) {
+arrayBuffer += value
+hsResult.add(value)
+  },
+(valueNaN: Any) =>
+  if (hs.containsNaN()) {
+arrayBuffer += valueNaN
+  })
   var i = 0
   while (i < array2.numElements()) {
 if (array2.isNullAt(i)) {
-  foundNullElement = true
+  hs.addNull()
 } else {
   val elem = array2.get(i, elementType)
-  hs.add(elem)
+  withArray2NaNCheckFunc(elem)
 }
 i += 1
   }
-  val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
   i = 0
   while (i < array1.numElements()) {
 if (array1.isNullAt(i)) {
-  if (foundNullElement) {
+  if (hs.containsNull() && !hsResult.containsNull()) {
 arrayBuffer += null
-foundNullElement = false
+hsResult.addNull()
   }
 } else {
   val elem = array1.get(i, elementType)
-  if (hs.contains(elem) && !hsResult.contains(elem)) {
-arrayBuffer += elem
-hsResult.add(elem)
-  }
+  withArray1NaNCheckFunc(elem)
 }
 i += 1
   }
@@ -3684,10 +3693,9 @@ case class ArrayIntersect(left: Expression, right: 
Expression) extends ArrayBina
   val ptName = CodeGenerator.primitiveTypeName(jt)
 
   nullSafeCodeGen(ctx, ev, (array1, array2) => {
-val foundNullElement = ctx.freshName("foundNullElement")
 val nullElementIndex = ctx.freshName("nullElementIndex")
 val builder = ctx.freshName("builder")
-val openHashSet = classOf[OpenHashSet[_]].getName
+val openHashSet = classOf[SQLOpenHashSet[_]].getName
 val classTag = s"scala.reflect.ClassTag$$.MODULE$$.$hsTypeName()"
 val hashSet = ctx.freshName("hashSet")
 val hashSetResult =

[spark] branch branch-3.1 updated: [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN

2021-09-20 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 3d0e631  [SPARK-36754][SQL] ArrayIntersect handle duplicated 
Double.NaN and Float.NaN
3d0e631 is described below

commit 3d0e631b1b16e267968f1793fabc8c62a97efc7f
Author: Angerszh 
AuthorDate: Mon Sep 20 16:48:59 2021 +0800

[SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN

### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' 
as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and 
`Float.NaN` too.
In this pr fix this based on https://github.com/apache/spark/pull/33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value

### How was this patch tested?
Added UT

Closes #33995 from AngersZh/SPARK-36754.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 2fc7f2f702c6c08d9c76332f45e2902728ba2ee3)
Signed-off-by: Wenchen Fan 
---
 .../expressions/collectionOperations.scala | 66 ++
 .../expressions/CollectionExpressionsSuite.scala   | 17 ++
 2 files changed, 58 insertions(+), 25 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index 6cf1ead..77340c6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -3632,33 +3632,42 @@ case class ArrayIntersect(left: Expression, right: 
Expression) extends ArrayBina
 if (TypeUtils.typeWithProperEquals(elementType)) {
   (array1, array2) =>
 if (array1.numElements() != 0 && array2.numElements() != 0) {
-  val hs = new OpenHashSet[Any]
-  val hsResult = new OpenHashSet[Any]
-  var foundNullElement = false
+  val hs = new SQLOpenHashSet[Any]
+  val hsResult = new SQLOpenHashSet[Any]
+  val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
+  val withArray2NaNCheckFunc = 
SQLOpenHashSet.withNaNCheckFunc(elementType, hs,
+(value: Any) => hs.add(value),
+(valueNaN: Any) => {} )
+  val withArray1NaNCheckFunc = 
SQLOpenHashSet.withNaNCheckFunc(elementType, hsResult,
+(value: Any) =>
+  if (hs.contains(value) && !hsResult.contains(value)) {
+arrayBuffer += value
+hsResult.add(value)
+  },
+(valueNaN: Any) =>
+  if (hs.containsNaN()) {
+arrayBuffer += valueNaN
+  })
   var i = 0
   while (i < array2.numElements()) {
 if (array2.isNullAt(i)) {
-  foundNullElement = true
+  hs.addNull()
 } else {
   val elem = array2.get(i, elementType)
-  hs.add(elem)
+  withArray2NaNCheckFunc(elem)
 }
 i += 1
   }
-  val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
   i = 0
   while (i < array1.numElements()) {
 if (array1.isNullAt(i)) {
-  if (foundNullElement) {
+  if (hs.containsNull() && !hsResult.containsNull()) {
 arrayBuffer += null
-foundNullElement = false
+hsResult.addNull()
   }
 } else {
   val elem = array1.get(i, elementType)
-  if (hs.contains(elem) && !hsResult.contains(elem)) {
-arrayBuffer += elem
-hsResult.add(elem)
-  }
+  withArray1NaNCheckFunc(elem)
 }
 i += 1
   }
@@ -3733,10 +3742,9 @@ case class ArrayIntersect(left: Expression, right: 
Expression) extends ArrayBina
   val ptName = CodeGenerator.primitiveTypeName(jt)
 
   nullSafeCodeGen(ctx, ev, (array1, array2) => {
-val foundNullElement = ctx.freshName("foundNullElement")
 val nullElementIndex = ctx.freshName("nullElementIndex")
 val builder = ctx.freshName("builder")
-val openHashSet = classOf[OpenHashSet[_]].getName
+val openHashSet = classOf[SQLOpenHashSet[_]].getName
 val classTag = s"scala.reflect.ClassTag$$.MODULE$$.$hsTypeName()"
 val hashSet = ctx.freshName("hashSet")
 val hashSetResult =

[spark] branch branch-3.2 updated: [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN

2021-09-20 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 337a197  [SPARK-36754][SQL] ArrayIntersect handle duplicated 
Double.NaN and Float.NaN
337a197 is described below

commit 337a1979d291f7c836635d065e0bdda4ee188992
Author: Angerszh 
AuthorDate: Mon Sep 20 16:48:59 2021 +0800

[SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN

### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' 
as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and 
`Float.NaN` too.
In this pr fix this based on https://github.com/apache/spark/pull/33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value

### How was this patch tested?
Added UT

Closes #33995 from AngersZh/SPARK-36754.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 2fc7f2f702c6c08d9c76332f45e2902728ba2ee3)
Signed-off-by: Wenchen Fan 
---
 .../expressions/collectionOperations.scala | 66 ++
 .../expressions/CollectionExpressionsSuite.scala   | 17 ++
 2 files changed, 58 insertions(+), 25 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index 1182194..b325e9a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -3847,33 +3847,42 @@ case class ArrayIntersect(left: Expression, right: 
Expression) extends ArrayBina
 if (TypeUtils.typeWithProperEquals(elementType)) {
   (array1, array2) =>
 if (array1.numElements() != 0 && array2.numElements() != 0) {
-  val hs = new OpenHashSet[Any]
-  val hsResult = new OpenHashSet[Any]
-  var foundNullElement = false
+  val hs = new SQLOpenHashSet[Any]
+  val hsResult = new SQLOpenHashSet[Any]
+  val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
+  val withArray2NaNCheckFunc = 
SQLOpenHashSet.withNaNCheckFunc(elementType, hs,
+(value: Any) => hs.add(value),
+(valueNaN: Any) => {} )
+  val withArray1NaNCheckFunc = 
SQLOpenHashSet.withNaNCheckFunc(elementType, hsResult,
+(value: Any) =>
+  if (hs.contains(value) && !hsResult.contains(value)) {
+arrayBuffer += value
+hsResult.add(value)
+  },
+(valueNaN: Any) =>
+  if (hs.containsNaN()) {
+arrayBuffer += valueNaN
+  })
   var i = 0
   while (i < array2.numElements()) {
 if (array2.isNullAt(i)) {
-  foundNullElement = true
+  hs.addNull()
 } else {
   val elem = array2.get(i, elementType)
-  hs.add(elem)
+  withArray2NaNCheckFunc(elem)
 }
 i += 1
   }
-  val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
   i = 0
   while (i < array1.numElements()) {
 if (array1.isNullAt(i)) {
-  if (foundNullElement) {
+  if (hs.containsNull() && !hsResult.containsNull()) {
 arrayBuffer += null
-foundNullElement = false
+hsResult.addNull()
   }
 } else {
   val elem = array1.get(i, elementType)
-  if (hs.contains(elem) && !hsResult.contains(elem)) {
-arrayBuffer += elem
-hsResult.add(elem)
-  }
+  withArray1NaNCheckFunc(elem)
 }
 i += 1
   }
@@ -3948,10 +3957,9 @@ case class ArrayIntersect(left: Expression, right: 
Expression) extends ArrayBina
   val ptName = CodeGenerator.primitiveTypeName(jt)
 
   nullSafeCodeGen(ctx, ev, (array1, array2) => {
-val foundNullElement = ctx.freshName("foundNullElement")
 val nullElementIndex = ctx.freshName("nullElementIndex")
 val builder = ctx.freshName("builder")
-val openHashSet = classOf[OpenHashSet[_]].getName
+val openHashSet = classOf[SQLOpenHashSet[_]].getName
 val classTag = s"scala.reflect.ClassTag$$.MODULE$$.$hsTypeName()"
 val hashSet = ctx.freshName("hashSet")
 val hashSetResult =

[spark] branch master updated: [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN

2021-09-20 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2fc7f2f  [SPARK-36754][SQL] ArrayIntersect handle duplicated 
Double.NaN and Float.NaN
2fc7f2f is described below

commit 2fc7f2f702c6c08d9c76332f45e2902728ba2ee3
Author: Angerszh 
AuthorDate: Mon Sep 20 16:48:59 2021 +0800

[SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN

### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' 
as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and 
`Float.NaN` too.
In this pr fix this based on https://github.com/apache/spark/pull/33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value

### How was this patch tested?
Added UT

Closes #33995 from AngersZh/SPARK-36754.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
---
 .../expressions/collectionOperations.scala | 66 ++
 .../expressions/CollectionExpressionsSuite.scala   | 17 ++
 2 files changed, 58 insertions(+), 25 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index 1182194..b325e9a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -3847,33 +3847,42 @@ case class ArrayIntersect(left: Expression, right: 
Expression) extends ArrayBina
 if (TypeUtils.typeWithProperEquals(elementType)) {
   (array1, array2) =>
 if (array1.numElements() != 0 && array2.numElements() != 0) {
-  val hs = new OpenHashSet[Any]
-  val hsResult = new OpenHashSet[Any]
-  var foundNullElement = false
+  val hs = new SQLOpenHashSet[Any]
+  val hsResult = new SQLOpenHashSet[Any]
+  val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
+  val withArray2NaNCheckFunc = 
SQLOpenHashSet.withNaNCheckFunc(elementType, hs,
+(value: Any) => hs.add(value),
+(valueNaN: Any) => {} )
+  val withArray1NaNCheckFunc = 
SQLOpenHashSet.withNaNCheckFunc(elementType, hsResult,
+(value: Any) =>
+  if (hs.contains(value) && !hsResult.contains(value)) {
+arrayBuffer += value
+hsResult.add(value)
+  },
+(valueNaN: Any) =>
+  if (hs.containsNaN()) {
+arrayBuffer += valueNaN
+  })
   var i = 0
   while (i < array2.numElements()) {
 if (array2.isNullAt(i)) {
-  foundNullElement = true
+  hs.addNull()
 } else {
   val elem = array2.get(i, elementType)
-  hs.add(elem)
+  withArray2NaNCheckFunc(elem)
 }
 i += 1
   }
-  val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
   i = 0
   while (i < array1.numElements()) {
 if (array1.isNullAt(i)) {
-  if (foundNullElement) {
+  if (hs.containsNull() && !hsResult.containsNull()) {
 arrayBuffer += null
-foundNullElement = false
+hsResult.addNull()
   }
 } else {
   val elem = array1.get(i, elementType)
-  if (hs.contains(elem) && !hsResult.contains(elem)) {
-arrayBuffer += elem
-hsResult.add(elem)
-  }
+  withArray1NaNCheckFunc(elem)
 }
 i += 1
   }
@@ -3948,10 +3957,9 @@ case class ArrayIntersect(left: Expression, right: 
Expression) extends ArrayBina
   val ptName = CodeGenerator.primitiveTypeName(jt)
 
   nullSafeCodeGen(ctx, ev, (array1, array2) => {
-val foundNullElement = ctx.freshName("foundNullElement")
 val nullElementIndex = ctx.freshName("nullElementIndex")
 val builder = ctx.freshName("builder")
-val openHashSet = classOf[OpenHashSet[_]].getName
+val openHashSet = classOf[SQLOpenHashSet[_]].getName
 val classTag = s"scala.reflect.ClassTag$$.MODULE$$.$hsTypeName()"
 val hashSet = ctx.freshName("hashSet")
 val hashSetResult = ctx.freshName("hashSetResult")
@@ -3963,7 +3971,7 @@ case class ArrayIntersect(left: Expression, right: 
Expression) extends

[spark] branch master updated: [SPARK-34112][BUILD] Upgrade ORC to 1.7.0

2021-09-20 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a396dd6  [SPARK-34112][BUILD] Upgrade ORC to 1.7.0
a396dd6 is described below

commit a396dd6216b786c6665a07b8706154e4e9d14bbe
Author: Dongjoon Hyun 
AuthorDate: Mon Sep 20 01:09:15 2021 -0700

[SPARK-34112][BUILD] Upgrade ORC to 1.7.0

### What changes were proposed in this pull request?

This PR aims to upgrade Apache ORC from 1.6.11 to 1.7.0 for Apache Spark 
3.3.0.

### Why are the changes needed?

[Apache ORC 1.7.0](https://orc.apache.org/news/2021/09/15/ORC-1.7.0/) is a 
new release with the following new features and improvements.
  - ORC-377 Support Snappy compression in C++ Writer
  - ORC-577 Support row-level filtering
  - ORC-716 Build and test on Java 17-EA
  - ORC-731 Improve Java Tools
  - ORC-742 LazyIO of non-filter columns
  - ORC-751 Implement Predicate Pushdown in C++ Reader
  - ORC-755 Introduce OrcFilterContext
  - ORC-757 Add Hashtable implementation for dictionary
  - ORC-780 Support LZ4 Compression in C++ Writer
  - ORC-797 Allow writers to get the stripe information
  - ORC-818 Build and test in Apple Silicon
  - ORC-861 Bump CMake minimum requirement to 2.8.12
  - ORC-867 Upgrade hive-storage-api to 2.8.1
  - ORC-984 Save the software version that wrote each ORC file

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the existing CIs because this is a dependency change.

Closes #34045 from dongjoon-hyun/SPARK-34112.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 6 +++---
 dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 6 +++---
 pom.xml | 2 +-
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 
b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
index 6f91caf..83b0c8e 100644
--- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
@@ -195,9 +195,9 @@ objenesis/2.6//objenesis-2.6.jar
 okhttp/3.12.12//okhttp-3.12.12.jar
 okio/1.14.0//okio-1.14.0.jar
 opencsv/2.3//opencsv-2.3.jar
-orc-core/1.6.11//orc-core-1.6.11.jar
-orc-mapreduce/1.6.11//orc-mapreduce-1.6.11.jar
-orc-shims/1.6.11//orc-shims-1.6.11.jar
+orc-core/1.7.0//orc-core-1.7.0.jar
+orc-mapreduce/1.7.0//orc-mapreduce-1.7.0.jar
+orc-shims/1.7.0//orc-shims-1.7.0.jar
 oro/2.0.8//oro-2.0.8.jar
 osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
 paranamer/2.8//paranamer-2.8.jar
diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 
b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
index ecf448f..c72ac56 100644
--- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
@@ -165,9 +165,9 @@ objenesis/2.6//objenesis-2.6.jar
 okhttp/3.12.12//okhttp-3.12.12.jar
 okio/1.14.0//okio-1.14.0.jar
 opencsv/2.3//opencsv-2.3.jar
-orc-core/1.6.11//orc-core-1.6.11.jar
-orc-mapreduce/1.6.11//orc-mapreduce-1.6.11.jar
-orc-shims/1.6.11//orc-shims-1.6.11.jar
+orc-core/1.7.0//orc-core-1.7.0.jar
+orc-mapreduce/1.7.0//orc-mapreduce-1.7.0.jar
+orc-shims/1.7.0//orc-shims-1.7.0.jar
 oro/2.0.8//oro-2.0.8.jar
 osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
 paranamer/2.8//paranamer-2.8.jar
diff --git a/pom.xml b/pom.xml
index 419c6b7..3f18e6d 100644
--- a/pom.xml
+++ b/pom.xml
@@ -137,7 +137,7 @@
 
 10.14.2.0
 1.12.1
-1.6.11
+1.7.0
 9.4.43.v20210629
 4.0.3
 0.10.0

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image

[spark] branch branch-3.2 updated: [SPARK-36785][PYTHON] Fix DataFrame.isin when DataFrame has NaN value

[spark] branch master updated (4b61c62 -> cc182fe)

[spark] branch master updated: [SPARK-36746][PYTHON] Refactor `_select_rows_by_iterable` in `iLocIndexer` to use `Column.isin`

[spark] branch master updated: [SPARK-36618][PYTHON] Support dropping rows of a single-indexed DataFrame

[spark] branch branch-3.2 updated: [SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image

[spark] branch branch-3.1 updated (3d0e631 -> b4a370b)

[spark] branch master updated (75e71ef -> a178752)

[spark] branch master updated (3b56039 -> 75e71ef)

[spark] branch master updated (30d17b6 -> 3b56039)

[spark] branch master updated (4cc39cf -> 30d17b6)

[spark] branch master updated: [SPARK-36101][CORE] Grouping exception in core/api

[spark] branch branch-3.0 updated: [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN

[spark] branch branch-3.1 updated: [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN

[spark] branch branch-3.2 updated: [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN

[spark] branch master updated: [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN

[spark] branch master updated: [SPARK-34112][BUILD] Upgrade ORC to 1.7.0

17 matches

Site Navigation

Mail list logo

Footer information