[spark] branch branch-3.1 updated: [SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new c93d23b [SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image c93d23b is described below commit c93d23bd14bdd6ea0d25bbe03bef4755f3ad2616 Author: Dongjoon Hyun AuthorDate: Mon Sep 20 10:52:45 2021 -0700 [SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image ### What changes were proposed in this pull request? This PR aims to upgrade R from 3.6.3 to 4.0.4 in K8s R Docker image. ### Why are the changes needed? `openjdk:11-jre-slim` image is upgraded to `Debian 11`. ``` $ docker run -it openjdk:11-jre-slim cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian HOME_URL="https://www.debian.org/; SUPPORT_URL="https://www.debian.org/support; BUG_REPORT_URL="https://bugs.debian.org/; ``` It causes `R 3.5` installation failures in our K8s integration test environment. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47953/ ``` The following packages have unmet dependencies: r-base-core : Depends: libicu63 (>= 63.1-1~) but it is not installable Depends: libreadline7 (>= 6.0) but it is not installable E: Unable to correct problems, you have held broken packages. The command '/bin/sh -c apt-get update && apt install -y gnupg && echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list && apt-key adv --keyserver keyserver.ubuntu.com --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' && apt-get update && apt install -y -t buster-cran35 r-base r-base-dev && rm -rf ``` ### Does this PR introduce _any_ user-facing change? Yes, this will recover the installation. ### How was this patch tested? Succeed to build SparkR docker image in the K8s integration test in Jenkins CI. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47959/ ``` Successfully built 32e1a0cd5ff8 Successfully tagged kubespark/spark-r:3.3.0-SNAPSHOT_6e4f7e2d-054d-4978-812f-4f32fc546b51 ``` Closes #34048 from dongjoon-hyun/SPARK-36806. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit a178752540e2d37a6da847a381de7c8d6b4797d3) Signed-off-by: Dongjoon Hyun (cherry picked from commit 5d0e51e943615d65b28e245fbf1fa3e575e20128) Signed-off-by: Dongjoon Hyun --- .../docker/src/main/dockerfiles/spark/bindings/R/Dockerfile | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile index aabd04b..03e4210 100644 --- a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile +++ b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile @@ -25,14 +25,10 @@ USER 0 RUN mkdir ${SPARK_HOME}/R -# Install R 3.6.3 (http://cloud.r-project.org/bin/linux/debian/) +# Install R 4.0.4 (http://cloud.r-project.org/bin/linux/debian/) RUN \ apt-get update && \ - apt install -y gnupg && \ - echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list && \ - apt-key adv --keyserver keyserver.ubuntu.com --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' && \ - apt-get update && \ - apt install -y -t buster-cran35 r-base r-base-dev && \ + apt install -y r-base r-base-dev && \ rm -rf /var/cache/apt/* COPY R ${SPARK_HOME}/R - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36785][PYTHON] Fix DataFrame.isin when DataFrame has NaN value
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 3d47c69 [SPARK-36785][PYTHON] Fix DataFrame.isin when DataFrame has NaN value 3d47c69 is described below commit 3d47c692d276d2d489664aa2d4e66e23e9bae0f7 Author: dgd-contributor AuthorDate: Mon Sep 20 17:52:51 2021 -0700 [SPARK-36785][PYTHON] Fix DataFrame.isin when DataFrame has NaN value ### What changes were proposed in this pull request? Fix DataFrame.isin when DataFrame has NaN value ### Why are the changes needed? Fix DataFrame.isin when DataFrame has NaN value ``` python >>> psdf = ps.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0], "c": [1, 5, 1, 3, 2, 1, 1, 0, 0]}, ... ) >>> psdf ab c 0 NaN NaN 1 1 2.0 5.0 5 2 3.0 NaN 1 3 4.0 3.0 3 4 5.0 2.0 2 5 6.0 1.0 1 6 7.0 NaN 1 7 8.0 0.0 0 8 NaN 0.0 0 >>> other = [1, 2, None] >>> psdf.isin(other) a b c 0 None None True 1 True None None 2 None None True 3 None None None 4 None True True 5 None True True 6 None None True 7 None None None 8 None None None >>> psdf.to_pandas().isin(other) a b c 0 False False True 1 True False False 2 False False True 3 False False False 4 False True True 5 False True True 6 False False True 7 False False False 8 False False False ``` ### Does this PR introduce _any_ user-facing change? After this PR ``` python >>> psdf = ps.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0], "c": [1, 5, 1, 3, 2, 1, 1, 0, 0]}, ... ) >>> psdf ab c 0 NaN NaN 1 1 2.0 5.0 5 2 3.0 NaN 1 3 4.0 3.0 3 4 5.0 2.0 2 5 6.0 1.0 1 6 7.0 NaN 1 7 8.0 0.0 0 8 NaN 0.0 0 >>> other = [1, 2, None] >>> psdf.isin(other) a b c 0 False False True 1 True False False 2 False False True 3 False False False 4 False True True 5 False True True 6 False False True 7 False False False 8 False False False ``` ### How was this patch tested? Unit tests Closes #34040 from dgd-contributor/SPARK-36785_dataframe.isin_fix. Authored-by: dgd-contributor Signed-off-by: Takuya UESHIN (cherry picked from commit cc182fe6f61eab494350b81196b3cce356814a25) Signed-off-by: Takuya UESHIN --- python/pyspark/pandas/frame.py| 34 +++--- python/pyspark/pandas/tests/test_dataframe.py | 35 +++ 2 files changed, 55 insertions(+), 14 deletions(-) diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py index ec6b261..e576789 100644 --- a/python/pyspark/pandas/frame.py +++ b/python/pyspark/pandas/frame.py @@ -7394,31 +7394,37 @@ defaultdict(, {'col..., 'col...})] if col in values: item = values[col] item = item.tolist() if isinstance(item, np.ndarray) else list(item) -data_spark_columns.append( - self._internal.spark_column_for(self._internal.column_labels[i]) -.isin(item) -.alias(self._internal.data_spark_column_names[i]) + +scol = self._internal.spark_column_for(self._internal.column_labels[i]).isin( +[SF.lit(v) for v in item] ) +scol = F.coalesce(scol, F.lit(False)) else: -data_spark_columns.append( - SF.lit(False).alias(self._internal.data_spark_column_names[i]) -) +scol = SF.lit(False) + data_spark_columns.append(scol.alias(self._internal.data_spark_column_names[i])) elif is_list_like(values): values = ( cast(np.ndarray, values).tolist() if isinstance(values, np.ndarray) else list(values) ) -data_spark_columns += [ -self._internal.spark_column_for(label) -.isin(values) -.alias(self._internal.spark_column_name_for(label)) -for label in self._internal.column_labels -] + +for label in self._internal.column_labels: +scol = self._internal.spark_column_for(label).isin([SF.lit(v) for v in values]) +scol =
[spark] branch master updated (4b61c62 -> cc182fe)
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 4b61c62 [SPARK-36746][PYTHON] Refactor `_select_rows_by_iterable` in `iLocIndexer` to use `Column.isin` add cc182fe [SPARK-36785][PYTHON] Fix DataFrame.isin when DataFrame has NaN value No new revisions were added by this update. Summary of changes: python/pyspark/pandas/frame.py| 34 +++--- python/pyspark/pandas/tests/test_dataframe.py | 35 +++ 2 files changed, 55 insertions(+), 14 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-36746][PYTHON] Refactor `_select_rows_by_iterable` in `iLocIndexer` to use `Column.isin`
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4b61c62 [SPARK-36746][PYTHON] Refactor `_select_rows_by_iterable` in `iLocIndexer` to use `Column.isin` 4b61c62 is described below commit 4b61c623b52032c7e4749db641c5e63c1317c3a4 Author: Xinrong Meng AuthorDate: Mon Sep 20 15:00:10 2021 -0700 [SPARK-36746][PYTHON] Refactor `_select_rows_by_iterable` in `iLocIndexer` to use `Column.isin` ### What changes were proposed in this pull request? Refactor `_select_rows_by_iterable` in `iLocIndexer` to use `Column.isin`. ### Why are the changes needed? For better performance. After a rough benchmark, a long projection performs worse than `Column.isin`, even when the length of the filtering conditions exceeding `compute.isin_limit`. So we use `Column.isin` instead. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33964 from xinrong-databricks/iloc_select. Authored-by: Xinrong Meng Signed-off-by: Takuya UESHIN --- python/pyspark/pandas/indexing.py | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/python/pyspark/pandas/indexing.py b/python/pyspark/pandas/indexing.py index c46e250..3e07975 100644 --- a/python/pyspark/pandas/indexing.py +++ b/python/pyspark/pandas/indexing.py @@ -1657,14 +1657,13 @@ class iLocIndexer(LocIndexerLike): "however, normalised index was [%s]" % new_rows_sel ) -sequence_scol = sdf[self._sequence_col] -cond = [] -for key in new_rows_sel: -cond.append(sequence_scol == SF.lit(int(key)).cast(LongType())) - -if len(cond) == 0: -cond = [SF.lit(False)] -return reduce(lambda x, y: x | y, cond), None, None +if len(new_rows_sel) == 0: +cond = SF.lit(False) +else: +cond = sdf[self._sequence_col].isin( +[SF.lit(int(key)).cast(LongType()) for key in new_rows_sel] +) +return cond, None, None def _select_rows_else( self, rows_sel: Any - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-36618][PYTHON] Support dropping rows of a single-indexed DataFrame
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4cf86d3 [SPARK-36618][PYTHON] Support dropping rows of a single-indexed DataFrame 4cf86d3 is described below commit 4cf86d33adc483382a2803486db628e21cad44e9 Author: Xinrong Meng AuthorDate: Mon Sep 20 14:50:50 2021 -0700 [SPARK-36618][PYTHON] Support dropping rows of a single-indexed DataFrame ### What changes were proposed in this pull request? Support dropping rows of a single-indexed DataFrame. Dropping rows and columns at the same time is supported in this PR as well. ### Why are the changes needed? To increase pandas API coverage. ### Does this PR introduce _any_ user-facing change? Yes, dropping rows of a single-indexed DataFrame is supported now. ```py >>> df = ps.DataFrame(np.arange(12).reshape(3, 4), columns=['A', 'B', 'C', 'D']) >>> df A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 ``` From ```py >>> df.drop([0, 1]) Traceback (most recent call last): ... KeyError: [(0,), (1,)] >>> df.drop([0, 1], axis=0) Traceback (most recent call last): ... NotImplementedError: Drop currently only works for axis=1 >>> df.drop(1) Traceback (most recent call last): ... KeyError: [(1,)] >>> df.drop(index=1) Traceback (most recent call last): ... TypeError: drop() got an unexpected keyword argument 'index' >>> df.drop(index=[0, 1], columns='A') Traceback (most recent call last): ... TypeError: drop() got an unexpected keyword argument 'index' ``` To ```py >>> df.drop([0, 1]) A B C D 2 8 9 10 11 >>> df.drop([0, 1], axis=0) A B C D 2 8 9 10 11 >>> df.drop(1) A B C D 0 0 1 2 3 2 8 9 10 11 >>> df.drop(index=1) A B C D 0 0 1 2 3 2 8 9 10 11 >>> df.drop(index=[0, 1], columns='A') B C D 2 9 10 11 ``` ### How was this patch tested? Unit tests. Closes #33929 from xinrong-databricks/frame_drop. Authored-by: Xinrong Meng Signed-off-by: Takuya UESHIN --- .../source/migration_guide/pyspark_3.2_to_3.3.rst | 23 +++ python/pyspark/pandas/frame.py | 176 ++--- python/pyspark/pandas/indexing.py | 4 +- python/pyspark/pandas/tests/test_dataframe.py | 106 +++-- python/pyspark/pandas/tests/test_groupby.py| 8 +- 5 files changed, 241 insertions(+), 76 deletions(-) diff --git a/python/docs/source/migration_guide/pyspark_3.2_to_3.3.rst b/python/docs/source/migration_guide/pyspark_3.2_to_3.3.rst new file mode 100644 index 000..060f24c --- /dev/null +++ b/python/docs/source/migration_guide/pyspark_3.2_to_3.3.rst @@ -0,0 +1,23 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + + += +Upgrading from PySpark 3.2 to 3.3 += + +* In Spark 3.3, the ``drop`` method of pandas API on Spark DataFrame supports dropping rows by ``index``, and sets dropping by index instead of column by default. diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py index 8d37d31..f863890 100644 --- a/python/pyspark/pandas/frame.py +++ b/python/pyspark/pandas/frame.py @@ -2817,7 +2817,7 @@ defaultdict(, {'col..., 'col...})] 3 NaN """ result = self[item] -self._update_internal_frame(self.drop(item)._internal) +self._update_internal_frame(self.drop(columns=item)._internal) return result # TODO: add axis parameter can work when '1' or 'columns' @@ -6586,23 +6586,31 @@ defaultdict(, {'col..., 'col...})] def drop( self, labels: Optional[Union[Name, List[Name]]] = None, -axis: Axis = 1, +axis: Optional[Axis] = 0, +index: Union[Name, List[Name]] =
[spark] branch branch-3.2 updated: [SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 5d0e51e [SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image 5d0e51e is described below commit 5d0e51e943615d65b28e245fbf1fa3e575e20128 Author: Dongjoon Hyun AuthorDate: Mon Sep 20 10:52:45 2021 -0700 [SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image ### What changes were proposed in this pull request? This PR aims to upgrade R from 3.6.3 to 4.0.4 in K8s R Docker image. ### Why are the changes needed? `openjdk:11-jre-slim` image is upgraded to `Debian 11`. ``` $ docker run -it openjdk:11-jre-slim cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian HOME_URL="https://www.debian.org/; SUPPORT_URL="https://www.debian.org/support; BUG_REPORT_URL="https://bugs.debian.org/; ``` It causes `R 3.5` installation failures in our K8s integration test environment. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47953/ ``` The following packages have unmet dependencies: r-base-core : Depends: libicu63 (>= 63.1-1~) but it is not installable Depends: libreadline7 (>= 6.0) but it is not installable E: Unable to correct problems, you have held broken packages. The command '/bin/sh -c apt-get update && apt install -y gnupg && echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list && apt-key adv --keyserver keyserver.ubuntu.com --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' && apt-get update && apt install -y -t buster-cran35 r-base r-base-dev && rm -rf ``` ### Does this PR introduce _any_ user-facing change? Yes, this will recover the installation. ### How was this patch tested? Succeed to build SparkR docker image in the K8s integration test in Jenkins CI. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47959/ ``` Successfully built 32e1a0cd5ff8 Successfully tagged kubespark/spark-r:3.3.0-SNAPSHOT_6e4f7e2d-054d-4978-812f-4f32fc546b51 ``` Closes #34048 from dongjoon-hyun/SPARK-36806. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit a178752540e2d37a6da847a381de7c8d6b4797d3) Signed-off-by: Dongjoon Hyun --- .../docker/src/main/dockerfiles/spark/bindings/R/Dockerfile | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile index aabd04b..03e4210 100644 --- a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile +++ b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile @@ -25,14 +25,10 @@ USER 0 RUN mkdir ${SPARK_HOME}/R -# Install R 3.6.3 (http://cloud.r-project.org/bin/linux/debian/) +# Install R 4.0.4 (http://cloud.r-project.org/bin/linux/debian/) RUN \ apt-get update && \ - apt install -y gnupg && \ - echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list && \ - apt-key adv --keyserver keyserver.ubuntu.com --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' && \ - apt-get update && \ - apt install -y -t buster-cran35 r-base r-base-dev && \ + apt install -y r-base r-base-dev && \ rm -rf /var/cache/apt/* COPY R ${SPARK_HOME}/R - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated (3d0e631 -> b4a370b)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git. from 3d0e631 [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN add b4a370b [SPARK-36706][SQL][3.1] OverwriteByExpression conversion in DataSourceV2Strategy use wrong param in translateFilter No new revisions were added by this update. Summary of changes: .../spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (75e71ef -> a178752)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 75e71ef [SPARK-36808][BUILD] Upgrade Kafka to 2.8.1 add a178752 [SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image No new revisions were added by this update. Summary of changes: .../docker/src/main/dockerfiles/spark/bindings/R/Dockerfile | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (3b56039 -> 75e71ef)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 3b56039 [SPARK-36805][BUILD][K8S] Upgrade kubernetes-client to 5.7.3 add 75e71ef [SPARK-36808][BUILD] Upgrade Kafka to 2.8.1 No new revisions were added by this update. Summary of changes: pom.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (30d17b6 -> 3b56039)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 30d17b6 [SPARK-36683][SQL] Add new built-in SQL functions: SEC and CSC add 3b56039 [SPARK-36805][BUILD][K8S] Upgrade kubernetes-client to 5.7.3 No new revisions were added by this update. Summary of changes: dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 42 - dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 42 - pom.xml | 2 +- 3 files changed, 43 insertions(+), 43 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (4cc39cf -> 30d17b6)
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 4cc39cf [SPARK-36101][CORE] Grouping exception in core/api add 30d17b6 [SPARK-36683][SQL] Add new built-in SQL functions: SEC and CSC No new revisions were added by this update. Summary of changes: python/docs/source/reference/pyspark.sql.rst | 2 + python/pyspark/sql/functions.py| 58 python/pyspark/sql/functions.pyi | 2 + python/pyspark/sql/tests/test_functions.py | 78 -- python/pyspark/testing/sqlutils.py | 8 +++ .../sql/catalyst/analysis/FunctionRegistry.scala | 2 + .../sql/catalyst/expressions/mathExpressions.scala | 46 + .../expressions/MathExpressionsSuite.scala | 28 .../scala/org/apache/spark/sql/functions.scala | 18 + .../sql-functions/sql-expression-schema.md | 4 +- .../test/resources/sql-tests/inputs/operators.sql | 8 +++ .../resources/sql-tests/results/operators.sql.out | 66 +- .../org/apache/spark/sql/MathFunctionsSuite.scala | 15 + 13 files changed, 299 insertions(+), 36 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-36101][CORE] Grouping exception in core/api
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4cc39cf [SPARK-36101][CORE] Grouping exception in core/api 4cc39cf is described below commit 4cc39cfe153aca70c97dc8702b3b55cf22f3ce80 Author: dgd-contributor AuthorDate: Mon Sep 20 17:19:29 2021 +0800 [SPARK-36101][CORE] Grouping exception in core/api ### What changes were proposed in this pull request? This PR group exception messages in core/src/main/scala/org/apache/spark/api ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #33536 from dgd-contributor/SPARK-36101. Authored-by: dgd-contributor Signed-off-by: Wenchen Fan --- .../org/apache/spark/api/python/Py4JServer.scala | 7 +++--- .../spark/api/python/PythonWorkerFactory.scala | 10 - .../python/WriteInputFormatTestDataGenerator.scala | 6 ++--- .../org/apache/spark/errors/SparkCoreErrors.scala | 26 ++ .../main/scala/org/apache/spark/rdd/PipedRDD.scala | 2 +- .../apache/spark/scheduler/TaskSchedulerImpl.scala | 5 +++-- 6 files changed, 33 insertions(+), 23 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/api/python/Py4JServer.scala b/core/src/main/scala/org/apache/spark/api/python/Py4JServer.scala index 2edc492..ef85bbd 100644 --- a/core/src/main/scala/org/apache/spark/api/python/Py4JServer.scala +++ b/core/src/main/scala/org/apache/spark/api/python/Py4JServer.scala @@ -21,6 +21,7 @@ import java.net.InetAddress import java.util.Locale import org.apache.spark.SparkConf +import org.apache.spark.errors.SparkCoreErrors import org.apache.spark.internal.Logging import org.apache.spark.util.Utils @@ -52,18 +53,18 @@ private[spark] class Py4JServer(sparkConf: SparkConf) extends Logging { def start(): Unit = server match { case clientServer: py4j.ClientServer => clientServer.startServer() case gatewayServer: py4j.GatewayServer => gatewayServer.start() -case other => throw new RuntimeException(s"Unexpected Py4J server ${other.getClass}") +case other => throw SparkCoreErrors.unexpectedPy4JServerError(other) } def getListeningPort: Int = server match { case clientServer: py4j.ClientServer => clientServer.getJavaServer.getListeningPort case gatewayServer: py4j.GatewayServer => gatewayServer.getListeningPort -case other => throw new RuntimeException(s"Unexpected Py4J server ${other.getClass}") +case other => throw SparkCoreErrors.unexpectedPy4JServerError(other) } def shutdown(): Unit = server match { case clientServer: py4j.ClientServer => clientServer.shutdown() case gatewayServer: py4j.GatewayServer => gatewayServer.shutdown() -case other => throw new RuntimeException(s"Unexpected Py4J server ${other.getClass}") +case other => throw SparkCoreErrors.unexpectedPy4JServerError(other) } } diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala b/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala index 7b2c36b..2beca6f 100644 --- a/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala +++ b/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala @@ -27,6 +27,7 @@ import scala.collection.JavaConverters._ import scala.collection.mutable import org.apache.spark._ +import org.apache.spark.errors.SparkCoreErrors import org.apache.spark.internal.Logging import org.apache.spark.internal.config.Python._ import org.apache.spark.security.SocketAuthHelper @@ -219,12 +220,11 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String daemonPort = in.readInt() } catch { case _: EOFException if daemon.isAlive => -throw new SparkException("EOFException occurred while reading the port number " + - s"from $daemonModule's stdout") +throw SparkCoreErrors.eofExceptionWhileReadPortNumberError( + daemonModule) case _: EOFException => -throw new SparkException( - s"EOFException occurred while reading the port number from $daemonModule's" + - s" stdout and terminated with code: ${daemon.exitValue}.") +throw SparkCoreErrors. + eofExceptionWhileReadPortNumberError(daemonModule, Some(daemon.exitValue)) } // test that the returned port number is within a valid range. diff --git
[spark] branch branch-3.0 updated: [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 016eeaf [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN 016eeaf is described below commit 016eeafd1196fb20449b457599487e1e06b6ffa7 Author: Angerszh AuthorDate: Mon Sep 20 16:48:59 2021 +0800 [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [NaN], but it should return []. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on https://github.com/apache/spark/pull/33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayIntersect won't show equal `NaN` value ### How was this patch tested? Added UT Closes #33995 from AngersZh/SPARK-36754. Authored-by: Angerszh Signed-off-by: Wenchen Fan (cherry picked from commit 2fc7f2f702c6c08d9c76332f45e2902728ba2ee3) Signed-off-by: Wenchen Fan --- .../expressions/collectionOperations.scala | 66 ++ .../expressions/CollectionExpressionsSuite.scala | 17 ++ 2 files changed, 58 insertions(+), 25 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index 479b1d7..c153181 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala @@ -3583,33 +3583,42 @@ case class ArrayIntersect(left: Expression, right: Expression) extends ArrayBina if (TypeUtils.typeWithProperEquals(elementType)) { (array1, array2) => if (array1.numElements() != 0 && array2.numElements() != 0) { - val hs = new OpenHashSet[Any] - val hsResult = new OpenHashSet[Any] - var foundNullElement = false + val hs = new SQLOpenHashSet[Any] + val hsResult = new SQLOpenHashSet[Any] + val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any] + val withArray2NaNCheckFunc = SQLOpenHashSet.withNaNCheckFunc(elementType, hs, +(value: Any) => hs.add(value), +(valueNaN: Any) => {} ) + val withArray1NaNCheckFunc = SQLOpenHashSet.withNaNCheckFunc(elementType, hsResult, +(value: Any) => + if (hs.contains(value) && !hsResult.contains(value)) { +arrayBuffer += value +hsResult.add(value) + }, +(valueNaN: Any) => + if (hs.containsNaN()) { +arrayBuffer += valueNaN + }) var i = 0 while (i < array2.numElements()) { if (array2.isNullAt(i)) { - foundNullElement = true + hs.addNull() } else { val elem = array2.get(i, elementType) - hs.add(elem) + withArray2NaNCheckFunc(elem) } i += 1 } - val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any] i = 0 while (i < array1.numElements()) { if (array1.isNullAt(i)) { - if (foundNullElement) { + if (hs.containsNull() && !hsResult.containsNull()) { arrayBuffer += null -foundNullElement = false +hsResult.addNull() } } else { val elem = array1.get(i, elementType) - if (hs.contains(elem) && !hsResult.contains(elem)) { -arrayBuffer += elem -hsResult.add(elem) - } + withArray1NaNCheckFunc(elem) } i += 1 } @@ -3684,10 +3693,9 @@ case class ArrayIntersect(left: Expression, right: Expression) extends ArrayBina val ptName = CodeGenerator.primitiveTypeName(jt) nullSafeCodeGen(ctx, ev, (array1, array2) => { -val foundNullElement = ctx.freshName("foundNullElement") val nullElementIndex = ctx.freshName("nullElementIndex") val builder = ctx.freshName("builder") -val openHashSet = classOf[OpenHashSet[_]].getName +val openHashSet = classOf[SQLOpenHashSet[_]].getName val classTag = s"scala.reflect.ClassTag$$.MODULE$$.$hsTypeName()" val hashSet = ctx.freshName("hashSet") val hashSetResult =
[spark] branch branch-3.1 updated: [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 3d0e631 [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN 3d0e631 is described below commit 3d0e631b1b16e267968f1793fabc8c62a97efc7f Author: Angerszh AuthorDate: Mon Sep 20 16:48:59 2021 +0800 [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [NaN], but it should return []. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on https://github.com/apache/spark/pull/33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayIntersect won't show equal `NaN` value ### How was this patch tested? Added UT Closes #33995 from AngersZh/SPARK-36754. Authored-by: Angerszh Signed-off-by: Wenchen Fan (cherry picked from commit 2fc7f2f702c6c08d9c76332f45e2902728ba2ee3) Signed-off-by: Wenchen Fan --- .../expressions/collectionOperations.scala | 66 ++ .../expressions/CollectionExpressionsSuite.scala | 17 ++ 2 files changed, 58 insertions(+), 25 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index 6cf1ead..77340c6 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala @@ -3632,33 +3632,42 @@ case class ArrayIntersect(left: Expression, right: Expression) extends ArrayBina if (TypeUtils.typeWithProperEquals(elementType)) { (array1, array2) => if (array1.numElements() != 0 && array2.numElements() != 0) { - val hs = new OpenHashSet[Any] - val hsResult = new OpenHashSet[Any] - var foundNullElement = false + val hs = new SQLOpenHashSet[Any] + val hsResult = new SQLOpenHashSet[Any] + val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any] + val withArray2NaNCheckFunc = SQLOpenHashSet.withNaNCheckFunc(elementType, hs, +(value: Any) => hs.add(value), +(valueNaN: Any) => {} ) + val withArray1NaNCheckFunc = SQLOpenHashSet.withNaNCheckFunc(elementType, hsResult, +(value: Any) => + if (hs.contains(value) && !hsResult.contains(value)) { +arrayBuffer += value +hsResult.add(value) + }, +(valueNaN: Any) => + if (hs.containsNaN()) { +arrayBuffer += valueNaN + }) var i = 0 while (i < array2.numElements()) { if (array2.isNullAt(i)) { - foundNullElement = true + hs.addNull() } else { val elem = array2.get(i, elementType) - hs.add(elem) + withArray2NaNCheckFunc(elem) } i += 1 } - val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any] i = 0 while (i < array1.numElements()) { if (array1.isNullAt(i)) { - if (foundNullElement) { + if (hs.containsNull() && !hsResult.containsNull()) { arrayBuffer += null -foundNullElement = false +hsResult.addNull() } } else { val elem = array1.get(i, elementType) - if (hs.contains(elem) && !hsResult.contains(elem)) { -arrayBuffer += elem -hsResult.add(elem) - } + withArray1NaNCheckFunc(elem) } i += 1 } @@ -3733,10 +3742,9 @@ case class ArrayIntersect(left: Expression, right: Expression) extends ArrayBina val ptName = CodeGenerator.primitiveTypeName(jt) nullSafeCodeGen(ctx, ev, (array1, array2) => { -val foundNullElement = ctx.freshName("foundNullElement") val nullElementIndex = ctx.freshName("nullElementIndex") val builder = ctx.freshName("builder") -val openHashSet = classOf[OpenHashSet[_]].getName +val openHashSet = classOf[SQLOpenHashSet[_]].getName val classTag = s"scala.reflect.ClassTag$$.MODULE$$.$hsTypeName()" val hashSet = ctx.freshName("hashSet") val hashSetResult =
[spark] branch branch-3.2 updated: [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 337a197 [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN 337a197 is described below commit 337a1979d291f7c836635d065e0bdda4ee188992 Author: Angerszh AuthorDate: Mon Sep 20 16:48:59 2021 +0800 [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [NaN], but it should return []. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on https://github.com/apache/spark/pull/33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayIntersect won't show equal `NaN` value ### How was this patch tested? Added UT Closes #33995 from AngersZh/SPARK-36754. Authored-by: Angerszh Signed-off-by: Wenchen Fan (cherry picked from commit 2fc7f2f702c6c08d9c76332f45e2902728ba2ee3) Signed-off-by: Wenchen Fan --- .../expressions/collectionOperations.scala | 66 ++ .../expressions/CollectionExpressionsSuite.scala | 17 ++ 2 files changed, 58 insertions(+), 25 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index 1182194..b325e9a 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala @@ -3847,33 +3847,42 @@ case class ArrayIntersect(left: Expression, right: Expression) extends ArrayBina if (TypeUtils.typeWithProperEquals(elementType)) { (array1, array2) => if (array1.numElements() != 0 && array2.numElements() != 0) { - val hs = new OpenHashSet[Any] - val hsResult = new OpenHashSet[Any] - var foundNullElement = false + val hs = new SQLOpenHashSet[Any] + val hsResult = new SQLOpenHashSet[Any] + val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any] + val withArray2NaNCheckFunc = SQLOpenHashSet.withNaNCheckFunc(elementType, hs, +(value: Any) => hs.add(value), +(valueNaN: Any) => {} ) + val withArray1NaNCheckFunc = SQLOpenHashSet.withNaNCheckFunc(elementType, hsResult, +(value: Any) => + if (hs.contains(value) && !hsResult.contains(value)) { +arrayBuffer += value +hsResult.add(value) + }, +(valueNaN: Any) => + if (hs.containsNaN()) { +arrayBuffer += valueNaN + }) var i = 0 while (i < array2.numElements()) { if (array2.isNullAt(i)) { - foundNullElement = true + hs.addNull() } else { val elem = array2.get(i, elementType) - hs.add(elem) + withArray2NaNCheckFunc(elem) } i += 1 } - val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any] i = 0 while (i < array1.numElements()) { if (array1.isNullAt(i)) { - if (foundNullElement) { + if (hs.containsNull() && !hsResult.containsNull()) { arrayBuffer += null -foundNullElement = false +hsResult.addNull() } } else { val elem = array1.get(i, elementType) - if (hs.contains(elem) && !hsResult.contains(elem)) { -arrayBuffer += elem -hsResult.add(elem) - } + withArray1NaNCheckFunc(elem) } i += 1 } @@ -3948,10 +3957,9 @@ case class ArrayIntersect(left: Expression, right: Expression) extends ArrayBina val ptName = CodeGenerator.primitiveTypeName(jt) nullSafeCodeGen(ctx, ev, (array1, array2) => { -val foundNullElement = ctx.freshName("foundNullElement") val nullElementIndex = ctx.freshName("nullElementIndex") val builder = ctx.freshName("builder") -val openHashSet = classOf[OpenHashSet[_]].getName +val openHashSet = classOf[SQLOpenHashSet[_]].getName val classTag = s"scala.reflect.ClassTag$$.MODULE$$.$hsTypeName()" val hashSet = ctx.freshName("hashSet") val hashSetResult =
[spark] branch master updated: [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2fc7f2f [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN 2fc7f2f is described below commit 2fc7f2f702c6c08d9c76332f45e2902728ba2ee3 Author: Angerszh AuthorDate: Mon Sep 20 16:48:59 2021 +0800 [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [NaN], but it should return []. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on https://github.com/apache/spark/pull/33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayIntersect won't show equal `NaN` value ### How was this patch tested? Added UT Closes #33995 from AngersZh/SPARK-36754. Authored-by: Angerszh Signed-off-by: Wenchen Fan --- .../expressions/collectionOperations.scala | 66 ++ .../expressions/CollectionExpressionsSuite.scala | 17 ++ 2 files changed, 58 insertions(+), 25 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index 1182194..b325e9a 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala @@ -3847,33 +3847,42 @@ case class ArrayIntersect(left: Expression, right: Expression) extends ArrayBina if (TypeUtils.typeWithProperEquals(elementType)) { (array1, array2) => if (array1.numElements() != 0 && array2.numElements() != 0) { - val hs = new OpenHashSet[Any] - val hsResult = new OpenHashSet[Any] - var foundNullElement = false + val hs = new SQLOpenHashSet[Any] + val hsResult = new SQLOpenHashSet[Any] + val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any] + val withArray2NaNCheckFunc = SQLOpenHashSet.withNaNCheckFunc(elementType, hs, +(value: Any) => hs.add(value), +(valueNaN: Any) => {} ) + val withArray1NaNCheckFunc = SQLOpenHashSet.withNaNCheckFunc(elementType, hsResult, +(value: Any) => + if (hs.contains(value) && !hsResult.contains(value)) { +arrayBuffer += value +hsResult.add(value) + }, +(valueNaN: Any) => + if (hs.containsNaN()) { +arrayBuffer += valueNaN + }) var i = 0 while (i < array2.numElements()) { if (array2.isNullAt(i)) { - foundNullElement = true + hs.addNull() } else { val elem = array2.get(i, elementType) - hs.add(elem) + withArray2NaNCheckFunc(elem) } i += 1 } - val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any] i = 0 while (i < array1.numElements()) { if (array1.isNullAt(i)) { - if (foundNullElement) { + if (hs.containsNull() && !hsResult.containsNull()) { arrayBuffer += null -foundNullElement = false +hsResult.addNull() } } else { val elem = array1.get(i, elementType) - if (hs.contains(elem) && !hsResult.contains(elem)) { -arrayBuffer += elem -hsResult.add(elem) - } + withArray1NaNCheckFunc(elem) } i += 1 } @@ -3948,10 +3957,9 @@ case class ArrayIntersect(left: Expression, right: Expression) extends ArrayBina val ptName = CodeGenerator.primitiveTypeName(jt) nullSafeCodeGen(ctx, ev, (array1, array2) => { -val foundNullElement = ctx.freshName("foundNullElement") val nullElementIndex = ctx.freshName("nullElementIndex") val builder = ctx.freshName("builder") -val openHashSet = classOf[OpenHashSet[_]].getName +val openHashSet = classOf[SQLOpenHashSet[_]].getName val classTag = s"scala.reflect.ClassTag$$.MODULE$$.$hsTypeName()" val hashSet = ctx.freshName("hashSet") val hashSetResult = ctx.freshName("hashSetResult") @@ -3963,7 +3971,7 @@ case class ArrayIntersect(left: Expression, right: Expression) extends
[spark] branch master updated: [SPARK-34112][BUILD] Upgrade ORC to 1.7.0
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a396dd6 [SPARK-34112][BUILD] Upgrade ORC to 1.7.0 a396dd6 is described below commit a396dd6216b786c6665a07b8706154e4e9d14bbe Author: Dongjoon Hyun AuthorDate: Mon Sep 20 01:09:15 2021 -0700 [SPARK-34112][BUILD] Upgrade ORC to 1.7.0 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC from 1.6.11 to 1.7.0 for Apache Spark 3.3.0. ### Why are the changes needed? [Apache ORC 1.7.0](https://orc.apache.org/news/2021/09/15/ORC-1.7.0/) is a new release with the following new features and improvements. - ORC-377 Support Snappy compression in C++ Writer - ORC-577 Support row-level filtering - ORC-716 Build and test on Java 17-EA - ORC-731 Improve Java Tools - ORC-742 LazyIO of non-filter columns - ORC-751 Implement Predicate Pushdown in C++ Reader - ORC-755 Introduce OrcFilterContext - ORC-757 Add Hashtable implementation for dictionary - ORC-780 Support LZ4 Compression in C++ Writer - ORC-797 Allow writers to get the stripe information - ORC-818 Build and test in Apple Silicon - ORC-861 Bump CMake minimum requirement to 2.8.12 - ORC-867 Upgrade hive-storage-api to 2.8.1 - ORC-984 Save the software version that wrote each ORC file ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the existing CIs because this is a dependency change. Closes #34045 from dongjoon-hyun/SPARK-34112. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 6 +++--- dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 6 +++--- pom.xml | 2 +- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 b/dev/deps/spark-deps-hadoop-2.7-hive-2.3 index 6f91caf..83b0c8e 100644 --- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3 @@ -195,9 +195,9 @@ objenesis/2.6//objenesis-2.6.jar okhttp/3.12.12//okhttp-3.12.12.jar okio/1.14.0//okio-1.14.0.jar opencsv/2.3//opencsv-2.3.jar -orc-core/1.6.11//orc-core-1.6.11.jar -orc-mapreduce/1.6.11//orc-mapreduce-1.6.11.jar -orc-shims/1.6.11//orc-shims-1.6.11.jar +orc-core/1.7.0//orc-core-1.7.0.jar +orc-mapreduce/1.7.0//orc-mapreduce-1.7.0.jar +orc-shims/1.7.0//orc-shims-1.7.0.jar oro/2.0.8//oro-2.0.8.jar osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar paranamer/2.8//paranamer-2.8.jar diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 b/dev/deps/spark-deps-hadoop-3.2-hive-2.3 index ecf448f..c72ac56 100644 --- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3 @@ -165,9 +165,9 @@ objenesis/2.6//objenesis-2.6.jar okhttp/3.12.12//okhttp-3.12.12.jar okio/1.14.0//okio-1.14.0.jar opencsv/2.3//opencsv-2.3.jar -orc-core/1.6.11//orc-core-1.6.11.jar -orc-mapreduce/1.6.11//orc-mapreduce-1.6.11.jar -orc-shims/1.6.11//orc-shims-1.6.11.jar +orc-core/1.7.0//orc-core-1.7.0.jar +orc-mapreduce/1.7.0//orc-mapreduce-1.7.0.jar +orc-shims/1.7.0//orc-shims-1.7.0.jar oro/2.0.8//oro-2.0.8.jar osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar paranamer/2.8//paranamer-2.8.jar diff --git a/pom.xml b/pom.xml index 419c6b7..3f18e6d 100644 --- a/pom.xml +++ b/pom.xml @@ -137,7 +137,7 @@ 10.14.2.0 1.12.1 -1.6.11 +1.7.0 9.4.43.v20210629 4.0.3 0.10.0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org