[spark] branch master updated: [SPARK-43030][SQL][FOLLOWUP] CTE ref should keep the output attributes duplicated when renew
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 84431a6 [SPARK-43030][SQL][FOLLOWUP] CTE ref should keep the output attributes duplicated when renew 84431a6 is described below commit 84431a6e4631afca22b3a4763eabeeef1398 Author: Wenchen Fan AuthorDate: Tue May 30 13:51:38 2023 +0800 [SPARK-43030][SQL][FOLLOWUP] CTE ref should keep the output attributes duplicated when renew ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/40662 to fix a regression. `CTERelationRef` inherits the output attributes from a query, which may contain duplicated attributes, for queries like SELECT a, a FROM t. It's important to keep the duplicated attributes to have the same id in the new instance, as column resolution allows more than one matching attribute if their ids are the same. For example, `Project('a, CTERelationRef(a#1, a#1))` can be resolved properly as the matching attributes a have the same id, but `Project('a, CTERelationRef(a#2, a#3))` can't be resolved. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? the bug is not released yet. ### How was this patch tested? new test Closes #41363 from cloud-fan/fix. Authored-by: Wenchen Fan Signed-off-by: Wenchen Fan --- .../sql/catalyst/plans/logical/basicLogicalOperators.scala | 12 +++- .../apache/spark/sql/catalyst/analysis/AnalysisSuite.scala | 11 +++ 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala index ceed7b0cc54..4bde26a7d6e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala @@ -844,7 +844,17 @@ case class CTERelationRef( override lazy val resolved: Boolean = _resolved - override def newInstance(): LogicalPlan = copy(output = output.map(_.newInstance())) + override def newInstance(): LogicalPlan = { +// CTERelationRef inherits the output attributes from a query, which may contain duplicated +// attributes, for queries like `SELECT a, a FROM t`. It's important to keep the duplicated +// attributes to have the same id in the new instance, as column resolution allows more than one +// matching attributes if their ids are the same. +// For example, `Project('a, CTERelationRef(a#1, a#1))` can be resolved properly as the matching +// attributes `a` have the same id, but `Project('a, CTERelationRef(a#2, a#3))` can't be +// resolved. +val oldAttrToNewAttr = AttributeMap(output.zip(output.map(_.newInstance( +copy(output = output.map(attr => oldAttrToNewAttr(attr))) + } def withNewStats(statsOpt: Option[Statistics]): CTERelationRef = copy(statsOpt = statsOpt) diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala index 1029f7f8fab..e1050e91e59 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala @@ -1500,6 +1500,17 @@ class AnalysisSuite extends AnalysisTest with Matchers { assert(refs.map(_.output).distinct.length == 2) } +withClue("CTE relation has duplicated attributes") { + val cteDef = CTERelationDef(testRelation.select($"a", $"a")) + val cteRef = CTERelationRef(cteDef.id, false, Nil) + val plan = WithCTE(cteRef.join(cteRef.select($"a")), Seq(cteDef)).analyze + val refs = plan.collect { +case r: CTERelationRef => r + } + assert(refs.length == 2) + assert(refs.map(_.output).distinct.length == 2) +} + withClue("references in both CTE relation definition and main query") { val cteDef2 = CTERelationDef(cteRef.where($"a" > 2)) val cteRef2 = CTERelationRef(cteDef2.id, false, Nil) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43353][PYTHON] Migrate remaining session errors into error class
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ff2b9c2ddc0 [SPARK-43353][PYTHON] Migrate remaining session errors into error class ff2b9c2ddc0 is described below commit ff2b9c2ddc07c02f6ba4c68a2fd66243919acfb6 Author: itholic AuthorDate: Tue May 30 13:49:45 2023 +0900 [SPARK-43353][PYTHON] Migrate remaining session errors into error class ### What changes were proposed in this pull request? This PR proposes to migrate remaining Spark session errors into error class ### Why are the changes needed? To leverage PySpark error framework. ### Does this PR introduce _any_ user-facing change? No API changes. ### How was this patch tested? The existing CI should pass. Closes #41031 from itholic/error_session. Authored-by: itholic Signed-off-by: Hyukjin Kwon --- python/pyspark/errors/error_classes.py | 20 ++ python/pyspark/sql/session.py | 48 +++--- 2 files changed, 58 insertions(+), 10 deletions(-) diff --git a/python/pyspark/errors/error_classes.py b/python/pyspark/errors/error_classes.py index 817b8ce60db..2d82d03eb6d 100644 --- a/python/pyspark/errors/error_classes.py +++ b/python/pyspark/errors/error_classes.py @@ -89,6 +89,11 @@ ERROR_CLASSES_JSON = """ "Cannot convert into ." ] }, + "CANNOT_DETERMINE_TYPE": { +"message": [ + "Some of types cannot be determined after inferring." +] + }, "CANNOT_GET_BATCH_ID": { "message": [ "Could not get batch id from ." @@ -470,6 +475,11 @@ ERROR_CLASSES_JSON = """ "Argument `` should be a list[str], got ." ] }, + "NOT_LIST_OR_NONE_OR_STRUCT" : { +"message" : [ + "Argument `` should be a list, None or StructType, got ." +] + }, "NOT_LIST_OR_STR_OR_TUPLE" : { "message" : [ "Argument `` should be a list, str or tuple, got ." @@ -576,6 +586,11 @@ ERROR_CLASSES_JSON = """ "Result vector from pandas_udf was not the required length: expected , got ." ] }, + "SESSION_ALREADY_EXIST" : { +"message" : [ + "Cannot start a remote Spark session because there is a regular Spark session already running." +] + }, "SESSION_NOT_SAME" : { "message" : [ "Both Datasets must belong to the same SparkSession." @@ -586,6 +601,11 @@ ERROR_CLASSES_JSON = """ "There should not be an existing Spark Session or Spark Context." ] }, + "SHOULD_NOT_DATAFRAME": { +"message": [ + "Argument `` should not be a DataFrame." +] + }, "SLICE_WITH_STEP" : { "message" : [ "Slice with step is not supported." diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py index df970a0bf37..e96dc9cee3f 100644 --- a/python/pyspark/sql/session.py +++ b/python/pyspark/sql/session.py @@ -64,6 +64,7 @@ from pyspark.sql.types import ( ) from pyspark.errors.exceptions.captured import install_exception_handler from pyspark.sql.utils import is_timestamp_ntz_preferred, to_str +from pyspark.errors import PySparkValueError, PySparkTypeError if TYPE_CHECKING: from pyspark.sql._typing import AtomicValue, RowLike, OptionalPrimitiveType @@ -873,7 +874,10 @@ class SparkSession(SparkConversionMixin): :class:`pyspark.sql.types.StructType` """ if not data: -raise ValueError("can not infer schema from empty dataset") +raise PySparkValueError( +error_class="CANNOT_INFER_EMPTY_SCHEMA", +message_parameters={}, +) infer_dict_as_struct = self._jconf.inferDictAsStruct() infer_array_from_first_element = self._jconf.legacyInferArrayTypeFromFirstElement() prefer_timestamp_ntz = is_timestamp_ntz_preferred() @@ -891,7 +895,10 @@ class SparkSession(SparkConversionMixin): ), ) if _has_nulltype(schema): -raise ValueError("Some of types cannot be determined after inferring") +raise PySparkValueError( +error_class="CANNOT_DETERMINE_TYPE", +message_parameters={}, +) return schema def _inferSchema( @@ -917,7 +924,10 @@ class SparkSession(SparkConversionMixin): """ first = rdd.first() if isinstance(first, Sized) and len(first) == 0: -raise ValueError("The first row in RDD is empty, can not infer schema") +raise PySparkValueError( +error_class="CANNOT_INFER_EMPTY_SCHEMA", +message_parameters={}, +) infer_dict_as_struct = self._jconf.inferDictAsStruct() infer_array_from_first_element = self._jconf.legacyInferArrayTypeFromFirstElement() @@ -944,9 +954,9
[spark] branch master updated: [SPARK-43603][PS][CONNECT][TESTS][FOLLOW-UP] Delete unused `test_parity_dataframe.py`
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new dada63c24ec [SPARK-43603][PS][CONNECT][TESTS][FOLLOW-UP] Delete unused `test_parity_dataframe.py` dada63c24ec is described below commit dada63c24ecbe1b65dc4dac6632b34db05fbef98 Author: Ruifeng Zheng AuthorDate: Tue May 30 12:05:16 2023 +0900 [SPARK-43603][PS][CONNECT][TESTS][FOLLOW-UP] Delete unused `test_parity_dataframe.py` ### What changes were proposed in this pull request? `pyspark.pandas.tests.connect.test_parity_dataframe` had been split in https://github.com/apache/spark/commit/cb5bd57aac40c06921321929df00d2086e37ba34, but I forgot to remove the test file. this PR removes unused `test_parity_dataframe.py` ### Why are the changes needed? `test_parity_dataframe.py` is no longer needed ### Does this PR introduce _any_ user-facing change? No, test-only ### How was this patch tested? CI Closes #41372 from zhengruifeng/reorg_ps_df_tests_followup. Authored-by: Ruifeng Zheng Signed-off-by: Hyukjin Kwon --- .../pandas/tests/connect/test_parity_dataframe.py | 154 - 1 file changed, 154 deletions(-) diff --git a/python/pyspark/pandas/tests/connect/test_parity_dataframe.py b/python/pyspark/pandas/tests/connect/test_parity_dataframe.py deleted file mode 100644 index c1b9ae2ee11..000 --- a/python/pyspark/pandas/tests/connect/test_parity_dataframe.py +++ /dev/null @@ -1,154 +0,0 @@ -# -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -#http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# -import unittest - -from pyspark import pandas as ps -from pyspark.pandas.tests.test_dataframe import DataFrameTestsMixin -from pyspark.testing.connectutils import ReusedConnectTestCase -from pyspark.testing.pandasutils import PandasOnSparkTestUtils - - -class DataFrameParityTests(DataFrameTestsMixin, PandasOnSparkTestUtils, ReusedConnectTestCase): -@property -def psdf(self): -return ps.from_pandas(self.pdf) - -@unittest.skip( -"TODO(SPARK-43610): Enable `InternalFrame.attach_distributed_column` in Spark Connect." -) -def test_aggregate(self): -super().test_aggregate() - -@unittest.skip("TODO(SPARK-41876): Implement DataFrame `toLocalIterator`") -def test_iterrows(self): -super().test_iterrows() - -@unittest.skip("TODO(SPARK-41876): Implement DataFrame `toLocalIterator`") -def test_itertuples(self): -super().test_itertuples() - -@unittest.skip( -"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark Connect client." -) -def test_cummax(self): -super().test_cummax() - -@unittest.skip( -"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark Connect client." -) -def test_cummax_multiindex_columns(self): -super().test_cummax_multiindex_columns() - -@unittest.skip( -"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark Connect client." -) -def test_cummin(self): -super().test_cummin() - -@unittest.skip( -"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark Connect client." -) -def test_cummin_multiindex_columns(self): -super().test_cummin_multiindex_columns() - -@unittest.skip( -"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark Connect client." -) -def test_cumprod(self): -super().test_cumprod() - -@unittest.skip( -"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark Connect client." -) -def test_cumprod_multiindex_columns(self): -super().test_cumprod_multiindex_columns() - -@unittest.skip( -"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark Connect client." -) -def test_cumsum(self): -super().test_cumsum() - -@unittest.skip( -"TODO(SPARK-43611): Fix unexpected `AnalysisException` from Spark Connect client." -) -def test_cumsum_multiindex_columns(self): -
[spark] branch master updated (1da7b7f9c21 -> 014dd357656)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 1da7b7f9c21 [SPARK-43024][PYTHON] Upgrade pandas to 2.0.0 add 014dd357656 [SPARK-43863][CONNECT] Remove redundant `toSeq` from `SparkConnectPlanner` for Scala 2.13 No new revisions were added by this update. Summary of changes: .../sql/connect/planner/SparkConnectPlanner.scala | 33 +++--- 1 file changed, 17 insertions(+), 16 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43024][PYTHON] Upgrade pandas to 2.0.0
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1da7b7f9c21 [SPARK-43024][PYTHON] Upgrade pandas to 2.0.0 1da7b7f9c21 is described below commit 1da7b7f9c21f4b1981e9c52ed88d71a6b317f104 Author: itholic AuthorDate: Tue May 30 09:02:54 2023 +0900 [SPARK-43024][PYTHON] Upgrade pandas to 2.0.0 ### What changes were proposed in this pull request? This PR proposes to upgrade pandas to 2.0.0. ### Why are the changes needed? To support latest pandas. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Addressed the existing UTs. Closes #41211 from itholic/pandas_2. Authored-by: itholic Signed-off-by: Hyukjin Kwon --- dev/infra/Dockerfile | 4 +- python/pyspark/mlv2/tests/test_feature.py | 10 + python/pyspark/mlv2/tests/test_summarizer.py | 6 + python/pyspark/pandas/base.py | 14 +- python/pyspark/pandas/frame.py | 11 +- python/pyspark/pandas/generic.py | 6 +- python/pyspark/pandas/groupby.py | 8 +- python/pyspark/pandas/indexes/base.py | 63 +++--- python/pyspark/pandas/indexes/category.py | 2 +- python/pyspark/pandas/indexes/datetimes.py | 63 +++--- python/pyspark/pandas/indexes/numeric.py | 12 +- python/pyspark/pandas/namespace.py | 22 +- python/pyspark/pandas/series.py| 10 +- python/pyspark/pandas/spark/accessors.py | 9 +- python/pyspark/pandas/strings.py | 30 +-- python/pyspark/pandas/supported_api_gen.py | 2 +- .../pandas/tests/computation/test_any_all.py | 4 + .../pandas/tests/computation/test_combine.py | 4 + .../pandas/tests/computation/test_compute.py | 13 ++ .../pyspark/pandas/tests/computation/test_cov.py | 4 + .../pandas/tests/computation/test_describe.py | 8 + .../pandas/tests/data_type_ops/test_date_ops.py| 10 + .../pyspark/pandas/tests/frame/test_reindexing.py | 4 + python/pyspark/pandas/tests/indexes/test_base.py | 237 + .../pyspark/pandas/tests/indexes/test_category.py | 13 ++ .../pyspark/pandas/tests/indexes/test_datetime.py | 10 + .../pyspark/pandas/tests/indexes/test_indexing.py | 5 + .../pyspark/pandas/tests/indexes/test_reindex.py | 5 + .../pyspark/pandas/tests/indexes/test_timedelta.py | 6 + .../tests/plot/test_frame_plot_matplotlib.py | 56 + python/pyspark/pandas/tests/test_categorical.py| 22 ++ python/pyspark/pandas/tests/test_csv.py| 6 + .../pandas/tests/test_dataframe_conversion.py | 5 + python/pyspark/pandas/tests/test_groupby.py| 38 python/pyspark/pandas/tests/test_groupby_slow.py | 9 + python/pyspark/pandas/tests/test_namespace.py | 5 + .../pandas/tests/test_ops_on_diff_frames.py| 5 + .../tests/test_ops_on_diff_frames_groupby.py | 11 + .../test_ops_on_diff_frames_groupby_rolling.py | 5 + python/pyspark/pandas/tests/test_rolling.py| 9 + python/pyspark/pandas/tests/test_series.py | 44 .../pyspark/pandas/tests/test_series_conversion.py | 5 + .../pyspark/pandas/tests/test_series_datetime.py | 65 ++ python/pyspark/pandas/tests/test_series_string.py | 14 ++ python/pyspark/pandas/tests/test_stats.py | 15 ++ .../pyspark/sql/tests/connect/test_parity_arrow.py | 6 + python/pyspark/sql/tests/test_arrow.py | 4 + 47 files changed, 746 insertions(+), 173 deletions(-) diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile index 189bd606499..888b4e00b39 100644 --- a/dev/infra/Dockerfile +++ b/dev/infra/Dockerfile @@ -64,8 +64,8 @@ RUN Rscript -e "devtools::install_version('roxygen2', version='7.2.0', repos='ht # See more in SPARK-39735 ENV R_LIBS_SITE "/usr/local/lib/R/site-library:${R_LIBS_SITE}:/usr/lib/R/library" -RUN pypy3 -m pip install numpy 'pandas<=1.5.3' scipy coverage matplotlib -RUN python3.9 -m pip install numpy pyarrow 'pandas<=1.5.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*' +RUN pypy3 -m pip install numpy 'pandas<=2.0.0' scipy coverage matplotlib +RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.0.0' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*' # Add Python deps for Spark Connect. RUN python3.9 -m pip install grpcio protobuf googleapis-common-protos grpcio-status diff --git a/python/pyspark/mlv2/tests/test_feature.py
[spark] branch master updated: [SPARK-43799][PYTHON] Add descriptor binary option to Pyspark Protobuf API
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ff9d41abaff [SPARK-43799][PYTHON] Add descriptor binary option to Pyspark Protobuf API ff9d41abaff is described below commit ff9d41abaffcbd6f0c26ce5be9d2324fe9f01d5c Author: Raghu Angadi AuthorDate: Tue May 30 09:01:32 2023 +0900 [SPARK-43799][PYTHON] Add descriptor binary option to Pyspark Protobuf API ### What changes were proposed in this pull request? This updated Protobuf Pyspark API to allow passing binary FileDescriptorSet rather than a file name. This is a Python follow up to feature implemented in Scala in #41192. ### Why are the changes needed? - This allows flexibility for Pyspark users to provide binary descriptor set directly. - Even if users are using file path, Pyspark avoids passing file name to Scala and reads the descriptor file in Python. This avoids having to read the file in Scala. ### Does this PR introduce _any_ user-facing change? - This adds extra arg to `from_protobuf()` and `to_protobuf()` API. ### How was this patch tested? - Doc tests - Manual tests Closes #41343 from rangadi/py-proto-file-buffer. Authored-by: Raghu Angadi Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/protobuf/functions.py | 74 +++- 1 file changed, 64 insertions(+), 10 deletions(-) diff --git a/python/pyspark/sql/protobuf/functions.py b/python/pyspark/sql/protobuf/functions.py index a303cf91493..42165938eb7 100644 --- a/python/pyspark/sql/protobuf/functions.py +++ b/python/pyspark/sql/protobuf/functions.py @@ -37,13 +37,17 @@ def from_protobuf( messageName: str, descFilePath: Optional[str] = None, options: Optional[Dict[str, str]] = None, +binaryDescriptorSet: Optional[bytes] = None, ) -> Column: """ Converts a binary column of Protobuf format into its corresponding catalyst value. -The Protobuf definition is provided in one of these two ways: +The Protobuf definition is provided in one of these ways: - Protobuf descriptor file: E.g. a descriptor file created with `protoc --include_imports --descriptor_set_out=abc.desc abc.proto` + - Protobuf descriptor as binary: Rather than file path as in previous option, + we can provide the binary content of the file. This allows flexibility in how the + descriptor set is created and fetched. - Jar containing Protobuf Java class: The jar containing Java class should be shaded. Specifically, `com.google.protobuf.*` should be shaded to `org.sparkproject.spark_protobuf.protobuf.*`. @@ -52,6 +56,9 @@ def from_protobuf( .. versionadded:: 3.4.0 +.. versionchanged:: 3.5.0 +Supports `binaryDescriptorSet` arg to pass binary descriptor directly. + Parameters -- data : :class:`~pyspark.sql.Column` or str @@ -61,9 +68,11 @@ def from_protobuf( The Protobuf class name when descFilePath parameter is not set. E.g. `com.example.protos.ExampleEvent`. descFilePath : str, optional -The protobuf descriptor file. +The Protobuf descriptor file. options : dict, optional options to control how the protobuf record is parsed. +binaryDescriptorSet: bytes, optional +The Protobuf `FileDescriptorSet` serialized as binary. Notes - @@ -92,9 +101,14 @@ def from_protobuf( ... proto_df = df.select( ... to_protobuf(df.value, message_name, desc_file_path).alias("value")) ... proto_df.show(truncate=False) -... proto_df = proto_df.select( +... proto_df_1 = proto_df.select( # With file name for descriptor ... from_protobuf(proto_df.value, message_name, desc_file_path).alias("value")) -... proto_df.show(truncate=False) +... proto_df_1.show(truncate=False) +... proto_df_2 = proto_df.select( # With binary for descriptor +... from_protobuf(proto_df.value, message_name, +... binaryDescriptorSet = bytearray.fromhex(desc_hex)) +... .alias("value")) +... proto_df_2.show(truncate=False) ++ |value | ++ @@ -105,6 +119,11 @@ def from_protobuf( +--+ |{2, Alice, 109200}| +--+ ++--+ +|value | ++--+ +|{2, Alice, 109200}| ++--+ >>> data = [([(1668035962, 2020)])] >>> ddl_schema = "value struct" >>> df = spark.createDataFrame(data, ddl_schema)
[spark] branch master updated: [SPARK-43858][INFRA][FOLLOWUP] Restore the Scala version to 2.13 after run benchmark
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ce568b7b56b [SPARK-43858][INFRA][FOLLOWUP] Restore the Scala version to 2.13 after run benchmark ce568b7b56b is described below commit ce568b7b56b0f8ee89045f2de7a1eb6816d8156b Author: yangjie01 AuthorDate: Mon May 29 10:15:53 2023 -0700 [SPARK-43858][INFRA][FOLLOWUP] Restore the Scala version to 2.13 after run benchmark ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/41358, this pr change to restore Scala version to 2.13 after running benchmark to avoid containing changed `pom.xml` in the results tarball. ### Why are the changes needed? After running benchmark, should restore Scala to default version to results tarball includes changed `pom.xml`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #41371 from LuciferYang/SPARK-43858-FOLLOWUP. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- .github/workflows/benchmark.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml index 055f238bc61..f9699ac80f2 100644 --- a/.github/workflows/benchmark.yml +++ b/.github/workflows/benchmark.yml @@ -182,7 +182,7 @@ jobs: "`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \ "${{ github.event.inputs.class }}" # Revert to default Scala version to clean up unnecessary git diff -dev/change-scala-version.sh 2.12 +dev/change-scala-version.sh 2.13 # To keep the directory structure and file permissions, tar them # See also https://github.com/actions/upload-artifact#maintaining-file-permissions-and-case-sensitive-files echo "Preparing the benchmark results:" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (8e18d687bd0 -> 15ffe91f447)
This is an automated email from the ASF dual-hosted git repository. yangjie01 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 8e18d687bd0 [SPARK-43858][INFRA] Make benchmark Github Action task use Scala 2.13 as default add 15ffe91f447 [SPARK-43860][SQL] Enable tail-recursion wherever possible No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/connect/common/LiteralValueProtoConverter.scala | 1 + .../scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala | 1 + .../spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala | 1 + 3 files changed, 3 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (a9f13e1e54d -> 8e18d687bd0)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from a9f13e1e54d [SPARK-43849][BUILD] Enable unused imports check for Scala 2.13 add 8e18d687bd0 [SPARK-43858][INFRA] Make benchmark Github Action task use Scala 2.13 as default No new revisions were added by this update. Summary of changes: .github/workflows/benchmark.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (383d9fb4df8 -> a9f13e1e54d)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 383d9fb4df8 [SPARK-43680][SPARK-43681][SPARK-43682][SPARK-43683][PS] Fix `NullOps` for Spark Connect add a9f13e1e54d [SPARK-43849][BUILD] Enable unused imports check for Scala 2.13 No new revisions were added by this update. Summary of changes: .../apache/spark/sql/connect/planner/SparkConnectPlanner.scala | 4 pom.xml | 8 +++- project/SparkBuild.scala | 9 ++--- .../apache/spark/sql/catalyst/expressions/ExpressionSet.scala| 2 +- 4 files changed, 14 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43680][SPARK-43681][SPARK-43682][SPARK-43683][PS] Fix `NullOps` for Spark Connect
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 383d9fb4df8 [SPARK-43680][SPARK-43681][SPARK-43682][SPARK-43683][PS] Fix `NullOps` for Spark Connect 383d9fb4df8 is described below commit 383d9fb4df81429d2d7d31a35f200ff152bf77f6 Author: itholic AuthorDate: Mon May 29 19:38:27 2023 +0800 [SPARK-43680][SPARK-43681][SPARK-43682][SPARK-43683][PS] Fix `NullOps` for Spark Connect ### What changes were proposed in this pull request? This PR proposes to fix `NullOps` test for pandas API on Spark with Spark Connect. This includes SPARK-43680, SPARK-43681, SPARK-43682, SPARK-43683 at once, because they are all related similar modifications in single file. ### Why are the changes needed? To support all features for pandas API on Spark with Spark Connect. ### Does this PR introduce _any_ user-facing change? Yes, `NullOps.lt`, `NullOps.le`, `NullOps.ge`, `NullOps.gt` are now working as expected on Spark Connect. ### How was this patch tested? Uncomment the UTs, and tested manually. Closes #41361 from itholic/SPARK-43680-3. Authored-by: itholic Signed-off-by: Ruifeng Zheng --- python/pyspark/pandas/data_type_ops/null_ops.py| 34 +- .../connect/data_type_ops/test_parity_null_ops.py | 16 -- 2 files changed, 21 insertions(+), 29 deletions(-) diff --git a/python/pyspark/pandas/data_type_ops/null_ops.py b/python/pyspark/pandas/data_type_ops/null_ops.py index 9205d5e2407..ddd7bddcfbd 100644 --- a/python/pyspark/pandas/data_type_ops/null_ops.py +++ b/python/pyspark/pandas/data_type_ops/null_ops.py @@ -30,8 +30,8 @@ from pyspark.pandas.data_type_ops.base import ( ) from pyspark.pandas._typing import SeriesOrIndex from pyspark.pandas.typedef import pandas_on_spark_type -from pyspark.sql import Column from pyspark.sql.types import BooleanType, StringType +from pyspark.sql.utils import pyspark_column_op, is_remote class NullOps(DataTypeOps): @@ -44,28 +44,36 @@ class NullOps(DataTypeOps): return "nulls" def lt(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: -from pyspark.pandas.base import column_op - _sanitize_list_like(right) -return column_op(Column.__lt__)(left, right) +result = pyspark_column_op("__lt__")(left, right) +if is_remote: +# In Spark Connect, it returns None instead of False, so we manually cast it. +result = result.fillna(False) +return result def le(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: -from pyspark.pandas.base import column_op - _sanitize_list_like(right) -return column_op(Column.__le__)(left, right) +result = pyspark_column_op("__le__")(left, right) +if is_remote: +# In Spark Connect, it returns None instead of False, so we manually cast it. +result = result.fillna(False) +return result def ge(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: -from pyspark.pandas.base import column_op - _sanitize_list_like(right) -return column_op(Column.__ge__)(left, right) +result = pyspark_column_op("__ge__")(left, right) +if is_remote: +# In Spark Connect, it returns None instead of False, so we manually cast it. +result = result.fillna(False) +return result def gt(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: -from pyspark.pandas.base import column_op - _sanitize_list_like(right) -return column_op(Column.__gt__)(left, right) +result = pyspark_column_op("__gt__")(left, right) +if is_remote: +# In Spark Connect, it returns None instead of False, so we manually cast it. +result = result.fillna(False) +return result def astype(self, index_ops: IndexOpsLike, dtype: Union[str, type, Dtype]) -> IndexOpsLike: dtype, spark_type = pandas_on_spark_type(dtype) diff --git a/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_null_ops.py b/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_null_ops.py index 00bfb75087a..1b53a064971 100644 --- a/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_null_ops.py +++ b/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_null_ops.py @@ -33,22 +33,6 @@ class NullOpsParityTests( def test_eq(self): super().test_eq() -@unittest.skip("TODO(SPARK-43680): Fix NullOps.ge to work with Spark Connect Column.") -def test_ge(self): -super().test_ge() - -@unittest.skip("TODO(SPARK-43681): Fix NullOps.gt to work with Spark Connect Column.") -def test_gt(self): -
[spark] branch master updated: [SPARK-43676][SPARK-43677][SPARK-43678][SPARK-43679][PS] Fix `DatetimeOps` for Spark Connect
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 373975b8ef3 [SPARK-43676][SPARK-43677][SPARK-43678][SPARK-43679][PS] Fix `DatetimeOps` for Spark Connect 373975b8ef3 is described below commit 373975b8ef35afc80d396e69081173875022a970 Author: itholic AuthorDate: Mon May 29 19:36:38 2023 +0800 [SPARK-43676][SPARK-43677][SPARK-43678][SPARK-43679][PS] Fix `DatetimeOps` for Spark Connect ### What changes were proposed in this pull request? This PR proposes to fix `DatetimeOps` test for pandas API on Spark with Spark Connect. This includes SPARK-43676, SPARK-43677, SPARK-43678, SPARK-43679 at once, because they are all related similar modifications in single file. ### Why are the changes needed? To support all features for pandas API on Spark with Spark Connect. ### Does this PR introduce _any_ user-facing change? Yes, `DatetimeOps.lt`, `DatetimeOps.le`, `DatetimeOps.ge`, `DatetimeOps.gt` are now working as expected on Spark Connect. ### How was this patch tested? Uncomment the UTs, and tested manually. Closes #41306 from itholic/SPARK-43676-9. Authored-by: itholic Signed-off-by: Ruifeng Zheng --- python/pyspark/pandas/data_type_ops/datetime_ops.py | 17 + .../connect/data_type_ops/test_parity_datetime_ops.py | 16 2 files changed, 5 insertions(+), 28 deletions(-) diff --git a/python/pyspark/pandas/data_type_ops/datetime_ops.py b/python/pyspark/pandas/data_type_ops/datetime_ops.py index 5069598..c5f4df96bde 100644 --- a/python/pyspark/pandas/data_type_ops/datetime_ops.py +++ b/python/pyspark/pandas/data_type_ops/datetime_ops.py @@ -33,6 +33,7 @@ from pyspark.sql.types import ( TimestampNTZType, NumericType, ) +from pyspark.sql.utils import pyspark_column_op from pyspark.pandas._typing import Dtype, IndexOpsLike, SeriesOrIndex from pyspark.pandas.base import IndexOpsMixin @@ -109,28 +110,20 @@ class DatetimeOps(DataTypeOps): raise TypeError("Datetime subtraction can only be applied to datetime series.") def lt(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: -from pyspark.pandas.base import column_op - _sanitize_list_like(right) -return column_op(Column.__lt__)(left, right) +return pyspark_column_op("__lt__")(left, right) def le(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: -from pyspark.pandas.base import column_op - _sanitize_list_like(right) -return column_op(Column.__le__)(left, right) +return pyspark_column_op("__le__")(left, right) def ge(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: -from pyspark.pandas.base import column_op - _sanitize_list_like(right) -return column_op(Column.__ge__)(left, right) +return pyspark_column_op("__ge__")(left, right) def gt(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: -from pyspark.pandas.base import column_op - _sanitize_list_like(right) -return column_op(Column.__gt__)(left, right) +return pyspark_column_op("__gt__")(left, right) def prepare(self, col: pd.Series) -> pd.Series: """Prepare column when from_pandas.""" diff --git a/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_datetime_ops.py b/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_datetime_ops.py index 697c191b743..6d081b10aba 100644 --- a/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_datetime_ops.py +++ b/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_datetime_ops.py @@ -34,22 +34,6 @@ class DatetimeOpsParityTests( def test_astype(self): super().test_astype() -@unittest.skip("TODO(SPARK-43676): Fix DatetimeOps.ge to work with Spark Connect Column.") -def test_ge(self): -super().test_ge() - -@unittest.skip("TODO(SPARK-43677): Fix DatetimeOps.gt to work with Spark Connect Column.") -def test_gt(self): -super().test_gt() - -@unittest.skip("TODO(SPARK-43678): Fix DatetimeOps.le to work with Spark Connect Column.") -def test_le(self): -super().test_le() - -@unittest.skip("TODO(SPARK-43679): Fix DatetimeOps.lt to work with Spark Connect Column.") -def test_lt(self): -super().test_lt() - if __name__ == "__main__": from pyspark.pandas.tests.connect.data_type_ops.test_parity_datetime_ops import * # noqa: F401 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43779][SQL] ParseToDate should load the EvalMode in main thread
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 88f69d6f928 [SPARK-43779][SQL] ParseToDate should load the EvalMode in main thread 88f69d6f928 is described below commit 88f69d6f92860823b1a90bc162ebca2b7c8132fc Author: Rui Wang AuthorDate: Mon May 29 16:57:16 2023 +0800 [SPARK-43779][SQL] ParseToDate should load the EvalMode in main thread ### What changes were proposed in this pull request? ParseToDate should load the EvalMode in main thread instead of loading it in a lazy val. ### Why are the changes needed? This is because it is sometimes hard to estimate when the lazy val is executed while the SQLConf where we load the EvalMode is thread local. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #41298 from amaliujia/SPARK-43779. Authored-by: Rui Wang Signed-off-by: Wenchen Fan --- .../catalyst/expressions/datetimeExpressions.scala | 9 +--- .../sql-tests/analyzer-results/ansi/date.sql.out | 8 +++ .../ansi/datetime-parsing-invalid.sql.out | 4 ++-- .../sql-tests/analyzer-results/date.sql.out| 8 +++ .../analyzer-results/datetime-legacy.sql.out | 8 +++ .../datetime-parsing-invalid.sql.out | 4 ++-- .../analyzer-results/group-by-filter.sql.out | 8 +++ .../analyzer-results/postgreSQL/text.sql.out | 4 ++-- .../analyzer-results/predicate-functions.sql.out | 26 +++--- .../analyzer-results/timestamp-ltz.sql.out | 2 +- .../analyzer-results/timestamp-ntz.sql.out | 2 +- 11 files changed, 43 insertions(+), 40 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala index 57c70f4d9bd..51ddf2b85f8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala @@ -2044,12 +2044,15 @@ case class MonthsBetween( case class ParseToDate( left: Expression, format: Option[Expression], -timeZoneId: Option[String] = None) +timeZoneId: Option[String] = None, +ansiEnabled: Boolean = SQLConf.get.ansiEnabled) extends RuntimeReplaceable with ImplicitCastInputTypes with TimeZoneAwareExpression { override lazy val replacement: Expression = format.map { f => -Cast(GetTimestamp(left, f, TimestampType, timeZoneId), DateType, timeZoneId) - }.getOrElse(Cast(left, DateType, timeZoneId)) // backwards compatibility +Cast(GetTimestamp(left, f, TimestampType, timeZoneId, ansiEnabled), DateType, timeZoneId, + EvalMode.fromBoolean(ansiEnabled)) + }.getOrElse(Cast(left, DateType, timeZoneId, +EvalMode.fromBoolean(ansiEnabled))) // backwards compatibility def this(left: Expression, format: Expression) = { this(left, Option(format)) diff --git a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/date.sql.out b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/date.sql.out index 28fe86d930f..3765d65ec3b 100644 --- a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/date.sql.out +++ b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/date.sql.out @@ -148,21 +148,21 @@ select UNIX_DATE(DATE('1970-01-01')), UNIX_DATE(DATE('2020-12-04')), UNIX_DATE(n -- !query select to_date(null), to_date('2016-12-31'), to_date('2016-12-31', '-MM-dd') -- !query analysis -Project [to_date(cast(null as string), None, Some(America/Los_Angeles)) AS to_date(NULL)#x, to_date(2016-12-31, None, Some(America/Los_Angeles)) AS to_date(2016-12-31)#x, to_date(2016-12-31, Some(-MM-dd), Some(America/Los_Angeles)) AS to_date(2016-12-31, -MM-dd)#x] +Project [to_date(cast(null as string), None, Some(America/Los_Angeles), true) AS to_date(NULL)#x, to_date(2016-12-31, None, Some(America/Los_Angeles), true) AS to_date(2016-12-31)#x, to_date(2016-12-31, Some(-MM-dd), Some(America/Los_Angeles), true) AS to_date(2016-12-31, -MM-dd)#x] +- OneRowRelation -- !query select to_date("16", "dd") -- !query analysis -Project [to_date(16, Some(dd), Some(America/Los_Angeles)) AS to_date(16, dd)#x] +Project [to_date(16, Some(dd), Some(America/Los_Angeles), true) AS to_date(16, dd)#x] +- OneRowRelation -- !query select to_date("02-29", "MM-dd") -- !query analysis -Project [to_date(02-29, Some(MM-dd), Some(America/Los_Angeles)) AS to_date(02-29, MM-dd)#x] +Project [to_date(02-29, Some(MM-dd), Some(America/Los_Angeles), true) AS to_date(02-29, MM-dd)#x] +-
[spark] branch master updated (27bb384947e -> 8b464df9fcf)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 27bb384947e [SPARK-43841][SQL] Handle candidate attributes with no prefix in `StringUtils#orderSuggestedIdentifiersBySimilarity` add 8b464df9fcf [SPARK-43846][SQL][TESTS] Use checkError() to check Exception in SessionCatalogSuite No new revisions were added by this update. Summary of changes: .../sql/catalyst/catalog/SessionCatalogSuite.scala | 462 - 1 file changed, 277 insertions(+), 185 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (31a8ef803a8 -> 27bb384947e)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 31a8ef803a8 [SPARK-43821][CONNECT][TESTS] Make the prompt for `findJar` method in IntegrationTestUtils clearer add 27bb384947e [SPARK-43841][SQL] Handle candidate attributes with no prefix in `StringUtils#orderSuggestedIdentifiersBySimilarity` No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/util/StringUtils.scala | 2 +- .../spark/sql/catalyst/util/StringUtilsSuite.scala | 7 ++ .../sql/errors/QueryCompilationErrorsSuite.scala | 27 ++ 3 files changed, 35 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org