[spark] branch slow deleted (was 6650667c407)
This is an automated email from the ASF dual-hosted git repository. yumwang pushed a change to branch slow in repository https://gitbox.apache.org/repos/asf/spark.git was 6650667c407 fix This change permanently discards the following revisions: discard 6650667c407 fix discard e0df57fdbbd Use ExpressionSet and add UT discard b042b755f93 [CARMEL-6796] Pull out complex aggregate expressions discard 29bd591cd2f [carmel-6785] Add configuration spark.sql.materializedView.name.prefix to support definition of available mv (#1307) discard b3cce6eb3a0 [MINOR] Bug fix for HiveUDF codegen (#1303) discard 905523e4cd2 [CARMEL-6734] Fix reorder filter condition issue (#1293) discard b3036d65479 [CARMEL-6741] Tag Queries with Join Expansion in Runtime (#1299) discard a6d9e7c163d [CARMEL-5941] Add how to request HDM data access to error message (#1301) discard 285aff3d2a4 [CARMEL-6739][SPARK-43050][SQL] Fix construct aggregate expressions by replacing grouping functions (#1300) discard d803e943edb [CARMEL-6751] More reasonable error message when heavily skewed partition (#1297) discard 2e9d18bd5dd [CARMEL-6760] Correct metadata ‘mv_updated_time’ which is used for data validation (#1298) discard 2973a317042 [CARMEL-6633] Reduce Skew Join Split Size Considering Expand Node (#1295) discard 46be35accc9 [CARMEL-6664] Fix the InterruptedException error and incorrect state for cancelled download (#1291) discard ee430edaf39 [CARMEL-6703] Take too much time to build bloom filter (#1292) discard f76533f3acc [Carmel 6640] Support create materialized view as datasource table (#1280) discard 7efe9b78cff [CARMEL-6647] Enhance DemoteBucketJoin to support Alias Aware Output Partitioning (#1290) discard e7d9e88ddcb [CARMEL-6583][SPARK-42500][SQL] ConstantPropagation support more cases (#1287) discard d8fa6e4e368 [CARMEL-6621][SPARK-42789][SQL] Rewrite multiple GetJsonObjects to a JsonTuple if their json expressions are the same (#1286) discard 5b0fffed633 [CARMEL-6705] Bug fix for query output row count (#1289) discard 96c7333cb22 [CARMEL-6615] Backport [SPARK-42052][SQL] Codegen Support for HiveSimpleUDF (#1288) discard a53b95c0c4d [CARMEL-6675] Support to enable decommission nodes when hive service discovery is disabled (#1282) discard c08771ff897 [CARMEL-6683][SPARK-31008][SQL] Support json_array_length function (#1284) discard b243ff8b59e [CARMEL-6674] Project fail to be collapsed (#1283) discard 93f89f941dc [CARMEL-6587] Support Generic Skew Join Patterns (#1281) discard 81d76f6b116 [CARMEL-6652][SPARK-40501][SQL] Add PushProjectionThroughLimit for Optimizer (#1278) discard 3e8e629a2d0 [CARMEL-6655] Do not trim whitespaces by default when downloading data as CSV file (#1279) discard b3e8faafbda [CARMEL-6651] Remove repartition if it is the child of LocalLimit (#1277) discard 2ddd418edf4 [CARMEL-6632] Fix the running time of download statement in query log (#1275) discard 9e899c166ce [CARMEL-6586] Ignore SinglePartition when determining expectedChildrenNumPartitions (#1252) discard 25a3b904b61 [CARMEL-6327] Support Broadcast Join with Stream Side Skew (#1272) discard c3f17a52289 [CARMEL-6608] Increase bucket table scan partitions (#1269) discard 434b16e0cc8 [CARMEL-6439] Define new query execution event and log column lineage asynchronously (#1268) discard e77e54fd7da [CARMEL-6609] Casts types according to bucket info support view (#1270) discard 872399e7a5e [CARMEL-6593] Upgrade parquet to 1.12.3.0.1.0 (#1271) discard 40900d24073 [CARMEL-6604] Stop posting duplicate execution event (#1267) discard e2036ae1eda [CARMEL-6591][SPARK-42597] Support unwrap date type to timestamp type (#1263) discard f7c77ff2bbd [CARMEL-6582][SPARK-42513][SQL] Push down topK through join (#1257) discard a214b2b67fd [CARMEL-6541] Support Query Level SQL Conf leveraging Hint (#1256) discard 10a27944ccf [CARMEL-6568] Analyze join operator and support data expansion check (#1265) discard 56c9bad9395 [CARMEL-6511] Disable rename temp table (#1254) discard 71489e91204 [CARMEL-6371] Partial aggregation push through left/right outer join (#1233) discard 7905f88b1f3 [CARMEL-6581] TakeOrderedAndProject should not replace project if project expression is not deterministic (#1249) discard 142b10b1580 [CARMEL-6339] Implementation of materialized view (#1244) discard 1cd2cbcaa2f [CARMEL-6556] Avoid coalesce partitions from different UNION sides (#1246) discard 413224b3270 [CARMEL-6495] Do not quit am even when all nodes are in blacklist (#1248) discard 9f306b4ddf4 [CARMEL-6439] Add configuration to enable log column lineage (#1238) discard 5f91f872777 [CARMEL-6553] Backport [SPARK-35673][SQL] Fix user-defined hint and unrecognized hint in subquery (#1236) discard d8b439948dd [CARMEL-6525][MINOR] Support tag different drivers in the queue (#1237) discard c4930985ae1 [CARMEL-6537] [Followup] Support Iceberg with maven dependency in Carmel - use correct jar (#1234) discard
[spark] branch master updated: [SPARK-43817][SPARK-43702][PYTHON] Support UserDefinedType in createDataFrame from pandas DataFrame and toPandas
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 916b0d3de97 [SPARK-43817][SPARK-43702][PYTHON] Support UserDefinedType in createDataFrame from pandas DataFrame and toPandas 916b0d3de97 is described below commit 916b0d3de973b8b30a8ede3d56b9f8a70512 Author: Takuya UESHIN AuthorDate: Sun May 28 08:47:35 2023 +0800 [SPARK-43817][SPARK-43702][PYTHON] Support UserDefinedType in createDataFrame from pandas DataFrame and toPandas ### What changes were proposed in this pull request? Support `UserDefinedType` in `createDataFrame` from pandas DataFrame and `toPandas`. For the following schema and pandas DataFrame: ```py schema = ( StructType() .add("point", ExamplePointUDT()) .add("struct", StructType().add("point", ExamplePointUDT())) .add("array", ArrayType(ExamplePointUDT())) .add("map", MapType(StringType(), ExamplePointUDT())) ) data = [ Row( ExamplePoint(1.0, 2.0), Row(ExamplePoint(3.0, 4.0)), [ExamplePoint(5.0, 6.0)], dict(point=ExamplePoint(7.0, 8.0)), ) ] df = spark.createDataFrame(data, schema) pdf = pd.DataFrame.from_records(data, columns=schema.names) ``` # `spark.createDataFrame()` For all, return the same results: ```py >>> spark.createDataFrame(pdf, schema).show(truncate=False) +--+++-+ |point |struct |array |map | +--+++-+ |(1.0, 2.0)|{(3.0, 4.0)}|[(5.0, 6.0)]|{point -> (7.0, 8.0)}| +--+++-+ ``` # `df.toPandas()` ```py >>> spark.conf.set('spark.sql.execution.pandas.structHandlingMode', 'row') >>> df.toPandas() pointstructarray map 0 (1.0,2.0) ((3.0,4.0),) [(5.0,6.0)] {'point': (7.0,8.0)} ``` ### Why are the changes needed? Currently `UserDefinedType` in `spark.createDataFrame()` with pandas DataFrame and `df.toPandas()` is not supported with Arrow enabled or in Spark Connect. # `spark.createDataFrame()` Works without Arrow: ```py >>> spark.createDataFrame(pdf, schema).show(truncate=False) +--+++-+ |point |struct |array |map | +--+++-+ |(1.0, 2.0)|{(3.0, 4.0)}|[(5.0, 6.0)]|{point -> (7.0, 8.0)}| +--+++-+ ``` , whereas: - With Arrow: Works with fallback: ```py >>> spark.createDataFrame(pdf, schema).show(truncate=False) /.../python/pyspark/sql/pandas/conversion.py:351: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: [UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION] ExamplePointUDT() is not supported in conversion to Arrow. Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warn(msg) +--+++-+ |point |struct |array |map | +--+++-+ |(1.0, 2.0)|{(3.0, 4.0)}|[(5.0, 6.0)]|{point -> (7.0, 8.0)}| +--+++-+ ``` - Spark Connect ```py >>> spark.createDataFrame(pdf, schema).show(truncate=False) Traceback (most recent call last): ... pyspark.errors.exceptions.base.PySparkTypeError: [UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION] ExamplePointUDT() is not supported in conversion to Arrow. ``` # `df.toPandas()` Works without Arrow: ```py >>> spark.conf.set('spark.sql.execution.pandas.structHandlingMode', 'row') >>> df.toPandas() pointstructarray map 0 (1.0,2.0) ((3.0,4.0),) [(5.0,6.0)] {'point': (7.0,8.0)} ``` , whereas: - With Arrow Works with fallback: ```py >>> spark.conf.set('spark.sql.execution.pandas.structHandlingMode', 'row') >>> df.toPandas() /.../python/pyspark/sql/pandas/conversion.py:111: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: [UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION] ExamplePointUDT() is not
[spark] branch master updated: [SPARK-43671][PS][FOLLOWUP] Refine `CategoricalOps` functions
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 001da5d003c [SPARK-43671][PS][FOLLOWUP] Refine `CategoricalOps` functions 001da5d003c is described below commit 001da5d003caef3cda9978d35967ade55837e0bc Author: itholic AuthorDate: Sun May 28 08:44:16 2023 +0800 [SPARK-43671][PS][FOLLOWUP] Refine `CategoricalOps` functions ### What changes were proposed in this pull request? This PR follow-up for SPARK-43671, to refine functions to use `pyspark_column_op` util for clean-up the code. ### Why are the changes needed? To avoid `is_remote` in too many places for future maintenance. ### Does this PR introduce _any_ user-facing change? No, it's code cleanup ### How was this patch tested? The existing CI should pass Closes #41326 from itholic/categorical_followup. Authored-by: itholic Signed-off-by: Ruifeng Zheng --- .../pandas/data_type_ops/categorical_ops.py| 69 +- 1 file changed, 14 insertions(+), 55 deletions(-) diff --git a/python/pyspark/pandas/data_type_ops/categorical_ops.py b/python/pyspark/pandas/data_type_ops/categorical_ops.py index 9f14a4b1ee7..66e181a6079 100644 --- a/python/pyspark/pandas/data_type_ops/categorical_ops.py +++ b/python/pyspark/pandas/data_type_ops/categorical_ops.py @@ -16,19 +16,18 @@ # from itertools import chain -from typing import cast, Any, Callable, Union +from typing import cast, Any, Union import pandas as pd import numpy as np from pandas.api.types import is_list_like, CategoricalDtype # type: ignore[attr-defined] from pyspark.pandas._typing import Dtype, IndexOpsLike, SeriesOrIndex -from pyspark.pandas.base import column_op, IndexOpsMixin +from pyspark.pandas.base import IndexOpsMixin from pyspark.pandas.data_type_ops.base import _sanitize_list_like, DataTypeOps from pyspark.pandas.typedef import pandas_on_spark_type from pyspark.sql import functions as F -from pyspark.sql.column import Column as PySparkColumn -from pyspark.sql.utils import is_remote +from pyspark.sql.utils import pyspark_column_op class CategoricalOps(DataTypeOps): @@ -66,73 +65,33 @@ class CategoricalOps(DataTypeOps): def eq(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: _sanitize_list_like(right) -if is_remote(): -from pyspark.sql.connect.column import Column as ConnectColumn - -Column = ConnectColumn -else: -Column = PySparkColumn # type: ignore[assignment] -return _compare( -left, right, Column.__eq__, is_equality_comparison=True # type: ignore[arg-type] -) +return _compare(left, right, "__eq__", is_equality_comparison=True) def ne(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: _sanitize_list_like(right) -if is_remote(): -from pyspark.sql.connect.column import Column as ConnectColumn - -Column = ConnectColumn -else: -Column = PySparkColumn # type: ignore[assignment] -return _compare( -left, right, Column.__ne__, is_equality_comparison=True # type: ignore[arg-type] -) +return _compare(left, right, "__ne__", is_equality_comparison=True) def lt(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: _sanitize_list_like(right) -if is_remote(): -from pyspark.sql.connect.column import Column as ConnectColumn - -Column = ConnectColumn -else: -Column = PySparkColumn # type: ignore[assignment] -return _compare(left, right, Column.__lt__) # type: ignore[arg-type] +return _compare(left, right, "__lt__") def le(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: _sanitize_list_like(right) -if is_remote(): -from pyspark.sql.connect.column import Column as ConnectColumn - -Column = ConnectColumn -else: -Column = PySparkColumn # type: ignore[assignment] -return _compare(left, right, Column.__le__) # type: ignore[arg-type] +return _compare(left, right, "__le__") def gt(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: _sanitize_list_like(right) -if is_remote(): -from pyspark.sql.connect.column import Column as ConnectColumn - -Column = ConnectColumn -else: -Column = PySparkColumn # type: ignore[assignment] -return _compare(left, right, Column.__gt__) # type: ignore[arg-type] +return _compare(left, right, "__gt__") def ge(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: _sanitize_list_like(right) -if is_remote(): -from
[spark] branch master updated: [SPARK-43692][SPARK-43693][SPARK-43694][SPARK-43695][PS] Fix `StringOps` for Spark Connect
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6f0a73e457d [SPARK-43692][SPARK-43693][SPARK-43694][SPARK-43695][PS] Fix `StringOps` for Spark Connect 6f0a73e457d is described below commit 6f0a73e457dd3c49a4adce996d7201010cdd2651 Author: itholic AuthorDate: Sun May 28 08:41:44 2023 +0800 [SPARK-43692][SPARK-43693][SPARK-43694][SPARK-43695][PS] Fix `StringOps` for Spark Connect ### What changes were proposed in this pull request? This PR proposes to fix `StringOps` test for pandas API on Spark with Spark Connect. This includes SPARK-43692, SPARK-43693, SPARK-43694, SPARK-43695 at once, because they are all related similar modifications in single file. ### Why are the changes needed? To support all features for pandas API on Spark with Spark Connect. ### Does this PR introduce _any_ user-facing change? Yes, `StringOps.lt`, `StringOps.le`, `StringOps.ge`, `StringOps.gt` are now working as expected on Spark Connect. ### How was this patch tested? Uncomment the UTs, and tested manually. Closes #41308 from itholic/SPARK-43692-5. Authored-by: itholic Signed-off-by: Ruifeng Zheng --- python/pyspark/pandas/data_type_ops/string_ops.py | 18 +- .../connect/data_type_ops/test_parity_string_ops.py| 16 python/pyspark/sql/utils.py| 17 + 3 files changed, 22 insertions(+), 29 deletions(-) diff --git a/python/pyspark/pandas/data_type_ops/string_ops.py b/python/pyspark/pandas/data_type_ops/string_ops.py index 0b9eb87a163..e5818cb4635 100644 --- a/python/pyspark/pandas/data_type_ops/string_ops.py +++ b/python/pyspark/pandas/data_type_ops/string_ops.py @@ -22,6 +22,7 @@ from pandas.api.types import CategoricalDtype from pyspark.sql import functions as F from pyspark.sql.types import IntegralType, StringType +from pyspark.sql.utils import pyspark_column_op from pyspark.pandas._typing import Dtype, IndexOpsLike, SeriesOrIndex from pyspark.pandas.base import column_op, IndexOpsMixin @@ -34,7 +35,6 @@ from pyspark.pandas.data_type_ops.base import ( ) from pyspark.pandas.spark import functions as SF from pyspark.pandas.typedef import extension_dtypes, pandas_on_spark_type -from pyspark.sql import Column from pyspark.sql.types import BooleanType @@ -104,28 +104,20 @@ class StringOps(DataTypeOps): raise TypeError("Multiplication can not be applied to given types.") def lt(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: -from pyspark.pandas.base import column_op - _sanitize_list_like(right) -return column_op(Column.__lt__)(left, right) +return pyspark_column_op("__lt__")(left, right) def le(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: -from pyspark.pandas.base import column_op - _sanitize_list_like(right) -return column_op(Column.__le__)(left, right) +return pyspark_column_op("__le__")(left, right) def ge(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: -from pyspark.pandas.base import column_op - _sanitize_list_like(right) -return column_op(Column.__ge__)(left, right) +return pyspark_column_op("__ge__")(left, right) def gt(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex: -from pyspark.pandas.base import column_op - _sanitize_list_like(right) -return column_op(Column.__gt__)(left, right) +return pyspark_column_op("__gt__")(left, right) def astype(self, index_ops: IndexOpsLike, dtype: Union[str, type, Dtype]) -> IndexOpsLike: dtype, spark_type = pandas_on_spark_type(dtype) diff --git a/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_string_ops.py b/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_string_ops.py index 9abfe1d1e09..2d81db1c701 100644 --- a/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_string_ops.py +++ b/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_string_ops.py @@ -34,22 +34,6 @@ class StringOpsParityTests( def test_astype(self): super().test_astype() -@unittest.skip("TODO(SPARK-43692): Fix StringOps.ge to work with Spark Connect.") -def test_ge(self): -super().test_ge() - -@unittest.skip("TODO(SPARK-43693): Fix StringOps.gt to work with Spark Connect.") -def test_gt(self): -super().test_gt() - -@unittest.skip("TODO(SPARK-43694): Fix StringOps.le to work with Spark Connect.") -def test_le(self): -super().test_le() - -@unittest.skip("TODO(SPARK-43695): Fix StringOps.lt to work with Spark Connect.") -def test_lt(self): -
[spark] branch master updated: [SPARK-43773][CONNECT][PYTHON][, THRESHOLD] Implement 'levenshtein(str1, str2)' functions in python client
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new dc186c5e6b6 [SPARK-43773][CONNECT][PYTHON][, THRESHOLD] Implement 'levenshtein(str1, str2)' functions in python client dc186c5e6b6 is described below commit dc186c5e6b6bdb63345081ee9f70b8c102792cdd Author: panbingkun AuthorDate: Sun May 28 08:38:32 2023 +0800 [SPARK-43773][CONNECT][PYTHON][, THRESHOLD] Implement 'levenshtein(str1, str2)' functions in python client ### What changes were proposed in this pull request? The pr aims to implement 'levenshtein(str1, str2[, threshold])' functions in python client ### Why are the changes needed? After Add a max distance argument to the levenshtein() function We have already implemented it on the scala side, so we need to align it on `pyspark`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Manual testing python/run-tests --testnames 'python.pyspark.sql.tests.test_functions FunctionsTests.test_levenshtein_function' - Pass GA Closes #41296 from panbingkun/SPARK-43773. Lead-authored-by: panbingkun Co-authored-by: panbingkun <84731...@qq.com> Signed-off-by: Ruifeng Zheng --- python/pyspark/sql/connect/functions.py | 9 +++-- python/pyspark/sql/functions.py | 19 +-- .../sql/tests/connect/test_connect_function.py| 5 + python/pyspark/sql/tests/test_functions.py| 7 +++ 4 files changed, 36 insertions(+), 4 deletions(-) diff --git a/python/pyspark/sql/connect/functions.py b/python/pyspark/sql/connect/functions.py index b7d7bc937cf..d3a05d6a1c6 100644 --- a/python/pyspark/sql/connect/functions.py +++ b/python/pyspark/sql/connect/functions.py @@ -1878,8 +1878,13 @@ def substring_index(str: "ColumnOrName", delim: str, count: int) -> Column: substring_index.__doc__ = pysparkfuncs.substring_index.__doc__ -def levenshtein(left: "ColumnOrName", right: "ColumnOrName") -> Column: -return _invoke_function_over_columns("levenshtein", left, right) +def levenshtein( +left: "ColumnOrName", right: "ColumnOrName", threshold: Optional[int] = None +) -> Column: +if threshold is None: +return _invoke_function_over_columns("levenshtein", left, right) +else: +return _invoke_function("levenshtein", _to_col(left), _to_col(right), lit(threshold)) levenshtein.__doc__ = pysparkfuncs.levenshtein.__doc__ diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index e9b71f7d617..fe35f12c402 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -6594,7 +6594,9 @@ def substring_index(str: "ColumnOrName", delim: str, count: int) -> Column: @try_remote_functions -def levenshtein(left: "ColumnOrName", right: "ColumnOrName") -> Column: +def levenshtein( +left: "ColumnOrName", right: "ColumnOrName", threshold: Optional[int] = None +) -> Column: """Computes the Levenshtein distance of the two given strings. .. versionadded:: 1.5.0 @@ -6608,6 +6610,12 @@ def levenshtein(left: "ColumnOrName", right: "ColumnOrName") -> Column: first column value. right : :class:`~pyspark.sql.Column` or str second column value. +threshold : int, optional +if set when the levenshtein distance of the two given strings +less than or equal to a given threshold then return result distance, or -1 + +.. versionchanged: 3.5.0 +Added ``threshold`` argument. Returns --- @@ -6619,8 +6627,15 @@ def levenshtein(left: "ColumnOrName", right: "ColumnOrName") -> Column: >>> df0 = spark.createDataFrame([('kitten', 'sitting',)], ['l', 'r']) >>> df0.select(levenshtein('l', 'r').alias('d')).collect() [Row(d=3)] +>>> df0.select(levenshtein('l', 'r', 2).alias('d')).collect() +[Row(d=-1)] """ -return _invoke_function_over_columns("levenshtein", left, right) +if threshold is None: +return _invoke_function_over_columns("levenshtein", left, right) +else: +return _invoke_function( +"levenshtein", _to_java_column(left), _to_java_column(right), threshold +) @try_remote_functions diff --git a/python/pyspark/sql/tests/connect/test_connect_function.py b/python/pyspark/sql/tests/connect/test_connect_function.py index e274635d3c6..3e3b4dd5b16 100644 --- a/python/pyspark/sql/tests/connect/test_connect_function.py +++ b/python/pyspark/sql/tests/connect/test_connect_function.py @@ -1924,6 +1924,11 @@ class SparkConnectFunctionTests(ReusedConnectTestCase, PandasOnSparkTestUtils, S cdf.select(CF.levenshtein(cdf.b, cdf.c)).toPandas(), sdf.select(SF.levenshtein(sdf.b,
[spark] branch master updated: [SPARK-43830][BUILD] Update scalatest and scalatestplus related dependencies to newest version
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4c800478d5a [SPARK-43830][BUILD] Update scalatest and scalatestplus related dependencies to newest version 4c800478d5a is described below commit 4c800478d5a6c76dca3ed1fc945de71182fd65e3 Author: panbingkun AuthorDate: Sat May 27 18:12:52 2023 -0500 [SPARK-43830][BUILD] Update scalatest and scalatestplus related dependencies to newest version ### What changes were proposed in this pull request? The pr aims to update scalatest and scalatestplus related dependencies to newest version, include: This pr aims upgrade `scalatest` related test dependencies to 3.2.16: - scalatest: upgrade scalatest from 3.2.15 to 3.2.16 - mockito - mockito-core: upgrade from 4.6.1 to 4.11.0 - mockito-inline: upgrade from 4.6.1 to 4.11.0 - selenium-java: upgrade from 4.7.2 to 4.9.1 - htmlunit-driver: upgrade from 4.7.2 to 4.9.1 - htmlunit: upgrade from 2.67.0 to 2.70.0 - scalatestplus - scalacheck-1-17: upgrade from 3.2.15.0 to 3.2.16.0 - mockito: upgrade from `mockito-4-6` 3.2.15.0 to `mockito-4-11` 3.2.16.0 - selenium: upgrade from `selenium-4-7` 3.2.15.0 to `selenium-4-9` 3.2.16.0 ### Why are the changes needed? The relevant release notes as follows: - scalatest: - https://github.com/scalatest/scalatest/releases/tag/release-3.2.16 - [mockito](https://github.com/mockito/mockito) - https://github.com/mockito/mockito/releases/tag/v4.11.0 - https://github.com/mockito/mockito/releases/tag/v4.10.0 - https://github.com/mockito/mockito/releases/tag/v4.9.0 - https://github.com/mockito/mockito/releases/tag/v4.8.1 - https://github.com/mockito/mockito/releases/tag/v4.8.0 - https://github.com/mockito/mockito/releases/tag/v4.7.0 - [selenium-java](https://github.com/SeleniumHQ/selenium) - https://github.com/SeleniumHQ/selenium/releases/tag/selenium-4.9.1 - https://github.com/SeleniumHQ/selenium/releases/tag/selenium-4.9.0 - https://github.com/SeleniumHQ/selenium/releases/tag/selenium-4.8.3-java - https://github.com/SeleniumHQ/selenium/releases/tag/selenium-4.8.2-java - https://github.com/SeleniumHQ/selenium/releases/tag/selenium-4.8.1 - https://github.com/SeleniumHQ/selenium/releases/tag/selenium-4.8.0 - [htmlunit-driver](https://github.com/SeleniumHQ/htmlunit-driver) - https://github.com/SeleniumHQ/htmlunit-driver/releases/tag/htmlunit-driver-4.9.1 - https://github.com/SeleniumHQ/htmlunit-driver/releases/tag/htmlunit-driver-4.9.0 - https://github.com/SeleniumHQ/htmlunit-driver/releases/tag/htmlunit-driver-4.8.3 - https://github.com/SeleniumHQ/htmlunit-driver/releases/tag/htmlunit-driver-4.8.1.1 - https://github.com/SeleniumHQ/htmlunit-driver/releases/tag/4.8.1 - https://github.com/SeleniumHQ/htmlunit-driver/releases/tag/4.8.0 - [htmlunit](https://github.com/HtmlUnit/htmlunit) - https://github.com/HtmlUnit/htmlunit/releases/tag/2.70.0 - Why this version: because the 4.9.1 version of Selenium relies on it. https://github.com/SeleniumHQ/selenium/blob/selenium-4.9.1/java/maven_deps.bzl#L83 - [org.scalatestplus:scalacheck-1-17](https://github.com/scalatest/scalatestplus-scalacheck) - https://github.com/scalatest/scalatestplus-scalacheck/releases/tag/release-3.2.16.0-for-scalacheck-1.17 - [org.scalatestplus:mockito-4-11](https://github.com/scalatest/scalatestplus-mockito) - https://github.com/scalatest/scalatestplus-mockito/releases/tag/release-3.2.16.0-for-mockito-4.11 - [org.scalatestplus:selenium-4-9](https://github.com/scalatest/scalatestplus-selenium) - https://github.com/scalatest/scalatestplus-selenium/releases/tag/release-3.2.16.0-for-selenium-4.9 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GitHub Actions - Manual test: - ChromeUISeleniumSuite - RocksDBBackendChromeUIHistoryServerSuite ``` build/sbt -Dguava.version=31.1-jre -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest.default.exclude.tags="" -Phive -Phive-thriftserver "core/testOnly org.apache.spark.ui.ChromeUISeleniumSuite" build/sbt -Dguava.version=31.1-jre -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest.default.exclude.tags="" -Phive -Phive-thriftserver "core/testOnly org.apache.spark.deploy.history.RocksDBBackendChromeUIHistoryServerSuite" ``` https://github.com/apache/spark/assets/15246973/73349ffb-4198-4371-a741-411712d14712;> Closes #41341 from
[spark] branch master updated (7ce4dc64273 -> d052a454fda)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 7ce4dc64273 [SPARK-41775][PYTHON][FOLLOWUP] Use pyspark.cloudpickle instead of `cloudpickle` in torch distributor add d052a454fda [SPARK-43824][SPARK-43825] [SQL] Assign names to the error class _LEGACY_ERROR_TEMP_128[1-2] No new revisions were added by this update. Summary of changes: core/src/main/resources/error/error-classes.json | 20 ++-- .../spark/sql/errors/QueryCompilationErrors.scala| 14 +++--- .../apache/spark/sql/execution/command/views.scala | 3 +-- .../spark/sql/execution/SQLViewTestSuite.scala | 16 4 files changed, 26 insertions(+), 27 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41775][PYTHON][FOLLOWUP] Use pyspark.cloudpickle instead of `cloudpickle` in torch distributor
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7ce4dc64273 [SPARK-41775][PYTHON][FOLLOWUP] Use pyspark.cloudpickle instead of `cloudpickle` in torch distributor 7ce4dc64273 is described below commit 7ce4dc642736d78e79dff8e0b671cfd1b5d44166 Author: Weichen Xu AuthorDate: Sat May 27 09:06:39 2023 -0700 [SPARK-41775][PYTHON][FOLLOWUP] Use pyspark.cloudpickle instead of `cloudpickle` in torch distributor ### What changes were proposed in this pull request? Use pyspark.cloudpickle instead of `cloudpickle` in torch distributor ### Why are the changes needed? Make ser and deser code consistent ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Closes #41337 from WeichenXu123/fix-torch-distributor-import-cloudpickle. Authored-by: Weichen Xu Signed-off-by: Dongjoon Hyun --- python/pyspark/ml/torch/distributor.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/pyspark/ml/torch/distributor.py b/python/pyspark/ml/torch/distributor.py index 8a41bdcc886..ad8b4d8cc25 100644 --- a/python/pyspark/ml/torch/distributor.py +++ b/python/pyspark/ml/torch/distributor.py @@ -814,7 +814,7 @@ class TorchDistributor(Distributor): ) -> str: code = textwrap.dedent( f""" -import cloudpickle +from pyspark import cloudpickle import os if __name__ == "__main__": - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43530][PROTOBUF] Read descriptor file only once
This is an automated email from the ASF dual-hosted git repository. yangjie01 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 64642b48351 [SPARK-43530][PROTOBUF] Read descriptor file only once 64642b48351 is described below commit 64642b48351c0c4ef8f40ce7902b85b6f953bd8f Author: Raghu Angadi AuthorDate: Sat May 27 23:18:54 2023 +0800 [SPARK-43530][PROTOBUF] Read descriptor file only once ### What changes were proposed in this pull request? Protobuf functions (`from_protobuf()` & `to_protobuf()`) take file path of a descriptor file and use that for constructing Protobuf descriptors. Main problem with how this is that the file is read many times (e.g. at each executor). This is unnecessary and error prone. E.g. file contents may be updated couple of days after a streaming query starts. That could lead to various errors. **The fix**: Use the byte content (which is serialized `FileDescritptorSet` proto). We read the content from the file once and carry the byte buffer. This also adds new API where we can pass the byte buffer directly. This is useful when the users fetch the content themselves and passes it to Protobuf functions. E.g. they could fetch it from S3, or extract it Python Protobuf classes. **Note to reviewers**: This includes a lot of updates to test files, mainly because the interface change to pass the buffer. I have left a few PR comments to help with the review. ### Why are the changes needed? Described above. ### Does this PR introduce _any_ user-facing change? Yes, this adds two new versions for `from_protobuf()` and `to_protobuf()` API that take Protobuf bytes rather than file path. ### How was this patch tested? - Unit tests Closes #41192 from rangadi/proto-file-buffer. Authored-by: Raghu Angadi Signed-off-by: yangjie01 --- .../org/apache/spark/sql/protobuf/functions.scala | 135 +++-- .../org/apache/spark/sql/FunctionTestSuite.scala | 12 +- .../apache/spark/sql/PlanGenerationTestSuite.scala | 7 +- ..._protobuf_messageClassName_descFilePath.explain | 2 +- ...f_messageClassName_descFilePath_options.explain | 2 +- ..._protobuf_messageClassName_descFilePath.explain | 2 +- ...f_messageClassName_descFilePath_options.explain | 2 +- ...rom_protobuf_messageClassName_descFilePath.json | 2 +- ...rotobuf_messageClassName_descFilePath.proto.bin | Bin 156 -> 361 bytes ...obuf_messageClassName_descFilePath_options.json | 2 +- ...messageClassName_descFilePath_options.proto.bin | Bin 206 -> 409 bytes .../to_protobuf_messageClassName_descFilePath.json | 2 +- ...rotobuf_messageClassName_descFilePath.proto.bin | Bin 154 -> 359 bytes ...obuf_messageClassName_descFilePath_options.json | 2 +- ...messageClassName_descFilePath_options.proto.bin | Bin 204 -> 407 bytes .../sql/connect/planner/SparkConnectPlanner.scala | 33 ++--- .../sql/connect/ProtoToParsedPlanTestSuite.scala | 4 +- .../sql/protobuf/CatalystDataToProtobuf.scala | 7 +- .../sql/protobuf/ProtobufDataToCatalyst.scala | 14 +-- .../org/apache/spark/sql/protobuf/functions.scala | 114 +++-- .../spark/sql/protobuf/utils/ProtobufUtils.scala | 67 +- .../ProtobufCatalystDataConversionSuite.scala | 29 ++--- .../sql/protobuf/ProtobufFunctionsSuite.scala | 65 +- .../spark/sql/protobuf/ProtobufSerdeSuite.scala| 30 +++-- core/src/main/resources/error/error-classes.json | 7 +- docs/sql-error-conditions.md | 6 - .../spark/sql/errors/QueryCompilationErrors.scala | 11 +- 27 files changed, 393 insertions(+), 164 deletions(-) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/protobuf/functions.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/protobuf/functions.scala index c42f8417155..57ce013065e 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/protobuf/functions.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/protobuf/functions.scala @@ -16,10 +16,19 @@ */ package org.apache.spark.sql.protobuf +import java.io.File +import java.io.FileNotFoundException +import java.nio.file.NoSuchFileException +import java.util.Collections + import scala.collection.JavaConverters._ +import scala.util.control.NonFatal + +import org.apache.commons.io.FileUtils import org.apache.spark.annotation.Experimental import org.apache.spark.sql.Column +import org.apache.spark.sql.errors.QueryCompilationErrors import org.apache.spark.sql.functions.{fnWithOptions, lit} // scalastyle:off: object.name @@ -35,7 +44,8 @@ object functions { * @param messageName * the protobuf message name to look for in descriptor file. * @param descFilePath - *