date:20230527

[spark] branch slow deleted (was 6650667c407)

2023-05-27 Thread yumwang

This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a change to branch slow
in repository https://gitbox.apache.org/repos/asf/spark.git


 was 6650667c407 fix

This change permanently discards the following revisions:

 discard 6650667c407 fix
 discard e0df57fdbbd Use ExpressionSet and add UT
 discard b042b755f93 [CARMEL-6796] Pull out complex aggregate expressions
 discard 29bd591cd2f [carmel-6785] Add configuration 
spark.sql.materializedView.name.prefix to support definition of available mv 
(#1307)
 discard b3cce6eb3a0 [MINOR] Bug fix for HiveUDF codegen (#1303)
 discard 905523e4cd2 [CARMEL-6734] Fix reorder filter condition issue (#1293)
 discard b3036d65479 [CARMEL-6741] Tag Queries with Join Expansion in Runtime 
(#1299)
 discard a6d9e7c163d [CARMEL-5941] Add how to request HDM data access to error 
message (#1301)
 discard 285aff3d2a4 [CARMEL-6739][SPARK-43050][SQL] Fix construct aggregate 
expressions by replacing grouping functions (#1300)
 discard d803e943edb [CARMEL-6751] More reasonable error message when heavily 
skewed partition (#1297)
 discard 2e9d18bd5dd [CARMEL-6760] Correct metadata ‘mv_updated_time’ which is 
used for data validation (#1298)
 discard 2973a317042 [CARMEL-6633] Reduce Skew Join Split Size Considering 
Expand Node (#1295)
 discard 46be35accc9 [CARMEL-6664] Fix the InterruptedException error and 
incorrect state for cancelled download (#1291)
 discard ee430edaf39 [CARMEL-6703] Take too much time to build bloom filter 
(#1292)
 discard f76533f3acc [Carmel 6640] Support create materialized view as 
datasource table (#1280)
 discard 7efe9b78cff [CARMEL-6647] Enhance DemoteBucketJoin to support Alias 
Aware Output Partitioning (#1290)
 discard e7d9e88ddcb [CARMEL-6583][SPARK-42500][SQL] ConstantPropagation 
support more cases (#1287)
 discard d8fa6e4e368 [CARMEL-6621][SPARK-42789][SQL] Rewrite multiple 
GetJsonObjects to a JsonTuple if their json expressions are the same (#1286)
 discard 5b0fffed633 [CARMEL-6705] Bug fix for query output row count (#1289)
 discard 96c7333cb22 [CARMEL-6615] Backport [SPARK-42052][SQL] Codegen Support 
for HiveSimpleUDF (#1288)
 discard a53b95c0c4d [CARMEL-6675] Support to enable decommission nodes when 
hive service discovery is disabled (#1282)
 discard c08771ff897 [CARMEL-6683][SPARK-31008][SQL] Support json_array_length 
function (#1284)
 discard b243ff8b59e [CARMEL-6674] Project fail to be collapsed (#1283)
 discard 93f89f941dc [CARMEL-6587] Support Generic Skew Join Patterns (#1281)
 discard 81d76f6b116 [CARMEL-6652][SPARK-40501][SQL] Add 
PushProjectionThroughLimit for Optimizer (#1278)
 discard 3e8e629a2d0 [CARMEL-6655] Do not trim whitespaces by default when 
downloading data as CSV file (#1279)
 discard b3e8faafbda [CARMEL-6651] Remove repartition if it is the child of 
LocalLimit (#1277)
 discard 2ddd418edf4 [CARMEL-6632] Fix the running time of download statement 
in query log (#1275)
 discard 9e899c166ce [CARMEL-6586] Ignore SinglePartition when determining 
expectedChildrenNumPartitions (#1252)
 discard 25a3b904b61 [CARMEL-6327] Support Broadcast Join with Stream Side Skew 
(#1272)
 discard c3f17a52289 [CARMEL-6608] Increase bucket table scan partitions (#1269)
 discard 434b16e0cc8 [CARMEL-6439] Define new query execution event and log 
column lineage asynchronously (#1268)
 discard e77e54fd7da [CARMEL-6609] Casts types according to bucket info support 
view (#1270)
 discard 872399e7a5e [CARMEL-6593] Upgrade parquet to 1.12.3.0.1.0 (#1271)
 discard 40900d24073 [CARMEL-6604] Stop posting duplicate execution event 
(#1267)
 discard e2036ae1eda [CARMEL-6591][SPARK-42597] Support unwrap date type to 
timestamp type (#1263)
 discard f7c77ff2bbd [CARMEL-6582][SPARK-42513][SQL] Push down topK through 
join (#1257)
 discard a214b2b67fd [CARMEL-6541] Support Query Level SQL Conf leveraging Hint 
(#1256)
 discard 10a27944ccf [CARMEL-6568] Analyze join operator and support data 
expansion check (#1265)
 discard 56c9bad9395 [CARMEL-6511] Disable rename temp table (#1254)
 discard 71489e91204 [CARMEL-6371] Partial aggregation push through left/right 
outer join (#1233)
 discard 7905f88b1f3 [CARMEL-6581] TakeOrderedAndProject should not replace 
project if project expression is not deterministic (#1249)
 discard 142b10b1580 [CARMEL-6339] Implementation of materialized view (#1244)
 discard 1cd2cbcaa2f [CARMEL-6556] Avoid coalesce partitions from different 
UNION sides (#1246)
 discard 413224b3270 [CARMEL-6495] Do not quit am even when all nodes are in 
blacklist (#1248)
 discard 9f306b4ddf4 [CARMEL-6439] Add configuration to enable log column 
lineage (#1238)
 discard 5f91f872777 [CARMEL-6553] Backport [SPARK-35673][SQL] Fix user-defined 
hint and unrecognized hint in subquery (#1236)
 discard d8b439948dd [CARMEL-6525][MINOR] Support tag different drivers in the 
queue (#1237)
 discard c4930985ae1 [CARMEL-6537] [Followup] Support Iceberg with maven 
dependency in Carmel - use correct jar (#1234)
 discard

[spark] branch master updated: [SPARK-43817][SPARK-43702][PYTHON] Support UserDefinedType in createDataFrame from pandas DataFrame and toPandas

2023-05-27 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 916b0d3de97 [SPARK-43817][SPARK-43702][PYTHON] Support UserDefinedType 
in createDataFrame from pandas DataFrame and toPandas
916b0d3de97 is described below

commit 916b0d3de973b8b30a8ede3d56b9f8a70512
Author: Takuya UESHIN 
AuthorDate: Sun May 28 08:47:35 2023 +0800

[SPARK-43817][SPARK-43702][PYTHON] Support UserDefinedType in 
createDataFrame from pandas DataFrame and toPandas

### What changes were proposed in this pull request?

Support `UserDefinedType` in `createDataFrame` from pandas DataFrame and 
`toPandas`.

For the following schema and pandas DataFrame:

```py
schema = (
StructType()
.add("point", ExamplePointUDT())
.add("struct", StructType().add("point", ExamplePointUDT()))
.add("array", ArrayType(ExamplePointUDT()))
.add("map", MapType(StringType(), ExamplePointUDT()))
)
data = [
Row(
ExamplePoint(1.0, 2.0),
Row(ExamplePoint(3.0, 4.0)),
[ExamplePoint(5.0, 6.0)],
dict(point=ExamplePoint(7.0, 8.0)),
)
]

df = spark.createDataFrame(data, schema)

pdf = pd.DataFrame.from_records(data, columns=schema.names)
```

# `spark.createDataFrame()`

For all, return the same results:

```py
>>> spark.createDataFrame(pdf, schema).show(truncate=False)
+--+++-+
|point |struct  |array   |map  |
+--+++-+
|(1.0, 2.0)|{(3.0, 4.0)}|[(5.0, 6.0)]|{point -> (7.0, 8.0)}|
+--+++-+
```

# `df.toPandas()`

```py
>>> spark.conf.set('spark.sql.execution.pandas.structHandlingMode', 'row')
>>> df.toPandas()
   pointstructarray   map
0  (1.0,2.0)  ((3.0,4.0),)  [(5.0,6.0)]  {'point': (7.0,8.0)}
```

### Why are the changes needed?

Currently `UserDefinedType` in `spark.createDataFrame()` with pandas 
DataFrame and `df.toPandas()` is not supported with Arrow enabled or in Spark 
Connect.

# `spark.createDataFrame()`

Works without Arrow:

```py
>>> spark.createDataFrame(pdf, schema).show(truncate=False)
+--+++-+
|point |struct  |array   |map  |
+--+++-+
|(1.0, 2.0)|{(3.0, 4.0)}|[(5.0, 6.0)]|{point -> (7.0, 8.0)}|
+--+++-+
```

, whereas:

- With Arrow:

Works with fallback:

```py
>>> spark.createDataFrame(pdf, schema).show(truncate=False)
/.../python/pyspark/sql/pandas/conversion.py:351: UserWarning: 
createDataFrame attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  [UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION] ExamplePointUDT() is not 
supported in conversion to Arrow.
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warn(msg)
+--+++-+
|point |struct  |array   |map  |
+--+++-+
|(1.0, 2.0)|{(3.0, 4.0)}|[(5.0, 6.0)]|{point -> (7.0, 8.0)}|
+--+++-+
```

- Spark Connect

```py
>>> spark.createDataFrame(pdf, schema).show(truncate=False)
Traceback (most recent call last):
...
pyspark.errors.exceptions.base.PySparkTypeError: 
[UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION] ExamplePointUDT() is not supported 
in conversion to Arrow.
```

# `df.toPandas()`

Works without Arrow:

```py
>>> spark.conf.set('spark.sql.execution.pandas.structHandlingMode', 'row')
>>> df.toPandas()
   pointstructarray   map
0  (1.0,2.0)  ((3.0,4.0),)  [(5.0,6.0)]  {'point': (7.0,8.0)}
```

, whereas:

- With Arrow

Works with fallback:

```py
>>> spark.conf.set('spark.sql.execution.pandas.structHandlingMode', 'row')
>>> df.toPandas()
/.../python/pyspark/sql/pandas/conversion.py:111: UserWarning: toPandas 
attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  [UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION] ExamplePointUDT() is not

[spark] branch master updated: [SPARK-43671][PS][FOLLOWUP] Refine `CategoricalOps` functions

2023-05-27 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 001da5d003c [SPARK-43671][PS][FOLLOWUP] Refine `CategoricalOps` 
functions
001da5d003c is described below

commit 001da5d003caef3cda9978d35967ade55837e0bc
Author: itholic 
AuthorDate: Sun May 28 08:44:16 2023 +0800

[SPARK-43671][PS][FOLLOWUP] Refine `CategoricalOps` functions

### What changes were proposed in this pull request?

This PR follow-up for SPARK-43671, to refine functions to use 
`pyspark_column_op` util for clean-up the code.

### Why are the changes needed?

To avoid `is_remote` in too many places for future maintenance.

### Does this PR introduce _any_ user-facing change?

No, it's code cleanup

### How was this patch tested?

The existing CI should pass

Closes #41326 from itholic/categorical_followup.

Authored-by: itholic 
Signed-off-by: Ruifeng Zheng 
---
 .../pandas/data_type_ops/categorical_ops.py| 69 +-
 1 file changed, 14 insertions(+), 55 deletions(-)

diff --git a/python/pyspark/pandas/data_type_ops/categorical_ops.py 
b/python/pyspark/pandas/data_type_ops/categorical_ops.py
index 9f14a4b1ee7..66e181a6079 100644
--- a/python/pyspark/pandas/data_type_ops/categorical_ops.py
+++ b/python/pyspark/pandas/data_type_ops/categorical_ops.py
@@ -16,19 +16,18 @@
 #
 
 from itertools import chain
-from typing import cast, Any, Callable, Union
+from typing import cast, Any, Union
 
 import pandas as pd
 import numpy as np
 from pandas.api.types import is_list_like, CategoricalDtype  # type: 
ignore[attr-defined]
 
 from pyspark.pandas._typing import Dtype, IndexOpsLike, SeriesOrIndex
-from pyspark.pandas.base import column_op, IndexOpsMixin
+from pyspark.pandas.base import IndexOpsMixin
 from pyspark.pandas.data_type_ops.base import _sanitize_list_like, DataTypeOps
 from pyspark.pandas.typedef import pandas_on_spark_type
 from pyspark.sql import functions as F
-from pyspark.sql.column import Column as PySparkColumn
-from pyspark.sql.utils import is_remote
+from pyspark.sql.utils import pyspark_column_op
 
 
 class CategoricalOps(DataTypeOps):
@@ -66,73 +65,33 @@ class CategoricalOps(DataTypeOps):
 
 def eq(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
 _sanitize_list_like(right)
-if is_remote():
-from pyspark.sql.connect.column import Column as ConnectColumn
-
-Column = ConnectColumn
-else:
-Column = PySparkColumn  # type: ignore[assignment]
-return _compare(
-left, right, Column.__eq__, is_equality_comparison=True  # type: 
ignore[arg-type]
-)
+return _compare(left, right, "__eq__", is_equality_comparison=True)
 
 def ne(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
 _sanitize_list_like(right)
-if is_remote():
-from pyspark.sql.connect.column import Column as ConnectColumn
-
-Column = ConnectColumn
-else:
-Column = PySparkColumn  # type: ignore[assignment]
-return _compare(
-left, right, Column.__ne__, is_equality_comparison=True  # type: 
ignore[arg-type]
-)
+return _compare(left, right, "__ne__", is_equality_comparison=True)
 
 def lt(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
 _sanitize_list_like(right)
-if is_remote():
-from pyspark.sql.connect.column import Column as ConnectColumn
-
-Column = ConnectColumn
-else:
-Column = PySparkColumn  # type: ignore[assignment]
-return _compare(left, right, Column.__lt__)  # type: ignore[arg-type]
+return _compare(left, right, "__lt__")
 
 def le(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
 _sanitize_list_like(right)
-if is_remote():
-from pyspark.sql.connect.column import Column as ConnectColumn
-
-Column = ConnectColumn
-else:
-Column = PySparkColumn  # type: ignore[assignment]
-return _compare(left, right, Column.__le__)  # type: ignore[arg-type]
+return _compare(left, right, "__le__")
 
 def gt(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
 _sanitize_list_like(right)
-if is_remote():
-from pyspark.sql.connect.column import Column as ConnectColumn
-
-Column = ConnectColumn
-else:
-Column = PySparkColumn  # type: ignore[assignment]
-return _compare(left, right, Column.__gt__)  # type: ignore[arg-type]
+return _compare(left, right, "__gt__")
 
 def ge(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
 _sanitize_list_like(right)
-if is_remote():
-from

[spark] branch master updated: [SPARK-43692][SPARK-43693][SPARK-43694][SPARK-43695][PS] Fix `StringOps` for Spark Connect

2023-05-27 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6f0a73e457d [SPARK-43692][SPARK-43693][SPARK-43694][SPARK-43695][PS] 
Fix `StringOps` for Spark Connect
6f0a73e457d is described below

commit 6f0a73e457dd3c49a4adce996d7201010cdd2651
Author: itholic 
AuthorDate: Sun May 28 08:41:44 2023 +0800

[SPARK-43692][SPARK-43693][SPARK-43694][SPARK-43695][PS] Fix `StringOps` 
for Spark Connect

### What changes were proposed in this pull request?

This PR proposes to fix `StringOps` test for pandas API on Spark with Spark 
Connect.

This includes SPARK-43692, SPARK-43693, SPARK-43694, SPARK-43695 at once, 
because they are all related similar modifications in single file.

### Why are the changes needed?

To support all features for pandas API on Spark with Spark Connect.

### Does this PR introduce _any_ user-facing change?

Yes, `StringOps.lt`,  `StringOps.le`, `StringOps.ge`, `StringOps.gt` are 
now working as expected on Spark Connect.

### How was this patch tested?

Uncomment the UTs, and tested manually.

Closes #41308 from itholic/SPARK-43692-5.

Authored-by: itholic 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/pandas/data_type_ops/string_ops.py  | 18 +-
 .../connect/data_type_ops/test_parity_string_ops.py| 16 
 python/pyspark/sql/utils.py| 17 +
 3 files changed, 22 insertions(+), 29 deletions(-)

diff --git a/python/pyspark/pandas/data_type_ops/string_ops.py 
b/python/pyspark/pandas/data_type_ops/string_ops.py
index 0b9eb87a163..e5818cb4635 100644
--- a/python/pyspark/pandas/data_type_ops/string_ops.py
+++ b/python/pyspark/pandas/data_type_ops/string_ops.py
@@ -22,6 +22,7 @@ from pandas.api.types import CategoricalDtype
 
 from pyspark.sql import functions as F
 from pyspark.sql.types import IntegralType, StringType
+from pyspark.sql.utils import pyspark_column_op
 
 from pyspark.pandas._typing import Dtype, IndexOpsLike, SeriesOrIndex
 from pyspark.pandas.base import column_op, IndexOpsMixin
@@ -34,7 +35,6 @@ from pyspark.pandas.data_type_ops.base import (
 )
 from pyspark.pandas.spark import functions as SF
 from pyspark.pandas.typedef import extension_dtypes, pandas_on_spark_type
-from pyspark.sql import Column
 from pyspark.sql.types import BooleanType
 
 
@@ -104,28 +104,20 @@ class StringOps(DataTypeOps):
 raise TypeError("Multiplication can not be applied to given 
types.")
 
 def lt(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
-from pyspark.pandas.base import column_op
-
 _sanitize_list_like(right)
-return column_op(Column.__lt__)(left, right)
+return pyspark_column_op("__lt__")(left, right)
 
 def le(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
-from pyspark.pandas.base import column_op
-
 _sanitize_list_like(right)
-return column_op(Column.__le__)(left, right)
+return pyspark_column_op("__le__")(left, right)
 
 def ge(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
-from pyspark.pandas.base import column_op
-
 _sanitize_list_like(right)
-return column_op(Column.__ge__)(left, right)
+return pyspark_column_op("__ge__")(left, right)
 
 def gt(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
-from pyspark.pandas.base import column_op
-
 _sanitize_list_like(right)
-return column_op(Column.__gt__)(left, right)
+return pyspark_column_op("__gt__")(left, right)
 
 def astype(self, index_ops: IndexOpsLike, dtype: Union[str, type, Dtype]) 
-> IndexOpsLike:
 dtype, spark_type = pandas_on_spark_type(dtype)
diff --git 
a/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_string_ops.py 
b/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_string_ops.py
index 9abfe1d1e09..2d81db1c701 100644
--- 
a/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_string_ops.py
+++ 
b/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_string_ops.py
@@ -34,22 +34,6 @@ class StringOpsParityTests(
 def test_astype(self):
 super().test_astype()
 
-@unittest.skip("TODO(SPARK-43692): Fix StringOps.ge to work with Spark 
Connect.")
-def test_ge(self):
-super().test_ge()
-
-@unittest.skip("TODO(SPARK-43693): Fix StringOps.gt to work with Spark 
Connect.")
-def test_gt(self):
-super().test_gt()
-
-@unittest.skip("TODO(SPARK-43694): Fix StringOps.le to work with Spark 
Connect.")
-def test_le(self):
-super().test_le()
-
-@unittest.skip("TODO(SPARK-43695): Fix StringOps.lt to work with Spark 
Connect.")
-def test_lt(self):
-

[spark] branch master updated: [SPARK-43773][CONNECT][PYTHON][, THRESHOLD] Implement 'levenshtein(str1, str2)' functions in python client

2023-05-27 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new dc186c5e6b6 [SPARK-43773][CONNECT][PYTHON][, THRESHOLD] Implement 
'levenshtein(str1, str2)' functions in python client
dc186c5e6b6 is described below

commit dc186c5e6b6bdb63345081ee9f70b8c102792cdd
Author: panbingkun 
AuthorDate: Sun May 28 08:38:32 2023 +0800

[SPARK-43773][CONNECT][PYTHON][, THRESHOLD] Implement 'levenshtein(str1, 
str2)' functions in python client

### What changes were proposed in this pull request?
The pr aims to implement 'levenshtein(str1, str2[, threshold])' functions 
in python client

### Why are the changes needed?
After Add a max distance argument to the levenshtein() function We have 
already implemented it on the scala side, so we need to align it on `pyspark`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Manual testing
python/run-tests --testnames 'python.pyspark.sql.tests.test_functions 
FunctionsTests.test_levenshtein_function'
- Pass GA

Closes #41296 from panbingkun/SPARK-43773.

Lead-authored-by: panbingkun 
Co-authored-by: panbingkun <84731...@qq.com>
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/functions.py   |  9 +++--
 python/pyspark/sql/functions.py   | 19 +--
 .../sql/tests/connect/test_connect_function.py|  5 +
 python/pyspark/sql/tests/test_functions.py|  7 +++
 4 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/python/pyspark/sql/connect/functions.py 
b/python/pyspark/sql/connect/functions.py
index b7d7bc937cf..d3a05d6a1c6 100644
--- a/python/pyspark/sql/connect/functions.py
+++ b/python/pyspark/sql/connect/functions.py
@@ -1878,8 +1878,13 @@ def substring_index(str: "ColumnOrName", delim: str, 
count: int) -> Column:
 substring_index.__doc__ = pysparkfuncs.substring_index.__doc__
 
 
-def levenshtein(left: "ColumnOrName", right: "ColumnOrName") -> Column:
-return _invoke_function_over_columns("levenshtein", left, right)
+def levenshtein(
+left: "ColumnOrName", right: "ColumnOrName", threshold: Optional[int] = 
None
+) -> Column:
+if threshold is None:
+return _invoke_function_over_columns("levenshtein", left, right)
+else:
+return _invoke_function("levenshtein", _to_col(left), _to_col(right), 
lit(threshold))
 
 
 levenshtein.__doc__ = pysparkfuncs.levenshtein.__doc__
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index e9b71f7d617..fe35f12c402 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -6594,7 +6594,9 @@ def substring_index(str: "ColumnOrName", delim: str, 
count: int) -> Column:
 
 
 @try_remote_functions
-def levenshtein(left: "ColumnOrName", right: "ColumnOrName") -> Column:
+def levenshtein(
+left: "ColumnOrName", right: "ColumnOrName", threshold: Optional[int] = 
None
+) -> Column:
 """Computes the Levenshtein distance of the two given strings.
 
 .. versionadded:: 1.5.0
@@ -6608,6 +6610,12 @@ def levenshtein(left: "ColumnOrName", right: 
"ColumnOrName") -> Column:
 first column value.
 right : :class:`~pyspark.sql.Column` or str
 second column value.
+threshold : int, optional
+if set when the levenshtein distance of the two given strings
+less than or equal to a given threshold then return result distance, 
or -1
+
+.. versionchanged: 3.5.0
+Added ``threshold`` argument.
 
 Returns
 ---
@@ -6619,8 +6627,15 @@ def levenshtein(left: "ColumnOrName", right: 
"ColumnOrName") -> Column:
 >>> df0 = spark.createDataFrame([('kitten', 'sitting',)], ['l', 'r'])
 >>> df0.select(levenshtein('l', 'r').alias('d')).collect()
 [Row(d=3)]
+>>> df0.select(levenshtein('l', 'r', 2).alias('d')).collect()
+[Row(d=-1)]
 """
-return _invoke_function_over_columns("levenshtein", left, right)
+if threshold is None:
+return _invoke_function_over_columns("levenshtein", left, right)
+else:
+return _invoke_function(
+"levenshtein", _to_java_column(left), _to_java_column(right), 
threshold
+)
 
 
 @try_remote_functions
diff --git a/python/pyspark/sql/tests/connect/test_connect_function.py 
b/python/pyspark/sql/tests/connect/test_connect_function.py
index e274635d3c6..3e3b4dd5b16 100644
--- a/python/pyspark/sql/tests/connect/test_connect_function.py
+++ b/python/pyspark/sql/tests/connect/test_connect_function.py
@@ -1924,6 +1924,11 @@ class SparkConnectFunctionTests(ReusedConnectTestCase, 
PandasOnSparkTestUtils, S
 cdf.select(CF.levenshtein(cdf.b, cdf.c)).toPandas(),
 sdf.select(SF.levenshtein(sdf.b,

[spark] branch master updated: [SPARK-43830][BUILD] Update scalatest and scalatestplus related dependencies to newest version

2023-05-27 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4c800478d5a [SPARK-43830][BUILD] Update scalatest and scalatestplus 
related dependencies to newest version
4c800478d5a is described below

commit 4c800478d5a6c76dca3ed1fc945de71182fd65e3
Author: panbingkun 
AuthorDate: Sat May 27 18:12:52 2023 -0500

[SPARK-43830][BUILD] Update scalatest and scalatestplus related 
dependencies to newest version

### What changes were proposed in this pull request?
The pr aims to update scalatest and scalatestplus related dependencies to 
newest version, include:
This pr aims upgrade `scalatest` related test dependencies to 3.2.16:
 - scalatest: upgrade scalatest from 3.2.15 to 3.2.16

 - mockito
   - mockito-core: upgrade from 4.6.1 to 4.11.0
   - mockito-inline: upgrade from 4.6.1 to 4.11.0

 - selenium-java: upgrade from 4.7.2 to 4.9.1

 - htmlunit-driver: upgrade from 4.7.2 to 4.9.1

 - htmlunit: upgrade from 2.67.0 to 2.70.0

 - scalatestplus
 - scalacheck-1-17: upgrade from 3.2.15.0 to 3.2.16.0
   - mockito: upgrade from `mockito-4-6` 3.2.15.0 to `mockito-4-11` 3.2.16.0
   - selenium: upgrade from `selenium-4-7` 3.2.15.0 to `selenium-4-9` 
3.2.16.0

### Why are the changes needed?
The relevant release notes as follows:
 - scalatest:
 - 
https://github.com/scalatest/scalatest/releases/tag/release-3.2.16

 - [mockito](https://github.com/mockito/mockito)
   - https://github.com/mockito/mockito/releases/tag/v4.11.0
   - https://github.com/mockito/mockito/releases/tag/v4.10.0
   - https://github.com/mockito/mockito/releases/tag/v4.9.0
   - https://github.com/mockito/mockito/releases/tag/v4.8.1
   - https://github.com/mockito/mockito/releases/tag/v4.8.0
   - https://github.com/mockito/mockito/releases/tag/v4.7.0

 - [selenium-java](https://github.com/SeleniumHQ/selenium)
   - https://github.com/SeleniumHQ/selenium/releases/tag/selenium-4.9.1
   - https://github.com/SeleniumHQ/selenium/releases/tag/selenium-4.9.0
   - https://github.com/SeleniumHQ/selenium/releases/tag/selenium-4.8.3-java
   - https://github.com/SeleniumHQ/selenium/releases/tag/selenium-4.8.2-java
   - https://github.com/SeleniumHQ/selenium/releases/tag/selenium-4.8.1
   - https://github.com/SeleniumHQ/selenium/releases/tag/selenium-4.8.0

 - [htmlunit-driver](https://github.com/SeleniumHQ/htmlunit-driver)
   - 
https://github.com/SeleniumHQ/htmlunit-driver/releases/tag/htmlunit-driver-4.9.1
   - 
https://github.com/SeleniumHQ/htmlunit-driver/releases/tag/htmlunit-driver-4.9.0
   - 
https://github.com/SeleniumHQ/htmlunit-driver/releases/tag/htmlunit-driver-4.8.3
   - 
https://github.com/SeleniumHQ/htmlunit-driver/releases/tag/htmlunit-driver-4.8.1.1
   - https://github.com/SeleniumHQ/htmlunit-driver/releases/tag/4.8.1
   - https://github.com/SeleniumHQ/htmlunit-driver/releases/tag/4.8.0

 - [htmlunit](https://github.com/HtmlUnit/htmlunit)
   - https://github.com/HtmlUnit/htmlunit/releases/tag/2.70.0
   - Why this version: because the 4.9.1 version of Selenium relies on it. 
https://github.com/SeleniumHQ/selenium/blob/selenium-4.9.1/java/maven_deps.bzl#L83

 - 
[org.scalatestplus:scalacheck-1-17](https://github.com/scalatest/scalatestplus-scalacheck)
   - 
https://github.com/scalatest/scalatestplus-scalacheck/releases/tag/release-3.2.16.0-for-scalacheck-1.17

 - 
[org.scalatestplus:mockito-4-11](https://github.com/scalatest/scalatestplus-mockito)
   - 
https://github.com/scalatest/scalatestplus-mockito/releases/tag/release-3.2.16.0-for-mockito-4.11

 - 
[org.scalatestplus:selenium-4-9](https://github.com/scalatest/scalatestplus-selenium)
   - 
https://github.com/scalatest/scalatestplus-selenium/releases/tag/release-3.2.16.0-for-selenium-4.9

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GitHub Actions
- Manual test:
   - ChromeUISeleniumSuite
   - RocksDBBackendChromeUIHistoryServerSuite

```
build/sbt -Dguava.version=31.1-jre 
-Dspark.test.webdriver.chrome.driver=/path/to/chromedriver 
-Dtest.default.exclude.tags="" -Phive -Phive-thriftserver "core/testOnly 
org.apache.spark.ui.ChromeUISeleniumSuite"

build/sbt -Dguava.version=31.1-jre 
-Dspark.test.webdriver.chrome.driver=/path/to/chromedriver 
-Dtest.default.exclude.tags="" -Phive -Phive-thriftserver "core/testOnly 
org.apache.spark.deploy.history.RocksDBBackendChromeUIHistoryServerSuite"
```
https://github.com/apache/spark/assets/15246973/73349ffb-4198-4371-a741-411712d14712;>

Closes #41341 from

[spark] branch master updated (7ce4dc64273 -> d052a454fda)

2023-05-27 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 7ce4dc64273 [SPARK-41775][PYTHON][FOLLOWUP] Use pyspark.cloudpickle 
instead of `cloudpickle` in torch distributor
 add d052a454fda [SPARK-43824][SPARK-43825] [SQL] Assign names to the error 
class _LEGACY_ERROR_TEMP_128[1-2]

No new revisions were added by this update.

Summary of changes:
 core/src/main/resources/error/error-classes.json | 20 ++--
 .../spark/sql/errors/QueryCompilationErrors.scala| 14 +++---
 .../apache/spark/sql/execution/command/views.scala   |  3 +--
 .../spark/sql/execution/SQLViewTestSuite.scala   | 16 
 4 files changed, 26 insertions(+), 27 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-41775][PYTHON][FOLLOWUP] Use pyspark.cloudpickle instead of `cloudpickle` in torch distributor

2023-05-27 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7ce4dc64273 [SPARK-41775][PYTHON][FOLLOWUP] Use pyspark.cloudpickle 
instead of `cloudpickle` in torch distributor
7ce4dc64273 is described below

commit 7ce4dc642736d78e79dff8e0b671cfd1b5d44166
Author: Weichen Xu 
AuthorDate: Sat May 27 09:06:39 2023 -0700

[SPARK-41775][PYTHON][FOLLOWUP] Use pyspark.cloudpickle instead of 
`cloudpickle` in torch distributor

### What changes were proposed in this pull request?

Use pyspark.cloudpickle instead of `cloudpickle` in torch distributor

### Why are the changes needed?

Make ser and deser code consistent

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Closes #41337 from WeichenXu123/fix-torch-distributor-import-cloudpickle.

Authored-by: Weichen Xu 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/ml/torch/distributor.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyspark/ml/torch/distributor.py 
b/python/pyspark/ml/torch/distributor.py
index 8a41bdcc886..ad8b4d8cc25 100644
--- a/python/pyspark/ml/torch/distributor.py
+++ b/python/pyspark/ml/torch/distributor.py
@@ -814,7 +814,7 @@ class TorchDistributor(Distributor):
 ) -> str:
 code = textwrap.dedent(
 f"""
-import cloudpickle
+from pyspark import cloudpickle
 import os
 
 if __name__ == "__main__":


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43530][PROTOBUF] Read descriptor file only once

2023-05-27 Thread yangjie01

This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 64642b48351 [SPARK-43530][PROTOBUF] Read descriptor file only once
64642b48351 is described below

commit 64642b48351c0c4ef8f40ce7902b85b6f953bd8f
Author: Raghu Angadi 
AuthorDate: Sat May 27 23:18:54 2023 +0800

[SPARK-43530][PROTOBUF] Read descriptor file only once

### What changes were proposed in this pull request?

Protobuf functions (`from_protobuf()` & `to_protobuf()`) take file path of 
a descriptor file and use that for constructing Protobuf descriptors.
Main problem with how this is that the file is read many times (e.g. at 
each executor). This is unnecessary and error prone. E.g. file contents may be 
updated couple of days after a streaming query starts. That could lead to 
various errors.

**The fix**: Use the byte content (which is serialized `FileDescritptorSet` 
proto). We read the content from the file once and carry the byte buffer.

This also adds new API where we can pass the byte buffer directly. This is 
useful when the users fetch the content themselves and passes it to Protobuf 
functions. E.g. they could fetch it from S3, or extract it Python Protobuf 
classes.

**Note to reviewers**: This includes a lot of updates to test files, mainly 
because the interface change to pass the buffer. I have left a few PR comments 
to help with the review.

### Why are the changes needed?
Described above.

### Does this PR introduce _any_ user-facing change?
Yes, this adds two new versions for `from_protobuf()` and `to_protobuf()` 
API that take Protobuf bytes rather than file path.

### How was this patch tested?
 - Unit tests

Closes #41192 from rangadi/proto-file-buffer.

Authored-by: Raghu Angadi 
Signed-off-by: yangjie01 
---
 .../org/apache/spark/sql/protobuf/functions.scala  | 135 +++--
 .../org/apache/spark/sql/FunctionTestSuite.scala   |  12 +-
 .../apache/spark/sql/PlanGenerationTestSuite.scala |   7 +-
 ..._protobuf_messageClassName_descFilePath.explain |   2 +-
 ...f_messageClassName_descFilePath_options.explain |   2 +-
 ..._protobuf_messageClassName_descFilePath.explain |   2 +-
 ...f_messageClassName_descFilePath_options.explain |   2 +-
 ...rom_protobuf_messageClassName_descFilePath.json |   2 +-
 ...rotobuf_messageClassName_descFilePath.proto.bin | Bin 156 -> 361 bytes
 ...obuf_messageClassName_descFilePath_options.json |   2 +-
 ...messageClassName_descFilePath_options.proto.bin | Bin 206 -> 409 bytes
 .../to_protobuf_messageClassName_descFilePath.json |   2 +-
 ...rotobuf_messageClassName_descFilePath.proto.bin | Bin 154 -> 359 bytes
 ...obuf_messageClassName_descFilePath_options.json |   2 +-
 ...messageClassName_descFilePath_options.proto.bin | Bin 204 -> 407 bytes
 .../sql/connect/planner/SparkConnectPlanner.scala  |  33 ++---
 .../sql/connect/ProtoToParsedPlanTestSuite.scala   |   4 +-
 .../sql/protobuf/CatalystDataToProtobuf.scala  |   7 +-
 .../sql/protobuf/ProtobufDataToCatalyst.scala  |  14 +--
 .../org/apache/spark/sql/protobuf/functions.scala  | 114 +++--
 .../spark/sql/protobuf/utils/ProtobufUtils.scala   |  67 +-
 .../ProtobufCatalystDataConversionSuite.scala  |  29 ++---
 .../sql/protobuf/ProtobufFunctionsSuite.scala  |  65 +-
 .../spark/sql/protobuf/ProtobufSerdeSuite.scala|  30 +++--
 core/src/main/resources/error/error-classes.json   |   7 +-
 docs/sql-error-conditions.md   |   6 -
 .../spark/sql/errors/QueryCompilationErrors.scala  |  11 +-
 27 files changed, 393 insertions(+), 164 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/protobuf/functions.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/protobuf/functions.scala
index c42f8417155..57ce013065e 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/protobuf/functions.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/protobuf/functions.scala
@@ -16,10 +16,19 @@
  */
 package org.apache.spark.sql.protobuf
 
+import java.io.File
+import java.io.FileNotFoundException
+import java.nio.file.NoSuchFileException
+import java.util.Collections
+
 import scala.collection.JavaConverters._
+import scala.util.control.NonFatal
+
+import org.apache.commons.io.FileUtils
 
 import org.apache.spark.annotation.Experimental
 import org.apache.spark.sql.Column
+import org.apache.spark.sql.errors.QueryCompilationErrors
 import org.apache.spark.sql.functions.{fnWithOptions, lit}
 
 // scalastyle:off: object.name
@@ -35,7 +44,8 @@ object functions {
* @param messageName
*   the protobuf message name to look for in descriptor file.
* @param descFilePath
-   *

[spark] branch slow deleted (was 6650667c407)

[spark] branch master updated: [SPARK-43817][SPARK-43702][PYTHON] Support UserDefinedType in createDataFrame from pandas DataFrame and toPandas

[spark] branch master updated: [SPARK-43671][PS][FOLLOWUP] Refine `CategoricalOps` functions

[spark] branch master updated: [SPARK-43692][SPARK-43693][SPARK-43694][SPARK-43695][PS] Fix `StringOps` for Spark Connect

[spark] branch master updated: [SPARK-43773][CONNECT][PYTHON][, THRESHOLD] Implement 'levenshtein(str1, str2)' functions in python client

[spark] branch master updated: [SPARK-43830][BUILD] Update scalatest and scalatestplus related dependencies to newest version

[spark] branch master updated (7ce4dc64273 -> d052a454fda)

[spark] branch master updated: [SPARK-41775][PYTHON][FOLLOWUP] Use pyspark.cloudpickle instead of `cloudpickle` in torch distributor

[spark] branch master updated: [SPARK-43530][PROTOBUF] Read descriptor file only once

9 matches

Site Navigation

Mail list logo

Footer information