from:"ruifengz"

(spark) branch master updated: [SPARK-49784][PYTHON][TESTS] Add more test for `spark.sql`

2024-09-26 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 913a0f7813c5 [SPARK-49784][PYTHON][TESTS] Add more test for `spark.sql`
913a0f7813c5 is described below

commit 913a0f7813c5b2d2bf105160bf8e55e08b34513b
Author: Ruifeng Zheng 
AuthorDate: Thu Sep 26 15:15:37 2024 +0800

[SPARK-49784][PYTHON][TESTS] Add more test for `spark.sql`

### What changes were proposed in this pull request?
add more test for `spark.sql`

### Why are the changes needed?
for test coverage

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #48246 from zhengruifeng/py_sql_test.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 dev/sparktestsupport/modules.py|   2 +
 .../pyspark/sql/tests/connect/test_parity_sql.py   |  37 +
 python/pyspark/sql/tests/test_sql.py   | 185 +
 3 files changed, 224 insertions(+)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index eda6b063350e..d2c000b702a6 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -520,6 +520,7 @@ pyspark_sql = Module(
 "pyspark.sql.tests.test_errors",
 "pyspark.sql.tests.test_functions",
 "pyspark.sql.tests.test_group",
+"pyspark.sql.tests.test_sql",
 "pyspark.sql.tests.pandas.test_pandas_cogrouped_map",
 "pyspark.sql.tests.pandas.test_pandas_grouped_map",
 "pyspark.sql.tests.pandas.test_pandas_grouped_map_with_state",
@@ -1032,6 +1033,7 @@ pyspark_connect = Module(
 "pyspark.sql.tests.connect.test_parity_serde",
 "pyspark.sql.tests.connect.test_parity_functions",
 "pyspark.sql.tests.connect.test_parity_group",
+"pyspark.sql.tests.connect.test_parity_sql",
 "pyspark.sql.tests.connect.test_parity_dataframe",
 "pyspark.sql.tests.connect.test_parity_collection",
 "pyspark.sql.tests.connect.test_parity_creation",
diff --git a/python/pyspark/sql/tests/connect/test_parity_sql.py 
b/python/pyspark/sql/tests/connect/test_parity_sql.py
new file mode 100644
index ..4c6b11c60cbe
--- /dev/null
+++ b/python/pyspark/sql/tests/connect/test_parity_sql.py
@@ -0,0 +1,37 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import unittest
+
+from pyspark.sql.tests.test_sql import SQLTestsMixin
+from pyspark.testing.connectutils import ReusedConnectTestCase
+
+
+class SQLParityTests(SQLTestsMixin, ReusedConnectTestCase):
+pass
+
+
+if __name__ == "__main__":
+from pyspark.sql.tests.connect.test_parity_sql import *  # noqa: F401
+
+try:
+import xmlrunner  # type: ignore[import]
+
+testRunner = xmlrunner.XMLTestRunner(output="target/test-reports", 
verbosity=2)
+except ImportError:
+testRunner = None
+unittest.main(testRunner=testRunner, verbosity=2)
diff --git a/python/pyspark/sql/tests/test_sql.py 
b/python/pyspark/sql/tests/test_sql.py
new file mode 100644
index ..bf50bbc11ac3
--- /dev/null
+++ b/python/pyspark/sql/tests/test_sql.py
@@ -0,0 +1,185 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an &quo

(spark) branch master updated: [SPARK-49609][PYTHON][FOLLOWUP] Correct the typehint for `filter` and `where`

2024-09-25 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0ccf53ae6faa [SPARK-49609][PYTHON][FOLLOWUP] Correct the typehint for 
`filter` and `where`
0ccf53ae6faa is described below

commit 0ccf53ae6faabc4420317d379da77a299794c84c
Author: Ruifeng Zheng 
AuthorDate: Wed Sep 25 19:21:36 2024 +0800

[SPARK-49609][PYTHON][FOLLOWUP] Correct the typehint for `filter` and 
`where`

### What changes were proposed in this pull request?
Correct the typehint for `filter` and `where`

### Why are the changes needed?
the input `str` should not be treated as column name

### Does this PR introduce _any_ user-facing change?
doc change

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #48244 from zhengruifeng/py_filter_where.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/classic/dataframe.py | 2 +-
 python/pyspark/sql/connect/dataframe.py | 2 +-
 python/pyspark/sql/dataframe.py | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/python/pyspark/sql/classic/dataframe.py 
b/python/pyspark/sql/classic/dataframe.py
index 23484fcf0051..0dd66a9d8654 100644
--- a/python/pyspark/sql/classic/dataframe.py
+++ b/python/pyspark/sql/classic/dataframe.py
@@ -1787,7 +1787,7 @@ class DataFrame(ParentDataFrame, PandasMapOpsMixin, 
PandasConversionMixin):
 def inputFiles(self) -> List[str]:
 return list(self._jdf.inputFiles())
 
-def where(self, condition: "ColumnOrName") -> ParentDataFrame:
+def where(self, condition: Union[Column, str]) -> ParentDataFrame:
 return self.filter(condition)
 
 # Two aliases below were added for pandas compatibility many years ago.
diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index cb37af8868aa..146cfe11bc50 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1260,7 +1260,7 @@ class DataFrame(ParentDataFrame):
 res._cached_schema = self._merge_cached_schema(other)
 return res
 
-def where(self, condition: "ColumnOrName") -> ParentDataFrame:
+def where(self, condition: Union[Column, str]) -> ParentDataFrame:
 if not isinstance(condition, (str, Column)):
 raise PySparkTypeError(
 errorClass="NOT_COLUMN_OR_STR",
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 2179a844b1e5..142034583dbd 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -3351,7 +3351,7 @@ class DataFrame:
 ...
 
 @dispatch_df_method
-def filter(self, condition: "ColumnOrName") -> "DataFrame":
+def filter(self, condition: Union[Column, str]) -> "DataFrame":
 """Filters rows using the given condition.
 
 :func:`where` is an alias for :func:`filter`.
@@ -5902,7 +5902,7 @@ class DataFrame:
 ...
 
 @dispatch_df_method
-def where(self, condition: "ColumnOrName") -> "DataFrame":
+def where(self, condition: Union[Column, str]) -> "DataFrame":
 """
 :func:`where` is an alias for :func:`filter`.
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-49552][PYTHON] Add DataFrame API support for new 'randstr' and 'uniform' SQL functions

2024-09-24 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e2d2ab510632 [SPARK-49552][PYTHON] Add DataFrame API support for new 
'randstr' and 'uniform' SQL functions
e2d2ab510632 is described below

commit e2d2ab510632cc1948cb6b4500e9da49036a96bd
Author: Daniel Tenedorio 
AuthorDate: Wed Sep 25 10:57:44 2024 +0800

[SPARK-49552][PYTHON] Add DataFrame API support for new 'randstr' and 
'uniform' SQL functions

### What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/48004 we added new SQL functions 
`randstr` and `uniform`. This PR adds DataFrame API support for them.

For example, in Scala:

```
sql("create table t(col int not null) using csv")
sql("insert into t values (0)")
val df = sql("select col from t")
df.select(randstr(lit(5), lit(0)).alias("x")).select(length(col("x")))
> 5

df.select(uniform(lit(10), lit(20), lit(0)).alias("x")).selectExpr("x > 5")
> true
```

### Why are the changes needed?

This improves DataFrame parity with the SQL API.

### Does this PR introduce _any_ user-facing change?

Yes, see above.

### How was this patch tested?

This PR adds unit test coverage.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48143 from dtenedor/dataframes-uniform-randstr.

Authored-by: Daniel Tenedorio 
Signed-off-by: Ruifeng Zheng 
---
 .../source/reference/pyspark.sql/functions.rst |   2 +
 python/pyspark/sql/connect/functions/builtin.py|  28 ++
 python/pyspark/sql/functions/builtin.py|  92 ++
 python/pyspark/sql/tests/test_functions.py |  21 -
 .../scala/org/apache/spark/sql/functions.scala |  45 +
 .../catalyst/expressions/randomExpressions.scala   |  49 --
 .../apache/spark/sql/DataFrameFunctionsSuite.scala | 104 +
 7 files changed, 331 insertions(+), 10 deletions(-)

diff --git a/python/docs/source/reference/pyspark.sql/functions.rst 
b/python/docs/source/reference/pyspark.sql/functions.rst
index 4910a5b59273..6248e7133165 100644
--- a/python/docs/source/reference/pyspark.sql/functions.rst
+++ b/python/docs/source/reference/pyspark.sql/functions.rst
@@ -148,6 +148,7 @@ Mathematical Functions
 try_multiply
 try_subtract
 unhex
+uniform
 width_bucket
 
 
@@ -189,6 +190,7 @@ String Functions
 overlay
 position
 printf
+randstr
 regexp_count
 regexp_extract
 regexp_extract_all
diff --git a/python/pyspark/sql/connect/functions/builtin.py 
b/python/pyspark/sql/connect/functions/builtin.py
index 6953230f5b42..27b12fff3c0a 100644
--- a/python/pyspark/sql/connect/functions/builtin.py
+++ b/python/pyspark/sql/connect/functions/builtin.py
@@ -1007,6 +1007,22 @@ def unhex(col: "ColumnOrName") -> Column:
 unhex.__doc__ = pysparkfuncs.unhex.__doc__
 
 
+def uniform(
+min: Union[Column, int, float],
+max: Union[Column, int, float],
+seed: Optional[Union[Column, int]] = None,
+) -> Column:
+if seed is None:
+return _invoke_function_over_columns(
+"uniform", lit(min), lit(max), lit(random.randint(0, sys.maxsize))
+)
+else:
+return _invoke_function_over_columns("uniform", lit(min), lit(max), 
lit(seed))
+
+
+uniform.__doc__ = pysparkfuncs.uniform.__doc__
+
+
 def approxCountDistinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
 warnings.warn("Deprecated in 3.4, use approx_count_distinct instead.", 
FutureWarning)
 return approx_count_distinct(col, rsd)
@@ -2581,6 +2597,18 @@ def regexp_like(str: "ColumnOrName", regexp: 
"ColumnOrName") -> Column:
 regexp_like.__doc__ = pysparkfuncs.regexp_like.__doc__
 
 
+def randstr(length: Union[Column, int], seed: Optional[Union[Column, int]] = 
None) -> Column:
+if seed is None:
+return _invoke_function_over_columns(
+"randstr", lit(length), lit(random.randint(0, sys.maxsize))
+)
+else:
+return _invoke_function_over_columns("randstr", lit(length), lit(seed))
+
+
+randstr.__doc__ = pysparkfuncs.randstr.__doc__
+
+
 def regexp_count(str: "ColumnOrName", regexp: "ColumnOrName") -> Column:
 return _invoke_function_over_columns("regexp_count", str, regexp)
 
diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index 09a286fe7c94..4ca39562cb20 100644
--- a/python/pyspark/sql/functions/builtin.py
+++

(spark) branch master updated: [SPARK-49734][PYTHON] Add `seed` argument for function `shuffle`

2024-09-22 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0eeb61fb64e0 [SPARK-49734][PYTHON] Add `seed` argument for function 
`shuffle`
0eeb61fb64e0 is described below

commit 0eeb61fb64e0c499610c7b9a84f9e41e923251e8
Author: Ruifeng Zheng 
AuthorDate: Mon Sep 23 10:46:08 2024 +0800

[SPARK-49734][PYTHON] Add `seed` argument for function `shuffle`

### What changes were proposed in this pull request?
1, Add `seed` argument for function `shuffle`;
2, Rewrite and enable the doctest by specify the seed and control the 
partitioning;

### Why are the changes needed?
feature parity, seed is support in SQL side

### Does this PR introduce _any_ user-facing change?
yes, new argument

### How was this patch tested?
updated doctest

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #48184 from zhengruifeng/py_func_shuffle.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/functions/builtin.py| 10 +---
 python/pyspark/sql/functions/builtin.py| 69 --
 .../scala/org/apache/spark/sql/functions.scala | 13 +++-
 3 files changed, 53 insertions(+), 39 deletions(-)

diff --git a/python/pyspark/sql/connect/functions/builtin.py 
b/python/pyspark/sql/connect/functions/builtin.py
index 7fed175cbc8e..2a39bc6bfddd 100644
--- a/python/pyspark/sql/connect/functions/builtin.py
+++ b/python/pyspark/sql/connect/functions/builtin.py
@@ -65,7 +65,6 @@ from pyspark.sql import functions as pysparkfuncs
 from pyspark.sql.types import (
 _from_numpy_type,
 DataType,
-LongType,
 StructType,
 ArrayType,
 StringType,
@@ -2206,12 +2205,9 @@ def schema_of_xml(xml: Union[str, Column], options: 
Optional[Mapping[str, str]]
 schema_of_xml.__doc__ = pysparkfuncs.schema_of_xml.__doc__
 
 
-def shuffle(col: "ColumnOrName") -> Column:
-return _invoke_function(
-"shuffle",
-_to_col(col),
-LiteralExpression(random.randint(0, sys.maxsize), LongType()),
-)
+def shuffle(col: "ColumnOrName", seed: Optional[Union[Column, int]] = None) -> 
Column:
+_seed = lit(random.randint(0, sys.maxsize)) if seed is None else lit(seed)
+return _invoke_function("shuffle", _to_col(col), _seed)
 
 
 shuffle.__doc__ = pysparkfuncs.shuffle.__doc__
diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index 5f8d1c21a24f..2d5dbb594605 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -17723,7 +17723,7 @@ def array_sort(
 
 
 @_try_remote_functions
-def shuffle(col: "ColumnOrName") -> Column:
+def shuffle(col: "ColumnOrName", seed: Optional[Union[Column, int]] = None) -> 
Column:
 """
 Array function: Generates a random permutation of the given array.
 
@@ -17736,6 +17736,10 @@ def shuffle(col: "ColumnOrName") -> Column:
 --
 col : :class:`~pyspark.sql.Column` or str
 The name of the column or expression to be shuffled.
+seed : :class:`~pyspark.sql.Column` or int, optional
+Seed value for the random generator.
+
+.. versionadded:: 4.0.0
 
 Returns
 ---
@@ -17752,48 +17756,51 @@ def shuffle(col: "ColumnOrName") -> Column:
 Example 1: Shuffling a simple array
 
 >>> import pyspark.sql.functions as sf
->>> df = spark.createDataFrame([([1, 20, 3, 5],)], ['data'])
->>> df.select(sf.shuffle(df.data)).show() # doctest: +SKIP
-+-+
-|shuffle(data)|
-+-+
-|[1, 3, 20, 5]|
-+-+
+>>> df = spark.sql("SELECT ARRAY(1, 20, 3, 5) AS data")
+>>> df.select("*", sf.shuffle(df.data, sf.lit(123))).show()
++-+-+
+| data|shuffle(data)|
++-+-+
+|[1, 20, 3, 5]|[5, 1, 20, 3]|
++-+-+
 
 Example 2: Shuffling an array with null values
 
 >>> import pyspark.sql.functions as sf
->>> df = spark.createDataFrame([([1, 20, None, 3],)], ['data'])
->>> df.select(sf.shuffle(df.data)).show() # doctest: +SKIP
-++
-|   shuffle(data)|
-++
-|[20, 3, NULL, 1]|
-++
+>>> df = spark.sql("SELECT ARRAY(1, 20, NULL, 5) AS data")
+>>> df.select("*", sf.shuffle(sf.col("data"), 234)).show()
++++
+|data|   shuffle(data)|
++

(spark) branch master updated: [SPARK-49713][PYTHON][CONNECT] Make function `count_min_sketch` accept number arguments

2024-09-19 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a5ac80af8e94 [SPARK-49713][PYTHON][CONNECT] Make function 
`count_min_sketch` accept number arguments
a5ac80af8e94 is described below

commit a5ac80af8e94afe56105c265a94d02ef878e1de9
Author: Ruifeng Zheng 
AuthorDate: Fri Sep 20 08:29:48 2024 +0800

[SPARK-49713][PYTHON][CONNECT] Make function `count_min_sketch` accept 
number arguments

### What changes were proposed in this pull request?
1, Make function `count_min_sketch` accept number arguments;
2, Make argument `seed` optional;
3, fix the type hints of `eps/confidence/seed` from `ColumnOrName` to 
`Column`, because they require a foldable value and actually do not accept 
column name:
```
In [3]: from pyspark.sql import functions as sf

In [4]: df = spark.range(1).withColumn("seed", sf.lit(1).cast("int"))

In [5]: df.select(sf.hex(sf.count_min_sketch("id", sf.lit(0.5), 
sf.lit(0.5), "seed")))
...
AnalysisException: [DATATYPE_MISMATCH.NON_FOLDABLE_INPUT] Cannot resolve 
"count_min_sketch(id, 0.5, 0.5, seed)" due to data type mismatch: the input 
`seed` should be a foldable "INT" expression; however, got "seed". SQLSTATE: 
42K09;
'Aggregate [unresolvedalias('hex(count_min_sketch(id#1L, 0.5, 0.5, seed#2, 
0, 0)))]
+- Project [id#1L, cast(1 as int) AS seed#2]
   +- Range (0, 1, step=1, splits=Some(12))
...
```

### Why are the changes needed?
1, seed is optional in other similar functions;
2, existing type hint is `ColumnOrName` which is misleading since column 
name is not actually supported

### Does this PR introduce _any_ user-facing change?
yes, it support number arguments

### How was this patch tested?
updated doctests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #48157 from zhengruifeng/py_fix_count_min_sketch.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/functions/builtin.py| 10 +--
 python/pyspark/sql/functions/builtin.py| 71 ++
 .../scala/org/apache/spark/sql/functions.scala | 12 
 3 files changed, 77 insertions(+), 16 deletions(-)

diff --git a/python/pyspark/sql/connect/functions/builtin.py 
b/python/pyspark/sql/connect/functions/builtin.py
index 2870d9c408b6..7fed175cbc8e 100644
--- a/python/pyspark/sql/connect/functions/builtin.py
+++ b/python/pyspark/sql/connect/functions/builtin.py
@@ -71,6 +71,7 @@ from pyspark.sql.types import (
 StringType,
 )
 from pyspark.sql.utils import enum_to_value as _enum_to_value
+from pyspark.util import JVM_INT_MAX
 
 # The implementation of pandas_udf is embedded in 
pyspark.sql.function.pandas_udf
 # for code reuse.
@@ -1126,11 +1127,12 @@ grouping_id.__doc__ = pysparkfuncs.grouping_id.__doc__
 
 def count_min_sketch(
 col: "ColumnOrName",
-eps: "ColumnOrName",
-confidence: "ColumnOrName",
-seed: "ColumnOrName",
+eps: Union[Column, float],
+confidence: Union[Column, float],
+seed: Optional[Union[Column, int]] = None,
 ) -> Column:
-return _invoke_function_over_columns("count_min_sketch", col, eps, 
confidence, seed)
+_seed = lit(random.randint(0, JVM_INT_MAX)) if seed is None else lit(seed)
+return _invoke_function_over_columns("count_min_sketch", col, lit(eps), 
lit(confidence), _seed)
 
 
 count_min_sketch.__doc__ = pysparkfuncs.count_min_sketch.__doc__
diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index c0730b193bc7..5f8d1c21a24f 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -6015,9 +6015,9 @@ def grouping_id(*cols: "ColumnOrName") -> Column:
 @_try_remote_functions
 def count_min_sketch(
 col: "ColumnOrName",
-eps: "ColumnOrName",
-confidence: "ColumnOrName",
-seed: "ColumnOrName",
+eps: Union[Column, float],
+confidence: Union[Column, float],
+seed: Optional[Union[Column, int]] = None,
 ) -> Column:
 """
 Returns a count-min sketch of a column with the given esp, confidence and 
seed.
@@ -6031,13 +6031,24 @@ def count_min_sketch(
 --
 col : :class:`~pyspark.sql.Column` or str
 target column to compute on.
-eps : :class:`~pyspark.sql.Column` or str
+eps : :class:`~pyspark.sql.Column` or float
 relative error, must be positive
-confidence : :class:`~pyspark.sql.Column` or str
+
+.. versionchanged:: 4.0.0
+`eps` now accep

(spark) branch master updated: [SPARK-49693][PYTHON][CONNECT] Refine the string representation of `timedelta`

2024-09-19 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 94dca78c128f [SPARK-49693][PYTHON][CONNECT] Refine the string 
representation of `timedelta`
94dca78c128f is described below

commit 94dca78c128ff3d1571326629b4100ee092afb54
Author: Ruifeng Zheng 
AuthorDate: Thu Sep 19 21:10:52 2024 +0800

[SPARK-49693][PYTHON][CONNECT] Refine the string representation of 
`timedelta`

### What changes were proposed in this pull request?
Refine the string representation of `timedelta`, by following the ISO 
format.
Note that the used units in JVM side (`Duration`) and Pandas are different.

### Why are the changes needed?
We should not leak the raw data

### Does this PR introduce _any_ user-facing change?
yes

PySpark Classic:
```
In [1]: from pyspark.sql import functions as sf

In [2]: import datetime

In [3]: sf.lit(datetime.timedelta(1, 1))
Out[3]: Column<'PT24H1S'>
```

PySpark Connect (before):
```
In [1]: from pyspark.sql import functions as sf

In [2]: import datetime

In [3]: sf.lit(datetime.timedelta(1, 1))
Out[3]: Column<'8640100'>
```

PySpark Connect (after):
```
In [1]: from pyspark.sql import functions as sf

In [2]: import datetime

In [3]: sf.lit(datetime.timedelta(1, 1))
Out[3]: Column<'P1DT0H0M1S'>
```

### How was this patch tested?
added test

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #48159 from zhengruifeng/pc_lit_delta.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/expressions.py | 12 +++-
 python/pyspark/sql/tests/test_column.py   | 23 ++-
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/connect/expressions.py 
b/python/pyspark/sql/connect/expressions.py
index 63128ef48e38..0b5512b61925 100644
--- a/python/pyspark/sql/connect/expressions.py
+++ b/python/pyspark/sql/connect/expressions.py
@@ -489,7 +489,17 @@ class LiteralExpression(Expression):
 ts = TimestampNTZType().fromInternal(self._value)
 if ts is not None and isinstance(ts, datetime.datetime):
 return ts.strftime("%Y-%m-%d %H:%M:%S.%f")
-# TODO(SPARK-49693): Refine the string representation of timedelta
+elif isinstance(self._dataType, DayTimeIntervalType):
+delta = DayTimeIntervalType().fromInternal(self._value)
+if delta is not None and isinstance(delta, datetime.timedelta):
+import pandas as pd
+
+# Note: timedelta itself does not provide isoformat method.
+# Both Pandas and java.time.Duration provide it, but the format
+# is sightly different:
+# java.time.Duration only applies HOURS, MINUTES, SECONDS 
units,
+# while Pandas applies all supported units.
+return pd.Timedelta(delta).isoformat()  # type: 
ignore[attr-defined]
 return f"{self._value}"
 
 
diff --git a/python/pyspark/sql/tests/test_column.py 
b/python/pyspark/sql/tests/test_column.py
index 220ecd387f7e..1972dd2804d9 100644
--- a/python/pyspark/sql/tests/test_column.py
+++ b/python/pyspark/sql/tests/test_column.py
@@ -19,12 +19,13 @@
 from enum import Enum
 from itertools import chain
 import datetime
+import unittest
 
 from pyspark.sql import Column, Row
 from pyspark.sql import functions as sf
 from pyspark.sql.types import StructType, StructField, IntegerType, LongType
 from pyspark.errors import AnalysisException, PySparkTypeError, 
PySparkValueError
-from pyspark.testing.sqlutils import ReusedSQLTestCase
+from pyspark.testing.sqlutils import ReusedSQLTestCase, have_pandas, 
pandas_requirement_message
 
 
 class ColumnTestsMixin:
@@ -289,6 +290,26 @@ class ColumnTestsMixin:
 ts = datetime.datetime(2021, 3, 4, 12, 34, 56, 1234)
 self.assertEqual(str(sf.lit(ts)), "Column<'2021-03-04 
12:34:56.001234'>")
 
+@unittest.skipIf(not have_pandas, pandas_requirement_message)
+def test_lit_delta_representation(self):
+for delta in [
+datetime.timedelta(days=1),
+datetime.timedelta(hours=2),
+datetime.timedelta(minutes=3),
+datetime.timedelta(seconds=4),
+datetime.timedelta(microseconds=5),
+datetime.timedelta(days=2, hours=21, microseconds=908),
+datetime.timedelta(days=1, minutes=-3, microseconds=-1001),
+datetime.timedelta(days=1, hours=2, minutes=3, seconds=4, 
microseconds=5),
+]:
+

(spark) branch master updated: [SPARK-49717][SQL][TESTS] Function parity test ignore private[xxx] functions

2024-09-19 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4068fbcc0de5 [SPARK-49717][SQL][TESTS] Function parity test ignore 
private[xxx] functions
4068fbcc0de5 is described below

commit 4068fbcc0de59154db9bdeb1296bd24059db9f42
Author: Ruifeng Zheng 
AuthorDate: Thu Sep 19 21:00:57 2024 +0800

[SPARK-49717][SQL][TESTS] Function parity test ignore private[xxx] functions

### What changes were proposed in this pull request?
Function parity test ignore private functions

### Why are the changes needed?
existing test is based on `java.lang.reflect.Modifier` which cannot 
properly handle `private[xxx]`

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #48163 from zhengruifeng/df_func_test.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../org/apache/spark/sql/DataFrameFunctionsSuite.scala | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
index f16171940df2..0842b92e5d53 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
@@ -17,10 +17,10 @@
 
 package org.apache.spark.sql
 
-import java.lang.reflect.Modifier
 import java.nio.charset.StandardCharsets
 import java.sql.{Date, Timestamp}
 
+import scala.reflect.runtime.universe.runtimeMirror
 import scala.util.Random
 
 import org.apache.spark.{QueryContextType, SPARK_DOC_ROOT, SparkException, 
SparkRuntimeException}
@@ -82,7 +82,6 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSparkSession {
   "bucket", "days", "hours", "months", "years", // Datasource v2 partition 
transformations
   "product", // Discussed in https://github.com/apache/spark/pull/30745
   "unwrap_udt",
-  "collect_top_k",
   "timestamp_add",
   "timestamp_diff"
 )
@@ -92,10 +91,13 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSparkSession {
 val word_pattern = """\w*"""
 
 // Set of DataFrame functions in org.apache.spark.sql.functions
-val dataFrameFunctions = functions.getClass
-  .getDeclaredMethods
-  .filter(m => Modifier.isPublic(m.getModifiers))
-  .map(_.getName)
+val dataFrameFunctions = runtimeMirror(getClass.getClassLoader)
+  .reflect(functions)
+  .symbol
+  .typeSignature
+  .decls
+  .filter(s => s.isMethod && s.isPublic)
+  .map(_.name.toString)
   .toSet
   .filter(_.matches(word_pattern))
   .diff(excludedDataFrameFunctions)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-49692][PYTHON][CONNECT] Refine the string representation of literal date and datetime

2024-09-18 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 25d6b7a280f6 [SPARK-49692][PYTHON][CONNECT] Refine the string 
representation of literal date and datetime
25d6b7a280f6 is described below

commit 25d6b7a280f690c1a467f65143115cce846a732a
Author: Ruifeng Zheng 
AuthorDate: Thu Sep 19 07:46:18 2024 +0800

[SPARK-49692][PYTHON][CONNECT] Refine the string representation of literal 
date and datetime

### What changes were proposed in this pull request?
Refine the string representation of literal date and datetime

### Why are the changes needed?
1, we should not represent those literals with internal values;
2, the string representation should be consistent with PySpark Classic if 
possible (we cannot make sure the representations are always the same because 
we only hold an unresolved expression in connect, but we can try our best to do 
so)

### Does this PR introduce _any_ user-facing change?
yes

before:
```
In [3]: lit(datetime.date(2024, 7, 10))
Out[3]: Column<'19914'>

In [4]: lit(datetime.datetime(2024, 7, 10, 1, 2, 3, 456))
Out[4]: Column<'1720544523000456'>
```

after:
```
In [3]: lit(datetime.date(2024, 7, 10))
Out[3]: Column<'2024-07-10'>

In [4]: lit(datetime.datetime(2024, 7, 10, 1, 2, 3, 456))
Out[4]: Column<'2024-07-10 01:02:03.000456'>
```

### How was this patch tested?
added tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #48137 from zhengruifeng/py_connect_lit_dt.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/expressions.py | 16 ++--
 python/pyspark/sql/tests/test_column.py   |  9 +
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/connect/expressions.py 
b/python/pyspark/sql/connect/expressions.py
index db1cd1c013be..63128ef48e38 100644
--- a/python/pyspark/sql/connect/expressions.py
+++ b/python/pyspark/sql/connect/expressions.py
@@ -477,8 +477,20 @@ class LiteralExpression(Expression):
 def __repr__(self) -> str:
 if self._value is None:
 return "NULL"
-else:
-return f"{self._value}"
+elif isinstance(self._dataType, DateType):
+dt = DateType().fromInternal(self._value)
+if dt is not None and isinstance(dt, datetime.date):
+return dt.strftime("%Y-%m-%d")
+elif isinstance(self._dataType, TimestampType):
+ts = TimestampType().fromInternal(self._value)
+if ts is not None and isinstance(ts, datetime.datetime):
+return ts.strftime("%Y-%m-%d %H:%M:%S.%f")
+elif isinstance(self._dataType, TimestampNTZType):
+ts = TimestampNTZType().fromInternal(self._value)
+if ts is not None and isinstance(ts, datetime.datetime):
+return ts.strftime("%Y-%m-%d %H:%M:%S.%f")
+# TODO(SPARK-49693): Refine the string representation of timedelta
+return f"{self._value}"
 
 
 class ColumnReference(Expression):
diff --git a/python/pyspark/sql/tests/test_column.py 
b/python/pyspark/sql/tests/test_column.py
index 2bd66baaa2bf..220ecd387f7e 100644
--- a/python/pyspark/sql/tests/test_column.py
+++ b/python/pyspark/sql/tests/test_column.py
@@ -18,6 +18,8 @@
 
 from enum import Enum
 from itertools import chain
+import datetime
+
 from pyspark.sql import Column, Row
 from pyspark.sql import functions as sf
 from pyspark.sql.types import StructType, StructField, IntegerType, LongType
@@ -280,6 +282,13 @@ class ColumnTestsMixin:
 when_cond = sf.when(expression, sf.lit(None))
 self.assertEqual(str(when_cond), "Column<'CASE WHEN foo THEN NULL 
END'>")
 
+def test_lit_time_representation(self):
+dt = datetime.date(2021, 3, 4)
+self.assertEqual(str(sf.lit(dt)), "Column<'2021-03-04'>")
+
+ts = datetime.datetime(2021, 3, 4, 12, 34, 56, 1234)
+self.assertEqual(str(sf.lit(ts)), "Column<'2021-03-04 
12:34:56.001234'>")
+
 def test_enum_literals(self):
 class IntEnum(Enum):
 X = 1


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-49640][PS] Apply reservoir sampling in `SampledPlotBase`

2024-09-17 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a7f191ba5947 [SPARK-49640][PS] Apply reservoir sampling in 
`SampledPlotBase`
a7f191ba5947 is described below

commit a7f191ba5947075066154a33da7908b24c412ccb
Author: Ruifeng Zheng 
AuthorDate: Wed Sep 18 08:44:22 2024 +0800

[SPARK-49640][PS] Apply reservoir sampling in `SampledPlotBase`

### What changes were proposed in this pull request?
Apply reservoir sampling in `SampledPlotBase`

### Why are the changes needed?
Existing sampling approach has two drawbacks:

1, it needs two jobs to sample `max_rows` rows:

- df.count() to compute `fraction = max_rows / count`
- df.sample(fraction).to_pandas() to do the sampling

2, the df.sample is based on Bernoulli sampling which **cannot** guarantee 
the sampled size == expected `max_rows`, e.g.
```
In [1]: df = spark.range(1)

In [2]: [df.sample(0.01).count() for i in range(0, 10)]
Out[2]: [96, 97, 95, 97, 105, 105, 105, 87, 95, 110]
```
The size of sampled data is floating near the target size 1*0.01=100.
This relative deviation cannot be ignored, when the input dataset is large 
and the sampling fraction is small.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI and manually check

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #48105 from zhengruifeng/ps_sampling.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/pandas/plot/core.py | 51 +++---
 1 file changed, 42 insertions(+), 9 deletions(-)

diff --git a/python/pyspark/pandas/plot/core.py 
b/python/pyspark/pandas/plot/core.py
index 067c7db664de..7630ecc39895 100644
--- a/python/pyspark/pandas/plot/core.py
+++ b/python/pyspark/pandas/plot/core.py
@@ -68,19 +68,52 @@ class SampledPlotBase:
 def get_sampled(self, data):
 from pyspark.pandas import DataFrame, Series
 
+if not isinstance(data, (DataFrame, Series)):
+raise TypeError("Only DataFrame and Series are supported for 
plotting.")
+if isinstance(data, Series):
+data = data.to_frame()
+
 fraction = get_option("plotting.sample_ratio")
-if fraction is None:
-fraction = 1 / (len(data) / get_option("plotting.max_rows"))
-fraction = min(1.0, fraction)
-self.fraction = fraction
-
-if isinstance(data, (DataFrame, Series)):
-if isinstance(data, Series):
-data = data.to_frame()
+if fraction is not None:
+self.fraction = fraction
 sampled = 
data._internal.resolved_copy.spark_frame.sample(fraction=self.fraction)
 return DataFrame(data._internal.with_new_sdf(sampled))._to_pandas()
 else:
-raise TypeError("Only DataFrame and Series are supported for 
plotting.")
+from pyspark.sql import Observation
+
+max_rows = get_option("plotting.max_rows")
+observation = Observation("ps plotting")
+sdf = data._internal.resolved_copy.spark_frame.observe(
+observation, F.count(F.lit(1)).alias("count")
+)
+
+rand_col_name = "__ps_plotting_sampled_plot_base_rand__"
+id_col_name = "__ps_plotting_sampled_plot_base_id__"
+
+sampled = (
+sdf.select(
+"*",
+F.rand().alias(rand_col_name),
+F.monotonically_increasing_id().alias(id_col_name),
+)
+.sort(rand_col_name)
+.limit(max_rows + 1)
+.coalesce(1)
+.sortWithinPartitions(id_col_name)
+.drop(rand_col_name, id_col_name)
+)
+
+pdf = DataFrame(data._internal.with_new_sdf(sampled))._to_pandas()
+
+if len(pdf) > max_rows:
+try:
+self.fraction = float(max_rows) / observation.get["count"]
+except Exception:
+pass
+return pdf[:max_rows]
+else:
+self.fraction = 1.0
+return pdf
 
 def set_result_text(self, ax):
 assert hasattr(self, "fraction")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-49531][PYTHON][CONNECT] Support line plot with plotly backend

2024-09-12 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3b8dddac65bc [SPARK-49531][PYTHON][CONNECT] Support line plot with 
plotly backend
3b8dddac65bc is described below

commit 3b8dddac65bce6f88f51e23e777d521d65fa3373
Author: Xinrong Meng 
AuthorDate: Fri Sep 13 09:21:20 2024 +0800

[SPARK-49531][PYTHON][CONNECT] Support line plot with plotly backend

### What changes were proposed in this pull request?
Support line plot with plotly backend on both Spark Connect and Spark 
classic.

### Why are the changes needed?
While Pandas on Spark supports plotting, PySpark currently lacks this 
feature. The proposed API will enable users to generate visualizations, such as 
line plots, by leveraging libraries like Plotly. This will provide users with 
an intuitive, interactive way to explore and understand large datasets directly 
from PySpark DataFrames, streamlining the data analysis workflow in distributed 
environments.

See more at [PySpark Plotting API 
Specification](https://docs.google.com/document/d/1IjOEzC8zcetG86WDvqkereQPj_NGLNW7Bdu910g30Dg/edit?usp=sharing)
 in progress.

Part of https://issues.apache.org/jira/browse/SPARK-49530.

### Does this PR introduce _any_ user-facing change?
Yes.

```python
>>> data = [("A", 10, 1.5), ("B", 30, 2.5), ("C", 20, 3.5)]
>>> columns = ["category", "int_val", "float_val"]
>>> sdf = spark.createDataFrame(data, columns)
>>> sdf.show()
++---+-+
|category|int_val|float_val|
++---+-+
|   A| 10|  1.5|
|   B| 30|  2.5|
|   C| 20|  3.5|
++---+-+

>>> f = sdf.plot(kind="line", x="category", y="int_val")
>>> f.show()  # see below
>>> g = sdf.plot.line(x="category", y=["int_val", "float_val"])
>>> g.show()  # see below
```
`f.show()`:

![newplot](https://github.com/user-attachments/assets/ebd50bbc-0dd1-437f-ae0c-0b4de8f3c722)

`g.show()`:
![newplot 
(1)](https://github.com/user-attachments/assets/46d28840-a147-428f-8d88-d424aa76ad06)

### How was this patch tested?
Unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #48008 from xinrong-meng/plot_line.

Authored-by: Xinrong Meng 
Signed-off-by: Ruifeng Zheng 
---
 dev/sparktestsupport/modules.py|   4 +
 python/pyspark/errors/error-conditions.json|   5 +
 python/pyspark/sql/classic/dataframe.py|   5 +
 python/pyspark/sql/connect/dataframe.py|   5 +
 python/pyspark/sql/dataframe.py|  27 +
 python/pyspark/sql/plot/__init__.py|  21 
 python/pyspark/sql/plot/core.py| 135 +
 python/pyspark/sql/plot/plotly.py  |  30 +
 .../sql/tests/connect/test_parity_frame_plot.py|  36 ++
 .../tests/connect/test_parity_frame_plot_plotly.py |  36 ++
 python/pyspark/sql/tests/plot/__init__.py  |  16 +++
 python/pyspark/sql/tests/plot/test_frame_plot.py   |  79 
 .../sql/tests/plot/test_frame_plot_plotly.py   |  64 ++
 python/pyspark/sql/utils.py|  17 +++
 python/pyspark/testing/sqlutils.py |   7 ++
 .../org/apache/spark/sql/internal/SQLConf.scala|  27 +
 16 files changed, 514 insertions(+)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index 34fbb8450d54..b9a4bed715f6 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -548,6 +548,8 @@ pyspark_sql = Module(
 "pyspark.sql.tests.test_udtf",
 "pyspark.sql.tests.test_utils",
 "pyspark.sql.tests.test_resources",
+"pyspark.sql.tests.plot.test_frame_plot",
+"pyspark.sql.tests.plot.test_frame_plot_plotly",
 ],
 )
 
@@ -1051,6 +1053,8 @@ pyspark_connect = Module(
 "pyspark.sql.tests.connect.test_parity_arrow_cogrouped_map",
 "pyspark.sql.tests.connect.test_parity_python_datasource",
 "pyspark.sql.tests.connect.test_parity_python_streaming_datasource",
+"pyspark.sql.tests.connect.test_parity_frame_plot",
+"pyspark.sql.tests.connect.test_parity_frame_plot_plotly",
 "pyspark.sql.tests.connect.test_utils",
 "pyspark.sql.tests.connect.client.test_artifact",
 "

(spark) branch master updated (e918fb65f9bc -> ab7aea144da4)

2024-09-10 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from e918fb65f9bc [SPARK-49085][FOLLOWUP] Update the scope of 
`spark-protobuf`
 add ab7aea144da4 [MINOR][DOCS] Fix scaladoc for `FlatMapGroupsInArrowExec` 
and `FlatMapCoGroupsInArrowExec`

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/execution/python/FlatMapCoGroupsInArrowExec.scala   | 8 
 .../spark/sql/execution/python/FlatMapGroupsInArrowExec.scala | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-49540][PS] Unify the usage of `distributed_sequence_id`

2024-09-08 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a3b918eb94c1 [SPARK-49540][PS] Unify the usage of 
`distributed_sequence_id`
a3b918eb94c1 is described below

commit a3b918eb94c1ad49bf8bdfddf31d40a346e0fafb
Author: Ruifeng Zheng 
AuthorDate: Mon Sep 9 12:26:23 2024 +0800

[SPARK-49540][PS] Unify the usage of `distributed_sequence_id`

### What changes were proposed in this pull request?

in PySpark Classic, it was used via a dataframe method 
`withSequenceColumn`, while in PySpark Connect, it was used as an internal 
function

This PR unifies the usage of `distributed_sequence_id`

### Why are the changes needed?
code refactoring

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
updated tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #48028 from zhengruifeng/func_withSequenceColumn.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/pandas/internal.py  | 18 +-
 python/pyspark/pandas/spark/functions.py   | 12 
 .../src/main/scala/org/apache/spark/sql/Dataset.scala  |  8 
 .../apache/spark/sql/api/python/PythonSQLUtils.scala   |  7 +++
 .../org/apache/spark/sql/DataFrameSelfJoinSuite.scala  |  5 +++--
 .../scala/org/apache/spark/sql/DataFrameSuite.scala|  4 +++-
 6 files changed, 30 insertions(+), 24 deletions(-)

diff --git a/python/pyspark/pandas/internal.py 
b/python/pyspark/pandas/internal.py
index 92d4a3357319..4be345201ba6 100644
--- a/python/pyspark/pandas/internal.py
+++ b/python/pyspark/pandas/internal.py
@@ -43,6 +43,7 @@ from pyspark.sql.types import (  # noqa: F401
 )
 from pyspark.sql.utils import is_timestamp_ntz_preferred, is_remote
 from pyspark import pandas as ps
+from pyspark.pandas.spark import functions as SF
 from pyspark.pandas._typing import Label
 from pyspark.pandas.spark.utils import as_nullable_spark_type, 
force_decimal_precision_scale
 from pyspark.pandas.data_type_ops.base import DataTypeOps
@@ -938,19 +939,10 @@ class InternalFrame:
 ++---+
 """
 if len(sdf.columns) > 0:
-if is_remote():
-from pyspark.sql.connect.column import Column as ConnectColumn
-from pyspark.sql.connect.expressions import 
DistributedSequenceID
-
-return sdf.select(
-ConnectColumn(DistributedSequenceID()).alias(column_name),
-"*",
-)
-else:
-return PySparkDataFrame(
-sdf._jdf.toDF().withSequenceColumn(column_name),
-sdf.sparkSession,
-)
+return sdf.select(
+SF.distributed_sequence_id().alias(column_name),
+"*",
+)
 else:
 cnt = sdf.count()
 if cnt > 0:
diff --git a/python/pyspark/pandas/spark/functions.py 
b/python/pyspark/pandas/spark/functions.py
index 6aaa63956c14..4bcf07f6f650 100644
--- a/python/pyspark/pandas/spark/functions.py
+++ b/python/pyspark/pandas/spark/functions.py
@@ -174,6 +174,18 @@ def null_index(col: Column) -> Column:
 return Column(sc._jvm.PythonSQLUtils.nullIndex(col._jc))
 
 
+def distributed_sequence_id() -> Column:
+if is_remote():
+from pyspark.sql.connect.functions.builtin import _invoke_function
+
+return _invoke_function("distributed_sequence_id")
+else:
+from pyspark import SparkContext
+
+sc = SparkContext._active_spark_context
+return Column(sc._jvm.PythonSQLUtils.distributed_sequence_id())
+
+
 def collect_top_k(col: Column, num: int, reverse: bool) -> Column:
 if is_remote():
 from pyspark.sql.connect.functions.builtin import 
_invoke_function_over_columns
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index 870571b533d0..0fab60a94842 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -2010,14 +2010,6 @@ class Dataset[T] private[sql](
   // For Python API
   
 
-  /**
-   * It adds a new long column with the name `name` that increases one by one.
-   * This is for 'distributed-sequence' default index in pandas API on Spark.
-   */
-  private[sql] def withSequenceColumn(name: String) = {
-select(column(DistributedSequenceID()).alias(name), col("*"))
-  }
-
   /**
* Converts a JavaR

(spark) branch master updated (39d4bd8b3d99 -> 339d1c9d9d50)

2024-09-03 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 39d4bd8b3d99 [SPARK-49504][BUILD] Add `jjwt` profile
 add 339d1c9d9d50 [SPARK-49202][PS] Apply `ArrayBinarySearch` for histogram

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/plot/core.py | 34 +-
 python/pyspark/pandas/spark/functions.py   | 13 +
 .../spark/sql/api/python/PythonSQLUtils.scala  |  3 ++
 3 files changed, 29 insertions(+), 21 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-49203][SQL] Add expression for `java.util.Arrays.binarySearch`

2024-09-02 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 32b054c40602 [SPARK-49203][SQL] Add expression for 
`java.util.Arrays.binarySearch`
32b054c40602 is described below

commit 32b054c40602c7355176903fa32224774f0c1bec
Author: panbingkun 
AuthorDate: Tue Sep 3 14:47:47 2024 +0800

[SPARK-49203][SQL] Add expression for `java.util.Arrays.binarySearch`

### What changes were proposed in this pull request?
The pr aims to an expression `array_binary_search` for 
`java.util.Arrays.binarySearch`.

### Why are the changes needed?
We can use it to implement `histogram plot` in the client side (no longer 
need to depend on mllib's `Bucketizer`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add new UT.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47741 from panbingkun/SPARK-49203.

Authored-by: panbingkun 
Signed-off-by: Ruifeng Zheng 
---
 .../catalyst/expressions/ArrayExpressionUtils.java | 176 +
 .../sql/catalyst/analysis/FunctionRegistry.scala   |   1 +
 .../expressions/collectionOperations.scala | 136 
 .../expressions/CollectionExpressionsSuite.scala   |  79 +
 4 files changed, 392 insertions(+)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ArrayExpressionUtils.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ArrayExpressionUtils.java
new file mode 100644
index ..ff6525acbe53
--- /dev/null
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ArrayExpressionUtils.java
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.expressions;
+
+import java.util.Arrays;
+import java.util.Comparator;
+
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.SQLOrderingUtil;
+import org.apache.spark.sql.types.ByteType$;
+import org.apache.spark.sql.types.BooleanType$;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.DoubleType$;
+import org.apache.spark.sql.types.FloatType$;
+import org.apache.spark.sql.types.IntegerType$;
+import org.apache.spark.sql.types.LongType$;
+import org.apache.spark.sql.types.ShortType$;
+
+public class ArrayExpressionUtils {
+
+  private static final Comparator booleanComp = (o1, o2) -> {
+if (o1 == null && o2 == null) {
+  return 0;
+} else if (o1 == null) {
+  return -1;
+} else if (o2 == null) {
+  return 1;
+}
+boolean c1 = (Boolean) o1, c2 = (Boolean) o2;
+return c1 == c2 ? 0 : (c1 ? 1 : -1);
+  };
+
+  private static final Comparator byteComp = (o1, o2) -> {
+if (o1 == null && o2 == null) {
+  return 0;
+} else if (o1 == null) {
+  return -1;
+} else if (o2 == null) {
+  return 1;
+}
+byte c1 = (Byte) o1, c2 = (Byte) o2;
+return Byte.compare(c1, c2);
+  };
+
+  private static final Comparator shortComp = (o1, o2) -> {
+if (o1 == null && o2 == null) {
+  return 0;
+} else if (o1 == null) {
+  return -1;
+} else if (o2 == null) {
+  return 1;
+}
+short c1 = (Short) o1, c2 = (Short) o2;
+return Short.compare(c1, c2);
+  };
+
+  private static final Comparator integerComp = (o1, o2) -> {
+if (o1 == null && o2 == null) {
+  return 0;
+} else if (o1 == null) {
+  return -1;
+} else if (o2 == null) {
+  return 1;
+}
+int c1 = (Integer) o1, c2 = (Integer) o2;
+return Integer.compare(c1, c2);
+  };
+
+  private static final Comparator longComp = (o1, o2) -> {
+if (o1 == null && o2 == null) {
+  return 0;
+} else if (o1 == null) {
+  return -1;
+} else if (o2 == null) {
+  return 1;
+}
+long c1 = (Long) o1, c2 = (Long) o2;
+return Long.compare(c1, c2);

(spark) branch master updated (8879df5fc12b -> 783c055d05d9)

2024-09-02 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8879df5fc12b [SPARK-49451] Allow duplicate keys in parse_json
 add 783c055d05d9 [SPARK-49441][ML] `StringIndexer` sort arrays in executors

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/ml/feature/StringIndexer.scala| 82 ++
 1 file changed, 23 insertions(+), 59 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (1c9cde59ba65 -> 54edfd3cdb1b)

2024-08-27 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 1c9cde59ba65 [SPARK-49402][PYTHON][FOLLOW-UP] Manually load ~/.profile 
in Spark Connect notebook
 add 54edfd3cdb1b [SPARK-49412][PS] Compute all box plot metrics in single 
job

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/plot/core.py | 263 +
 python/pyspark/pandas/plot/matplotlib.py   |  36 ++-
 python/pyspark/pandas/plot/plotly.py   |  94 +++-
 .../pyspark/pandas/tests/plot/test_frame_plot.py   |  52 ++--
 .../pyspark/pandas/tests/plot/test_series_plot.py  |  57 ++---
 5 files changed, 160 insertions(+), 342 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-49357][CONNECT][PYTHON] Vertically truncate deeply nested protobuf message

2024-08-27 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6d8235f3b2bb [SPARK-49357][CONNECT][PYTHON] Vertically truncate deeply 
nested protobuf message
6d8235f3b2bb is described below

commit 6d8235f3b2bbaa88b10c35d6eecddffa4d1b04a4
Author: Changgyoo Park 
AuthorDate: Wed Aug 28 10:58:41 2024 +0800

[SPARK-49357][CONNECT][PYTHON] Vertically truncate deeply nested protobuf 
message

### What changes were proposed in this pull request?

Add a new message truncation strategy to limit the nesting level since the 
existing truncation strategies do not apply well to a deeply nested and large 
protobuf message.

### Why are the changes needed?

There are instances where deeply nested protobuf messages cause performance 
problems on the client side when the logger is turned on.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add a new test scenario to test_truncate_message in test_connect_basic.py.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47891 from changgyoopark-db/SPARK-49357.

Authored-by: Changgyoo Park 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/client/core.py   | 21 -
 .../pyspark/sql/tests/connect/test_connect_basic.py | 10 ++
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/python/pyspark/sql/connect/client/core.py 
b/python/pyspark/sql/connect/client/core.py
index 723a11b35c26..35dcf677fdb7 100644
--- a/python/pyspark/sql/connect/client/core.py
+++ b/python/pyspark/sql/connect/client/core.py
@@ -993,20 +993,25 @@ class SparkConnectClient(object):
 --
 p : google.protobuf.message.Message
 Generic Message type
+truncate: bool
+Indicates whether to truncate the message
 
 Returns
 ---
 Single line string of the serialized proto message.
 """
 try:
-p2 = self._truncate(p) if truncate else p
+max_level = 8 if truncate else sys.maxsize
+p2 = self._truncate(p, max_level) if truncate else p
 return text_format.MessageToString(p2, as_one_line=True)
 except RecursionError:
 return ""
 except Exception:
 return ""
 
-def _truncate(self, p: google.protobuf.message.Message) -> 
google.protobuf.message.Message:
+def _truncate(
+self, p: google.protobuf.message.Message, allowed_recursion_depth: int
+) -> google.protobuf.message.Message:
 """
 Helper method to truncate the protobuf message.
 Refer to 'org.apache.spark.sql.connect.common.Abbreviator' in the 
server side.
@@ -1029,11 +1034,17 @@ class SparkConnectClient(object):
 field_name = descriptor.name
 
 if descriptor.type == descriptor.TYPE_MESSAGE:
-if descriptor.label == descriptor.LABEL_REPEATED:
+if allowed_recursion_depth == 0:
+p2.ClearField(field_name)
+elif descriptor.label == descriptor.LABEL_REPEATED:
 p2.ClearField(field_name)
-getattr(p2, field_name).extend([self._truncate(v) for 
v in value])
+getattr(p2, field_name).extend(
+[self._truncate(v, allowed_recursion_depth - 1) 
for v in value]
+)
 else:
-getattr(p2, field_name).CopyFrom(self._truncate(value))
+getattr(p2, field_name).CopyFrom(
+self._truncate(value, allowed_recursion_depth - 1)
+)
 
 elif descriptor.type == descriptor.TYPE_STRING:
 if descriptor.label == descriptor.LABEL_REPEATED:
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index 07fda95e6548..f084601d2e7b 100755
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -1434,6 +1434,16 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
 proto_string_truncated_2 = 
self.connect._client._proto_to_string(plan2, True)
 self.assertTrue(len(proto_string_truncated_2) < 8000, 
len(proto_string_truncated_2))
 
+cdf3 = cdf1.select("a" * 4096)
+for _ in range(64):
+cdf3 = cdf3.select("a" * 4096)
+plan3 = cdf3._plan.to_proto(self.connect._client)
+
+proto_string_3 = self.connect._cl

(spark) branch master updated: [SPARK-49366][CONNECT] Treat Union node as leaf in dataframe column resolution

2024-08-26 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d64d1f76ff28 [SPARK-49366][CONNECT] Treat Union node as leaf in 
dataframe column resolution
d64d1f76ff28 is described below

commit d64d1f76ff28f97db0ad3f3647ae2683e80095a2
Author: Ruifeng Zheng 
AuthorDate: Tue Aug 27 09:23:52 2024 +0800

[SPARK-49366][CONNECT] Treat Union node as leaf in dataframe column 
resolution

### What changes were proposed in this pull request?

Treat Union node as leaf in column resolution

### Why are the changes needed?
bug fix:
```
from pyspark.sql.functions import concat, lit, col
df1 = spark.range(10).withColumn("value", lit(1))
df2 = df1.union(df1)
df1.join(df2, df1.id == df2.id, "left").show()
```
fails with `AMBIGUOUS_COLUMN_REFERENCE`

```
resolveExpressionByPlanChildren: e = '`==`('id, 'id)
resolveExpressionByPlanChildren: q =
'[id=63]Join LeftOuter, '`==`('id, 'id)
:- [id=61]Project [id#550L, 1 AS value#553]
:  +- Range (0, 10, step=1, splits=Some(12))
+- [id=62]Union false, false
   :- [id=61]Project [id#564L, 1 AS value#565]
   :  +- Range (0, 10, step=1, splits=Some(12))
   +- [id=61]Project [id#566L, 1 AS value#567]
  +- Range (0, 10, step=1, splits=Some(12))

'id with id = 61

[id=61]Project [id#564L, 1 AS value#565]
+- Range (0, 10, step=1, splits=Some(12))

[id=61]Project [id#566L, 1 AS value#567]
+- Range (0, 10, step=1, splits=Some(12))

resolved: Vector((Some((id#564L,1)),true), (Some((id#566L,1)),true))
```

When resolving `'id with id = 61`, existing detection fails in the second 
child.

### Does this PR introduce _any_ user-facing change?
yes, bug fix

### How was this patch tested?
added tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47853 from zhengruifeng/fix_ambgious_union.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/tests/test_dataframe.py | 14 ++
 .../sql/catalyst/analysis/ColumnResolutionHelper.scala |  7 ++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/tests/test_dataframe.py 
b/python/pyspark/sql/tests/test_dataframe.py
index 7dd42eecde7f..4e2d3b9ba42a 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -130,6 +130,20 @@ class DataFrameTestsMixin:
 self.assertTrue(df3.columns, ["aa", "b", "a", "b"])
 self.assertTrue(df3.count() == 2)
 
+def test_self_join_III(self):
+df1 = self.spark.range(10).withColumn("value", lit(1))
+df2 = df1.union(df1)
+df3 = df1.join(df2, df1.id == df2.id, "left")
+self.assertTrue(df3.columns, ["id", "value", "id", "value"])
+self.assertTrue(df3.count() == 20)
+
+def test_self_join_IV(self):
+df1 = self.spark.range(10).withColumn("value", lit(1))
+df2 = df1.withColumn("value", lit(2)).union(df1.withColumn("value", 
lit(3)))
+df3 = df1.join(df2, df1.id == df2.id, "right")
+self.assertTrue(df3.columns, ["id", "value", "id", "value"])
+self.assertTrue(df3.count() == 20)
+
 def test_duplicated_column_names(self):
 df = self.spark.createDataFrame([(1, 2)], ["c", "c"])
 row = df.select("*").first()
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala
index c10e000a098c..1947c884694b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala
@@ -585,7 +585,12 @@ trait ColumnResolutionHelper extends Logging with 
DataTypeErrorsBase {
   }
   (resolved.map(r => (r, currentDepth)), true)
 } else {
-  resolveDataFrameColumnByPlanId(u, id, isMetadataAccess, p.children, 
currentDepth + 1)
+  val children = p match {
+// treat Union node as the leaf node
+case _: Union => Seq.empty[LogicalPlan]
+case _ => p.children
+  }
+  resolveDataFrameColumnByPlanId(u, id, isMetadataAccess, children, 
currentDepth + 1)
 }
 
 // In self join case like:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-49391][PS] Box plot select outliers by distance from fences

2024-08-26 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b1ddec5757ae [SPARK-49391][PS] Box plot select outliers by distance 
from fences
b1ddec5757ae is described below

commit b1ddec5757aeef69bdd4b08f4f75096b129f5d31
Author: Ruifeng Zheng 
AuthorDate: Mon Aug 26 18:10:36 2024 +0800

[SPARK-49391][PS] Box plot select outliers by distance from fences

### What changes were proposed in this pull request?
Box plot select outliers by distance from fences

### Why are the changes needed?
if there are more than 1k outliers, existing implementations select the 
values by distance `|value - min(non_outliers)|` which is not reasonable 
because it prefers outliers above upper fence over outliers below lower fence.
We should order them by the distance from fences:
1, if value > upper fence,  value - upper fence;
2, it value < lower fence,  lower fence - value;

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI and manually test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47870 from zhengruifeng/plot_hist_select_outlier.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/pandas/plot/core.py | 42 --
 python/pyspark/pandas/plot/matplotlib.py   |  2 +-
 python/pyspark/pandas/plot/plotly.py   |  4 +--
 .../pyspark/pandas/tests/plot/test_series_plot.py  |  2 +-
 4 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/python/pyspark/pandas/plot/core.py 
b/python/pyspark/pandas/plot/core.py
index 2e188b411df1..fe5beb0e730d 100644
--- a/python/pyspark/pandas/plot/core.py
+++ b/python/pyspark/pandas/plot/core.py
@@ -420,14 +420,24 @@ class BoxPlotBase:
 return minmax.iloc[0][["min", "max"]].values
 
 @staticmethod
-def get_fliers(colname, outliers, min_val):
+def get_fliers(colname, outliers, lfence, ufence):
 # Filters only the outliers, should "showfliers" be True
 fliers_df = outliers.filter("`__{}_outlier`".format(colname))
 
 # If it shows fliers, take the top 1k with highest absolute values
-# Here we normalize the values by subtracting the minimum value from
-# each, and use absolute values.
-order_col = F.abs(F.col("`{}`".format(colname)) - min_val.item())
+# Here we normalize the values by subtracting the fences.
+formated_colname = "`{}`".format(colname)
+order_col = (
+F.when(
+F.col(formated_colname) > F.lit(ufence),
+F.col(formated_colname) - F.lit(ufence),
+)
+.when(
+F.col(formated_colname) < F.lit(lfence),
+F.lit(lfence) - F.col(formated_colname),
+)
+.otherwise(F.lit(None))
+)
 fliers = (
 fliers_df.select(F.col("`{}`".format(colname)))
 .orderBy(order_col)
@@ -439,15 +449,26 @@ class BoxPlotBase:
 return fliers
 
 @staticmethod
-def get_multicol_fliers(colnames, multicol_outliers, multicol_whiskers):
+def get_multicol_fliers(colnames, multicol_outliers, multicol_stats):
 scols = []
-extract_colnames = []
 for i, colname in enumerate(colnames):
 formated_colname = "`{}`".format(colname)
 outlier_colname = "__{}_outlier".format(colname)
-min_val = multicol_whiskers[colname]["min"]
+lfence, ufence = multicol_stats[colname]["lfence"], 
multicol_stats[colname]["ufence"]
+order_col = (
+F.when(
+F.col(formated_colname) > F.lit(ufence),
+F.col(formated_colname) - F.lit(ufence),
+)
+.when(
+F.col(formated_colname) < F.lit(lfence),
+F.lit(lfence) - F.col(formated_colname),
+)
+.otherwise(F.lit(None))
+)
+
 pair_col = F.struct(
-F.abs(F.col(formated_colname) - F.lit(min_val)).alias("ord"),
+order_col.alias("ord"),
 F.col(formated_colname).alias("val"),
 )
 scols.append(
@@ -457,11 +478,10 @@ class BoxPlotBase:
 .alias(f"pair_{i}"),
 1001,
 False,
-).alias(f"top_{i}")
+).alias(f"top_{i}")["val"]
 )
-extract_colnames.append(f"top_{i

(spark) branch master updated: [SPARK-49382][PS] Make frame box plot properly render the fliers/outliers

2024-08-25 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8409da3d1815 [SPARK-49382][PS] Make frame box plot properly render the 
fliers/outliers
8409da3d1815 is described below

commit 8409da3d1815832132cd1006290679c0bed7d9f4
Author: Ruifeng Zheng 
AuthorDate: Mon Aug 26 13:12:55 2024 +0800

[SPARK-49382][PS] Make frame box plot properly render the fliers/outliers

### What changes were proposed in this pull request?
fliers/outliers was ignored in the initial implementation 
https://github.com/apache/spark/pull/36317

### Why are the changes needed?
feature parity for Pandas and Series box plot

### Does this PR introduce _any_ user-facing change?

```
import pyspark.pandas as ps
df = ps.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1], [6.4, 3.2, 
1], [5.9, 3.0, 2], [100, 200, 300]], columns=['length', 'width', 'species'])
df.boxplot()
```

`df.length.plot.box()`

![image](https://github.com/user-attachments/assets/43da563c-5f68-4305-ad27-a4f04815dfd1)

before:
`df.boxplot()`

![image](https://github.com/user-attachments/assets/e25c2760-c12a-4801-a730-3987a020f889)

after:
`df.boxplot()`

![image](https://github.com/user-attachments/assets/c19f13b1-b9e4-423e-bcec-0c47c1c8df32)

### How was this patch tested?
CI and manually check

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47866 from zhengruifeng/plot_hist_fly.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/pandas/plot/core.py | 32 ++
 python/pyspark/pandas/plot/plotly.py   | 10 ++-
 python/pyspark/pandas/spark/functions.py   | 13 +
 .../spark/sql/api/python/PythonSQLUtils.scala  |  3 ++
 4 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/pandas/plot/core.py 
b/python/pyspark/pandas/plot/core.py
index c1dc7d2dc621..2e188b411df1 100644
--- a/python/pyspark/pandas/plot/core.py
+++ b/python/pyspark/pandas/plot/core.py
@@ -26,6 +26,7 @@ from pandas.core.dtypes.inference import is_integer
 
 from pyspark.sql import functions as F, Column
 from pyspark.sql.types import DoubleType
+from pyspark.pandas.spark import functions as SF
 from pyspark.pandas.missing import unsupported_function
 from pyspark.pandas.config import get_option
 from pyspark.pandas.utils import name_like_string
@@ -437,6 +438,37 @@ class BoxPlotBase:
 
 return fliers
 
+@staticmethod
+def get_multicol_fliers(colnames, multicol_outliers, multicol_whiskers):
+scols = []
+extract_colnames = []
+for i, colname in enumerate(colnames):
+formated_colname = "`{}`".format(colname)
+outlier_colname = "__{}_outlier".format(colname)
+min_val = multicol_whiskers[colname]["min"]
+pair_col = F.struct(
+F.abs(F.col(formated_colname) - F.lit(min_val)).alias("ord"),
+F.col(formated_colname).alias("val"),
+)
+scols.append(
+SF.collect_top_k(
+F.when(F.col(outlier_colname), pair_col)
+.otherwise(F.lit(None))
+.alias(f"pair_{i}"),
+1001,
+False,
+).alias(f"top_{i}")
+)
+extract_colnames.append(f"top_{i}.val")
+
+results = 
multicol_outliers.select(scols).select(extract_colnames).first()
+
+fliers = {}
+for i, colname in enumerate(colnames):
+fliers[colname] = results[i]
+
+return fliers
+
 
 class KdePlotBase(NumericPlotBase):
 @staticmethod
diff --git a/python/pyspark/pandas/plot/plotly.py 
b/python/pyspark/pandas/plot/plotly.py
index 4de313b1e831..0afcd6d7e869 100644
--- a/python/pyspark/pandas/plot/plotly.py
+++ b/python/pyspark/pandas/plot/plotly.py
@@ -199,11 +199,19 @@ def plot_box(data: Union["ps.DataFrame", "ps.Series"], 
**kwargs):
 # Computes min and max values of non-outliers - the whiskers
 whiskers = BoxPlotBase.calc_multicol_whiskers(numeric_column_names, 
outliers)
 
+fliers = None
+if boxpoints:
+fliers = BoxPlotBase.get_multicol_fliers(numeric_column_names, 
outliers, whiskers)
+
 i = 0
 for colname in numeric_column_names:
 col_stats = multicol_stats[colname]
 col_whiskers = whiskers[colname]
 
+col_fliers = None
+if fliers is not None and colname in fliers and 
len(fliers[colname]) > 0:
+

(spark) branch master updated: [SPARK-49367][PS] Parallelize the KDE computation for multiple columns (plotly backend)

2024-08-25 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4b7191d43b4b [SPARK-49367][PS] Parallelize the KDE computation for 
multiple columns (plotly backend)
4b7191d43b4b is described below

commit 4b7191d43b4b505faa1e26481311c5e83e6340e5
Author: Ruifeng Zheng 
AuthorDate: Mon Aug 26 09:11:16 2024 +0800

[SPARK-49367][PS] Parallelize the KDE computation for multiple columns 
(plotly backend)

### What changes were proposed in this pull request?
Parallelize the KDE computation for `plotly` backend.

Note that `matplotlib` backend is not optimized in this PR, due to the 
computation logic is slightly different between `plotly` and `matplotlib`:
1, `plotly`: compute a global `ind` across all input columns, and then 
compute all curves based on it;
2, `matplotlib`: for each input column, compute its `ind` and then the 
curve;

I think `matplotlib`'s approach seems more reasonable, but it make this 
optimization cannot be directly applied on `matplotlib`, so it needs more 
investigation.

### Why are the changes needed?
existing implementation compute each curve once, this PR aims to compute 
multiple columns together

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI and manually test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47854 from zhengruifeng/plot_parallelize_kde.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/pandas/plot/core.py   | 32 ++--
 python/pyspark/pandas/plot/plotly.py | 26 --
 2 files changed, 34 insertions(+), 24 deletions(-)

diff --git a/python/pyspark/pandas/plot/core.py 
b/python/pyspark/pandas/plot/core.py
index e5db0bd701f1..c1dc7d2dc621 100644
--- a/python/pyspark/pandas/plot/core.py
+++ b/python/pyspark/pandas/plot/core.py
@@ -474,7 +474,7 @@ class KdePlotBase(NumericPlotBase):
 return ind
 
 @staticmethod
-def compute_kde(sdf, bw_method=None, ind=None):
+def compute_kde_col(input_col, bw_method=None, ind=None):
 # refers to org.apache.spark.mllib.stat.KernelDensity
 assert bw_method is not None and isinstance(
 bw_method, (int, float)
@@ -497,21 +497,25 @@ class KdePlotBase(NumericPlotBase):
 log_density = -0.5 * x1 * x1 - log_std_plus_half_log2_pi
 return F.exp(log_density)
 
-dataCol = F.col(sdf.columns[0]).cast("double")
-
-estimated = [
-F.avg(
-norm_pdf(
-dataCol,
-F.lit(bandwidth),
-F.lit(log_std_plus_half_log2_pi),
-F.lit(point),
+return F.array(
+[
+F.avg(
+norm_pdf(
+input_col.cast("double"),
+F.lit(bandwidth),
+F.lit(log_std_plus_half_log2_pi),
+F.lit(point),
+)
 )
-)
-for point in points
-]
+for point in points
+]
+)
 
-row = sdf.select(F.array(estimated)).first()
+@staticmethod
+def compute_kde(sdf, bw_method=None, ind=None):
+input_col = F.col(sdf.columns[0])
+kde_col = KdePlotBase.compute_kde_col(input_col, bw_method, 
ind).alias("kde")
+row = sdf.select(kde_col).first()
 return row[0]
 
 
diff --git a/python/pyspark/pandas/plot/plotly.py 
b/python/pyspark/pandas/plot/plotly.py
index d54166a33a0a..4de313b1e831 100644
--- a/python/pyspark/pandas/plot/plotly.py
+++ b/python/pyspark/pandas/plot/plotly.py
@@ -239,22 +239,28 @@ def plot_kde(data: Union["ps.DataFrame", "ps.Series"], 
**kwargs):
 ind = KdePlotBase.get_ind(sdf.select(*data_columns), kwargs.pop("ind", 
None))
 bw_method = kwargs.pop("bw_method", None)
 
-pdfs = []
-for label in psdf._internal.column_labels:
-pdfs.append(
+kde_cols = [
+KdePlotBase.compute_kde_col(
+input_col=psdf._internal.spark_column_for(label),
+ind=ind,
+bw_method=bw_method,
+).alias(f"kde_{i}")
+for i, label in enumerate(psdf._internal.column_labels)
+]
+kde_results = sdf.select(*kde_cols).first()
+
+pdf = pd.concat(
+[
 pd.DataFrame(
 {
-"Density": KdePlotBase.compute_kde(
-sdf.select(psdf._internal.spark_column_for(label)),
-ind=ind,
-bw_method=bw_method,

(spark) branch master updated: [SPARK-49365][PS] Simplify the bucket aggregation in hist plot

2024-08-24 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0c18fc072b05 [SPARK-49365][PS] Simplify the bucket aggregation in hist 
plot
0c18fc072b05 is described below

commit 0c18fc072b05671bc9c74a43de49b563a1ef7907
Author: Ruifeng Zheng 
AuthorDate: Sat Aug 24 16:34:48 2024 +0800

[SPARK-49365][PS] Simplify the bucket aggregation in hist plot

### What changes were proposed in this pull request?
Simplify the bucket aggregation in hist plot

### Why are the changes needed?
to simplify the implementation, by eliminating the multiple dataframes Union

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI and manually check

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47852 from zhengruifeng/plot_parallel_hist.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/pandas/plot/core.py | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/python/pyspark/pandas/plot/core.py 
b/python/pyspark/pandas/plot/core.py
index 3ec78100abe9..e5db0bd701f1 100644
--- a/python/pyspark/pandas/plot/core.py
+++ b/python/pyspark/pandas/plot/core.py
@@ -198,25 +198,18 @@ class HistogramPlotBase(NumericPlotBase):
 idx = bisect.bisect(bins, value) - 1
 return float(idx)
 
-output_df = None
-for group_id, (colname, bucket_name) in enumerate(zip(colnames, 
bucket_names)):
-# sdf.na.drop to match handleInvalid="skip" in Bucketizer
-
-bucket_df = sdf.na.drop(subset=[colname]).withColumn(
-bucket_name,
-binary_search_for_buckets(F.col(colname).cast("double")),
+output_df = (
+sdf.select(
+F.posexplode(
+F.array([F.col(colname).cast("double") for colname in 
colnames])
+).alias("__group_id", "__value")
 )
-
-if output_df is None:
-output_df = bucket_df.select(
-F.lit(group_id).alias("__group_id"), 
F.col(bucket_name).alias("__bucket")
-)
-else:
-output_df = output_df.union(
-bucket_df.select(
-F.lit(group_id).alias("__group_id"), 
F.col(bucket_name).alias("__bucket")
-)
-)
+# to match handleInvalid="skip" in Bucketizer
+.where(F.col("__value").isNotNull() & 
~F.col("__value").isNaN()).select(
+F.col("__group_id"),
+binary_search_for_buckets(F.col("__value")).alias("__bucket"),
+)
+)
 
 # 2. Calculate the count based on each group and bucket.
 # +--+---+--+


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-49223][ML] Simplify the StringIndexer.countByValue with builtin functions

2024-08-22 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 596098b5fe61 [SPARK-49223][ML] Simplify the StringIndexer.countByValue 
with builtin functions
596098b5fe61 is described below

commit 596098b5fe61d5f4987d0a77156b7724a1a697f7
Author: Ruifeng Zheng 
AuthorDate: Thu Aug 22 16:21:53 2024 +0800

[SPARK-49223][ML] Simplify the StringIndexer.countByValue with builtin 
functions

### What changes were proposed in this pull request?
Simplify the StringIndexer.countByValue with builtin functions

### Why are the changes needed?
the StringIndexerAggregator is not necessary here

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47742 from zhengruifeng/sql_gouped_count.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../apache/spark/ml/feature/StringIndexer.scala| 72 +-
 1 file changed, 17 insertions(+), 55 deletions(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
index 6c10630e7bb8..72947dc17b8e 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
@@ -27,8 +27,7 @@ import org.apache.spark.ml.attribute.{Attribute, 
NominalAttribute}
 import org.apache.spark.ml.param._
 import org.apache.spark.ml.param.shared._
 import org.apache.spark.ml.util._
-import org.apache.spark.sql.{AnalysisException, Column, DataFrame, Dataset, 
Encoder, Encoders, Row}
-import org.apache.spark.sql.expressions.Aggregator
+import org.apache.spark.sql.{AnalysisException, Column, DataFrame, Dataset, 
Row}
 import org.apache.spark.sql.functions._
 import org.apache.spark.sql.types._
 import org.apache.spark.util.ArrayImplicits._
@@ -201,16 +200,23 @@ class StringIndexer @Since("1.4.0") (
   private def countByValue(
   dataset: Dataset[_],
   inputCols: Array[String]): Array[OpenHashMap[String, Long]] = {
-
-val aggregator = new StringIndexerAggregator(inputCols.length)
-implicit val encoder = Encoders.kryo[Array[OpenHashMap[String, Long]]]
-
 val selectedCols = getSelectedCols(dataset, inputCols.toImmutableArraySeq)
-dataset.select(selectedCols: _*)
-  .toDF()
-  .agg(aggregator.toColumn)
-  .as[Array[OpenHashMap[String, Long]]]
-  .collect()(0)
+val results = Array.fill(selectedCols.size)(new OpenHashMap[String, 
Long]())
+dataset.select(posexplode(array(selectedCols: _*)).as(Seq("index", 
"value")))
+  .where(col("value").isNotNull)
+  .groupBy("index", "value")
+  .agg(count(lit(1)).as("count"))
+  .groupBy("index")
+  .agg(collect_list(struct("value", "count")))
+  .collect()
+  .foreach { row =>
+val index = row.getInt(0)
+val result = results(index)
+row.getSeq[Row](1).foreach { case Row(label: String, count: Long) =>
+  result.update(label, count)
+}
+  }
+results
   }
 
   private def sortByFreq(dataset: Dataset[_], ascending: Boolean): 
Array[Array[String]] = {
@@ -642,47 +648,3 @@ object IndexToString extends 
DefaultParamsReadable[IndexToString] {
   @Since("1.6.0")
   override def load(path: String): IndexToString = super.load(path)
 }
-
-/**
- * A SQL `Aggregator` used by `StringIndexer` to count labels in string 
columns during fitting.
- */
-private class StringIndexerAggregator(numColumns: Int)
-  extends Aggregator[Row, Array[OpenHashMap[String, Long]], 
Array[OpenHashMap[String, Long]]] {
-
-  override def zero: Array[OpenHashMap[String, Long]] =
-Array.fill(numColumns)(new OpenHashMap[String, Long]())
-
-  def reduce(
-  array: Array[OpenHashMap[String, Long]],
-  row: Row): Array[OpenHashMap[String, Long]] = {
-for (i <- 0 until numColumns) {
-  val stringValue = row.getString(i)
-  // We don't count for null values.
-  if (stringValue != null) {
-array(i).changeValue(stringValue, 1L, _ + 1)
-  }
-}
-array
-  }
-
-  def merge(
-  array1: Array[OpenHashMap[String, Long]],
-  array2: Array[OpenHashMap[String, Long]]): Array[OpenHashMap[String, 
Long]] = {
-for (i <- 0 until numColumns) {
-  array2(i).foreach { case (key: String, count: Long) =>
-array1(i).changeValue(key, count, _ + count)
-  }
-}
-array1
-  }
-
-  def finish(array: Array[OpenHashMap[String, Long]]): 
Array[OpenHashMap[String, Long]] = array
-
-  override def buffer

(spark) branch master updated: [SPARK-49185][PS][PYTHON][CONNECT] Reimplement `kde` plot with Spark SQL

2024-08-11 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5c1222511cd1 [SPARK-49185][PS][PYTHON][CONNECT] Reimplement `kde` plot 
with Spark SQL
5c1222511cd1 is described below

commit 5c1222511cd1b713be928535036613ea3e697234
Author: Ruifeng Zheng 
AuthorDate: Mon Aug 12 11:20:21 2024 +0800

[SPARK-49185][PS][PYTHON][CONNECT] Reimplement `kde` plot with Spark SQL

### What changes were proposed in this pull request?
Reimplement kde plot with Spark SQL

### Why are the changes needed?
Existing `kde` plot is not supported with Spark Connect, due to it's based 
on mllib which is not compatible with spark connect.

### Does this PR introduce _any_ user-facing change?
yes, following APIs are enabled:

- `{Frame, Series}.plot.kde`
- `{Frame, Series}.plot.density`
- `{Frame, Series}.plot(kind="kde", ...)`
- `{Frame, Series}.plot(kind="density", ...)`

### How was this patch tested?
1, enabled tests
2, manually check
```
import pyspark.pandas as ps
df = ps.DataFrame({'x': [1, 2, 2.5, 3, 3.5, 4, 5], 'y': [4, 4, 4.5, 5, 5.5, 
6, 6],})
df.plot.kde(bw_method=0.3)
```

before (Spark Classic):

![image](https://github.com/user-attachments/assets/01d5c180-e84c-4e31-b5ed-071a8b4d1227)

after (Spark Connect):

![image](https://github.com/user-attachments/assets/472b0fec-2029-4250-a623-1fe345e4bde8)

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47685 from zhengruifeng/reimpl_kde.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/pandas/plot/core.py | 62 ++
 .../plot/test_parity_frame_plot_matplotlib.py  |  4 --
 .../connect/plot/test_parity_frame_plot_plotly.py  |  4 --
 .../plot/test_parity_series_plot_matplotlib.py |  4 --
 .../connect/plot/test_parity_series_plot_plotly.py |  4 --
 .../pandas/tests/connect/test_connect_plotting.py  | 48 -
 6 files changed, 40 insertions(+), 86 deletions(-)

diff --git a/python/pyspark/pandas/plot/core.py 
b/python/pyspark/pandas/plot/core.py
index 91e20295ba7c..1924dee5a12e 100644
--- a/python/pyspark/pandas/plot/core.py
+++ b/python/pyspark/pandas/plot/core.py
@@ -16,13 +16,14 @@
 #
 
 import importlib
+import math
 
 import pandas as pd
 import numpy as np
 from pandas.core.base import PandasObject
 from pandas.core.dtypes.inference import is_integer
 
-from pyspark.sql import functions as F
+from pyspark.sql import functions as F, Column
 from pyspark.sql.utils import is_remote
 from pyspark.pandas.missing import unsupported_function
 from pyspark.pandas.config import get_option
@@ -464,22 +465,44 @@ class KdePlotBase(NumericPlotBase):
 
 @staticmethod
 def compute_kde(sdf, bw_method=None, ind=None):
-from pyspark.mllib.stat import KernelDensity
-
-# 'sdf' is a Spark DataFrame that selects one column.
-
-# Using RDD is slow so we might have to change it to Dataset based 
implementation
-# once Spark has that implementation.
-sample = sdf.rdd.map(lambda x: float(x[0]))
-kd = KernelDensity()
-kd.setSample(sample)
-
-assert isinstance(bw_method, (int, float)), "'bw_method' must be set 
as a scalar number."
+# refers to org.apache.spark.mllib.stat.KernelDensity
+assert bw_method is not None and isinstance(
+bw_method, (int, float)
+), "'bw_method' must be set as a scalar number."
+
+assert ind is not None, "'ind' must be a scalar array."
+
+bandwidth = float(bw_method)
+points = [float(i) for i in ind]
+log_std_plus_half_log2_pi = math.log(bandwidth) + 0.5 * math.log(2 * 
math.pi)
+
+def norm_pdf(
+mean: Column,
+std: Column,
+log_std_plus_half_log2_pi: Column,
+x: Column,
+) -> Column:
+x0 = x - mean
+x1 = x0 / std
+log_density = -0.5 * x1 * x1 - log_std_plus_half_log2_pi
+return F.exp(log_density)
+
+dataCol = F.col(sdf.columns[0]).cast("double")
+
+estimated = [
+F.avg(
+norm_pdf(
+dataCol,
+F.lit(bandwidth),
+F.lit(log_std_plus_half_log2_pi),
+F.lit(point),
+)
+)
+for point in points
+]
 
-if bw_method is not None:
-# Match the bandwidth with Spark.
-kd.setBandwidth(float(bw_method))
-return kd.estimate(list(map(float, ind)))
+r

(spark) branch master updated: [SPARK-49170][BUILD] Upgrade snappy to 1.1.10.6

2024-08-09 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8a0986fc4e25 [SPARK-49170][BUILD] Upgrade snappy to 1.1.10.6
8a0986fc4e25 is described below

commit 8a0986fc4e254d2ffe141d17c68fcf83b69a2cb5
Author: panbingkun 
AuthorDate: Fri Aug 9 17:10:41 2024 +0800

[SPARK-49170][BUILD] Upgrade snappy to 1.1.10.6

### What changes were proposed in this pull request?
The pr aims to upgrade `snappy` from `1.1.10.5` to `1.1.10.6`.

### Why are the changes needed?
Full release notes:
https://github.com/xerial/snappy-java/releases

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47675 from panbingkun/SPARK-49170.

Authored-by: panbingkun 
Signed-off-by: Ruifeng Zheng 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +-
 pom.xml   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 88ce34b82213..264fc2ac0c3c 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -256,7 +256,7 @@ scala-xml_2.13/2.3.0//scala-xml_2.13-2.3.0.jar
 slf4j-api/2.0.14//slf4j-api-2.0.14.jar
 snakeyaml-engine/2.7//snakeyaml-engine-2.7.jar
 snakeyaml/2.2//snakeyaml-2.2.jar
-snappy-java/1.1.10.5//snappy-java-1.1.10.5.jar
+snappy-java/1.1.10.6//snappy-java-1.1.10.6.jar
 spire-macros_2.13/0.18.0//spire-macros_2.13-0.18.0.jar
 spire-platform_2.13/0.18.0//spire-platform_2.13-0.18.0.jar
 spire-util_2.13/0.18.0//spire-util_2.13-0.18.0.jar
diff --git a/pom.xml b/pom.xml
index 6498c65d9632..54c32e9bb5bf 100644
--- a/pom.xml
+++ b/pom.xml
@@ -184,7 +184,7 @@
 
2.17.2
 2.3.1
 3.0.2
-1.1.10.5
+1.1.10.6
 3.0.3
 1.17.1
 1.26.2


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (44a84dd507ac -> e0d435d59162)

2024-08-08 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 44a84dd507ac [SPARK-47410][SQL][FOLLOWUP] Limit part of StringType API 
to private[sql]
 add e0d435d59162 [SPARK-49172][PYTHON][DOCS] Refine the type hints in 
functions

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/functions/builtin.py | 34 -
 python/pyspark/sql/functions/builtin.py | 33 
 2 files changed, 33 insertions(+), 34 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-49047][PYTHON][CONNECT] Truncate the message for logging

2024-08-04 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b76f0b9c5d91 [SPARK-49047][PYTHON][CONNECT] Truncate the message for 
logging
b76f0b9c5d91 is described below

commit b76f0b9c5d91651488d54a27e390c1a734b1a42d
Author: Ruifeng Zheng 
AuthorDate: Mon Aug 5 08:23:04 2024 +0800

[SPARK-49047][PYTHON][CONNECT] Truncate the message for logging

### What changes were proposed in this pull request?
Truncate the message for logging, by truncating the bytes and string fields

### Why are the changes needed?
existing implementation generates too massive logging

### Does this PR introduce _any_ user-facing change?
No, logging only

```
In [7]: df = spark.createDataFrame([('a B c'), ('X y Z'), ], ['abc'])

In [8]: plan = df._plan.to_proto(spark._client)

In [9]: spark._client._proto_to_string(plan, False)
Out[9]: 'root { common { plan_id: 4 } to_df { input { common { plan_id: 3 } 
local_relation { data: 
"\\377\\377\\377\\377p\\000\\000\\000\\020\\000\\000\\000\\000\\000\\n\\000\\014\\000\\006\\000\\005\\000\\010\\000\\n\\000\\000\\000\\000\\001\\004\\000\\014\\000\\000\\000\\010\\000\\010\\000\\000\\000\\004\\000\\010\\000\\000\\000\\004\\000\\000\\000\\001\\000\\000\\000\\024\\000\\000\\000\\020\\000\\024\\000\\010\\000\\006\\000\\007\\000\\014\\000\\000\\000\\020\\000\\020\\000\\000\\
 [...]

In [10]: spark._client._proto_to_string(plan, True)
Out[10]: 'root { common { plan_id: 4 } to_df { input { common { plan_id: 3 
} local_relation { data: "\\377\\377\\377\\377p\\000\\000\\000[truncated]" 
schema: "{\\"fields\\":[{\\"metadata\\":{},\\"name\\"[truncated]" } } 
column_names: "abc" } }'
```

### How was this patch tested?
added UT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47554 from zhengruifeng/py_client_truncate.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/client/core.py  | 69 +++---
 .../sql/tests/connect/test_connect_basic.py| 23 
 2 files changed, 83 insertions(+), 9 deletions(-)

diff --git a/python/pyspark/sql/connect/client/core.py 
b/python/pyspark/sql/connect/client/core.py
index 8992ad1c4310..2acb0d2cb01d 100644
--- a/python/pyspark/sql/connect/client/core.py
+++ b/python/pyspark/sql/connect/client/core.py
@@ -27,6 +27,7 @@ check_dependencies(__name__)
 import logging
 import threading
 import os
+import copy
 import platform
 import urllib.parse
 import uuid
@@ -864,7 +865,7 @@ class SparkConnectClient(object):
 Return given plan as a PyArrow Table iterator.
 """
 if logger.isEnabledFor(logging.INFO):
-logger.info(f"Executing plan {self._proto_to_string(plan)}")
+logger.info(f"Executing plan {self._proto_to_string(plan, True)}")
 req = self._execute_plan_request_with_metadata()
 req.plan.CopyFrom(plan)
 with Progress(handlers=self._progress_handlers, 
operation_id=req.operation_id) as progress:
@@ -881,7 +882,7 @@ class SparkConnectClient(object):
 Return given plan as a PyArrow Table.
 """
 if logger.isEnabledFor(logging.INFO):
-logger.info(f"Executing plan {self._proto_to_string(plan)}")
+logger.info(f"Executing plan {self._proto_to_string(plan, True)}")
 req = self._execute_plan_request_with_metadata()
 req.plan.CopyFrom(plan)
 table, schema, metrics, observed_metrics, _ = 
self._execute_and_fetch(req, observations)
@@ -898,7 +899,7 @@ class SparkConnectClient(object):
 Return given plan as a pandas DataFrame.
 """
 if logger.isEnabledFor(logging.INFO):
-logger.info(f"Executing plan {self._proto_to_string(plan)}")
+logger.info(f"Executing plan {self._proto_to_string(plan, True)}")
 req = self._execute_plan_request_with_metadata()
 req.plan.CopyFrom(plan)
 (self_destruct_conf,) = self.get_config_with_defaults(
@@ -978,7 +979,7 @@ class SparkConnectClient(object):
 pdf.attrs["observed_metrics"] = observed_metrics
 return pdf, ei
 
-def _proto_to_string(self, p: google.protobuf.message.Message) -> str:
+def _proto_to_string(self, p: google.protobuf.message.Message, truncate: 
bool = False) -> str:
 """
 Helper method to generate a one line string representation of the plan.
 
@@ -992,16 +993,62 @@ class SparkConnectClient(object):
 Single

(spark) branch master updated: [SPARK-48998][ML] Meta algorithms save/load model with SparkSession

2024-07-26 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5ccf9ba958f4 [SPARK-48998][ML] Meta algorithms save/load model with 
SparkSession
5ccf9ba958f4 is described below

commit 5ccf9ba958f492c1eb4dde22a647ba75aba63d8e
Author: Ruifeng Zheng 
AuthorDate: Fri Jul 26 18:17:52 2024 +0800

[SPARK-48998][ML] Meta algorithms save/load model with SparkSession

### What changes were proposed in this pull request?

1. add overloads with SparkSession of following helper functions:

- SharedReadWrite.saveImpl
- SharedReadWrite.load
- DefaultParamsWriter.getMetadataToSave
- DefaultParamsReader.loadParamsInstance
- DefaultParamsReader.loadParamsInstanceReader

2. deprecate old functions
3. apply the new functions in ML

### Why are the changes needed?
Meta algorithms save/load model with SparkSession

After this PR, all `.ml` implementations save and load models with 
SparkSession, while the old helper functions with `sc` are still available 
(just deprecated) for eco-system.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47477 from zhengruifeng/ml_meta_spark.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../main/scala/org/apache/spark/ml/Pipeline.scala  | 40 ++
 .../apache/spark/ml/classification/OneVsRest.scala | 23 +-
 .../org/apache/spark/ml/feature/Imputer.scala  |  2 +-
 .../org/apache/spark/ml/tree/treeModels.scala  |  2 +-
 .../apache/spark/ml/tuning/CrossValidator.scala| 12 +++---
 .../spark/ml/tuning/TrainValidationSplit.scala | 12 +++---
 .../apache/spark/ml/tuning/ValidatorParams.scala   | 16 +++
 .../scala/org/apache/spark/ml/util/ReadWrite.scala | 49 +++---
 8 files changed, 108 insertions(+), 48 deletions(-)

diff --git a/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala 
b/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
index 42106372a203..807648545fc6 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
@@ -32,7 +32,7 @@ import org.apache.spark.internal.Logging
 import org.apache.spark.ml.param.{Param, ParamMap, Params}
 import org.apache.spark.ml.util._
 import org.apache.spark.ml.util.Instrumentation.instrumented
-import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
 import org.apache.spark.sql.types.StructType
 import org.apache.spark.util.ArrayImplicits._
 
@@ -204,7 +204,7 @@ object Pipeline extends MLReadable[Pipeline] {
 override def save(path: String): Unit =
   instrumented(_.withSaveInstanceEvent(this, path)(super.save(path)))
 override protected def saveImpl(path: String): Unit =
-  SharedReadWrite.saveImpl(instance, instance.getStages, sc, path)
+  SharedReadWrite.saveImpl(instance, instance.getStages, sparkSession, 
path)
   }
 
   private class PipelineReader extends MLReader[Pipeline] {
@@ -213,7 +213,8 @@ object Pipeline extends MLReadable[Pipeline] {
 private val className = classOf[Pipeline].getName
 
 override def load(path: String): Pipeline = 
instrumented(_.withLoadInstanceEvent(this, path) {
-  val (uid: String, stages: Array[PipelineStage]) = 
SharedReadWrite.load(className, sc, path)
+  val (uid: String, stages: Array[PipelineStage]) =
+SharedReadWrite.load(className, sparkSession, path)
   new Pipeline(uid).setStages(stages)
 })
   }
@@ -241,14 +242,26 @@ object Pipeline extends MLReadable[Pipeline] {
  *  - save metadata to path/metadata
  *  - save stages to stages/IDX_UID
  */
+@deprecated("use saveImpl with SparkSession", "4.0.0")
 def saveImpl(
 instance: Params,
 stages: Array[PipelineStage],
 sc: SparkContext,
+path: String): Unit =
+  saveImpl(
+instance,
+stages,
+SparkSession.builder().sparkContext(sc).getOrCreate(),
+path)
+
+def saveImpl(
+instance: Params,
+stages: Array[PipelineStage],
+spark: SparkSession,
 path: String): Unit = instrumented { instr =>
   val stageUids = stages.map(_.uid)
   val jsonParams = List("stageUids" -> 
parse(compact(render(stageUids.toImmutableArraySeq
-  DefaultParamsWriter.saveMetadata(instance, path, sc, paramMap = 
Some(jsonParams))
+  DefaultParamsWriter.saveMetadata(instance, path, spark, None, 
Some(jsonParams))
 
   // Save stages
   val stagesDir = new Path(path, "stages").toString
@@ -263,18 +276,28

(spark) branch master updated: [SPARK-48954] try_mod() replaces try_remainder()

2024-07-21 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 15c98e0e5e07 [SPARK-48954] try_mod() replaces try_remainder()
15c98e0e5e07 is described below

commit 15c98e0e5e070d61e32a8eec935488efd9605480
Author: Serge Rielau 
AuthorDate: Sun Jul 21 16:47:46 2024 +0800

[SPARK-48954] try_mod() replaces try_remainder()

### What changes were proposed in this pull request?

for consistency try_remainder() gets renamed to try_mod().
this is Spark 4.0.0 only, so no need for config.

### Why are the changes needed?

To keep consistent naming.

### Does this PR introduce _any_ user-facing change?

Yes, replaces try_remainder() with try_mod()

### How was this patch tested?

Existing try_remainder() tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47427 from srielau/SPARK-48954-try-mod.

Authored-by: Serge Rielau 
Signed-off-by: Ruifeng Zheng 
---
 .../scala/org/apache/spark/sql/functions.scala |  2 +-
 docs/sql-ref-ansi-compliance.md|  2 +-
 .../source/reference/pyspark.sql/functions.rst |  2 +-
 python/pyspark/sql/connect/functions/builtin.py|  6 ++--
 python/pyspark/sql/functions/builtin.py| 32 +++---
 .../sql/tests/connect/test_connect_column.py   |  8 ++
 .../sql/catalyst/analysis/FunctionRegistry.scala   |  2 +-
 .../spark/sql/catalyst/expressions/TryEval.scala   |  4 +--
 .../sql/catalyst/expressions/arithmetic.scala  |  2 +-
 .../sql/catalyst/expressions/TryEvalSuite.scala|  2 +-
 .../scala/org/apache/spark/sql/functions.scala |  2 +-
 .../sql-functions/sql-expression-schema.md |  2 +-
 .../org/apache/spark/sql/MathFunctionsSuite.scala  |  6 ++--
 13 files changed, 34 insertions(+), 38 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
index 02b25dd6cbb5..c0bf9c9d013c 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
@@ -1947,7 +1947,7 @@ object functions {
* @group math_funcs
* @since 4.0.0
*/
-  def try_remainder(left: Column, right: Column): Column = 
Column.fn("try_remainder", left, right)
+  def try_mod(left: Column, right: Column): Column = Column.fn("try_mod", 
left, right)
 
   /**
* Returns `left``*``right` and the result is null on overflow. The 
acceptable input types are
diff --git a/docs/sql-ref-ansi-compliance.md b/docs/sql-ref-ansi-compliance.md
index 54f9fd439548..443bc8409efc 100644
--- a/docs/sql-ref-ansi-compliance.md
+++ b/docs/sql-ref-ansi-compliance.md
@@ -374,7 +374,7 @@ When ANSI mode is on, it throws exceptions for invalid 
operations. You can use t
   - `try_subtract`: identical to the add operator `-`, except that it returns 
`NULL` result instead of throwing an exception on integral value overflow.
   - `try_multiply`: identical to the add operator `*`, except that it returns 
`NULL` result instead of throwing an exception on integral value overflow.
   - `try_divide`: identical to the division operator `/`, except that it 
returns `NULL` result instead of throwing an exception on dividing 0.
-  - `try_remainder`: identical to the remainder operator `%`, except that it 
returns `NULL` result instead of throwing an exception on dividing 0.
+  - `try_mod`: identical to the remainder operator `%`, except that it returns 
`NULL` result instead of throwing an exception on dividing 0.
   - `try_sum`: identical to the function `sum`, except that it returns `NULL` 
result instead of throwing an exception on integral/decimal/interval value 
overflow.
   - `try_avg`: identical to the function `avg`, except that it returns `NULL` 
result instead of throwing an exception on decimal/interval value overflow.
   - `try_element_at`: identical to the function `element_at`, except that it 
returns `NULL` result instead of throwing an exception on array's index out of 
bound.
diff --git a/python/docs/source/reference/pyspark.sql/functions.rst 
b/python/docs/source/reference/pyspark.sql/functions.rst
index c7ae525429ca..7585448204f6 100644
--- a/python/docs/source/reference/pyspark.sql/functions.rst
+++ b/python/docs/source/reference/pyspark.sql/functions.rst
@@ -142,8 +142,8 @@ Mathematical Functions
 tanh
 try_add
 try_divide
+try_mod
 try_multiply
-try_remainder
 try_subtract
 unhex
 width_bucket
diff --git a/python/pyspark/sql/connect/functions/builtin.py 
b/python/pyspark/sql/connect/functions/builtin.py
index 7f7ea3c6f45d..

(spark) branch master updated: [SPARK-48892][ML] Avoid per-row param read in `Tokenizer`

2024-07-16 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3755d51eb5b8 [SPARK-48892][ML] Avoid per-row param read in `Tokenizer`
3755d51eb5b8 is described below

commit 3755d51eb5b8ab17f2e68ff4114aa488e2815fdc
Author: Ruifeng Zheng 
AuthorDate: Wed Jul 17 07:18:18 2024 +0800

[SPARK-48892][ML] Avoid per-row param read in `Tokenizer`

### What changes were proposed in this pull request?
Inspired by https://github.com/apache/spark/pull/47258, I am checking other 
ML implementations, and find that we can also optimize `Tokenizer` in the same 
way

### Why are the changes needed?
the function `createTransformFunc` is to build the udf for 
`UnaryTransformer.transform`:

https://github.com/apache/spark/blob/d679dabdd1b5ad04b8c7deb1c06ce886a154a928/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L118

existing implementation read the params for each row.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI and manually tests:

create test dataset
```

spark.range(100).select(uuid().as("uuid")).write.mode("overwrite").parquet("/tmp/regex_tokenizer.parquet")
```

duration
```
val df = spark.read.parquet("/tmp/regex_tokenizer.parquet")
import org.apache.spark.ml.feature._
val tokenizer = new RegexTokenizer().setPattern("-").setInputCol("uuid")
Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()) // warm up
val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => 
tokenizer.transform(df).count()); System.currentTimeMillis - tic
```

result (before this PR)
```
scala> val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => 
tokenizer.transform(df).count()); System.currentTimeMillis - tic
val tic: Long = 1720613235068
val res5: Long = 50397
```

result (after this PR)
```
scala> val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => 
tokenizer.transform(df).count()); System.currentTimeMillis - tic
val tic: Long = 1720612871256
val res5: Long = 43748
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47342 from zhengruifeng/opt_tokenizer.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../scala/org/apache/spark/ml/feature/Tokenizer.scala | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
index e7b3ff76a8d8..1acbfd781820 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
@@ -141,14 +141,19 @@ class RegexTokenizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
 
   setDefault(minTokenLength -> 1, gaps -> true, pattern -> "\\s+", toLowercase 
-> true)
 
-  override protected def createTransformFunc: String => Seq[String] = { 
originStr =>
+  override protected def createTransformFunc: String => Seq[String] = {
 val re = $(pattern).r
-// scalastyle:off caselocale
-val str = if ($(toLowercase)) originStr.toLowerCase() else originStr
-// scalastyle:on caselocale
-val tokens = if ($(gaps)) re.split(str).toImmutableArraySeq else 
re.findAllIn(str).toSeq
-val minLength = $(minTokenLength)
-tokens.filter(_.length >= minLength)
+val localToLowercase = $(toLowercase)
+val localGaps = $(gaps)
+val localMinTokenLength = $(minTokenLength)
+
+(originStr: String) => {
+  // scalastyle:off caselocale
+  val str = if (localToLowercase) originStr.toLowerCase() else originStr
+  // scalastyle:on caselocale
+  val tokens = if (localGaps) re.split(str).toImmutableArraySeq else 
re.findAllIn(str).toSeq
+  tokens.filter(_.length >= localMinTokenLength)
+}
   }
 
   override protected def validateInputType(inputType: DataType): Unit = {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48884][PYTHON] Remove unused helper function `PythonSQLUtils.makeInterval`

2024-07-15 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4a44a9ec4a44 [SPARK-48884][PYTHON] Remove unused helper function 
`PythonSQLUtils.makeInterval`
4a44a9ec4a44 is described below

commit 4a44a9ec4a442e49220a1a4ca19858c2babd33bf
Author: Ruifeng Zheng 
AuthorDate: Tue Jul 16 10:31:50 2024 +0800

[SPARK-48884][PYTHON] Remove unused helper function 
`PythonSQLUtils.makeInterval`

### What changes were proposed in this pull request?
Remove unused helper function `PythonSQLUtils.makeInterval`

### Why are the changes needed?
As a followup cleanup of 
https://github.com/apache/spark/commit/bd14d6412a3124eecce1493fcad436280915ba71

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes #47330 from zhengruifeng/py_sql_utils_cleanup.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../apache/spark/sql/api/python/PythonSQLUtils.scala   | 18 --
 1 file changed, 18 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala
index eb8c1d65a8b5..79c5249b3669 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala
@@ -20,11 +20,9 @@ package org.apache.spark.sql.api.python
 import java.io.InputStream
 import java.net.Socket
 import java.nio.channels.Channels
-import java.util.Locale
 
 import net.razorvine.pickle.{Pickler, Unpickler}
 
-import org.apache.spark.SparkException
 import org.apache.spark.api.python.DechunkedInputStream
 import org.apache.spark.internal.{Logging, MDC}
 import org.apache.spark.internal.LogKeys.CLASS_LOADER
@@ -149,22 +147,6 @@ private[sql] object PythonSQLUtils extends Logging {
 
   def nullIndex(e: Column): Column = Column(NullIndex(e.expr))
 
-  def makeInterval(unit: String, e: Column): Column = {
-val zero = MakeInterval(years = Literal(0), months = Literal(0), weeks = 
Literal(0),
-  days = Literal(0), hours = Literal(0), mins = Literal(0), secs = 
Literal(0))
-
-unit.toUpperCase(Locale.ROOT) match {
-  case "YEAR" => Column(zero.copy(years = e.expr))
-  case "MONTH" => Column(zero.copy(months = e.expr))
-  case "WEEK" => Column(zero.copy(weeks = e.expr))
-  case "DAY" => Column(zero.copy(days = e.expr))
-  case "HOUR" => Column(zero.copy(hours = e.expr))
-  case "MINUTE" => Column(zero.copy(mins = e.expr))
-  case "SECOND" => Column(zero.copy(secs = e.expr))
-  case _ => throw SparkException.internalError(s"Got the unexpected unit 
'$unit'.")
-}
-  }
-
   def pandasProduct(e: Column, ignoreNA: Boolean): Column = {
 Column(PandasProduct(e.expr, ignoreNA).toAggregateExpression(false))
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48878][PYTHON][DOCS] Add doctests for `options` in json functions

2024-07-12 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 918ca333a900 [SPARK-48878][PYTHON][DOCS] Add doctests for `options` in 
json functions
918ca333a900 is described below

commit 918ca333a900ac999351ee06855f17cc7b7d9ad5
Author: Kent Yao 
AuthorDate: Fri Jul 12 17:52:55 2024 +0800

[SPARK-48878][PYTHON][DOCS] Add doctests for `options` in json functions

### What changes were proposed in this pull request?
Add doctests for `options` in json functions

### Why are the changes needed?
test coverage, we never test `options` in `from_json` and `to_json` before

since it is a new underlying implementation in Spark Connect, we should 
explicitly test it

### Does this PR introduce _any_ user-facing change?
doc changes

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47319 from zhengruifeng/from_json_option.

Lead-authored-by: Kent Yao 
Co-authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/functions/builtin.py | 42 -
 1 file changed, 36 insertions(+), 6 deletions(-)

diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index 0b464aa20710..9e0c0700ae04 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -15793,6 +15793,20 @@ def from_json(
 +-+
 |[1, 2, 3]|
 +-+
+
+Example 6: Parsing JSON with specified options
+
+>>> import pyspark.sql.functions as sf
+>>> df = spark.createDataFrame([(1, '''{a:123}'''), (2, '''{"a":456}''')], 
("key", "value"))
+>>> parsed1 = sf.from_json(df.value, "a INT")
+>>> parsed2 = sf.from_json(df.value, "a INT", {"allowUnquotedFieldNames": 
"true"})
+>>> df.select("value", parsed1, parsed2).show()
++-+++
+|value|from_json(value)|from_json(value)|
++-+++
+|  {a:123}|  {NULL}|   {123}|
+|{"a":456}|   {456}|   {456}|
++-+++
 """
 from pyspark.sql.classic.column import _to_java_column
 
@@ -16113,6 +16127,19 @@ def to_json(col: "ColumnOrName", options: 
Optional[Dict[str, str]] = None) -> Co
 +---+
 |["Alice","Bob"]|
 +---+
+
+Example 6: Converting to JSON with specified options
+
+>>> import pyspark.sql.functions as sf
+>>> df = spark.sql("SELECT (DATE('2022-02-22'), 1) AS date")
+>>> json1 = sf.to_json(df.date)
+>>> json2 = sf.to_json(df.date, {"dateFormat": "/MM/dd"})
+>>> df.select("date", json1, json2).show(truncate=False)
+
+---+--+--+
+|date   |to_json(date) |to_json(date)  
   |
+
+---+--+--+
+|{2022-02-22, 
1}|{"col1":"2022-02-22","col2":1}|{"col1":"2022/02/22","col2":1}|
+
+---+--+--+
 """
 from pyspark.sql.classic.column import _to_java_column
 
@@ -16150,12 +16177,15 @@ def schema_of_json(json: Union[Column, str], options: 
Optional[Dict[str, str]] =
 
 Examples
 
->>> df = spark.range(1)
->>> df.select(schema_of_json(lit('{"a": 0}')).alias("json")).collect()
-[Row(json='STRUCT')]
->>> schema = schema_of_json('{a: 1}', {'allowUnquotedFieldNames':'true'})
->>> df.select(schema.alias("json")).collect()
-[Row(json='STRUCT')]
+>>> import pyspark.sql.functions as sf
+>>> parsed1 = sf.schema_of_json(sf.lit('{"a": 0}'))
+>>> parsed2 = sf.schema_of_json('{a: 1}', 
{'allowUnquotedFieldNames':'true'})
+>>> spark.range(1).select(parsed1, parsed2).show()
+++--+
+|schema_of_json({"a": 0})|schema_of_json({a: 1})|
+++--+
+|   STRUCT| STRUCT|
+++--+
 """
 from pyspark.sql.classic.column import _create_column_from_literal, 
_to_java_column
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48842][DOCS] Document non-determinism of max_by and min_by

2024-07-11 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5bbe9c850aaa [SPARK-48842][DOCS] Document non-determinism of max_by 
and min_by
5bbe9c850aaa is described below

commit 5bbe9c850aaaf31327b81d893ed513033a129e08
Author: Ruifeng Zheng 
AuthorDate: Fri Jul 12 12:41:07 2024 +0800

[SPARK-48842][DOCS] Document non-determinism of max_by and min_by

### What changes were proposed in this pull request?
Document non-determinism of max_by and min_by

### Why are the changes needed?
I have been confused by this non-determinism twice, it occurred like a 
correctness bug to me.
So I think we need to document it

### Does this PR introduce _any_ user-facing change?
doc change only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #47266 from zhengruifeng/py_doc_max_by.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 R/pkg/R/functions.R|  6 ++
 .../jvm/src/main/scala/org/apache/spark/sql/functions.scala|  8 
 python/pyspark/sql/functions/builtin.py| 10 ++
 .../sql/catalyst/expressions/aggregate/MaxByAndMinBy.scala |  8 
 sql/core/src/main/scala/org/apache/spark/sql/functions.scala   |  6 ++
 5 files changed, 38 insertions(+)

diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index a7e337d3f9af..b91124f96a6f 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -1558,6 +1558,9 @@ setMethod("max",
 #' @details
 #' \code{max_by}: Returns the value associated with the maximum value of ord.
 #'
+#' Note: The function is non-deterministic so the output order can be different
+#' for those associated the same values of `x`.
+#'
 #' @rdname column_aggregate_functions
 #' @aliases max_by max_by,Column-method
 #' @note max_by since 3.3.0
@@ -1633,6 +1636,9 @@ setMethod("min",
 #' @details
 #' \code{min_by}: Returns the value associated with the minimum value of ord.
 #'
+#' Note: The function is non-deterministic so the output order can be different
+#' for those associated the same values of `x`.
+#'
 #' @rdname column_aggregate_functions
 #' @aliases min_by min_by,Column-method
 #' @note min_by since 3.3.0
diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
index 92e7bc9da590..81f25b3d743f 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
@@ -884,6 +884,10 @@ object functions {
   /**
* Aggregate function: returns the value associated with the maximum value 
of ord.
*
+   * @note
+   *   The function is non-deterministic so the output order can be different 
for those associated
+   *   the same values of `e`.
+   *
* @group agg_funcs
* @since 3.4.0
*/
@@ -932,6 +936,10 @@ object functions {
   /**
* Aggregate function: returns the value associated with the minimum value 
of ord.
*
+   * @note
+   *   The function is non-deterministic so the output order can be different 
for those associated
+   *   the same values of `e`.
+   *
* @group agg_funcs
* @since 3.4.0
*/
diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index 1ca522313f24..446ff2b1be93 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -1271,6 +1271,11 @@ def max_by(col: "ColumnOrName", ord: "ColumnOrName") -> 
Column:
 .. versionchanged:: 3.4.0
 Supports Spark Connect.
 
+Notes
+-
+The function is non-deterministic so the output order can be different for 
those
+associated the same values of `col`.
+
 Parameters
 --
 col : :class:`~pyspark.sql.Column` or str
@@ -1352,6 +1357,11 @@ def min_by(col: "ColumnOrName", ord: "ColumnOrName") -> 
Column:
 .. versionchanged:: 3.4.0
 Supports Spark Connect.
 
+Notes
+-
+The function is non-deterministic so the output order can be different for 
those
+associated the same values of `col`.
+
 Parameters
 --
 col : :class:`~pyspark.sql.Column` or str
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/MaxByAndMinBy.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/MaxByAndMinBy.scala
index 56941c9de451..b33142ed2

(spark) branch master updated: [SPARK-48822][DOCS] Add examples section header to `format_number` docstring

2024-07-09 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f76bc3a843f4 [SPARK-48822][DOCS] Add examples section header to 
`format_number` docstring
f76bc3a843f4 is described below

commit f76bc3a843f4216588a29ff14cafdd870fd0254c
Author: thomas.hart 
AuthorDate: Wed Jul 10 12:14:05 2024 +0800

[SPARK-48822][DOCS] Add examples section header to `format_number` docstring

### What changes were proposed in this pull request?
This PR adds and "Examples" section header to `format_number` docstring.

### Why are the changes needed?
To improve the documentation.

### Does this PR introduce any user-facing change?
No changes in behavior are introduced.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47237 from thomhart31/docs-format_number.

Lead-authored-by: thomas.hart 
Co-authored-by: Thomas Hart 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/functions/builtin.py | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index 0ff830e8d48d..6fd8fdfec8ea 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -10654,6 +10654,8 @@ def format_number(col: "ColumnOrName", d: int) -> 
Column:
 :class:`~pyspark.sql.Column`
 the column of formatted results.
 
+Examples
+
 >>> spark.createDataFrame([(5,)], ['a']).select(format_number('a', 
4).alias('v')).collect()
 [Row(v='5.')]
 """


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (65daff55f556 -> f0cc86d04aeb)

2024-07-09 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 65daff55f556 [SPARK-48826][BUILD] Upgrade `fasterxml.jackson` to 2.17.2
 add f0cc86d04aeb [SPARK-48840][INFRA] Remove unnecessary existence check 
for `./dev/free_disk_space_container`

No new revisions were added by this update.

Summary of changes:
 .github/workflows/build_and_test.yml | 20 
 1 file changed, 4 insertions(+), 16 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark-connect-go) branch master updated: [SPARK-48754] Address comments (#31)

2024-07-08 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-connect-go.git


The following commit(s) were added to refs/heads/master by this push:
 new e0cde67  [SPARK-48754] Address comments (#31)
e0cde67 is described below

commit e0cde671095f499881aff688224910238b860c9f
Author: Martin Grund 
AuthorDate: Tue Jul 9 04:31:54 2024 +0200

[SPARK-48754] Address comments (#31)
---
 spark/client/channel/channel.go   | 10 +-
 spark/sql/session/sparksession.go |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/spark/client/channel/channel.go b/spark/client/channel/channel.go
index d0b0394..3d2e246 100644
--- a/spark/client/channel/channel.go
+++ b/spark/client/channel/channel.go
@@ -62,23 +62,23 @@ type BaseBuilder struct {
headers map[string]string
 }
 
-func (cb BaseBuilder) Host() string {
+func (cb *BaseBuilder) Host() string {
return cb.host
 }
 
-func (cb BaseBuilder) Port() int {
+func (cb *BaseBuilder) Port() int {
return cb.port
 }
 
-func (cb BaseBuilder) Token() string {
+func (cb *BaseBuilder) Token() string {
return cb.token
 }
 
-func (cb BaseBuilder) User() string {
+func (cb *BaseBuilder) User() string {
return cb.user
 }
 
-func (cb BaseBuilder) Headers() map[string]string {
+func (cb *BaseBuilder) Headers() map[string]string {
return cb.headers
 }
 
diff --git a/spark/sql/session/sparksession.go 
b/spark/sql/session/sparksession.go
index 8a45fb0..a68b69b 100644
--- a/spark/sql/session/sparksession.go
+++ b/spark/sql/session/sparksession.go
@@ -51,7 +51,7 @@ func (s *SparkSessionBuilder) Remote(connectionString string) 
*SparkSessionBuild
return s
 }
 
-func (s *SparkSessionBuilder) ChannelBuilder(cb channel.Builder) 
*SparkSessionBuilder {
+func (s *SparkSessionBuilder) WithChannelBuilder(cb channel.Builder) 
*SparkSessionBuilder {
s.channelBuilder = cb
return s
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark-connect-go) branch master updated: [SPARK-48777][BUILD] Properly lint, vet and check for license headers. (#32)

2024-07-08 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-connect-go.git


The following commit(s) were added to refs/heads/master by this push:
 new a1b4f12  [SPARK-48777][BUILD] Properly lint, vet and check for license 
headers. (#32)
a1b4f12 is described below

commit a1b4f12ed6b2dedf0d752118b0ed9f96f5a3fa2c
Author: Martin Grund 
AuthorDate: Tue Jul 9 04:31:36 2024 +0200

[SPARK-48777][BUILD] Properly lint, vet and check for license headers. (#32)

* [SPARK-48777][BUILD] Making sure that style, format, and license headers 
are present

* add wf

* adding missing files

* comments
---
 .github/workflows/build.yml   | 14 +++-
 .gitignore|  5 +-
 .gitignore => .golangci.yml   | 15 +---
 CONTRIBUTING.md   | 18 -
 Makefile  |  9 ++-
 cmd/spark-connect-example-raw-grpc-client/main.go |  3 +-
 cmd/spark-connect-example-spark-session/main.go   |  6 +-
 dev/.rat-excludes | 15 
 dev/check-license | 86 +++
 spark/client/channel/channel.go   |  8 +--
 spark/client/channel/channel_test.go  |  4 +-
 spark/client/channel/compat.go| 15 
 spark/mocks/mocks.go  | 16 +
 spark/sparkerrors/errors_test.go  | 15 
 spark/sql/dataframe.go|  3 +-
 spark/sql/dataframe_test.go   |  6 +-
 spark/sql/dataframereader.go  | 16 +
 spark/sql/dataframereader_test.go | 15 
 spark/sql/dataframewriter.go  | 16 +
 spark/sql/dataframewriter_test.go | 15 
 spark/sql/executeplanclient.go| 16 +
 spark/sql/mocks_test.go   | 16 +
 spark/sql/plan_test.go| 16 +
 spark/sql/row_test.go | 16 +
 spark/sql/session/sparksession_test.go| 30 ++--
 spark/sql/utils/check.go  | 23 ++
 26 files changed, 380 insertions(+), 37 deletions(-)

diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
index dc0eade..877d768 100644
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -31,6 +31,14 @@ on:
 branches:
   - master
 
+permissions:
+  # Required: allow read access to the content for analysis.
+  contents: read
+  # Optional: allow read access to pull request. Use with `only-new-issues` 
option.
+  pull-requests: read
+  # Optional: allow write access to checks to allow the action to annotate 
code in the PR.
+  checks: write
+
 
 jobs:
   build:
@@ -59,4 +67,8 @@ jobs:
   go mod download -x
   make gen
   make
-  make test
\ No newline at end of file
+  make test
+  - name: golangci-lint
+uses: golangci/golangci-lint-action@v6
+with:
+  version: v1.59
diff --git a/.gitignore b/.gitignore
index e76d6f0..8381e8d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -26,4 +26,7 @@ coverage*
 
 # Ignore binaries
 cmd/spark-connect-example-raw-grpc-client/spark-connect-example-raw-grpc-client
-cmd/spark-connect-example-spark-session/spark-connect-example-spark-session
\ No newline at end of file
+cmd/spark-connect-example-spark-session/spark-connect-example-spark-session
+
+target
+lib
\ No newline at end of file
diff --git a/.gitignore b/.golangci.yml
similarity index 73%
copy from .gitignore
copy to .golangci.yml
index e76d6f0..05a64f5 100644
--- a/.gitignore
+++ b/.golangci.yml
@@ -15,15 +15,6 @@
 # limitations under the License.
 #
 
-# All generated files
-internal/generated.out
-
-# Ignore Coverage Files
-coverage*
-
-# Ignore IDE files
-.idea/
-
-# Ignore binaries
-cmd/spark-connect-example-raw-grpc-client/spark-connect-example-raw-grpc-client
-cmd/spark-connect-example-spark-session/spark-connect-example-spark-session
\ No newline at end of file
+linters:
+  enable:
+- gofumpt
\ No newline at end of file
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 4e5a578..995f799 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -13,4 +13,20 @@ When you contribute code, you affirm that the contribution 
is your original work
 license the work to the project under the project's open source license. 
Whether or not you
 state this explicitly, by submitting any copyrighted material via pull 
request, email, or
 other means you agree to license the material under the project's open source 
license and
-warrant that you have the legal authority to do so.
\ No newline at end of file
+warrant that you have the legal authority to do so.
+
+
+### Code Style and Checks
+
+Wh

(spark) branch master updated (b062d4436f2b -> 3c3b1129fb6b)

2024-07-08 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from b062d4436f2b [SPARK-48798][PYTHON] Introduce `spark.profile.render` 
for SparkSession-based profiling
 add 3c3b1129fb6b [MINOR][PYTHON] Eliminating warnings for panda

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/datetimes.py | 4 ++--
 python/pyspark/pandas/frame.py | 6 +++---
 python/pyspark/pandas/generic.py   | 2 +-
 python/pyspark/pandas/indexes/datetimes.py | 2 +-
 python/pyspark/pandas/namespace.py | 2 +-
 python/pyspark/pandas/plot/core.py | 4 ++--
 python/pyspark/pandas/spark/accessors.py   | 4 ++--
 7 files changed, 12 insertions(+), 12 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48825][DOCS] Unify the 'See Also' section formatting across PySpark docstrings

2024-07-08 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cd1d687ac0e6 [SPARK-48825][DOCS] Unify the 'See Also' section 
formatting across PySpark docstrings
cd1d687ac0e6 is described below

commit cd1d687ac0e6740a504bc15673d827ae9f1cd1f1
Author: allisonwang-db 
AuthorDate: Mon Jul 8 18:31:11 2024 +0800

[SPARK-48825][DOCS] Unify the 'See Also' section formatting across PySpark 
docstrings

### What changes were proposed in this pull request?

This PR unifies the 'See Also' section formatting across PySpark docstrings 
and fixes some invalid references.

### Why are the changes needed?

To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

doctest

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47240 from allisonwang-db/spark-48825-also-see-docs.

Authored-by: allisonwang-db 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/dataframe.py | 17 +
 python/pyspark/sql/functions/builtin.py | 24 
 2 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 8d16604879bf..d31d8fa85ea1 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1887,7 +1887,7 @@ class DataFrame:
 
 See Also
 
-DataFrame.dropDuplicates
+DataFrame.dropDuplicates : Remove duplicate rows from this DataFrame.
 
 Examples
 
@@ -2951,7 +2951,7 @@ class DataFrame:
 
 See Also
 
-DataFrame.summary
+DataFrame.summary : Computes summary statistics for numeric and string 
columns.
 """
 ...
 
@@ -3022,7 +3022,7 @@ class DataFrame:
 
 See Also
 
-DataFrame.display
+DataFrame.describe : Computes basic statistics for numeric and string 
columns.
 """
 ...
 
@@ -3790,7 +3790,7 @@ class DataFrame:
 self, groupingSets: Sequence[Sequence["ColumnOrName"]], *cols: 
"ColumnOrName"
 ) -> "GroupedData":
 """
-Create multi-dimensional aggregation for the current `class`:DataFrame 
using the specified
+Create multi-dimensional aggregation for the current 
:class:`DataFrame` using the specified
 grouping sets, so we can run aggregation on them.
 
 .. versionadded:: 4.0.0
@@ -3873,7 +3873,7 @@ class DataFrame:
 
 See Also
 
-GroupedData
+DataFrame.rollup : Compute hierarchical summaries at multiple levels.
 """
 ...
 
@@ -5420,7 +5420,7 @@ class DataFrame:
 
 See Also
 
-:meth:`withColumnsRenamed`
+DataFrame.withColumnsRenamed
 
 Examples
 
@@ -5480,7 +5480,7 @@ class DataFrame:
 
 See Also
 
-:meth:`withColumnRenamed`
+DataFrame.withColumnRenamed
 
 Examples
 
@@ -6183,6 +6183,7 @@ class DataFrame:
 See Also
 
 pyspark.sql.functions.pandas_udf
+DataFrame.mapInArrow
 """
 ...
 
@@ -6259,7 +6260,7 @@ class DataFrame:
 See Also
 
 pyspark.sql.functions.pandas_udf
-pyspark.sql.DataFrame.mapInPandas
+DataFrame.mapInPandas
 """
 ...
 
diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index a2f4523a3f24..1508b042b61a 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -14040,8 +14040,8 @@ def element_at(col: "ColumnOrName", extraction: Any) -> 
Column:
 
 See Also
 
-:meth:`get`
-:meth:`try_element_at`
+:meth:`pyspark.sql.functions.get`
+:meth:`pyspark.sql.functions.try_element_at`
 
 Examples
 
@@ -14131,8 +14131,8 @@ def try_element_at(col: "ColumnOrName", extraction: 
"ColumnOrName") -> Column:
 
 See Also
 
-:meth:`get`
-:meth:`element_at`
+:meth:`pyspark.sql.functions.get`
+:meth:`pyspark.sql.functions.element_at`
 
 Examples
 
@@ -14233,7 +14233,7 @@ def get(col: "ColumnOrName", index: 
Union["ColumnOrName", int]) -> Column:
 
 See Also
 
-:meth:`element_at`
+:meth:`pyspark.sql.functions.element_at`
 
 Examples
 
@@ -15153,9 +15153,9 @@ def exp

(spark) branch master updated (0b0bf4f424c0 -> 30055f7059b5)

2024-07-08 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 0b0bf4f424c0 [SPARK-48810][CONNECT] Session stop() API should be 
idempotent and not fail if the session is already closed by the server
 add 30055f7059b5 [SPARK-48818][PYTHON] Simplify `percentile` functions

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/functions/builtin.py | 18 ++
 python/pyspark/sql/functions/builtin.py | 75 ++---
 2 files changed, 12 insertions(+), 81 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48766][PYTHON] Document the behavior difference of `extraction` between `element_at` and `try_element_at`

2024-07-01 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5ac7c9bdb6ca [SPARK-48766][PYTHON] Document the behavior difference of 
`extraction` between `element_at` and `try_element_at`
5ac7c9bdb6ca is described below

commit 5ac7c9bdb6ca572f80ecda0d4c97856402a7754b
Author: Ruifeng Zheng 
AuthorDate: Tue Jul 2 07:40:38 2024 +0800

[SPARK-48766][PYTHON] Document the behavior difference of `extraction` 
between `element_at` and `try_element_at`

### What changes were proposed in this pull request?
Document the behavior difference of `extraction` between `element_at` and 
`try_element_at`

### Why are the changes needed?
when the function `try_element_at` was introduced in 3.5, its `extraction` 
handling was unintentionally  not consistent with the `element_at`, which 
causes confusion.

This PR document this behavior difference (I don't think we can fix it 
since it will be a breaking change).
```
In [1]: from pyspark.sql import functions as sf

In [2]: df = spark.createDataFrame([({"a": 1.0, "b": 2.0}, "a")], ['data', 
'b'])

In [3]: df.select(sf.try_element_at(df.data, 'b')).show()
+---+
|try_element_at(data, b)|
+---+
|1.0|
+---+

In [4]: df.select(sf.element_at(df.data, 'b')).show()
+---+
|element_at(data, b)|
+---+
|2.0|
+---+
```

### Does this PR introduce _any_ user-facing change?
doc changes

### How was this patch tested?
ci, added doctests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #47161 from zhengruifeng/doc_element_at_extraction.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/functions/builtin.py | 36 +
 1 file changed, 36 insertions(+)

diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index d82927b7af04..2a302d1e5112 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -14098,10 +14098,13 @@ def element_at(col: "ColumnOrName", extraction: Any) 
-> Column:
 Notes
 -
 The position is not zero based, but 1 based index.
+If extraction is a string, :meth:`element_at` treats it as a literal 
string,
+while :meth:`try_element_at` treats it as a column name.
 
 See Also
 
 :meth:`get`
+:meth:`try_element_at`
 
 Examples
 
@@ -14148,6 +14151,17 @@ def element_at(col: "ColumnOrName", extraction: Any) 
-> Column:
 +---+
 |   NULL|
 +---+
+
+Example 5: Getting a value from a map using a literal string as the key
+
+>>> from pyspark.sql import functions as sf
+>>> df = spark.createDataFrame([({"a": 1.0, "b": 2.0}, "a")], ['data', 
'b'])
+>>> df.select(sf.element_at(df.data, 'b')).show()
++---+
+|element_at(data, b)|
++---+
+|2.0|
++---+
 """
 return _invoke_function_over_columns("element_at", col, lit(extraction))
 
@@ -14172,6 +14186,17 @@ def try_element_at(col: "ColumnOrName", extraction: 
"ColumnOrName") -> Column:
 extraction :
 index to check for in array or key to check for in map
 
+Notes
+-
+The position is not zero based, but 1 based index.
+If extraction is a string, :meth:`try_element_at` treats it as a column 
name,
+while :meth:`element_at` treats it as a literal string.
+
+See Also
+
+:meth:`get`
+:meth:`element_at`
+
 Examples
 
 Example 1: Getting the first element of an array
@@ -14228,6 +14253,17 @@ def try_element_at(col: "ColumnOrName", extraction: 
"ColumnOrName") -> Column:
 +---+
 |   NULL|
 +---+
+
+Example 6: Getting a value from a map using a column name as the key
+
+>>> from pyspark.sql import functions as sf
+>>> df = spark.createDataFrame([({"a": 1.0, "b": 2.0}, "a")], ['data', 
'b'])
+>>> df.select(sf.try_element_at(df.data, 'b')).show()
++---+
+|try_element_at(data, b)|
++---+
+|1.0|
++---+
 """
 return _invoke_function_over_columns("try_element_at", col, extraction)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48768][PYTHON][CONNECT] Should not cache `explain`

2024-07-01 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5c29d8d505a9 [SPARK-48768][PYTHON][CONNECT] Should not cache `explain`
5c29d8d505a9 is described below

commit 5c29d8d505a9167099c7113af58dca8fe09d2323
Author: Ruifeng Zheng 
AuthorDate: Tue Jul 2 07:37:00 2024 +0800

[SPARK-48768][PYTHON][CONNECT] Should not cache `explain`

### What changes were proposed in this pull request?
Should not cache `explain`

### Why are the changes needed?
the plans can be affected by `dataframe.cache`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47163 from zhengruifeng/should_not_cache_explain.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/dataframe.py | 1 -
 1 file changed, 1 deletion(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index 1aa8fc00cfcc..46698c2530ea 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1975,7 +1975,6 @@ class DataFrame(ParentDataFrame):
 query = self._plan.to_proto(self._session.client)
 return self._session.client.explain_string(query, explain_mode)
 
-@functools.cache
 def explain(
 self, extended: Optional[Union[bool, str]] = None, mode: Optional[str] 
= None
 ) -> None:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (f70ce135ba1e -> 399980edaa81)

2024-06-30 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from f70ce135ba1e [SPARK-48638][INFRA][FOLLOW-UP] Add graphviz into CI to 
run the related tests
 add 399980edaa81 [MINOR][DOCS] Fix the type hints of `functions.first(..., 
ignorenulls)` and `functions.last(..., ignorenulls)`

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/functions/builtin.py | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48555][PYTHON][FOLLOW-UP] Simplify the support of `Any` parameters

2024-06-26 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 58d1a892faf8 [SPARK-48555][PYTHON][FOLLOW-UP] Simplify the support of 
`Any` parameters
58d1a892faf8 is described below

commit 58d1a892faf87939edd85c5dc39a96db95813dde
Author: Ruifeng Zheng 
AuthorDate: Thu Jun 27 12:37:19 2024 +0800

[SPARK-48555][PYTHON][FOLLOW-UP] Simplify the support of `Any` parameters

### What changes were proposed in this pull request?
Simplify the support of column type `Any`

### Why are the changes needed?
I checked all the `Any` parameters, and all of them supports the Column 
type now.
but there are two kinds of implementations, I think the approach of 
`array_append` is much simpler, so I try to unify the implementations:
```
_try_remote_functions
def array_append(col: "ColumnOrName", value: Any) -> Column:
return _invoke_function_over_columns("array_append", col, lit(value))
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47110 from zhengruifeng/py_func_any.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/functions/builtin.py | 34 -
 1 file changed, 12 insertions(+), 22 deletions(-)

diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index ed66ca8684ef..b496cdaf0955 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -10938,11 +10938,15 @@ def substring(
 target column to work on.
 pos : :class:`~pyspark.sql.Column` or str or int
 starting position in str.
+
+.. versionchanged:: 4.0.0
+`pos` now accepts column and column name.
+
 len : :class:`~pyspark.sql.Column` or str or int
 length of chars.
 
 .. versionchanged:: 4.0.0
-`pos` and `len` now also accept Columns or names of Columns.
+`len` now accepts column and column name.
 
 Returns
 ---
@@ -10962,11 +10966,9 @@ def substring(
 >>> df.select(substring(df.s, df.p, df.l).alias('s')).collect()
 [Row(s='par')]
 """
-from pyspark.sql.classic.column import _to_java_column
-
-pos = _to_java_column(lit(pos) if isinstance(pos, int) else pos)
-len = _to_java_column(lit(len) if isinstance(len, int) else len)
-return _invoke_function("substring", _to_java_column(str), pos, len)
+pos = lit(pos) if isinstance(pos, int) else pos
+len = lit(len) if isinstance(len, int) else len
+return _invoke_function_over_columns("substring", str, pos, len)
 
 
 @_try_remote_functions
@@ -13618,10 +13620,7 @@ def array_contains(col: "ColumnOrName", value: Any) -> 
Column:
 |  true|
 +--+
 """
-from pyspark.sql.classic.column import _to_java_column
-
-value = value._jc if isinstance(value, Column) else value
-return _invoke_function("array_contains", _to_java_column(col), value)
+return _invoke_function_over_columns("array_contains", col, lit(value))
 
 
 @_try_remote_functions
@@ -14064,10 +14063,7 @@ def array_position(col: "ColumnOrName", value: Any) -> 
Column:
 +-+
 
 """
-from pyspark.sql.classic.column import _to_java_column
-
-value = _to_java_column(value) if isinstance(value, Column) else value
-return _invoke_function("array_position", _to_java_column(col), value)
+return _invoke_function_over_columns("array_position", col, lit(value))
 
 
 @_try_remote_functions
@@ -14515,10 +14511,7 @@ def array_remove(col: "ColumnOrName", element: Any) -> 
Column:
 | [2, 3]|
 +---+
 """
-from pyspark.sql.classic.column import _to_java_column
-
-element = _to_java_column(element) if isinstance(element, Column) else 
element
-return _invoke_function("array_remove", _to_java_column(col), element)
+return _invoke_function_over_columns("array_remove", col, lit(element))
 
 
 @_try_remote_functions
@@ -17327,10 +17320,7 @@ def map_contains_key(col: "ColumnOrName", value: Any) 
-> Column:
 |   true|
 +---+
 """
-from pyspark.sql.classic.column import _to_java_column
-
-value = _to_java_column(value) if isinstance(value, Column) else value
-return _invoke_function("map_contains_key", _to_java_colum

(spark) branch master updated: [SPARK-48695][PYTHON] `TimestampNTZType.fromInternal` not use the deprecated methods

2024-06-24 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8e02a6493ef5 [SPARK-48695][PYTHON] `TimestampNTZType.fromInternal` not 
use the deprecated methods
8e02a6493ef5 is described below

commit 8e02a6493ef5dc5949e161179a7c081c5ca58ff2
Author: Ruifeng Zheng 
AuthorDate: Mon Jun 24 20:15:26 2024 +0800

[SPARK-48695][PYTHON] `TimestampNTZType.fromInternal` not use the 
deprecated methods

### What changes were proposed in this pull request?
`TimestampNTZType.fromInternal` not use the deprecated methods

### Why are the changes needed?
```
In [2]: ts = 

In [3]: datetime.datetime.utcfromtimestamp(ts // 100).replace(
   ...: microsecond=ts % 100
   ...: )
:1: DeprecationWarning: 
datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in 
a future version. Use timezone-aware objects to represent datetimes in UTC: 
datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
  datetime.datetime.utcfromtimestamp(ts // 100).replace(
Out[3]: datetime.datetime(1970, 1, 2, 6, 51, 51, 11)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
new tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47068 from zhengruifeng/fix_ntz_conversion.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/tests/test_serde.py | 8 
 python/pyspark/sql/types.py| 4 ++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/tests/test_serde.py 
b/python/pyspark/sql/tests/test_serde.py
index ef8bbd2c370f..01cf3c51d7de 100644
--- a/python/pyspark/sql/tests/test_serde.py
+++ b/python/pyspark/sql/tests/test_serde.py
@@ -95,6 +95,14 @@ class SerdeTestsMixin:
 self.assertEqual(now, now1)
 self.assertEqual(now, utcnow1)
 
+def test_ntz_from_internal(self):
+for ts in [1, 22, 333, , 55]:
+t1 = datetime.datetime.utcfromtimestamp(ts // 
100).replace(microsecond=ts % 100)
+t2 = datetime.datetime.fromtimestamp(ts // 100, 
datetime.timezone.utc).replace(
+microsecond=ts % 100, tzinfo=None
+)
+self.assertEqual(t1, t2)
+
 # regression test for SPARK-19561
 def test_datetime_at_epoch(self):
 epoch = datetime.datetime.fromtimestamp(0)
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 69074a17ca6c..d2adc53a3618 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -434,8 +434,8 @@ class TimestampNTZType(AtomicType, 
metaclass=DataTypeSingleton):
 def fromInternal(self, ts: int) -> datetime.datetime:
 if ts is not None:
 # using int to avoid precision loss in float
-return datetime.datetime.utcfromtimestamp(ts // 100).replace(
-microsecond=ts % 100
+return datetime.datetime.fromtimestamp(ts // 100, 
datetime.timezone.utc).replace(
+microsecond=ts % 100, tzinfo=None
 )
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48620][PYTHON][FOLLOW-UP] Correct the error message for `CalendarIntervalType`

2024-06-21 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c8d75c1f9f61 [SPARK-48620][PYTHON][FOLLOW-UP] Correct the error 
message for `CalendarIntervalType`
c8d75c1f9f61 is described below

commit c8d75c1f9f610be72e1052116b50abc6107e1dd4
Author: Ruifeng Zheng 
AuthorDate: Sat Jun 22 11:25:45 2024 +0800

[SPARK-48620][PYTHON][FOLLOW-UP] Correct the error message for 
`CalendarIntervalType`

### What changes were proposed in this pull request?
Correct the error message for `CalendarIntervalType`

### Why are the changes needed?
the message is incorrect

### Does this PR introduce _any_ user-facing change?
no, this error was just added, not yet released

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #47041 from zhengruifeng/fail_interval_followup.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/types.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index b7b0a977ec08..69074a17ca6c 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -675,13 +675,13 @@ class CalendarIntervalType(DataType, 
metaclass=DataTypeSingleton):
 def toInternal(self, obj: Any) -> Any:
 raise PySparkNotImplementedError(
 error_class="NOT_IMPLEMENTED",
-message_parameters={"feature": "YearMonthIntervalType.toInternal"},
+message_parameters={"feature": "CalendarIntervalType.toInternal"},
 )
 
 def fromInternal(self, obj: Any) -> Any:
 raise PySparkNotImplementedError(
 error_class="NOT_IMPLEMENTED",
-message_parameters={"feature": 
"YearMonthIntervalType.fromInternal"},
+message_parameters={"feature": 
"CalendarIntervalType.fromInternal"},
 )
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48630][INFRA] Make `merge_spark_pr` keep the format of revert PR

2024-06-20 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b99bb00ea11b [SPARK-48630][INFRA] Make `merge_spark_pr` keep the 
format of revert PR
b99bb00ea11b is described below

commit b99bb00ea11b3ff844629d60b2b57309dd9f2d81
Author: Ruifeng Zheng 
AuthorDate: Fri Jun 21 12:44:15 2024 +0800

[SPARK-48630][INFRA] Make `merge_spark_pr` keep the format of revert PR

### What changes were proposed in this pull request?
Make `merge_spark_pr` keep the format of revert PR

### Why are the changes needed?
existing script format revert PR in this way:
```
Original: Revert "[SPARK-48591][PYTHON] Simplify the if-else branches with 
`F.lit`"
Modified: [SPARK-48591][PYTHON] Revert "[] Simplify the if-else branches 
with `F.lit`"
```

another example:
```
Revert "[SPARK-46937][SQL] Improve concurrency performance for 
FunctionRegistry"
```
was modified to
```
[SPARK-46937][SQL] Revert "[] Improve concurrency performance for 
FunctionRegistry"
```
see 
https://github.com/apache/spark/commit/82a84ede6a47232fe3af86672ceea97f703b3e8a

### Does this PR introduce _any_ user-facing change?
no, infra-only

### How was this patch tested?
manually test, let me use this script to merge some PR to check

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46988 from zhengruifeng/fix_merge_spark_pr_for_revert.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 dev/merge_spark_pr.py | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/dev/merge_spark_pr.py b/dev/merge_spark_pr.py
index c9893fd7e5a9..5d014a6375cb 100755
--- a/dev/merge_spark_pr.py
+++ b/dev/merge_spark_pr.py
@@ -501,12 +501,19 @@ def standardize_jira_ref(text):
 >>> standardize_jira_ref(
 ... "[SPARK-6250][SPARK-6146][SPARK-5911][SQL] Types are now reserved 
words in DDL parser.")
 '[SPARK-6250][SPARK-6146][SPARK-5911][SQL] Types are now reserved words in 
DDL parser.'
+>>> standardize_jira_ref(
+... 'Revert "[SPARK-48591][PYTHON] Simplify the if-else branches with 
F.lit"')
+'Revert "[SPARK-48591][PYTHON] Simplify the if-else branches with F.lit"'
 >>> standardize_jira_ref("Additional information for users building from 
source code")
 'Additional information for users building from source code'
 """
 jira_refs = []
 components = []
 
+# If this is a Revert PR, no need to process any further
+if text.startswith('Revert "') and text.endswith('"'):
+return text
+
 # If the string is compliant, no need to process any further
 if re.search(r"^\[SPARK-[0-9]{3,6}\](\[[A-Z0-9_\s,]+\] )+\S+", text):
 return text


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48620][PYTHON] Fix internal raw data leak in `YearMonthIntervalType` and `CalendarIntervalType`

2024-06-19 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 955349f6d970 [SPARK-48620][PYTHON] Fix internal raw data leak in 
`YearMonthIntervalType` and `CalendarIntervalType`
955349f6d970 is described below

commit 955349f6d970b64b496034087d2f2ea5fc0c161d
Author: Ruifeng Zheng 
AuthorDate: Thu Jun 20 13:44:54 2024 +0800

[SPARK-48620][PYTHON] Fix internal raw data leak in `YearMonthIntervalType` 
and `CalendarIntervalType`

### What changes were proposed in this pull request?
Fix internal raw data leak in `YearMonthIntervalType/CalendarIntervalType`:

PySpark Classic: it fails collection of 
`YearMonthIntervalType/CalendarIntervalType`

### Why are the changes needed?
the raw data should not be leaked

### Does this PR introduce _any_ user-facing change?
**PySpark Classic** (before):
```
In [4]: spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS 
interval").first()[0]
Out[4]: 128

In [5]: spark.sql("SELECT make_interval(100, 11, 1, 1, 12, 30, 
01.001001)").first()[0]
Out[5]: {'__class__': 'org.apache.spark.unsafe.types.CalendarInterval'}
```

**PySpark Classic** (after):
```
In [1]: spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS 
interval").first()
---
PySparkNotImplementedErrorTraceback (most recent call last)
Cell In[1], line 1
> 1 spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS 
interval").first()

...

PySparkNotImplementedError: [NOT_IMPLEMENTED] 
YearMonthIntervalType.fromInternal is not implemented.

In [2]: import os

In [3]: os.environ['PYSPARK_YM_INTERVAL_LEGACY'] = "1"

In [4]: spark.sql("SELECT INTERVAL '10-8' YEAR TO MONTH AS 
interval").first()
Out[4]: Row(interval=128)
```

### How was this patch tested?
Added test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46975 from zhengruifeng/fail_ym_interval_collect.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../source/migration_guide/pyspark_upgrade.rst |  2 +-
 .../pyspark/sql/tests/connect/test_parity_types.py |  8 +
 python/pyspark/sql/tests/test_types.py | 15 
 python/pyspark/sql/types.py| 41 +-
 4 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/python/docs/source/migration_guide/pyspark_upgrade.rst 
b/python/docs/source/migration_guide/pyspark_upgrade.rst
index 227293d83ada..529253042002 100644
--- a/python/docs/source/migration_guide/pyspark_upgrade.rst
+++ b/python/docs/source/migration_guide/pyspark_upgrade.rst
@@ -73,7 +73,7 @@ Upgrading from PySpark 3.5 to 4.0
 * In Spark 4.0, the aliases ``Y``, ``M``, ``H``, ``T``, ``S`` have been 
deprecated from Pandas API on Spark, use ``YE``, ``ME``, ``h``, ``min``, ``s`` 
instead respectively.
 * In Spark 4.0, the schema of a map column is inferred by merging the schemas 
of all pairs in the map. To restore the previous behavior where the schema is 
only inferred from the first non-null pair, you can set 
``spark.sql.pyspark.legacy.inferMapTypeFromFirstPair.enabled`` to ``true``.
 * In Spark 4.0, `compute.ops_on_diff_frames` is on by default. To restore the 
previous behavior, set `compute.ops_on_diff_frames` to `false`.
-
+* In Spark 4.0, the data type `YearMonthIntervalType` in ``DataFrame.collect`` 
no longer returns the underlying integers. To restore the previous behavior, 
set ``PYSPARK_YM_INTERVAL_LEGACY`` environment variable to ``1``.
 
 
 Upgrading from PySpark 3.3 to 3.4
diff --git a/python/pyspark/sql/tests/connect/test_parity_types.py 
b/python/pyspark/sql/tests/connect/test_parity_types.py
index fd75595b3873..6d06611def6a 100644
--- a/python/pyspark/sql/tests/connect/test_parity_types.py
+++ b/python/pyspark/sql/tests/connect/test_parity_types.py
@@ -94,6 +94,14 @@ class TypesParityTests(TypesTestsMixin, 
ReusedConnectTestCase):
 def test_schema_with_collations_json_ser_de(self):
 super().test_schema_with_collations_json_ser_de()
 
+@unittest.skip("This test is dedicated for PySpark Classic.")
+def test_ym_interval_in_collect(self):
+super().test_ym_interval_in_collect()
+
+@unittest.skip("This test is dedicated for PySpark Classic.")
+def test_cal_interval_in_collect(self):
+super().test_cal_interval_in_collect()
+
 
 if __name__ == "__main__":
 import unittest
diff --git a/python/pyspark/sql/tests/test_types.py 
b/python/pyspark/sql/tests/tes

(spark) branch master updated: [SPARK-48591][PYTHON] Add a helper function to simplify `Column.py`

2024-06-19 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 692d8692ef08 [SPARK-48591][PYTHON] Add a helper function to simplify 
`Column.py`
692d8692ef08 is described below

commit 692d8692ef0816e00b303df94609fd58c8fe7045
Author: Ruifeng Zheng 
AuthorDate: Thu Jun 20 13:43:39 2024 +0800

[SPARK-48591][PYTHON] Add a helper function to simplify `Column.py`

### What changes were proposed in this pull request?
Add a helper function to simplify `Column.py`

### Why are the changes needed?
code clean up

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47023 from zhengruifeng/column_to_expr.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/column.py | 58 ++--
 1 file changed, 22 insertions(+), 36 deletions(-)

diff --git a/python/pyspark/sql/connect/column.py 
b/python/pyspark/sql/connect/column.py
index b63e06bccae1..ef48091a35b0 100644
--- a/python/pyspark/sql/connect/column.py
+++ b/python/pyspark/sql/connect/column.py
@@ -96,6 +96,10 @@ def _unary_op(name: str, self: ParentColumn) -> ParentColumn:
 return Column(UnresolvedFunction(name, [self._expr]))  # type: 
ignore[list-item]
 
 
+def _to_expr(v: Any) -> Expression:
+return v._expr if isinstance(v, Column) else 
LiteralExpression._from_value(v)
+
+
 @with_origin_to_class
 class Column(ParentColumn):
 def __new__(
@@ -310,14 +314,12 @@ class Column(ParentColumn):
 message_parameters={},
 )
 
-if isinstance(value, Column):
-_value = value._expr
-else:
-_value = LiteralExpression._from_value(value)
-
-_branches = self._expr._branches + [(condition._expr, _value)]
-
-return Column(CaseWhen(branches=_branches, else_value=None))
+return Column(
+CaseWhen(
+branches=self._expr._branches + [(condition._expr, 
_to_expr(value))],
+else_value=None,
+)
+)
 
 def otherwise(self, value: Any) -> ParentColumn:
 if not isinstance(self._expr, CaseWhen):
@@ -330,12 +332,12 @@ class Column(ParentColumn):
 "otherwise() can only be applied once on a Column previously 
generated by when()"
 )
 
-if isinstance(value, Column):
-_value = value._expr
-else:
-_value = LiteralExpression._from_value(value)
-
-return Column(CaseWhen(branches=self._expr._branches, 
else_value=_value))
+return Column(
+CaseWhen(
+branches=self._expr._branches,
+else_value=_to_expr(value),
+)
+)
 
 def like(self: ParentColumn, other: str) -> ParentColumn:
 return _bin_op("like", self, other)
@@ -360,22 +362,15 @@ class Column(ParentColumn):
 },
 )
 
-if isinstance(length, Column):
-length_expr = length._expr
-start_expr = startPos._expr  # type: ignore[union-attr]
-elif isinstance(length, int):
-length_expr = LiteralExpression._from_value(length)
-start_expr = LiteralExpression._from_value(startPos)
+if isinstance(length, (Column, int)):
+length_expr = _to_expr(length)
+start_expr = _to_expr(startPos)
 else:
 raise PySparkTypeError(
 error_class="NOT_COLUMN_OR_INT",
 message_parameters={"arg_name": "startPos", "arg_type": 
type(length).__name__},
 )
-return Column(
-UnresolvedFunction(
-"substr", [self._expr, start_expr, length_expr]  # type: 
ignore[list-item]
-)
-)
+return Column(UnresolvedFunction("substr", [self._expr, start_expr, 
length_expr]))
 
 def __eq__(self, other: Any) -> ParentColumn:  # type: ignore[override]
 if other is None or isinstance(
@@ -459,14 +454,7 @@ class Column(ParentColumn):
 else:
 _cols = list(cols)
 
-_exprs = [self._expr]
-for c in _cols:
-if isinstance(c, Column):
-_exprs.append(c._expr)
-else:
-_exprs.append(LiteralExpression._from_value(c))
-
-return Column(UnresolvedFunction("in", _exprs))
+return Column(UnresolvedFunction("in", [self._expr] + [_to_expr(c) for 
c in _cols]))
 
 def between(
 self,
@@ -556,10 +544,8 @@ class Column(ParentColumn):
 me

(spark) branch master updated: [MINOR][PYTHON][DOCS] Fix pyspark.sql.functions.reduce docstring typo

2024-06-16 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 33a9c5d7f478 [MINOR][PYTHON][DOCS] Fix pyspark.sql.functions.reduce 
docstring typo
33a9c5d7f478 is described below

commit 33a9c5d7f478a476fe9882ad8fe101fd60756a98
Author: kaashif 
AuthorDate: Mon Jun 17 09:12:01 2024 +0800

[MINOR][PYTHON][DOCS] Fix pyspark.sql.functions.reduce docstring typo

### What changes were proposed in this pull request?

This PR fixes a mistake in the docstring for 
`pyspark.sql.functions.reduce`. The parameter to the function is called 
`initialValue` not `zero` - there is no other mention of `zero` on the page, so 
it must be a mistake and should be `initialValue`.

### Why are the changes needed?

The docstring is incorrect.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46923 from kaashif/patch-1.

Authored-by: kaashif 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/functions/builtin.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index 2edbc9f5abe1..ed66ca8684ef 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -18449,7 +18449,7 @@ def aggregate(
 initial value. Name of column or expression
 merge : function
 a binary function ``(acc: Column, x: Column) -> Column...`` returning 
expression
-of the same type as ``zero``
+of the same type as ``initialValue``
 finish : function, optional
 an optional unary function ``(x: Column) -> Column: ...``
 used to convert accumulated value.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (2d2bedf4aa16 -> aa4bfb05a0c3)

2024-06-14 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 2d2bedf4aa16 [SPARK-48056][CONNECT][FOLLOW-UP] Scala Client re-execute 
plan if a SESSION_NOT_FOUND error is raised and no partial response was received
 add aa4bfb05a0c3 Revert "[SPARK-48591][PYTHON] Simplify the if-else 
branches with `F.lit`"

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/column.py | 45 
 1 file changed, 25 insertions(+), 20 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48056][CONNECT][FOLLOW-UP] Scala Client re-execute plan if a SESSION_NOT_FOUND error is raised and no partial response was received

2024-06-14 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2d2bedf4aa16 [SPARK-48056][CONNECT][FOLLOW-UP] Scala Client re-execute 
plan if a SESSION_NOT_FOUND error is raised and no partial response was received
2d2bedf4aa16 is described below

commit 2d2bedf4aa16adfc2f45c192c4b7b954788b3acd
Author: Changgyoo Park 
AuthorDate: Fri Jun 14 16:31:37 2024 +0800

[SPARK-48056][CONNECT][FOLLOW-UP] Scala Client re-execute plan if a 
SESSION_NOT_FOUND error is raised and no partial response was received

### What changes were proposed in this pull request?

This change lets a Scala Spark Connect client reattempt execution of a plan 
when it receives a SESSION_NOT_FOUND error from the Spark Connect service if it 
has not received any partial responses.

This is a Scala version of the previous fix of the same issue - 
https://github.com/apache/spark/pull/46297.

### Why are the changes needed?

Spark Connect clients often get a spurious error from the Spark Connect 
service if the service is busy or the network is congested. This error leads to 
a situation where the client immediately attempts to reattach without the 
service being aware of the client; this leads to a query failure.

### Does this PR introduce _any_ user-facing change?

Prevoiusly, a Scala Spark Connect client would fail with the error code 
"INVALID_HANDLE.SESSION_NOT_FOUND" in the very first attempt to make a request 
to the service, but with this change, the client will automatically retry.

### How was this patch tested?

Attached unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46971 from changgyoopark-db/SPARK-48056.

Authored-by: Changgyoo Park 
Signed-off-by: Ruifeng Zheng 
---
 .../connect/client/SparkConnectClientSuite.scala   | 28 ++
 .../ExecutePlanResponseReattachableIterator.scala  | 14 +++
 2 files changed, 38 insertions(+), 4 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientSuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientSuite.scala
index 55f962b2a52c..46aeaeff43d2 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientSuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientSuite.scala
@@ -530,6 +530,25 @@ class SparkConnectClientSuite extends ConnectFunSuite with 
BeforeAndAfterEach {
 assert(reattachableIter.resultComplete)
   }
 
+  test("SPARK-48056: Client execute gets INVALID_HANDLE.SESSION_NOT_FOUND and 
proceeds") {
+startDummyServer(0)
+client = SparkConnectClient
+  .builder()
+  .connectionString(s"sc://localhost:${server.getPort}")
+  .enableReattachableExecute()
+  .build()
+service.errorToThrowOnExecute = Some(
+  new StatusRuntimeException(
+Status.INTERNAL.withDescription("INVALID_HANDLE.SESSION_NOT_FOUND")))
+
+val plan = buildPlan("select * from range(1)")
+val iter = client.execute(plan)
+val reattachableIter =
+  ExecutePlanResponseReattachableIterator.fromIterator(iter)
+reattachableIter.foreach(_ => ())
+assert(reattachableIter.resultComplete)
+  }
+
   test("GRPC stub unary call throws error immediately") {
 // Spark Connect error retry handling depends on the error being returned 
from the unary
 // call immediately.
@@ -609,6 +628,8 @@ class DummySparkConnectService() extends 
SparkConnectServiceGrpc.SparkConnectSer
   private val inputArtifactRequests: mutable.ListBuffer[AddArtifactsRequest] =
 mutable.ListBuffer.empty
 
+  var errorToThrowOnExecute: Option[Throwable] = None
+
   private[sql] def getAndClearLatestInputPlan(): proto.Plan = {
 val plan = inputPlan
 inputPlan = null
@@ -624,6 +645,13 @@ class DummySparkConnectService() extends 
SparkConnectServiceGrpc.SparkConnectSer
   override def executePlan(
   request: ExecutePlanRequest,
   responseObserver: StreamObserver[ExecutePlanResponse]): Unit = {
+if (errorToThrowOnExecute.isDefined) {
+  val error = errorToThrowOnExecute.get
+  errorToThrowOnExecute = None
+  responseObserver.onError(error)
+  return
+}
+
 // Reply with a dummy response using the same client ID
 val requestSessionId = request.getSessionId
 val operationId = if (request.hasOperationId) {
diff --git 
a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ExecutePlanResponseReattachableIterator.scala
 
b/connector/co

(spark) branch master updated: [SPARK-48372][SPARK-45716][PYTHON][FOLLOW-UP] Remove unused helper method

2024-06-11 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new df4156aa3217 [SPARK-48372][SPARK-45716][PYTHON][FOLLOW-UP] Remove 
unused helper method
df4156aa3217 is described below

commit df4156aa3217cf0f58b4c6cbf33c967bb43f7155
Author: Ruifeng Zheng 
AuthorDate: Tue Jun 11 18:45:02 2024 +0800

[SPARK-48372][SPARK-45716][PYTHON][FOLLOW-UP] Remove unused helper method

### What changes were proposed in this pull request?
followup of https://github.com/apache/spark/pull/46685, to remove unused 
helper method

### Why are the changes needed?
method `_tree_string` is no longer needed

### Does this PR introduce _any_ user-facing change?
No, internal change only

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46936 from zhengruifeng/tree_string_followup.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/dataframe.py| 8 
 python/pyspark/sql/tests/connect/test_connect_basic.py | 2 +-
 2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index 6fbb57f3ec61..baac1523c709 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1844,14 +1844,6 @@ class DataFrame(ParentDataFrame):
 assert result is not None
 return result
 
-def _tree_string(self, level: Optional[int] = None) -> str:
-query = self._plan.to_proto(self._session.client)
-result = self._session.client._analyze(
-method="tree_string", plan=query, level=level
-).tree_string
-assert result is not None
-return result
-
 def printSchema(self, level: Optional[int] = None) -> None:
 if level:
 print(self.schema.treeString(level))
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index 74d08424cafc..598c76a5b25f 100755
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -568,7 +568,7 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
 
 def test_print_schema(self):
 # SPARK-41216: Test print schema
-tree_str = self.connect.sql("SELECT 1 AS X, 2 AS Y")._tree_string()
+tree_str = self.connect.sql("SELECT 1 AS X, 2 AS 
Y").schema.treeString()
 # root
 #  |-- X: integer (nullable = false)
 #  |-- Y: integer (nullable = false)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [MINOR][PYTHON][TESTS] Move a test out of parity tests

2024-06-07 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 201df0d7ac81 [MINOR][PYTHON][TESTS] Move a test out of parity tests
201df0d7ac81 is described below

commit 201df0d7ac81f6bd5c39f513b0a06cb659dc9a3f
Author: Ruifeng Zheng 
AuthorDate: Sat Jun 8 07:49:15 2024 +0800

[MINOR][PYTHON][TESTS] Move a test out of parity tests

### What changes were proposed in this pull request?
Move a test out of parity tests

### Why are the changes needed?
it is not tested in Spark Classic, not a parity test

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46914 from zhengruifeng/move_a_non_parity_test.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../connect/test_connect_dataframe_property.py | 23 +
 .../sql/tests/connect/test_parity_dataframe.py | 24 --
 2 files changed, 23 insertions(+), 24 deletions(-)

diff --git 
a/python/pyspark/sql/tests/connect/test_connect_dataframe_property.py 
b/python/pyspark/sql/tests/connect/test_connect_dataframe_property.py
index f80f4509a7ce..c87c44760256 100644
--- a/python/pyspark/sql/tests/connect/test_connect_dataframe_property.py
+++ b/python/pyspark/sql/tests/connect/test_connect_dataframe_property.py
@@ -37,6 +37,29 @@ if have_pandas:
 
 
 class SparkConnectDataFramePropertyTests(SparkConnectSQLTestCase):
+def test_cached_property_is_copied(self):
+schema = StructType(
+[
+StructField("id", IntegerType(), True),
+StructField("name", StringType(), True),
+StructField("age", IntegerType(), True),
+StructField("city", StringType(), True),
+]
+)
+# Create some dummy data
+data = [
+(1, "Alice", 30, "New York"),
+(2, "Bob", 25, "San Francisco"),
+(3, "Cathy", 29, "Los Angeles"),
+(4, "David", 35, "Chicago"),
+]
+df = self.spark.createDataFrame(data, schema)
+df_columns = df.columns
+assert len(df.columns) == 4
+for col in ["id", "name"]:
+df_columns.remove(col)
+assert len(df.columns) == 4
+
 def test_cached_schema_to(self):
 cdf = self.connect.read.table(self.tbl_name)
 sdf = self.spark.read.table(self.tbl_name)
diff --git a/python/pyspark/sql/tests/connect/test_parity_dataframe.py 
b/python/pyspark/sql/tests/connect/test_parity_dataframe.py
index c9888a6a8f1a..343f485553a9 100644
--- a/python/pyspark/sql/tests/connect/test_parity_dataframe.py
+++ b/python/pyspark/sql/tests/connect/test_parity_dataframe.py
@@ -19,7 +19,6 @@ import unittest
 
 from pyspark.sql.tests.test_dataframe import DataFrameTestsMixin
 from pyspark.testing.connectutils import ReusedConnectTestCase
-from pyspark.sql.types import StructType, StructField, IntegerType, StringType
 
 
 class DataFrameParityTests(DataFrameTestsMixin, ReusedConnectTestCase):
@@ -27,29 +26,6 @@ class DataFrameParityTests(DataFrameTestsMixin, 
ReusedConnectTestCase):
 df = self.spark.createDataFrame(data=[{"foo": "bar"}, {"foo": "baz"}])
 super().check_help_command(df)
 
-def test_cached_property_is_copied(self):
-schema = StructType(
-[
-StructField("id", IntegerType(), True),
-StructField("name", StringType(), True),
-StructField("age", IntegerType(), True),
-StructField("city", StringType(), True),
-]
-)
-# Create some dummy data
-data = [
-(1, "Alice", 30, "New York"),
-(2, "Bob", 25, "San Francisco"),
-(3, "Cathy", 29, "Los Angeles"),
-(4, "David", 35, "Chicago"),
-]
-df = self.spark.createDataFrame(data, schema)
-df_columns = df.columns
-assert len(df.columns) == 4
-for col in ["id", "name"]:
-df_columns.remove(col)
-assert len(df.columns) == 4
-
 @unittest.skip("Spark Connect does not support RDD but the tests depend on 
them.")
 def test_toDF_with_schema_string(self):
 super().test_toDF_with_schema_string()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48561][PS][CONNECT] Throw `PandasNotImplementedError` for unsupported plotting functions

2024-06-07 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 87b0f5995383 [SPARK-48561][PS][CONNECT] Throw 
`PandasNotImplementedError` for unsupported plotting functions
87b0f5995383 is described below

commit 87b0f5995383173f6736695211994a1a26995192
Author: Ruifeng Zheng 
AuthorDate: Fri Jun 7 16:36:58 2024 +0800

[SPARK-48561][PS][CONNECT] Throw `PandasNotImplementedError` for 
unsupported plotting functions

### What changes were proposed in this pull request?
Throw `PandasNotImplementedError` for unsupported plotting functions:
- {Frame, Series}.plot.hist
- {Frame, Series}.plot.kde
- {Frame, Series}.plot.density
- {Frame, Series}.plot(kind="hist", ...)
- {Frame, Series}.plot(kind="hist", ...)
- {Frame, Series}.plot(kind="density", ...)

### Why are the changes needed?
the previous error message is confusing:
```
In [3]: psdf.plot.hist()
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:1017: 
PandasAPIOnSparkAdviceWarning: The config 'spark.sql.ansi.enabled' is set to 
True. This can cause unexpected behavior from pandas API on Spark since pandas 
API on Spark follows the behavior of pandas, not SQL.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)

[*---]
 57.14% Complete (0 Tasks running, 1s, 
Scanned[*---]
 57.14% Complete (0 Tasks running, 1s, 
Scanned[*---]
 57.14% Complete (0 Tasks running, 1s, Scanned  
[...]
PySparkAttributeError Traceback (most recent call last)
Cell In[3], line 1
> 1 psdf.plot.hist()

File ~/Dev/spark/python/pyspark/pandas/plot/core.py:951, in 
PandasOnSparkPlotAccessor.hist(self, bins, **kwds)
903 def hist(self, bins=10, **kwds):
904 """
905 Draw one histogram of the DataFrame’s columns.
906 A `histogram`_ is a representation of the distribution of data.
   (...)
949 >>> df.plot.hist(bins=12, alpha=0.5)  # doctest: +SKIP
950 """
--> 951 return self(kind="hist", bins=bins, **kwds)

File ~/Dev/spark/python/pyspark/pandas/plot/core.py:580, in 
PandasOnSparkPlotAccessor.__call__(self, kind, backend, **kwargs)
577 kind = {"density": "kde"}.get(kind, kind)
578 if hasattr(plot_backend, "plot_pandas_on_spark"):
579 # use if there's pandas-on-Spark specific method.
--> 580 return plot_backend.plot_pandas_on_spark(plot_data, kind=kind, 
**kwargs)
581 else:
582 # fallback to use pandas'
583 if not PandasOnSparkPlotAccessor.pandas_plot_data_map[kind]:

File ~/Dev/spark/python/pyspark/pandas/plot/plotly.py:41, in 
plot_pandas_on_spark(data, kind, **kwargs)
 39 return plot_pie(data, **kwargs)
 40 if kind == "hist":
---> 41 return plot_histogram(data, **kwargs)
 42 if kind == "box":
 43 return plot_box(data, **kwargs)

File ~/Dev/spark/python/pyspark/pandas/plot/plotly.py:87, in 
plot_histogram(data, **kwargs)
 85 psdf, bins = HistogramPlotBase.prepare_hist_data(data, bins)
 86 assert len(bins) > 2, "the number of buckets must be higher than 2."
---> 87 output_series = HistogramPlotBase.compute_hist(psdf, bins)
 88 prev = float("%.9f" % bins[0])  # to make it prettier, truncate.
 89 text_bins = []

File ~/Dev/spark/python/pyspark/pandas/plot/core.py:189, in 
HistogramPlotBase.compute_hist(psdf, bins)
183 for group_id, (colname, bucket_name) in enumerate(zip(colnames, 
bucket_names)):
184 # creates a Bucketizer to get corresponding bin of each value
185 bucketizer = Bucketizer(
186 splits=bins, inputCol=colname, outputCol=bucket_name, 
handleInvalid="skip"
187 )
--> 189 bucket_df = bucketizer.transform(sdf)
191 if output_df is None:
192 output_df = bucket_df.select(
193 F.lit(group_id).alias("__group_id"), 
F.col(bucket_name).alias("__bucket")
194 )

File ~/Dev/spark/python/pyspark/ml/base.py:260, in 
Transformer.transform(self, dataset, params)
258 return self.copy(params)._transform(dataset)
259

(spark) branch master updated (ce1b08f6e30b -> edb9236ea688)

2024-06-06 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from ce1b08f6e30b [SPARK-48553][PYTHON][CONNECT] Cache more properties
 add edb9236ea688 [SPARK-48504][PYTHON][CONNECT][FOLLOW-UP] Code clean up

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/window.py | 106 +--
 1 file changed, 14 insertions(+), 92 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (0f21df0b29cc -> ce1b08f6e30b)

2024-06-06 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 0f21df0b29cc [SPARK-48286] Fix analysis of column with exists default 
expression - Add user facing error
 add ce1b08f6e30b [SPARK-48553][PYTHON][CONNECT] Cache more properties

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/dataframe.py | 9 +++--
 python/pyspark/sql/connect/session.py   | 6 +++---
 2 files changed, 10 insertions(+), 5 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (8cb78a7811f3 -> ab00533221e2)

2024-06-06 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8cb78a7811f3 [SPARK-48550][PS] Directly use the parent Window class
 add ab00533221e2 [SPARK-47933][PYTHON][TESTS][FOLLOW-UP] Enable doctest 
`pyspark.sql.connect.column`

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/column.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48536][PYTHON][CONNECT] Cache user specified schema in applyInPandas and applyInArrow

2024-06-05 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 34ac7de89711 [SPARK-48536][PYTHON][CONNECT] Cache user specified 
schema in applyInPandas and applyInArrow
34ac7de89711 is described below

commit 34ac7de897115caada7330aed32f03aca4796299
Author: Ruifeng Zheng 
AuthorDate: Wed Jun 5 20:42:00 2024 +0800

[SPARK-48536][PYTHON][CONNECT] Cache user specified schema in applyInPandas 
and applyInArrow

### What changes were proposed in this pull request?
Cache user specified schema in applyInPandas and applyInArrow

### Why are the changes needed?
to avoid extra RPCs

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
added tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46877 from zhengruifeng/cache_schema_apply_in_x.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/group.py|  20 ++-
 .../connect/test_connect_dataframe_property.py | 145 -
 2 files changed, 160 insertions(+), 5 deletions(-)

diff --git a/python/pyspark/sql/connect/group.py 
b/python/pyspark/sql/connect/group.py
index 2a5bb5939a3f..85806b1a265b 100644
--- a/python/pyspark/sql/connect/group.py
+++ b/python/pyspark/sql/connect/group.py
@@ -301,7 +301,7 @@ class GroupedData:
 evalType=PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF,
 )
 
-return DataFrame(
+res = DataFrame(
 plan.GroupMap(
 child=self._df._plan,
 grouping_cols=self._grouping_cols,
@@ -310,6 +310,9 @@ class GroupedData:
 ),
 session=self._df._session,
 )
+if isinstance(schema, StructType):
+res._cached_schema = schema
+return res
 
 applyInPandas.__doc__ = PySparkGroupedData.applyInPandas.__doc__
 
@@ -370,7 +373,7 @@ class GroupedData:
 evalType=PythonEvalType.SQL_GROUPED_MAP_ARROW_UDF,
 )
 
-return DataFrame(
+res = DataFrame(
 plan.GroupMap(
 child=self._df._plan,
 grouping_cols=self._grouping_cols,
@@ -379,6 +382,9 @@ class GroupedData:
 ),
 session=self._df._session,
 )
+if isinstance(schema, StructType):
+res._cached_schema = schema
+return res
 
 applyInArrow.__doc__ = PySparkGroupedData.applyInArrow.__doc__
 
@@ -410,7 +416,7 @@ class PandasCogroupedOps:
 evalType=PythonEvalType.SQL_COGROUPED_MAP_PANDAS_UDF,
 )
 
-return DataFrame(
+res = DataFrame(
 plan.CoGroupMap(
 input=self._gd1._df._plan,
 input_grouping_cols=self._gd1._grouping_cols,
@@ -420,6 +426,9 @@ class PandasCogroupedOps:
 ),
 session=self._gd1._df._session,
 )
+if isinstance(schema, StructType):
+res._cached_schema = schema
+return res
 
 applyInPandas.__doc__ = PySparkPandasCogroupedOps.applyInPandas.__doc__
 
@@ -436,7 +445,7 @@ class PandasCogroupedOps:
 evalType=PythonEvalType.SQL_COGROUPED_MAP_ARROW_UDF,
 )
 
-return DataFrame(
+res = DataFrame(
 plan.CoGroupMap(
 input=self._gd1._df._plan,
 input_grouping_cols=self._gd1._grouping_cols,
@@ -446,6 +455,9 @@ class PandasCogroupedOps:
 ),
 session=self._gd1._df._session,
 )
+if isinstance(schema, StructType):
+res._cached_schema = schema
+return res
 
 applyInArrow.__doc__ = PySparkPandasCogroupedOps.applyInArrow.__doc__
 
diff --git 
a/python/pyspark/sql/tests/connect/test_connect_dataframe_property.py 
b/python/pyspark/sql/tests/connect/test_connect_dataframe_property.py
index 6abf6303b7ca..f80f4509a7ce 100644
--- a/python/pyspark/sql/tests/connect/test_connect_dataframe_property.py
+++ b/python/pyspark/sql/tests/connect/test_connect_dataframe_property.py
@@ -17,7 +17,7 @@
 
 import unittest
 
-from pyspark.sql.types import StructType, StructField, StringType, IntegerType
+from pyspark.sql.types import StructType, StructField, StringType, 
IntegerType, LongType, DoubleType
 from pyspark.sql.utils import is_remote
 
 from pyspark.sql.tests.connect.test_connect_basic import 
SparkConnectSQLTestCase
@@ -30,6 +30,7 @@ from pyspark.testing.sqlutils import (
 
 if have_pyarrow:
 import pyarrow as pa
+import pyarrow.compute as pc
 
 if have_pandas:
 import pandas as pd
@@ -127,6 +128,148 @@ class 
SparkConnectDataFramePropertyTests(SparkConnectSQLTestCase):
 self.assertEqual(cdf1.schema, sdf1.schema)
 self.assertEqual(cdf1.collect(), sdf1

(spark) branch master updated: [SPARK-48512][PYTHON][TESTS] Refactor Python tests

2024-06-04 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 02c645607f43 [SPARK-48512][PYTHON][TESTS] Refactor Python tests
02c645607f43 is described below

commit 02c645607f4353df573cdba568e092c3ff4c359a
Author: Rui Wang 
AuthorDate: Tue Jun 4 17:50:29 2024 +0800

[SPARK-48512][PYTHON][TESTS] Refactor Python tests

### What changes were proposed in this pull request?

Use withSQLConf in tests when it is appropriate.

### Why are the changes needed?

Enforce good practice for setting config in test cases.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

existing UT

### Was this patch authored or co-authored using generative AI tooling?

NO

Closes #46852 from amaliujia/refactor_pyspark.

Authored-by: Rui Wang 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/tests/test_context.py| 39 +
 python/pyspark/sql/tests/test_readwriter.py | 10 ++--
 python/pyspark/sql/tests/test_types.py  |  5 +---
 3 files changed, 21 insertions(+), 33 deletions(-)

diff --git a/python/pyspark/sql/tests/test_context.py 
b/python/pyspark/sql/tests/test_context.py
index b38183331486..f363b8748c0b 100644
--- a/python/pyspark/sql/tests/test_context.py
+++ b/python/pyspark/sql/tests/test_context.py
@@ -26,13 +26,13 @@ import py4j
 from pyspark import SparkContext, SQLContext
 from pyspark.sql import Row, SparkSession
 from pyspark.sql.types import StructType, StringType, StructField
-from pyspark.testing.utils import ReusedPySparkTestCase
+from pyspark.testing.sqlutils import ReusedSQLTestCase
 
 
-class HiveContextSQLTests(ReusedPySparkTestCase):
+class HiveContextSQLTests(ReusedSQLTestCase):
 @classmethod
 def setUpClass(cls):
-ReusedPySparkTestCase.setUpClass()
+ReusedSQLTestCase.setUpClass()
 cls.tempdir = tempfile.NamedTemporaryFile(delete=False)
 cls.hive_available = True
 cls.spark = None
@@ -58,7 +58,7 @@ class HiveContextSQLTests(ReusedPySparkTestCase):
 
 @classmethod
 def tearDownClass(cls):
-ReusedPySparkTestCase.tearDownClass()
+ReusedSQLTestCase.tearDownClass()
 shutil.rmtree(cls.tempdir.name, ignore_errors=True)
 if cls.spark is not None:
 cls.spark.stop()
@@ -100,23 +100,20 @@ class HiveContextSQLTests(ReusedPySparkTestCase):
 self.spark.sql("DROP TABLE savedJsonTable")
 self.spark.sql("DROP TABLE externalJsonTable")
 
-defaultDataSourceName = self.spark.conf.get(
-"spark.sql.sources.default", "org.apache.spark.sql.parquet"
-)
-self.spark.sql("SET 
spark.sql.sources.default=org.apache.spark.sql.json")
-df.write.saveAsTable("savedJsonTable", path=tmpPath, mode="overwrite")
-actual = self.spark.catalog.createTable("externalJsonTable", 
path=tmpPath)
-self.assertEqual(
-sorted(df.collect()), sorted(self.spark.sql("SELECT * FROM 
savedJsonTable").collect())
-)
-self.assertEqual(
-sorted(df.collect()),
-sorted(self.spark.sql("SELECT * FROM 
externalJsonTable").collect()),
-)
-self.assertEqual(sorted(df.collect()), sorted(actual.collect()))
-self.spark.sql("DROP TABLE savedJsonTable")
-self.spark.sql("DROP TABLE externalJsonTable")
-self.spark.sql("SET spark.sql.sources.default=" + 
defaultDataSourceName)
+with self.sql_conf({"spark.sql.sources.default": 
"org.apache.spark.sql.json"}):
+df.write.saveAsTable("savedJsonTable", path=tmpPath, 
mode="overwrite")
+actual = self.spark.catalog.createTable("externalJsonTable", 
path=tmpPath)
+self.assertEqual(
+sorted(df.collect()),
+sorted(self.spark.sql("SELECT * FROM 
savedJsonTable").collect()),
+)
+self.assertEqual(
+sorted(df.collect()),
+sorted(self.spark.sql("SELECT * FROM 
externalJsonTable").collect()),
+)
+self.assertEqual(sorted(df.collect()), sorted(actual.collect()))
+self.spark.sql("DROP TABLE savedJsonTable")
+self.spark.sql("DROP TABLE externalJsonTable")
 
 shutil.rmtree(tmpPath)
 
diff --git a/python/pyspark/sql/tests/test_readwriter.py 
b/python/pyspark/sql/tests/test_readwriter.py
index e752856d0316..8060a9ae8bc7 100644
--- a/python/pyspark/sql/tests/test_readwriter.py
+++ b/python/pyspark/sql/tests/test_readwriter.py
@@

(spark) branch master updated: Revert "[SPARK-48415][PYTHON] Refactor TypeName to support parameterized datatypes"

2024-05-30 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 910c3733bfdd Revert "[SPARK-48415][PYTHON] Refactor TypeName to 
support parameterized datatypes"
910c3733bfdd is described below

commit 910c3733bfdd1a0f386137d48796e317f64f7f50
Author: Ruifeng Zheng 
AuthorDate: Thu May 30 16:21:22 2024 +0800

Revert "[SPARK-48415][PYTHON] Refactor TypeName to support parameterized 
datatypes"

revert https://github.com/apache/spark/pull/46738

Closes #46804 from zhengruifeng/revert_typename_oss.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/tests/test_types.py | 133 -
 python/pyspark/sql/types.py|  74 +++---
 2 files changed, 47 insertions(+), 160 deletions(-)

diff --git a/python/pyspark/sql/tests/test_types.py 
b/python/pyspark/sql/tests/test_types.py
index cc482b886e3a..80f2c0fcbc03 100644
--- a/python/pyspark/sql/tests/test_types.py
+++ b/python/pyspark/sql/tests/test_types.py
@@ -81,139 +81,6 @@ from pyspark.testing.utils import PySparkErrorTestUtils
 
 
 class TypesTestsMixin:
-def test_class_method_type_name(self):
-for dataType, expected in [
-(StringType, "string"),
-(CharType, "char"),
-(VarcharType, "varchar"),
-(BinaryType, "binary"),
-(BooleanType, "boolean"),
-(DecimalType, "decimal"),
-(FloatType, "float"),
-(DoubleType, "double"),
-(ByteType, "byte"),
-(ShortType, "short"),
-(IntegerType, "integer"),
-(LongType, "long"),
-(DateType, "date"),
-(TimestampType, "timestamp"),
-(TimestampNTZType, "timestamp_ntz"),
-(NullType, "void"),
-(VariantType, "variant"),
-(YearMonthIntervalType, "yearmonthinterval"),
-(DayTimeIntervalType, "daytimeinterval"),
-(CalendarIntervalType, "interval"),
-]:
-self.assertEqual(dataType.typeName(), expected)
-
-def test_instance_method_type_name(self):
-for dataType, expected in [
-(StringType(), "string"),
-(CharType(5), "char(5)"),
-(VarcharType(10), "varchar(10)"),
-(BinaryType(), "binary"),
-(BooleanType(), "boolean"),
-(DecimalType(), "decimal(10,0)"),
-(DecimalType(10, 2), "decimal(10,2)"),
-(FloatType(), "float"),
-(DoubleType(), "double"),
-(ByteType(), "byte"),
-(ShortType(), "short"),
-(IntegerType(), "integer"),
-(LongType(), "long"),
-(DateType(), "date"),
-(TimestampType(), "timestamp"),
-(TimestampNTZType(), "timestamp_ntz"),
-(NullType(), "void"),
-(VariantType(), "variant"),
-(YearMonthIntervalType(), "interval year to month"),
-(YearMonthIntervalType(YearMonthIntervalType.YEAR), "interval 
year"),
-(
-YearMonthIntervalType(YearMonthIntervalType.YEAR, 
YearMonthIntervalType.MONTH),
-"interval year to month",
-),
-(DayTimeIntervalType(), "interval day to second"),
-(DayTimeIntervalType(DayTimeIntervalType.DAY), "interval day"),
-(
-DayTimeIntervalType(DayTimeIntervalType.HOUR, 
DayTimeIntervalType.SECOND),
-"interval hour to second",
-),
-(CalendarIntervalType(), "interval"),
-]:
-self.assertEqual(dataType.typeName(), expected)
-
-def test_simple_string(self):
-for dataType, expected in [
-(StringType(), "string"),
-(CharType(5), "char(5)"),
-(VarcharType(10), "varchar(10)"),
-(BinaryType(), "binary"),
-(BooleanType(), "boolean"),
-(DecimalType(), "decimal(10,0)"),
-(DecimalType(10, 2), "decimal(10,2)"),
-(FloatType(), "float"),
-(DoubleType(), "double"),
-(ByteType(), "tinyint"),
-(ShortType(), "smallint"),
-

(spark) branch master updated (cfbed998530e -> f5d9b8098815)

2024-05-29 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from cfbed998530e Revert "[SPARK-48322][SPARK-42965][SQL][CONNECT][PYTHON] 
Drop internal metadata in `DataFrame.schema`"
 add f5d9b8098815 [MINOR][PS] Fallback code clean up

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/frame.py | 8 
 1 file changed, 8 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: Revert "[SPARK-48322][SPARK-42965][SQL][CONNECT][PYTHON] Drop internal metadata in `DataFrame.schema`"

2024-05-29 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cfbed998530e Revert "[SPARK-48322][SPARK-42965][SQL][CONNECT][PYTHON] 
Drop internal metadata in `DataFrame.schema`"
cfbed998530e is described below

commit cfbed998530efaaf17f36d99a9462376eaa7d2ad
Author: Ruifeng Zheng 
AuthorDate: Wed May 29 20:44:36 2024 +0800

Revert "[SPARK-48322][SPARK-42965][SQL][CONNECT][PYTHON] Drop internal 
metadata in `DataFrame.schema`"

revert https://github.com/apache/spark/pull/46636

https://github.com/apache/spark/pull/46636#issuecomment-2137321359

Closes #46790 from zhengruifeng/revert_metadata_drop.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/pandas/internal.py  | 37 +-
 .../sql/tests/connect/test_connect_function.py |  4 ++-
 python/pyspark/sql/types.py| 13 
 .../main/scala/org/apache/spark/sql/Dataset.scala  |  4 +--
 .../apache/spark/sql/DataFrameAggregateSuite.scala |  5 +--
 5 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/python/pyspark/pandas/internal.py 
b/python/pyspark/pandas/internal.py
index fd0f28e50b2f..04285aa2d879 100644
--- a/python/pyspark/pandas/internal.py
+++ b/python/pyspark/pandas/internal.py
@@ -33,6 +33,7 @@ from pyspark.sql import (
 Window,
 )
 from pyspark.sql.types import (  # noqa: F401
+_drop_metadata,
 BooleanType,
 DataType,
 LongType,
@@ -756,10 +757,20 @@ class InternalFrame:
 
 if is_testing():
 struct_fields = 
spark_frame.select(index_spark_columns).schema.fields
-assert all(
-index_field.struct_field == struct_field
-for index_field, struct_field in zip(index_fields, 
struct_fields)
-), (index_fields, struct_fields)
+if is_remote():
+# TODO(SPARK-42965): For some reason, the metadata of 
StructField is different
+# in a few tests when using Spark Connect. However, the 
function works properly.
+# Therefore, we temporarily perform Spark Connect tests by 
excluding metadata
+# until the issue is resolved.
+assert all(
+_drop_metadata(index_field.struct_field) == 
_drop_metadata(struct_field)
+for index_field, struct_field in zip(index_fields, 
struct_fields)
+), (index_fields, struct_fields)
+else:
+assert all(
+index_field.struct_field == struct_field
+for index_field, struct_field in zip(index_fields, 
struct_fields)
+), (index_fields, struct_fields)
 
 self._index_fields: List[InternalField] = index_fields
 
@@ -774,10 +785,20 @@ class InternalFrame:
 
 if is_testing():
 struct_fields = 
spark_frame.select(data_spark_columns).schema.fields
-assert all(
-data_field.struct_field == struct_field
-for data_field, struct_field in zip(data_fields, struct_fields)
-), (data_fields, struct_fields)
+if is_remote():
+# TODO(SPARK-42965): For some reason, the metadata of 
StructField is different
+# in a few tests when using Spark Connect. However, the 
function works properly.
+# Therefore, we temporarily perform Spark Connect tests by 
excluding metadata
+# until the issue is resolved.
+assert all(
+_drop_metadata(data_field.struct_field) == 
_drop_metadata(struct_field)
+for data_field, struct_field in zip(data_fields, 
struct_fields)
+), (data_fields, struct_fields)
+else:
+assert all(
+data_field.struct_field == struct_field
+for data_field, struct_field in zip(data_fields, 
struct_fields)
+), (data_fields, struct_fields)
 
 self._data_fields: List[InternalField] = data_fields
 
diff --git a/python/pyspark/sql/tests/connect/test_connect_function.py 
b/python/pyspark/sql/tests/connect/test_connect_function.py
index 1fb0195b5203..0f0abfd4b856 100644
--- a/python/pyspark/sql/tests/connect/test_connect_function.py
+++ b/python/pyspark/sql/tests/connect/test_connect_function.py
@@ -22,6 +22,7 @@ from pyspark.util import is_remote_only
 from pyspark.errors import PySparkTypeError, PySparkValueError
 from pyspark.sql import SparkSession as PySparkSession
 from pyspark.sql.types import (
+_drop_metadata,
 StringType,
 StructType,
 StructField,
@@ -1673,7 +1674,8 @@ class SparkConnectFunctionTests(ReusedConnectTestCase, 
Panda

(spark) branch master updated: [SPARK-48415][PYTHON] Refactor `TypeName` to support parameterized datatypes

2024-05-27 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new fc1435d14d09 [SPARK-48415][PYTHON] Refactor `TypeName` to support 
parameterized datatypes
fc1435d14d09 is described below

commit fc1435d14d090b792a0f19372d6b11c7ff026372
Author: Ruifeng Zheng 
AuthorDate: Tue May 28 08:39:28 2024 +0800

[SPARK-48415][PYTHON] Refactor `TypeName` to support parameterized datatypes

### What changes were proposed in this pull request?
1, refactor instance method `TypeName` to support parameterized datatypes
2, remove redundant simpleString/jsonValue methods, since they are type 
name by default.

### Why are the changes needed?
to be consistent with the Scala side

### Does this PR introduce _any_ user-facing change?

type names changes:
`CharType(10)`: `char` -> `char(10)`
`VarcharType(10)`: `varchar` -> `varchar(10)`
`DecimalType(10, 2)`: `decimal` -> `decimal(10,2)`
`DayTimeIntervalType(DAY, HOUR)`: `daytimeinterval` -> `interval day to 
hour`
`YearMonthIntervalType(YEAR, MONTH)`: `yearmonthinterval` -> `interval year 
to month`

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46738 from zhengruifeng/py_type_name.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/tests/test_types.py | 133 +
 python/pyspark/sql/types.py|  74 +++---
 2 files changed, 160 insertions(+), 47 deletions(-)

diff --git a/python/pyspark/sql/tests/test_types.py 
b/python/pyspark/sql/tests/test_types.py
index 80f2c0fcbc03..cc482b886e3a 100644
--- a/python/pyspark/sql/tests/test_types.py
+++ b/python/pyspark/sql/tests/test_types.py
@@ -81,6 +81,139 @@ from pyspark.testing.utils import PySparkErrorTestUtils
 
 
 class TypesTestsMixin:
+def test_class_method_type_name(self):
+for dataType, expected in [
+(StringType, "string"),
+(CharType, "char"),
+(VarcharType, "varchar"),
+(BinaryType, "binary"),
+(BooleanType, "boolean"),
+(DecimalType, "decimal"),
+(FloatType, "float"),
+(DoubleType, "double"),
+(ByteType, "byte"),
+(ShortType, "short"),
+(IntegerType, "integer"),
+(LongType, "long"),
+(DateType, "date"),
+(TimestampType, "timestamp"),
+(TimestampNTZType, "timestamp_ntz"),
+(NullType, "void"),
+(VariantType, "variant"),
+(YearMonthIntervalType, "yearmonthinterval"),
+(DayTimeIntervalType, "daytimeinterval"),
+(CalendarIntervalType, "interval"),
+]:
+self.assertEqual(dataType.typeName(), expected)
+
+def test_instance_method_type_name(self):
+for dataType, expected in [
+(StringType(), "string"),
+(CharType(5), "char(5)"),
+(VarcharType(10), "varchar(10)"),
+(BinaryType(), "binary"),
+(BooleanType(), "boolean"),
+(DecimalType(), "decimal(10,0)"),
+(DecimalType(10, 2), "decimal(10,2)"),
+(FloatType(), "float"),
+(DoubleType(), "double"),
+(ByteType(), "byte"),
+(ShortType(), "short"),
+(IntegerType(), "integer"),
+(LongType(), "long"),
+(DateType(), "date"),
+(TimestampType(), "timestamp"),
+(TimestampNTZType(), "timestamp_ntz"),
+(NullType(), "void"),
+(VariantType(), "variant"),
+(YearMonthIntervalType(), "interval year to month"),
+(YearMonthIntervalType(YearMonthIntervalType.YEAR), "interval 
year"),
+(
+YearMonthIntervalType(YearMonthIntervalType.YEAR, 
YearMonthIntervalType.MONTH),
+"interval year to month",
+),
+(DayTimeIntervalType(), "interval day to second"),
+(DayTimeIntervalType(DayTimeIntervalType.DAY), "interval day"),
+(
+DayTimeIntervalType(DayTimeIntervalType.HOUR, 
DayTimeIntervalType.SECOND),
+"interval hour to second",
+),
+(CalendarIntervalType(), &quo

(spark) branch master updated: [SPARK-48412][PYTHON] Refactor data type json parse

2024-05-24 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bd95040c3170 [SPARK-48412][PYTHON] Refactor data type json parse
bd95040c3170 is described below

commit bd95040c3170aaed61ee5e9090d1b8580351ee80
Author: Ruifeng Zheng 
AuthorDate: Fri May 24 17:36:46 2024 +0800

[SPARK-48412][PYTHON] Refactor data type json parse

### What changes were proposed in this pull request?
Refactor data type json parse

### Why are the changes needed?
the `_all_atomic_types` causes confusions:

- it is only used in json parse, so it should use the `jsonValue` instead 
of `typeName` (and so it causes the `typeName` not consistent with Scala, will 
fix in separate PR);
- not all atomic types are included in it (e.g. `YearMonthIntervalType`);
- not all atomic types should be placed in it (e.g. `VarcharType` which has 
to be excluded here and there)

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci, added tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46733 from zhengruifeng/refactor_json_parse.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/tests/test_types.py | 42 ++---
 python/pyspark/sql/types.py| 57 --
 2 files changed, 79 insertions(+), 20 deletions(-)

diff --git a/python/pyspark/sql/tests/test_types.py 
b/python/pyspark/sql/tests/test_types.py
index 6c64a9471363..d665053d9490 100644
--- a/python/pyspark/sql/tests/test_types.py
+++ b/python/pyspark/sql/tests/test_types.py
@@ -1136,12 +1136,46 @@ class TypesTestsMixin:
 self.assertRaises(IndexError, lambda: struct1[9])
 self.assertRaises(TypeError, lambda: struct1[9.9])
 
+def test_parse_datatype_json_string(self):
+from pyspark.sql.types import _parse_datatype_json_string
+
+for dataType in [
+StringType(),
+CharType(5),
+VarcharType(10),
+BinaryType(),
+BooleanType(),
+DecimalType(),
+DecimalType(10, 2),
+FloatType(),
+DoubleType(),
+ByteType(),
+ShortType(),
+IntegerType(),
+LongType(),
+DateType(),
+TimestampType(),
+TimestampNTZType(),
+NullType(),
+VariantType(),
+YearMonthIntervalType(),
+YearMonthIntervalType(YearMonthIntervalType.YEAR),
+YearMonthIntervalType(YearMonthIntervalType.YEAR, 
YearMonthIntervalType.MONTH),
+DayTimeIntervalType(),
+DayTimeIntervalType(DayTimeIntervalType.DAY),
+DayTimeIntervalType(DayTimeIntervalType.HOUR, 
DayTimeIntervalType.SECOND),
+CalendarIntervalType(),
+]:
+json_str = dataType.json()
+parsed = _parse_datatype_json_string(json_str)
+self.assertEqual(dataType, parsed)
+
 def test_parse_datatype_string(self):
-from pyspark.sql.types import _all_atomic_types, _parse_datatype_string
+from pyspark.sql.types import _all_mappable_types, 
_parse_datatype_string
+
+for k, t in _all_mappable_types.items():
+self.assertEqual(t(), _parse_datatype_string(k))
 
-for k, t in _all_atomic_types.items():
-if k != "varchar" and k != "char":
-self.assertEqual(t(), _parse_datatype_string(k))
 self.assertEqual(IntegerType(), _parse_datatype_string("int"))
 self.assertEqual(StringType(), _parse_datatype_string("string"))
 self.assertEqual(CharType(1), _parse_datatype_string("char(1)"))
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 17b019240f82..b9db59e0a58a 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -1756,13 +1756,45 @@ _atomic_types: List[Type[DataType]] = [
 TimestampNTZType,
 NullType,
 VariantType,
+YearMonthIntervalType,
+DayTimeIntervalType,
 ]
-_all_atomic_types: Dict[str, Type[DataType]] = dict((t.typeName(), t) for t in 
_atomic_types)
 
-_complex_types: List[Type[Union[ArrayType, MapType, StructType]]] = 
[ArrayType, MapType, StructType]
-_all_complex_types: Dict[str, Type[Union[ArrayType, MapType, StructType]]] = 
dict(
-(v.typeName(), v) for v in _complex_types
-)
+_complex_types: List[Type[Union[ArrayType, MapType, StructType]]] = [
+ArrayType,
+MapType,
+StructType,
+]
+_all_complex_types: Dict[str, Type[Union[ArrayType, MapType, StructType]]] = {
+"array": ArrayType,
+"map": MapType,
+"struct": St

(spark) branch master updated: [MINOR][TESTS] Add a helper function for `spark.table` in dsl

2024-05-23 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4a471cceebed [MINOR][TESTS] Add a helper function for `spark.table` in 
dsl
4a471cceebed is described below

commit 4a471cceebedd938f781eb385162d33058124092
Author: Ruifeng Zheng 
AuthorDate: Thu May 23 19:46:46 2024 +0800

[MINOR][TESTS] Add a helper function for `spark.table` in dsl

### What changes were proposed in this pull request?
Add a helper function for `spark.table` in dsl

### Why are the changes needed?
to be used in tests

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46717 from zhengruifeng/dsl_read.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../scala/org/apache/spark/sql/connect/dsl/package.scala  | 15 +++
 1 file changed, 15 insertions(+)

diff --git 
a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/dsl/package.scala
 
b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/dsl/package.scala
index a94bbf9c8f24..3edb63ee8e81 100644
--- 
a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/dsl/package.scala
+++ 
b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/dsl/package.scala
@@ -332,6 +332,21 @@ package object dsl {
   def sql(sqlText: String): Relation = {
 
Relation.newBuilder().setSql(SQL.newBuilder().setQuery(sqlText)).build()
   }
+
+  def table(name: String): Relation = {
+proto.Relation
+  .newBuilder()
+  .setRead(
+proto.Read
+  .newBuilder()
+  .setNamedTable(
+proto.Read.NamedTable
+  .newBuilder()
+  .setUnparsedIdentifier(name)
+  .build())
+  .build())
+  .build()
+  }
 }
 
 implicit class DslNAFunctions(val logicalPlan: Relation) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48395][PYTHON] Fix `StructType.treeString` for parameterized types

2024-05-23 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 14d3f447360b [SPARK-48395][PYTHON] Fix `StructType.treeString` for 
parameterized types
14d3f447360b is described below

commit 14d3f447360b3c8979a8cdb4c40c480a1e04
Author: Ruifeng Zheng 
AuthorDate: Thu May 23 16:12:38 2024 +0800

[SPARK-48395][PYTHON] Fix `StructType.treeString` for parameterized types

### What changes were proposed in this pull request?
this PR is a follow up of https://github.com/apache/spark/pull/46685.

### Why are the changes needed?
`StructType.treeString` uses `DataType.typeName` to generate the tree 
string, however, the `typeName` in python is a class method and can not return 
the same string for parameterized types.

```
In [2]: schema = StructType().add("c", CharType(10), True).add("v", 
VarcharType(10), True).add("d", DecimalType(10, 2), True).add("ym00", YearM
   ...: onthIntervalType(0, 0)).add("ym01", YearMonthIntervalType(0, 
1)).add("ym11", YearMonthIntervalType(1, 1))

In [3]: print(schema.treeString())
root
 |-- c: char (nullable = true)
 |-- v: varchar (nullable = true)
 |-- d: decimal (nullable = true)
 |-- ym00: yearmonthinterval (nullable = true)
 |-- ym01: yearmonthinterval (nullable = true)
 |-- ym11: yearmonthinterval (nullable = true)
```

it should be
```
In [4]: print(schema.treeString())
root
 |-- c: char(10) (nullable = true)
 |-- v: varchar(10) (nullable = true)
 |-- d: decimal(10,2) (nullable = true)
 |-- ym00: interval year (nullable = true)
 |-- ym01: interval year to month (nullable = true)
 |-- ym11: interval month (nullable = true)
```

### Does this PR introduce _any_ user-facing change?
no, this feature was just added and not release out yet.

### How was this patch tested?
added tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46711 from zhengruifeng/tree_string_fix.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/tests/test_types.py | 67 ++
 python/pyspark/sql/types.py| 27 --
 2 files changed, 90 insertions(+), 4 deletions(-)

diff --git a/python/pyspark/sql/tests/test_types.py 
b/python/pyspark/sql/tests/test_types.py
index ec07406b1191..6c64a9471363 100644
--- a/python/pyspark/sql/tests/test_types.py
+++ b/python/pyspark/sql/tests/test_types.py
@@ -41,6 +41,7 @@ from pyspark.sql.types import (
 FloatType,
 DateType,
 TimestampType,
+TimestampNTZType,
 DayTimeIntervalType,
 YearMonthIntervalType,
 CalendarIntervalType,
@@ -1411,6 +1412,72 @@ class TypesTestsMixin:
 ],
 )
 
+def test_tree_string_for_builtin_types(self):
+schema = (
+StructType()
+.add("n", NullType())
+.add("str", StringType())
+.add("c", CharType(10))
+.add("v", VarcharType(10))
+.add("bin", BinaryType())
+.add("bool", BooleanType())
+.add("date", DateType())
+.add("ts", TimestampType())
+.add("ts_ntz", TimestampNTZType())
+.add("dec", DecimalType(10, 2))
+.add("double", DoubleType())
+.add("float", FloatType())
+.add("long", LongType())
+.add("int", IntegerType())
+.add("short", ShortType())
+.add("byte", ByteType())
+.add("ym_interval_1", YearMonthIntervalType())
+.add("ym_interval_2", 
YearMonthIntervalType(YearMonthIntervalType.YEAR))
+.add(
+"ym_interval_3",
+YearMonthIntervalType(YearMonthIntervalType.YEAR, 
YearMonthIntervalType.MONTH),
+)
+.add("dt_interval_1", DayTimeIntervalType())
+.add("dt_interval_2", DayTimeIntervalType(DayTimeIntervalType.DAY))
+.add(
+"dt_interval_3",
+DayTimeIntervalType(DayTimeIntervalType.HOUR, 
DayTimeIntervalType.SECOND),
+)
+.add("cal_interval", CalendarIntervalType())
+.add("var", VariantType())
+)
+self.assertEqual(
+schema.treeString().split("\n"),
+[
+"root",
+" |-- n: void (nullable = true)",
+" |-- str: string

(spark) branch master updated: [SPARK-48372][SPARK-45716][PYTHON] Implement `StructType.treeString`

2024-05-22 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e55875b0bbe0 [SPARK-48372][SPARK-45716][PYTHON] Implement 
`StructType.treeString`
e55875b0bbe0 is described below

commit e55875b0bbe08c435ffcb0ea034ceb95938d8729
Author: Ruifeng Zheng 
AuthorDate: Wed May 22 15:31:27 2024 +0800

[SPARK-48372][SPARK-45716][PYTHON] Implement `StructType.treeString`

### What changes were proposed in this pull request?
Implement `StructType.treeString`

### Why are the changes needed?
feature parity, this method is Scala-only before

### Does this PR introduce _any_ user-facing change?
yes

```
In [2]: schema1 = DataType.fromDDL("c1 INT, c2 STRUCT>")

In [3]: print(schema1.treeString())
root
 |-- c1: integer (nullable = true)
 |-- c2: struct (nullable = true)
 ||-- c3: integer (nullable = true)
 ||-- c4: struct (nullable = true)
 |||-- c5: integer (nullable = true)
 |||-- c6: integer (nullable = true)
```

### How was this patch tested?
added tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46685 from zhengruifeng/py_tree_string.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/tests/test_types.py | 241 +
 python/pyspark/sql/types.py|  87 +++-
 python/pyspark/sql/utils.py|  54 +++-
 3 files changed, 380 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/tests/test_types.py 
b/python/pyspark/sql/tests/test_types.py
index 4d6fc499b70b..ec07406b1191 100644
--- a/python/pyspark/sql/tests/test_types.py
+++ b/python/pyspark/sql/tests/test_types.py
@@ -1170,6 +1170,247 @@ class TypesTestsMixin:
 )
 self.assertEqual(VariantType(), _parse_datatype_string("variant"))
 
+def test_tree_string(self):
+schema1 = DataType.fromDDL("c1 INT, c2 STRUCT>")
+
+self.assertEqual(
+schema1.treeString().split("\n"),
+[
+"root",
+" |-- c1: integer (nullable = true)",
+" |-- c2: struct (nullable = true)",
+" ||-- c3: integer (nullable = true)",
+" ||-- c4: struct (nullable = true)",
+" |||-- c5: integer (nullable = true)",
+" |||-- c6: integer (nullable = true)",
+"",
+],
+)
+self.assertEqual(
+schema1.treeString(-1).split("\n"),
+[
+"root",
+" |-- c1: integer (nullable = true)",
+" |-- c2: struct (nullable = true)",
+" ||-- c3: integer (nullable = true)",
+" ||-- c4: struct (nullable = true)",
+" |||-- c5: integer (nullable = true)",
+" |||-- c6: integer (nullable = true)",
+"",
+],
+)
+self.assertEqual(
+schema1.treeString(0).split("\n"),
+[
+"root",
+" |-- c1: integer (nullable = true)",
+" |-- c2: struct (nullable = true)",
+" ||-- c3: integer (nullable = true)",
+" ||-- c4: struct (nullable = true)",
+" |||-- c5: integer (nullable = true)",
+" |||-- c6: integer (nullable = true)",
+"",
+],
+)
+self.assertEqual(
+schema1.treeString(1).split("\n"),
+[
+"root",
+" |-- c1: integer (nullable = true)",
+" |-- c2: struct (nullable = true)",
+"",
+],
+)
+self.assertEqual(
+schema1.treeString(2).split("\n"),
+[
+"root",
+" |-- c1: integer (nullable = true)",
+" |-- c2: struct (nullable = true)",
+" ||-- c3: integer (nullable = true)",
+" ||-- c4: struct (nullable = true)",
+"",
+],
+)
+self.assertEqual(
+schema1.treeString(3).split("\n"),
+[
+"root",
+

(spark) branch master updated (e702b32656bc -> a886121aee45)

2024-05-21 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from e702b32656bc [SPARK-48314][SS] Don't double cache files for 
FileStreamSource using Trigger.AvailableNow
 add a886121aee45 [MINOR][TESTS] Fix `DslLogicalPlan.as`

No new revisions were added by this update.

Summary of changes:
 .../src/test/scala/org/apache/spark/sql/connect/dsl/package.scala  | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (6e6e7a00f662 -> df50d4b309b5)

2024-05-21 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 6e6e7a00f662 [SPARK-48369][SQL][PYTHON][CONNECT] Add function 
`timestamp_add`
 add df50d4b309b5 [SPARK-48336][PS][CONNECT] Implement `ps.sql` in Spark 
Connect

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/sql_formatter.py | 63 +-
 .../pandas/tests/connect/test_parity_sql.py|  8 +--
 2 files changed, 39 insertions(+), 32 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48369][SQL][PYTHON][CONNECT] Add function `timestamp_add`

2024-05-21 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6e6e7a00f662 [SPARK-48369][SQL][PYTHON][CONNECT] Add function 
`timestamp_add`
6e6e7a00f662 is described below

commit 6e6e7a00f662ae1dc7e081c9e8ec40c30ad8d3d4
Author: Ruifeng Zheng 
AuthorDate: Tue May 21 19:35:24 2024 +0800

[SPARK-48369][SQL][PYTHON][CONNECT] Add function `timestamp_add`

### What changes were proposed in this pull request?
Add function `timestamp_add`

### Why are the changes needed?
this method is missing in dataframe API due to it is not in 
`FunctionRegistry`

### Does this PR introduce _any_ user-facing change?
yes, new method

```
>>> import datetime
>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame(
... [(datetime.datetime(2016, 3, 11, 9, 0, 7), 2),
...  (datetime.datetime(2024, 4, 2, 9, 0, 7), 3)], ["ts", 
"quantity"])
>>> df.select(sf.timestamp_add("year", "quantity", "ts")).show()
++
|timestampadd(year, quantity, ts)|
++
| 2018-03-11 09:00:07|
| 2027-04-02 09:00:07|
++
```

### How was this patch tested?
added tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46680 from zhengruifeng/func_ts_add.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../scala/org/apache/spark/sql/functions.scala |   9 +++
 .../apache/spark/sql/PlanGenerationTestSuite.scala |   4 ++
 .../explain-results/function_timestamp_add.explain |   2 +
 .../queries/function_timestamp_add.json|  33 +++
 .../queries/function_timestamp_add.proto.bin   | Bin 0 -> 144 bytes
 .../sql/connect/planner/SparkConnectPlanner.scala  |   5 ++
 .../source/reference/pyspark.sql/functions.rst |   1 +
 python/pyspark/sql/connect/functions/builtin.py|   7 +++
 python/pyspark/sql/functions/builtin.py|  63 +
 .../scala/org/apache/spark/sql/functions.scala |  10 
 .../apache/spark/sql/DataFrameFunctionsSuite.scala |   1 +
 11 files changed, 135 insertions(+)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
index e886c3998658..2f459d78362b 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
@@ -5963,6 +5963,15 @@ object functions {
   def timestamp_diff(unit: String, start: Column, end: Column): Column =
 Column.fn("timestampdiff", lit(unit), start, end)
 
+  /**
+   * Adds the specified number of units to the given timestamp.
+   *
+   * @group datetime_funcs
+   * @since 4.0.0
+   */
+  def timestamp_add(unit: String, quantity: Column, ts: Column): Column =
+Column.fn("timestampadd", lit(unit), quantity, ts)
+
   /**
* Parses the `timestamp` expression with the `format` expression to a 
timestamp without time
* zone. Returns null with invalid input.
diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
index e6955805d38d..49b1a5312fda 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
@@ -2309,6 +2309,10 @@ class PlanGenerationTestSuite
 fn.timestamp_diff("year", fn.col("t"), fn.col("t"))
   }
 
+  temporalFunctionTest("timestamp_add") {
+fn.timestamp_add("week", fn.col("x"), fn.col("t"))
+  }
+
   // Array of Long
   // Array of Long
   // Array of Array of Long
diff --git 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_timestamp_add.explain
 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_timestamp_add.explain
new file mode 100644
index ..36dde1393cdb
--- /dev/null
+++ 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_timestamp_add.explain
@@ -0,0 +1,2 @@
+Project [timestampadd(week, cast(x#0L as int), t#0, Some(America/Los_Angeles)) 
AS timestampadd(week, x, t)#0]
++- LocalRelation , [d#0, t#0, s#0, x#0L,

(spark) branch master updated: [MINOR][PYTHON][TESTS] Test `test_mixed_udf_and_sql` with parent `Colum` class

2024-05-20 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d2edefac59a7 [MINOR][PYTHON][TESTS] Test `test_mixed_udf_and_sql` with 
parent `Colum` class
d2edefac59a7 is described below

commit d2edefac59a7db2e07d1defb2d876ecdcd8032aa
Author: Ruifeng Zheng 
AuthorDate: Mon May 20 15:22:33 2024 +0800

[MINOR][PYTHON][TESTS] Test `test_mixed_udf_and_sql` with parent `Colum` 
class

### What changes were proposed in this pull request?
Test `test_mixed_udf_and_sql` with parent `Colum` class

Don't find other similar cases in parity tests

### Why are the changes needed?
to make this parity test exactly same as the Spark Classic

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46660 from zhengruifeng/test_mixed_udf_and_sql.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/tests/connect/test_parity_pandas_udf_scalar.py | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/python/pyspark/sql/tests/connect/test_parity_pandas_udf_scalar.py 
b/python/pyspark/sql/tests/connect/test_parity_pandas_udf_scalar.py
index 241aae50c692..451f0f68d6ee 100644
--- a/python/pyspark/sql/tests/connect/test_parity_pandas_udf_scalar.py
+++ b/python/pyspark/sql/tests/connect/test_parity_pandas_udf_scalar.py
@@ -15,14 +15,12 @@
 # limitations under the License.
 #
 import unittest
-from pyspark.sql.connect.column import Column
 from pyspark.sql.tests.pandas.test_pandas_udf_scalar import 
ScalarPandasUDFTestsMixin
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 class PandasUDFScalarParityTests(ScalarPandasUDFTestsMixin, 
ReusedConnectTestCase):
-def test_mixed_udf_and_sql(self):
-self._test_mixed_udf_and_sql(Column)
+pass
 
 
 if __name__ == "__main__":


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (52ca921113b4 -> fa8aa571ad18)

2024-05-19 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 52ca921113b4 [MINOR][PYTHON][TESTS] Remove unnecessary hack imports
 add fa8aa571ad18 [SPARK-48335][PYTHON][CONNECT] Make 
`_parse_datatype_string` compatible with Spark Connect

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/udf.py  | 17 ++-
 .../pyspark/sql/tests/connect/test_parity_types.py |  4 --
 python/pyspark/sql/types.py| 55 +-
 3 files changed, 39 insertions(+), 37 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [MINOR][PYTHON][TESTS] Remove unnecessary hack imports

2024-05-19 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 52ca921113b4 [MINOR][PYTHON][TESTS] Remove unnecessary hack imports
52ca921113b4 is described below

commit 52ca921113b4308e52298d4a9968da7257d21d00
Author: Ruifeng Zheng 
AuthorDate: Mon May 20 14:08:41 2024 +0800

[MINOR][PYTHON][TESTS] Remove unnecessary hack imports

### What changes were proposed in this pull request?
Remove unnecessary hack imports

### Why are the changes needed?
it should be no longer needed after the introduction of parent Column class

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46656 from zhengruifeng/test_parity_column.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/tests/connect/test_parity_column.py | 10 --
 1 file changed, 10 deletions(-)

diff --git a/python/pyspark/sql/tests/connect/test_parity_column.py 
b/python/pyspark/sql/tests/connect/test_parity_column.py
index d02fb289b7d8..a109d2ba3b58 100644
--- a/python/pyspark/sql/tests/connect/test_parity_column.py
+++ b/python/pyspark/sql/tests/connect/test_parity_column.py
@@ -17,16 +17,6 @@
 
 import unittest
 
-from pyspark.testing.connectutils import should_test_connect
-
-if should_test_connect:
-from pyspark import sql
-from pyspark.sql.connect.column import Column
-
-# This is a hack to make the Column instance comparison works in 
`ColumnTestsMixin`.
-# e.g., `isinstance(col, pyspark.sql.Column)`.
-sql.Column = Column
-
 from pyspark.sql.tests.test_column import ColumnTestsMixin
 from pyspark.testing.connectutils import ReusedConnectTestCase
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48321][CONNECT][TESTS] Avoid using deprecated methods in dsl

2024-05-17 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 15fb4787354a [SPARK-48321][CONNECT][TESTS] Avoid using deprecated 
methods in dsl
15fb4787354a is described below

commit 15fb4787354a2d5dc97afb31010beb1f3cc3b73d
Author: Ruifeng Zheng 
AuthorDate: Fri May 17 18:06:25 2024 +0800

[SPARK-48321][CONNECT][TESTS] Avoid using deprecated methods in dsl

### What changes were proposed in this pull request?
Avoid using deprecated methods in dsl

### Why are the changes needed?
`putAllRenameColumnsMap` was deprecated

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46635 from zhengruifeng/with_col_rename_dsl.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../src/test/scala/org/apache/spark/sql/connect/dsl/package.scala | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git 
a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/dsl/package.scala
 
b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/dsl/package.scala
index da9a0865b8ca..b50c04e7540f 100644
--- 
a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/dsl/package.scala
+++ 
b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/dsl/package.scala
@@ -1019,7 +1019,13 @@ package object dsl {
 WithColumnsRenamed
   .newBuilder()
   .setInput(logicalPlan)
-  .putAllRenameColumnsMap(renameColumnsMap.asJava))
+  .addAllRenames(renameColumnsMap.toSeq.map { case (k, v) =>
+WithColumnsRenamed.Rename
+  .newBuilder()
+  .setColName(k)
+  .setNewColName(v)
+  .build()
+  }.asJava))
   .build()
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-41625][PYTHON][CONNECT][TESTS][FOLLOW-UP] Enable `DataFrameObservationParityTests.test_observe_str`

2024-05-16 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 889820c1ff39 [SPARK-41625][PYTHON][CONNECT][TESTS][FOLLOW-UP] Enable 
`DataFrameObservationParityTests.test_observe_str`
889820c1ff39 is described below

commit 889820c1ff392983c52b55d80bd8d80be22785ab
Author: Hyukjin Kwon 
AuthorDate: Fri May 17 11:57:34 2024 +0800

[SPARK-41625][PYTHON][CONNECT][TESTS][FOLLOW-UP] Enable 
`DataFrameObservationParityTests.test_observe_str`

### What changes were proposed in this pull request?

This PR proposes to enable 
`DataFrameObservationParityTests.test_observe_str`.

### Why are the changes needed?

To make sure on the test coverage

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

CI in this PR.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46630 from HyukjinKwon/SPARK-41625-followup.

Authored-by: Hyukjin Kwon 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/tests/connect/test_parity_observation.py | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/python/pyspark/sql/tests/connect/test_parity_observation.py 
b/python/pyspark/sql/tests/connect/test_parity_observation.py
index a7b0009357b6..e16053d5a082 100644
--- a/python/pyspark/sql/tests/connect/test_parity_observation.py
+++ b/python/pyspark/sql/tests/connect/test_parity_observation.py
@@ -25,10 +25,7 @@ class DataFrameObservationParityTests(
 DataFrameObservationTestsMixin,
 ReusedConnectTestCase,
 ):
-# TODO(SPARK-41625): Support Structured Streaming
-@unittest.skip("Fails in Spark Connect, should enable.")
-def test_observe_str(self):
-super().test_observe_str()
+pass
 
 
 if __name__ == "__main__":


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48301][SQL][FOLLOWUP] Update the error message

2024-05-16 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b0e535217bf8 [SPARK-48301][SQL][FOLLOWUP] Update the error message
b0e535217bf8 is described below

commit b0e535217bf891f2320f2419d213e1c700e15b41
Author: Ruifeng Zheng 
AuthorDate: Fri May 17 09:56:06 2024 +0800

[SPARK-48301][SQL][FOLLOWUP] Update the error message

### What changes were proposed in this pull request?
Update the error message

### Why are the changes needed?
we don't support `CREATE PROCEDURE` in spark, to address 
https://github.com/apache/spark/pull/46608#discussion_r1604205064

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46628 from zhengruifeng/nit_error.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 common/utils/src/main/resources/error/error-conditions.json | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/common/utils/src/main/resources/error/error-conditions.json 
b/common/utils/src/main/resources/error/error-conditions.json
index 5d750ade7867..69889435b02e 100644
--- a/common/utils/src/main/resources/error/error-conditions.json
+++ b/common/utils/src/main/resources/error/error-conditions.json
@@ -2677,7 +2677,7 @@
   },
   "CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
 "message" : [
-  "CREATE PROCEDURE or CREATE FUNCTION with both IF NOT EXISTS and 
REPLACE is not allowed."
+  "Cannot create a routine with both IF NOT EXISTS and REPLACE 
specified."
 ]
   },
   "CREATE_TEMP_FUNC_WITH_DATABASE" : {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE`

2024-05-16 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3d3d18f14ba2 [SPARK-48301][SQL] Rename 
`CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to 
`CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE`
3d3d18f14ba2 is described below

commit 3d3d18f14ba29074ca3ff8b661449ad45d84369e
Author: Ruifeng Zheng 
AuthorDate: Thu May 16 20:58:15 2024 +0800

[SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to 
`CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE`

### What changes were proposed in this pull request?
Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to 
`CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE`

### Why are the changes needed?
`IF NOT EXISTS` + `REPLACE` is standard restriction, not just for functions.
Rename it to make it reusable.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
updated tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46608 from zhengruifeng/sql_rename_if_not_exists_replace.

Lead-authored-by: Ruifeng Zheng 
Co-authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 common/utils/src/main/resources/error/error-conditions.json   | 4 ++--
 .../main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala   | 2 +-
 .../scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala   | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-conditions.json 
b/common/utils/src/main/resources/error/error-conditions.json
index 75067a1920f7..5d750ade7867 100644
--- a/common/utils/src/main/resources/error/error-conditions.json
+++ b/common/utils/src/main/resources/error/error-conditions.json
@@ -2675,9 +2675,9 @@
   "ANALYZE TABLE(S) ... COMPUTE STATISTICS ...  must be either 
NOSCAN or empty."
 ]
   },
-  "CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
+  "CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
 "message" : [
-  "CREATE FUNCTION with both IF NOT EXISTS and REPLACE is not allowed."
+  "CREATE PROCEDURE or CREATE FUNCTION with both IF NOT EXISTS and 
REPLACE is not allowed."
 ]
   },
   "CREATE_TEMP_FUNC_WITH_DATABASE" : {
diff --git 
a/sql/api/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala 
b/sql/api/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala
index d07aa6741a14..5eafd4d915a4 100644
--- 
a/sql/api/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala
+++ 
b/sql/api/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala
@@ -576,7 +576,7 @@ private[sql] object QueryParsingErrors extends 
DataTypeErrorsBase {
 
   def createFuncWithBothIfNotExistsAndReplaceError(ctx: 
CreateFunctionContext): Throwable = {
 new ParseException(
-  errorClass = 
"INVALID_SQL_SYNTAX.CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE",
+  errorClass = 
"INVALID_SQL_SYNTAX.CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE",
   ctx)
   }
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala
index 5babce0ddb8d..29ab6e994e42 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala
@@ -288,7 +288,7 @@ class QueryParsingErrorsSuite extends QueryTest with 
SharedSparkSession with SQL
 stop = 27))
   }
 
-  test("INVALID_SQL_SYNTAX.CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE: " +
+  test("INVALID_SQL_SYNTAX.CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE: " +
 "Create function with both if not exists and replace") {
 val sqlText =
   """CREATE OR REPLACE FUNCTION IF NOT EXISTS func1 as
@@ -297,7 +297,7 @@ class QueryParsingErrorsSuite extends QueryTest with 
SharedSparkSession with SQL
 
 checkError(
   exception = parseException(sqlText),
-  errorClass = 
"INVALID_SQL_SYNTAX.CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE",
+  errorClass = 
"INVALID_SQL_SYNTAX.CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE",
   sqlState = "42000",
   context = ExpectedContext(
 fragment = sqlText,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48287][PS][CONNECT] Apply the builtin `timestamp_diff` method

2024-05-15 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d0f4533f4e79 [SPARK-48287][PS][CONNECT] Apply the builtin 
`timestamp_diff` method
d0f4533f4e79 is described below

commit d0f4533f4e797a439eb78b8214e7bbfe06d0839a
Author: Ruifeng Zheng 
AuthorDate: Thu May 16 12:15:18 2024 +0800

[SPARK-48287][PS][CONNECT] Apply the builtin `timestamp_diff` method

### What changes were proposed in this pull request?
Apply the builtin `timestamp_diff` method

### Why are the changes needed?
`timestamp_diff` method was added as a builtin method, no need to maintain 
a PS-specific method

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46595 from zhengruifeng/ps_ts_diff.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/pandas/resample.py  |  3 +--
 python/pyspark/pandas/spark/functions.py   | 18 --
 .../apache/spark/sql/api/python/PythonSQLUtils.scala   |  4 
 3 files changed, 1 insertion(+), 24 deletions(-)

diff --git a/python/pyspark/pandas/resample.py 
b/python/pyspark/pandas/resample.py
index 9683fc4f4e7f..5557ca2af773 100644
--- a/python/pyspark/pandas/resample.py
+++ b/python/pyspark/pandas/resample.py
@@ -56,7 +56,6 @@ from pyspark.pandas.utils import (
 scol_for,
 verify_temp_column_name,
 )
-from pyspark.pandas.spark.functions import timestampdiff
 
 
 class Resampler(Generic[FrameLike], metaclass=ABCMeta):
@@ -279,7 +278,7 @@ class Resampler(Generic[FrameLike], metaclass=ABCMeta):
 truncated_ts_scol = F.date_trunc(unit_str, ts_scol)
 if isinstance(key_type, TimestampNTZType):
 truncated_ts_scol = F.to_timestamp_ntz(truncated_ts_scol)
-diff = timestampdiff(unit_str, origin_scol, truncated_ts_scol)
+diff = F.timestamp_diff(unit_str, origin_scol, truncated_ts_scol)
 mod = F.lit(0) if n == 1 else (diff % F.lit(n))
 
 if rule_code in ["h", "H"]:
diff --git a/python/pyspark/pandas/spark/functions.py 
b/python/pyspark/pandas/spark/functions.py
index 91602ae2b2b8..db1cc423078a 100644
--- a/python/pyspark/pandas/spark/functions.py
+++ b/python/pyspark/pandas/spark/functions.py
@@ -171,21 +171,3 @@ def null_index(col: Column) -> Column:
 
 sc = SparkContext._active_spark_context
 return Column(sc._jvm.PythonSQLUtils.nullIndex(col._jc))
-
-
-def timestampdiff(unit: str, start: Column, end: Column) -> Column:
-if is_remote():
-from pyspark.sql.connect.functions.builtin import 
_invoke_function_over_columns, lit
-
-return _invoke_function_over_columns(
-"timestampdiff",
-lit(unit),
-start,
-end,
-)
-
-else:
-from pyspark import SparkContext
-
-sc = SparkContext._active_spark_context
-return Column(sc._jvm.PythonSQLUtils.timestampDiff(unit, start._jc, 
end._jc))
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala
index 1e42e6a5adaa..eb8c1d65a8b5 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala
@@ -165,10 +165,6 @@ private[sql] object PythonSQLUtils extends Logging {
 }
   }
 
-  def timestampDiff(unit: String, start: Column, end: Column): Column = {
-Column(TimestampDiff(unit, start.expr, end.expr))
-  }
-
   def pandasProduct(e: Column, ignoreNA: Boolean): Column = {
 Column(PandasProduct(e.expr, ignoreNA).toAggregateExpression(false))
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48278][PYTHON][CONNECT] Refine the string representation of `Cast`

2024-05-15 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e97ad0a44419 [SPARK-48278][PYTHON][CONNECT] Refine the string 
representation of `Cast`
e97ad0a44419 is described below

commit e97ad0a444195a6f1db551fd652225973a517571
Author: Ruifeng Zheng 
AuthorDate: Wed May 15 15:56:45 2024 +0800

[SPARK-48278][PYTHON][CONNECT] Refine the string representation of `Cast`

### What changes were proposed in this pull request?
Refine the string representation of `Cast`

### Why are the changes needed?
try the best to make the string representation consistent with Spark Classic

### Does this PR introduce _any_ user-facing change?
Spark Classic:
```
In [1]: from pyspark.sql import functions as sf

In [2]: sf.col("a").try_cast("int")
Out[2]: Column<'TRY_CAST(a AS INT)'>
```

Spark Connect, before this PR:
```
In [1]: from pyspark.sql import functions as sf

In [2]: sf.col("a").try_cast("int")
Out[2]: Column<'(a (int))'>
```

Spark Connect, after this PR:
```
In [1]: from pyspark.sql import functions as sf

In [2]: sf.col("a").try_cast("int")
Out[2]: Column<'TRY_CAST(a AS INT)'>
```

### How was this patch tested?
added tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46585 from zhengruifeng/cast_str_repr.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/expressions.py | 14 +-
 python/pyspark/sql/tests/test_column.py   | 13 -
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/connect/expressions.py 
b/python/pyspark/sql/connect/expressions.py
index 92dde2f3670e..4dc54793ed81 100644
--- a/python/pyspark/sql/connect/expressions.py
+++ b/python/pyspark/sql/connect/expressions.py
@@ -848,6 +848,7 @@ class CastExpression(Expression):
 ) -> None:
 super().__init__()
 self._expr = expr
+assert isinstance(data_type, (DataType, str))
 self._data_type = data_type
 if eval_mode is not None:
 assert isinstance(eval_mode, str)
@@ -873,7 +874,18 @@ class CastExpression(Expression):
 return fun
 
 def __repr__(self) -> str:
-return f"({self._expr} ({self._data_type}))"
+# We cannot guarantee the string representations be exactly the same, 
e.g.
+# str(sf.col("a").cast("long")):
+#   Column<'CAST(a AS BIGINT)'> <- Spark Classic
+#   Column<'CAST(a AS LONG)'>   <- Spark Connect
+if isinstance(self._data_type, DataType):
+str_data_type = self._data_type.simpleString().upper()
+else:
+str_data_type = str(self._data_type).upper()
+if self._eval_mode is not None and self._eval_mode == "try":
+return f"TRY_CAST({self._expr} AS {str_data_type})"
+else:
+return f"CAST({self._expr} AS {str_data_type})"
 
 
 class UnresolvedNamedLambdaVariable(Expression):
diff --git a/python/pyspark/sql/tests/test_column.py 
b/python/pyspark/sql/tests/test_column.py
index 6e5fcde57cab..8f6adb37b9d4 100644
--- a/python/pyspark/sql/tests/test_column.py
+++ b/python/pyspark/sql/tests/test_column.py
@@ -19,7 +19,7 @@
 from itertools import chain
 from pyspark.sql import Column, Row
 from pyspark.sql import functions as sf
-from pyspark.sql.types import StructType, StructField, LongType
+from pyspark.sql.types import StructType, StructField, IntegerType, LongType
 from pyspark.errors import AnalysisException, PySparkTypeError, 
PySparkValueError
 from pyspark.testing.sqlutils import ReusedSQLTestCase
 
@@ -228,6 +228,17 @@ class ColumnTestsMixin:
 message_parameters={"arg_name": "metadata"},
 )
 
+def test_cast_str_representation(self):
+self.assertEqual(str(sf.col("a").cast("int")), "Column<'CAST(a AS 
INT)'>")
+self.assertEqual(str(sf.col("a").cast("INT")), "Column<'CAST(a AS 
INT)'>")
+self.assertEqual(str(sf.col("a").cast(IntegerType())), "Column<'CAST(a 
AS INT)'>")
+self.assertEqual(str(sf.col("a").cast(LongType())), "Column<'CAST(a AS 
BIGINT)'>")
+
+self.assertEqual(str(sf.col("a").try_cast("int")), "Column<'TRY_CAST(a 
AS INT)'>

(spark) branch master updated: [SPARK-48272][SQL][PYTHON][CONNECT] Add function `timestamp_diff`

2024-05-15 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c03ebb467ec2 [SPARK-48272][SQL][PYTHON][CONNECT] Add function 
`timestamp_diff`
c03ebb467ec2 is described below

commit c03ebb467ec268d894f3d97bea388129a840f5cf
Author: Ruifeng Zheng 
AuthorDate: Wed May 15 15:55:09 2024 +0800

[SPARK-48272][SQL][PYTHON][CONNECT] Add function `timestamp_diff`

### What changes were proposed in this pull request?
Add function `timestamp_diff`, by reusing existing proto

https://github.com/apache/spark/blob/c4df12cc884cddefcfcf8324b4d7b9349fb4f6a0/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L1971-L1974

### Why are the changes needed?
this method is missing in dataframe API due to it is not in 
`FunctionRegistry`

### Does this PR introduce _any_ user-facing change?
yes, new method

### How was this patch tested?
added tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46576 from zhengruifeng/df_ts_diff.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../scala/org/apache/spark/sql/functions.scala |  10 
 .../apache/spark/sql/PlanGenerationTestSuite.scala |   4 ++
 .../function_timestamp_diff.explain|   2 +
 .../queries/function_timestamp_diff.json   |  33 
 .../queries/function_timestamp_diff.proto.bin  | Bin 0 -> 145 bytes
 .../sql/connect/planner/SparkConnectPlanner.scala  |  10 ++--
 .../source/reference/pyspark.sql/functions.rst |   1 +
 python/pyspark/sql/connect/functions/builtin.py|   7 +++
 python/pyspark/sql/functions/builtin.py|  60 +
 .../scala/org/apache/spark/sql/functions.scala |  11 
 .../apache/spark/sql/DataFrameFunctionsSuite.scala |   3 +-
 11 files changed, 135 insertions(+), 6 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
index bf41ada97916..c537f535c6b2 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
@@ -5953,6 +5953,16 @@ object functions {
*/
   def timestamp_micros(e: Column): Column = Column.fn("timestamp_micros", e)
 
+  /**
+   * Gets the difference between the timestamps in the specified units by 
truncating the fraction
+   * part.
+   *
+   * @group datetime_funcs
+   * @since 4.0.0
+   */
+  def timestamp_diff(unit: String, start: Column, end: Column): Column =
+Column.fn("timestampdiff", lit(unit), start, end)
+
   /**
* Parses the `timestamp` expression with the `format` expression to a 
timestamp without time
* zone. Returns null with invalid input.
diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
index 144b45bdfd31..e6955805d38d 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
@@ -2305,6 +2305,10 @@ class PlanGenerationTestSuite
 fn.timestamp_micros(fn.col("x"))
   }
 
+  temporalFunctionTest("timestamp_diff") {
+fn.timestamp_diff("year", fn.col("t"), fn.col("t"))
+  }
+
   // Array of Long
   // Array of Long
   // Array of Array of Long
diff --git 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_timestamp_diff.explain
 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_timestamp_diff.explain
new file mode 100644
index ..7a0a3ff8c53d
--- /dev/null
+++ 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_timestamp_diff.explain
@@ -0,0 +1,2 @@
+Project [timestampdiff(year, t#0, t#0, Some(America/Los_Angeles)) AS 
timestampdiff(year, t, t)#0L]
++- LocalRelation , [d#0, t#0, s#0, x#0L, wt#0]
diff --git 
a/connector/connect/common/src/test/resources/query-tests/queries/function_timestamp_diff.json
 
b/connector/connect/common/src/test/resources/query-tests/queries/function_timestamp_diff.json
new file mode 100644
index ..635cbb45460e
--- /dev/null
+++ 
b/connector/connect/common/src/test/resources/query-tests/queries/function_timestamp_diff.json
@@ -0,0 +1,33 @@
+{
+  "common": {
+"planId": "1"
+  },
+  "project": {
+"input": {
+

(spark) branch master updated: [SPARK-48276][PYTHON][CONNECT] Add the missing `repr` method for `SQLExpression`

2024-05-14 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d31161b27404 [SPARK-48276][PYTHON][CONNECT] Add the missing `__repr__` 
method for `SQLExpression`
d31161b27404 is described below

commit d31161b27404219169345c716d7b7fe20356085d
Author: Ruifeng Zheng 
AuthorDate: Wed May 15 11:42:33 2024 +0800

[SPARK-48276][PYTHON][CONNECT] Add the missing `__repr__` method for 
`SQLExpression`

### What changes were proposed in this pull request?
1, Add the missing `__repr__` method for `SQLExpression`
2, also adjust the output of `lit(None)`: `None` -> `NULL` to be more 
consistent with the Spark Classic

### Why are the changes needed?
bug fix, all expressions should implement the `__repr__` method.

```
In [2]: from pyspark.sql.functions import when, lit, expr

In [3]: expression = expr("foo")

In [4]: when(expression, lit(None))
Out[4]: 
---
TypeError Traceback (most recent call last)
File 
~/.dev/miniconda3/envs/spark_dev_312/lib/python3.12/site-packages/IPython/core/formatters.py:711,
 in PlainTextFormatter.__call__(self, obj)
704 stream = StringIO()
705 printer = pretty.RepresentationPrinter(stream, self.verbose,
706 self.max_width, self.newline,
707 max_seq_length=self.max_seq_length,
708 singleton_pprinters=self.singleton_printers,
709 type_pprinters=self.type_printers,
710 deferred_pprinters=self.deferred_printers)
--> 711 printer.pretty(obj)
712 printer.flush()
713 return stream.getvalue()

File 
~/.dev/miniconda3/envs/spark_dev_312/lib/python3.12/site-packages/IPython/lib/pretty.py:411,
 in RepresentationPrinter.pretty(self, obj)
408 return meth(obj, self, cycle)
409 if cls is not object \
410 and callable(cls.__dict__.get('__repr__')):
--> 411 return _repr_pprint(obj, self, cycle)
413 return _default_pprint(obj, self, cycle)
414 finally:

File 
~/.dev/miniconda3/envs/spark_dev_312/lib/python3.12/site-packages/IPython/lib/pretty.py:779,
 in _repr_pprint(obj, p, cycle)
777 """A pprint that just redirects to the normal repr function."""
778 # Find newlines and replace them with p.break_()
--> 779 output = repr(obj)
780 lines = output.splitlines()
781 with p.group():

File ~/Dev/spark/python/pyspark/sql/connect/column.py:441, in 
Column.__repr__(self)
440 def __repr__(self) -> str:
--> 441 return "Column<'%s'>" % self._expr.__repr__()

File ~/Dev/spark/python/pyspark/sql/connect/expressions.py:148, in 
CaseWhen.__repr__(self)
147 def __repr__(self) -> str:
--> 148 _cases = "".join([f" WHEN {c} THEN {v}" for c, v in 
self._branches])
149 _else = f" ELSE {self._else_value}" if self._else_value is not 
None else ""
150 return "CASE" + _cases + _else + " END"

TypeError: __str__ returned non-string (type NoneType)
```

### Does this PR introduce _any_ user-facing change?
yes

```
In [3]: from pyspark.sql.functions import when, lit, expr

In [4]: expression = expr("foo")

In [5]: when_cond = when(expression, lit(None))

In [6]: when_cond
Out[6]: Column<'CASE WHEN foo THEN NULL END'>

In [7]: str(when_cond)
Out[7]: "Column<'CASE WHEN foo THEN NULL END'>"
```

### How was this patch tested?
added ut

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46583 from zhengruifeng/expr_repr.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/expressions.py | 9 -
 python/pyspark/sql/tests/test_column.py   | 5 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/connect/expressions.py 
b/python/pyspark/sql/connect/expressions.py
index b1735f65f520..92dde2f3670e 100644
--- a/python/pyspark/sql/connect/expressions.py
+++ b/python/pyspark/sql/connect/expressions.py
@@ -455,7 +455,10 @@ class LiteralExpression(Expression):
 return expr
 
 def __repr__(self) -> str:
-return f"{self._value}"
+if self._value is None:
+return "NULL"
+else:
+return f"{self._value}"
 
 
 class Colu

(spark) branch master updated: [SPARK-41794][FOLLOWUP] Add `try_remainder` to python API references

2024-05-13 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4cc589afe5b5 [SPARK-41794][FOLLOWUP] Add `try_remainder` to python API 
references
4cc589afe5b5 is described below

commit 4cc589afe5b5f23442fcacbe149a8ab3057889dc
Author: Ruifeng Zheng 
AuthorDate: Tue May 14 11:52:39 2024 +0800

[SPARK-41794][FOLLOWUP] Add `try_remainder` to python API references

### What changes were proposed in this pull request?
Add `try_remainder` to python API references

### Why are the changes needed?
new methods should be added to API references

### Does this PR introduce _any_ user-facing change?
doc changes

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46566 from zhengruifeng/doc_try_remainder.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/docs/source/reference/pyspark.sql/functions.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/python/docs/source/reference/pyspark.sql/functions.rst 
b/python/docs/source/reference/pyspark.sql/functions.rst
index 40af3d52e653..fb3273bf95e7 100644
--- a/python/docs/source/reference/pyspark.sql/functions.rst
+++ b/python/docs/source/reference/pyspark.sql/functions.rst
@@ -143,6 +143,7 @@ Mathematical Functions
 try_add
 try_divide
 try_multiply
+try_remainder
 try_subtract
 unhex
 width_bucket


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48259][CONNECT][TESTS] Add 3 missing methods in dsl

2024-05-13 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 28cf3db77932 [SPARK-48259][CONNECT][TESTS] Add 3 missing methods in dsl
28cf3db77932 is described below

commit 28cf3db779322a487d26fa17282889e217f2d6b5
Author: Ruifeng Zheng 
AuthorDate: Tue May 14 10:16:21 2024 +0800

[SPARK-48259][CONNECT][TESTS] Add 3 missing methods in dsl

### What changes were proposed in this pull request?
Add 3 missing methods in dsl

### Why are the changes needed?
those methods could be used in tests

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46559 from zhengruifeng/missing_3_func.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../org/apache/spark/sql/connect/dsl/package.scala | 27 ++
 1 file changed, 27 insertions(+)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
index 6aadb6c34b77..da9a0865b8ca 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
@@ -513,6 +513,25 @@ package object dsl {
 freqItems(cols.toArray, support)
 
   def freqItems(cols: Seq[String]): Relation = freqItems(cols, 0.01)
+
+  def sampleBy(col: String, fractions: Map[Any, Double], seed: Long): 
Relation = {
+Relation
+  .newBuilder()
+  .setSampleBy(
+StatSampleBy
+  .newBuilder()
+  .setInput(logicalPlan)
+  .addAllFractions(fractions.toSeq.map { case (k, v) =>
+StatSampleBy.Fraction
+  .newBuilder()
+  .setStratum(toLiteralProto(k))
+  .setFraction(v)
+  .build()
+  }.asJava)
+  .setSeed(seed)
+  .build())
+  .build()
+  }
 }
 
 def select(exprs: Expression*): Relation = {
@@ -587,6 +606,10 @@ package object dsl {
   .build()
   }
 
+  def filter(condition: Expression): Relation = {
+where(condition)
+  }
+
   def deduplicate(colNames: Seq[String]): Relation =
 Relation
   .newBuilder()
@@ -641,6 +664,10 @@ package object dsl {
 join(otherPlan, joinType, usingColumns, None)
   }
 
+  def crossJoin(otherPlan: Relation): Relation = {
+join(otherPlan, JoinType.JOIN_TYPE_CROSS, Seq(), None)
+  }
+
   private def join(
   otherPlan: Relation,
   joinType: JoinType = JoinType.JOIN_TYPE_INNER,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [MINOR][PYTHON][TESTS] Move test `test_named_arguments_negative` to `test_arrow_python_udf`

2024-05-12 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cae2248bc13d [MINOR][PYTHON][TESTS] Move test 
`test_named_arguments_negative` to `test_arrow_python_udf`
cae2248bc13d is described below

commit cae2248bc13d8bde7c48a1d7479df68bcd31fbf1
Author: Ruifeng Zheng 
AuthorDate: Mon May 13 11:09:44 2024 +0800

[MINOR][PYTHON][TESTS] Move test `test_named_arguments_negative` to 
`test_arrow_python_udf`

### What changes were proposed in this pull request?
Move test `test_named_arguments_negative` to `test_arrow_python_udf`

### Why are the changes needed?
it seems was added in a wrong place, it only runs in Spark Connect, not 
Spark Classic.
After this PR, it will also be run in Spark Classic

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46544 from zhengruifeng/move_test_named_arguments_negative.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../tests/connect/test_parity_arrow_python_udf.py  | 26 --
 python/pyspark/sql/tests/test_arrow_python_udf.py  | 24 +++-
 2 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/python/pyspark/sql/tests/connect/test_parity_arrow_python_udf.py 
b/python/pyspark/sql/tests/connect/test_parity_arrow_python_udf.py
index fa329b598d98..732008eb05a3 100644
--- a/python/pyspark/sql/tests/connect/test_parity_arrow_python_udf.py
+++ b/python/pyspark/sql/tests/connect/test_parity_arrow_python_udf.py
@@ -15,10 +15,6 @@
 # limitations under the License.
 #
 
-import unittest
-
-from pyspark.errors import AnalysisException, PythonException
-from pyspark.sql.functions import udf
 from pyspark.sql.tests.connect.test_parity_udf import UDFParityTests
 from pyspark.sql.tests.test_arrow_python_udf import PythonUDFArrowTestsMixin
 
@@ -36,28 +32,6 @@ class ArrowPythonUDFParityTests(UDFParityTests, 
PythonUDFArrowTestsMixin):
 finally:
 super(ArrowPythonUDFParityTests, cls).tearDownClass()
 
-def test_named_arguments_negative(self):
-@udf("int")
-def test_udf(a, b):
-return a + b
-
-self.spark.udf.register("test_udf", test_udf)
-
-with self.assertRaisesRegex(
-AnalysisException,
-
"DUPLICATE_ROUTINE_PARAMETER_ASSIGNMENT.DOUBLE_NAMED_ARGUMENT_REFERENCE",
-):
-self.spark.sql("SELECT test_udf(a => id, a => id * 10) FROM 
range(2)").show()
-
-with self.assertRaisesRegex(AnalysisException, 
"UNEXPECTED_POSITIONAL_ARGUMENT"):
-self.spark.sql("SELECT test_udf(a => id, id * 10) FROM 
range(2)").show()
-
-with self.assertRaises(PythonException):
-self.spark.sql("SELECT test_udf(c => 'x') FROM range(2)").show()
-
-with self.assertRaises(PythonException):
-self.spark.sql("SELECT test_udf(id, a => id * 10) FROM 
range(2)").show()
-
 
 if __name__ == "__main__":
 import unittest
diff --git a/python/pyspark/sql/tests/test_arrow_python_udf.py 
b/python/pyspark/sql/tests/test_arrow_python_udf.py
index 23f302ec3c8d..5a66d61cb66a 100644
--- a/python/pyspark/sql/tests/test_arrow_python_udf.py
+++ b/python/pyspark/sql/tests/test_arrow_python_udf.py
@@ -17,7 +17,7 @@
 
 import unittest
 
-from pyspark.errors import PythonException, PySparkNotImplementedError
+from pyspark.errors import AnalysisException, PythonException, 
PySparkNotImplementedError
 from pyspark.sql import Row
 from pyspark.sql.functions import udf
 from pyspark.sql.tests.test_udf import BaseUDFTestsMixin
@@ -197,6 +197,28 @@ class PythonUDFArrowTestsMixin(BaseUDFTestsMixin):
 " without arguments.",
 )
 
+def test_named_arguments_negative(self):
+@udf("int")
+def test_udf(a, b):
+return a + b
+
+self.spark.udf.register("test_udf", test_udf)
+
+with self.assertRaisesRegex(
+AnalysisException,
+
"DUPLICATE_ROUTINE_PARAMETER_ASSIGNMENT.DOUBLE_NAMED_ARGUMENT_REFERENCE",
+):
+self.spark.sql("SELECT test_udf(a => id, a => id * 10) FROM 
range(2)").show()
+
+with self.assertRaisesRegex(AnalysisException, 
"UNEXPECTED_POSITIONAL_ARGUMENT"):
+self.spark.sql("SELECT test_udf(a => id, id * 10) FROM 
range(2)").show()
+
+with self.assertRaises(PythonException):
+self.spark.sql("SELECT test_udf(c => 'x') FROM range(2)").show()
+
+with

(spark) branch master updated: [SPARK-48228][PYTHON][CONNECT][FOLLOWUP] Also apply `_validate_pandas_udf` in MapInXXX

2024-05-10 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 259760a5c5e2 [SPARK-48228][PYTHON][CONNECT][FOLLOWUP] Also apply 
`_validate_pandas_udf` in MapInXXX
259760a5c5e2 is described below

commit 259760a5c5e26e33b2ee46282aeb63e4ea701020
Author: Ruifeng Zheng 
AuthorDate: Fri May 10 18:44:53 2024 +0800

[SPARK-48228][PYTHON][CONNECT][FOLLOWUP] Also apply `_validate_pandas_udf` 
in MapInXXX

### What changes were proposed in this pull request?
Also apply `_validate_pandas_udf`  in MapInXXX

### Why are the changes needed?
to make sure validation in `pandas_udf` is also applied in MapInXXX

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46524 from zhengruifeng/missing_check_map_in_xxx.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/dataframe.py | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index 3c9415adec2d..ccaaa15f3190 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -83,6 +83,7 @@ from pyspark.sql.connect.expressions import (
 )
 from pyspark.sql.connect.functions import builtin as F
 from pyspark.sql.pandas.types import from_arrow_schema
+from pyspark.sql.pandas.functions import _validate_pandas_udf  # type: 
ignore[attr-defined]
 
 
 if TYPE_CHECKING:
@@ -1997,6 +1998,7 @@ class DataFrame(ParentDataFrame):
 ) -> ParentDataFrame:
 from pyspark.sql.connect.udf import UserDefinedFunction
 
+_validate_pandas_udf(func, evalType)
 udf_obj = UserDefinedFunction(
 func,
 returnType=schema,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48190][PYTHON][PS][TESTS] Introduce a helper function to drop metadata

2024-05-08 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d7f69e7003a3 [SPARK-48190][PYTHON][PS][TESTS] Introduce a helper 
function to drop metadata
d7f69e7003a3 is described below

commit d7f69e7003a3c7e7ad22a39e6aaacd183d26d326
Author: Ruifeng Zheng 
AuthorDate: Wed May 8 18:48:21 2024 +0800

[SPARK-48190][PYTHON][PS][TESTS] Introduce a helper function to drop 
metadata

### What changes were proposed in this pull request?
Introduce a helper function to drop metadata

### Why are the changes needed?
existing helper function `remove_metadata` in PS doesn't support nested 
types, so cannot be reused in other places

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46466 from zhengruifeng/py_drop_meta.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/pandas/internal.py   | 17 +++--
 .../pyspark/sql/tests/connect/test_connect_function.py  | 11 +--
 python/pyspark/sql/types.py | 13 +
 3 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/python/pyspark/pandas/internal.py 
b/python/pyspark/pandas/internal.py
index 767ec9a57f9b..8ab8d79d5686 100644
--- a/python/pyspark/pandas/internal.py
+++ b/python/pyspark/pandas/internal.py
@@ -33,6 +33,7 @@ from pyspark.sql import (
 Window,
 )
 from pyspark.sql.types import (  # noqa: F401
+_drop_metadata,
 BooleanType,
 DataType,
 LongType,
@@ -761,14 +762,8 @@ class InternalFrame:
 # in a few tests when using Spark Connect. However, the 
function works properly.
 # Therefore, we temporarily perform Spark Connect tests by 
excluding metadata
 # until the issue is resolved.
-def remove_metadata(struct_field: StructField) -> StructField:
-new_struct_field = StructField(
-struct_field.name, struct_field.dataType, 
struct_field.nullable
-)
-return new_struct_field
-
 assert all(
-remove_metadata(index_field.struct_field) == 
remove_metadata(struct_field)
+_drop_metadata(index_field.struct_field) == 
_drop_metadata(struct_field)
 for index_field, struct_field in zip(index_fields, 
struct_fields)
 ), (index_fields, struct_fields)
 else:
@@ -795,14 +790,8 @@ class InternalFrame:
 # in a few tests when using Spark Connect. However, the 
function works properly.
 # Therefore, we temporarily perform Spark Connect tests by 
excluding metadata
 # until the issue is resolved.
-def remove_metadata(struct_field: StructField) -> StructField:
-new_struct_field = StructField(
-struct_field.name, struct_field.dataType, 
struct_field.nullable
-)
-return new_struct_field
-
 assert all(
-remove_metadata(data_field.struct_field) == 
remove_metadata(struct_field)
+_drop_metadata(data_field.struct_field) == 
_drop_metadata(struct_field)
 for data_field, struct_field in zip(data_fields, 
struct_fields)
 ), (data_fields, struct_fields)
 else:
diff --git a/python/pyspark/sql/tests/connect/test_connect_function.py 
b/python/pyspark/sql/tests/connect/test_connect_function.py
index 9d4db8cf7d15..0f0abfd4b856 100644
--- a/python/pyspark/sql/tests/connect/test_connect_function.py
+++ b/python/pyspark/sql/tests/connect/test_connect_function.py
@@ -21,7 +21,14 @@ from inspect import getmembers, isfunction
 from pyspark.util import is_remote_only
 from pyspark.errors import PySparkTypeError, PySparkValueError
 from pyspark.sql import SparkSession as PySparkSession
-from pyspark.sql.types import StringType, StructType, StructField, ArrayType, 
IntegerType
+from pyspark.sql.types import (
+_drop_metadata,
+StringType,
+StructType,
+StructField,
+ArrayType,
+IntegerType,
+)
 from pyspark.testing import assertDataFrameEqual
 from pyspark.testing.pandasutils import PandasOnSparkTestUtils
 from pyspark.testing.connectutils import ReusedConnectTestCase, 
should_test_connect
@@ -1668,7 +1675,7 @@ class SparkConnectFunctionTests(ReusedConnectTestCase, 
PandasOnSparkTestUtils, S
 )
 
 # TODO: 'cdf.schema' has an extra metadata '{'__autoGeneratedAlias': 
'true'}'

(spark) branch master updated: [SPARK-48058][SPARK-43727][PYTHON][CONNECT][TESTS][FOLLOWUP] Code clean up

2024-05-07 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 98d4ab734de0 
[SPARK-48058][SPARK-43727][PYTHON][CONNECT][TESTS][FOLLOWUP] Code clean up
98d4ab734de0 is described below

commit 98d4ab734de05eb5eec83011ed965cfb5b51e4b5
Author: Ruifeng Zheng 
AuthorDate: Tue May 7 15:35:54 2024 +0800

[SPARK-48058][SPARK-43727][PYTHON][CONNECT][TESTS][FOLLOWUP] Code clean up

### What changes were proposed in this pull request?
after https://github.com/apache/spark/pull/46300, the two tests are 
actually the same as


https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/python/pyspark/sql/tests/pandas/test_pandas_udf.py#L110-L125

and


https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/python/pyspark/sql/tests/pandas/test_pandas_udf.py#L55-L70

So no need to override them in the parity tests

### Why are the changes needed?
clean up

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46429 from zhengruifeng/return_type_followup.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../sql/tests/connect/test_parity_pandas_udf.py| 35 +-
 1 file changed, 1 insertion(+), 34 deletions(-)

diff --git a/python/pyspark/sql/tests/connect/test_parity_pandas_udf.py 
b/python/pyspark/sql/tests/connect/test_parity_pandas_udf.py
index b732a875fb0a..7f280a009f78 100644
--- a/python/pyspark/sql/tests/connect/test_parity_pandas_udf.py
+++ b/python/pyspark/sql/tests/connect/test_parity_pandas_udf.py
@@ -14,8 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-from pyspark.sql.functions import pandas_udf, PandasUDFType
-from pyspark.sql.types import DoubleType, StructType, StructField
+
 from pyspark.sql.tests.pandas.test_pandas_udf import PandasUDFTestsMixin
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
@@ -24,38 +23,6 @@ class PandasUDFParityTests(PandasUDFTestsMixin, 
ReusedConnectTestCase):
 def test_udf_wrong_arg(self):
 self.check_udf_wrong_arg()
 
-def test_pandas_udf_decorator_with_return_type_string(self):
-@pandas_udf("v double", PandasUDFType.GROUPED_MAP)
-def foo(x):
-return x
-
-self.assertEqual(foo.returnType, StructType([StructField("v", 
DoubleType(), True)]))
-self.assertEqual(foo.evalType, PandasUDFType.GROUPED_MAP)
-
-@pandas_udf(returnType="double", functionType=PandasUDFType.SCALAR)
-def foo(x):
-return x
-
-self.assertEqual(foo.returnType, DoubleType())
-self.assertEqual(foo.evalType, PandasUDFType.SCALAR)
-
-def test_pandas_udf_basic_with_return_type_string(self):
-udf = pandas_udf(lambda x: x, "double", PandasUDFType.SCALAR)
-self.assertEqual(udf.returnType, DoubleType())
-self.assertEqual(udf.evalType, PandasUDFType.SCALAR)
-
-udf = pandas_udf(lambda x: x, "v double", PandasUDFType.GROUPED_MAP)
-self.assertEqual(udf.returnType, StructType([StructField("v", 
DoubleType(), True)]))
-self.assertEqual(udf.evalType, PandasUDFType.GROUPED_MAP)
-
-udf = pandas_udf(lambda x: x, "v double", 
functionType=PandasUDFType.GROUPED_MAP)
-self.assertEqual(udf.returnType, StructType([StructField("v", 
DoubleType(), True)]))
-self.assertEqual(udf.evalType, PandasUDFType.GROUPED_MAP)
-
-udf = pandas_udf(lambda x: x, returnType="v double", 
functionType=PandasUDFType.GROUPED_MAP)
-self.assertEqual(udf.returnType, StructType([StructField("v", 
DoubleType(), True)]))
-self.assertEqual(udf.evalType, PandasUDFType.GROUPED_MAP)
-
 
 if __name__ == "__main__":
 import unittest


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (05b22ebb3060 -> 3b9f52f7768a)

2024-05-06 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 05b22ebb3060 [SPARK-48141][TEST] Update the Oracle docker image 
version used for test and integration to use Oracle Database 23ai Free
 add 3b9f52f7768a [SPARK-48154][PYTHON][CONNECT][TESTS] Enable 
`PandasUDFGroupedAggParityTests.test_manual`

No new revisions were added by this update.

Summary of changes:
 .../sql/tests/connect/test_parity_pandas_udf_grouped_agg.py  | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48142][PYTHON][CONNECT][TESTS] Enable `CogroupedApplyInPandasTests.test_wrong_args`

2024-05-06 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2ef7246b9c5b [SPARK-48142][PYTHON][CONNECT][TESTS] Enable 
`CogroupedApplyInPandasTests.test_wrong_args`
2ef7246b9c5b is described below

commit 2ef7246b9c5b39b16cf9a37d7fc84a233362967c
Author: Ruifeng Zheng 
AuthorDate: Tue May 7 09:15:31 2024 +0800

[SPARK-48142][PYTHON][CONNECT][TESTS] Enable 
`CogroupedApplyInPandasTests.test_wrong_args`

### What changes were proposed in this pull request?
Enable `CogroupedApplyInPandasTests.test_wrong_args` by including a missing 
check

### Why are the changes needed?
for test coverage

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46397 from zhengruifeng/fix_pandas_udf_check.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/group.py  | 2 ++
 python/pyspark/sql/pandas/functions.py   | 9 -
 .../sql/tests/connect/test_parity_pandas_cogrouped_map.py| 9 +
 3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/python/pyspark/sql/connect/group.py 
b/python/pyspark/sql/connect/group.py
index 699dce76c4a1..c916e8acf3e4 100644
--- a/python/pyspark/sql/connect/group.py
+++ b/python/pyspark/sql/connect/group.py
@@ -398,7 +398,9 @@ class PandasCogroupedOps:
 ) -> "DataFrame":
 from pyspark.sql.connect.udf import UserDefinedFunction
 from pyspark.sql.connect.dataframe import DataFrame
+from pyspark.sql.pandas.functions import _validate_pandas_udf  # type: 
ignore[attr-defined]
 
+_validate_pandas_udf(func, schema, 
PythonEvalType.SQL_COGROUPED_MAP_PANDAS_UDF)
 udf_obj = UserDefinedFunction(
 func,
 returnType=schema,
diff --git a/python/pyspark/sql/pandas/functions.py 
b/python/pyspark/sql/pandas/functions.py
index 62d365a3b2a1..5922a5ced863 100644
--- a/python/pyspark/sql/pandas/functions.py
+++ b/python/pyspark/sql/pandas/functions.py
@@ -431,7 +431,8 @@ def pandas_udf(f=None, returnType=None, functionType=None):
 return _create_pandas_udf(f=f, returnType=return_type, 
evalType=eval_type)
 
 
-def _create_pandas_udf(f, returnType, evalType):
+# validate the pandas udf and return the adjusted eval type
+def _validate_pandas_udf(f, returnType, evalType) -> int:
 argspec = getfullargspec(f)
 
 # pandas UDF by type hints.
@@ -528,6 +529,12 @@ def _create_pandas_udf(f, returnType, evalType):
 },
 )
 
+return evalType
+
+
+def _create_pandas_udf(f, returnType, evalType):
+evalType = _validate_pandas_udf(f, returnType, evalType)
+
 if is_remote():
 from pyspark.sql.connect.udf import _create_udf as _create_connect_udf
 
diff --git 
a/python/pyspark/sql/tests/connect/test_parity_pandas_cogrouped_map.py 
b/python/pyspark/sql/tests/connect/test_parity_pandas_cogrouped_map.py
index 708960dd47d4..00d71bda2d93 100644
--- a/python/pyspark/sql/tests/connect/test_parity_pandas_cogrouped_map.py
+++ b/python/pyspark/sql/tests/connect/test_parity_pandas_cogrouped_map.py
@@ -20,10 +20,11 @@ from pyspark.sql.tests.pandas.test_pandas_cogrouped_map 
import CogroupedApplyInP
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
-class CogroupedApplyInPandasTests(CogroupedApplyInPandasTestsMixin, 
ReusedConnectTestCase):
-@unittest.skip("Fails in Spark Connect, should enable.")
-def test_wrong_args(self):
-self.check_wrong_args()
+class CogroupedApplyInPandasTests(
+CogroupedApplyInPandasTestsMixin,
+ReusedConnectTestCase,
+):
+pass
 
 
 if __name__ == "__main__":


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48055][PYTHON][CONNECT][TESTS] Enable `PandasUDFScalarParityTests.{test_vectorized_udf_empty_partition, test_vectorized_udf_struct_with_empty_partition}`

2024-04-29 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ed5aa56f1200 [SPARK-48055][PYTHON][CONNECT][TESTS] Enable 
`PandasUDFScalarParityTests.{test_vectorized_udf_empty_partition, 
test_vectorized_udf_struct_with_empty_partition}`
ed5aa56f1200 is described below

commit ed5aa56f1200bc1b0a455269eeb57863b2043fa1
Author: Ruifeng Zheng 
AuthorDate: Tue Apr 30 14:37:30 2024 +0800

[SPARK-48055][PYTHON][CONNECT][TESTS] Enable 
`PandasUDFScalarParityTests.{test_vectorized_udf_empty_partition, 
test_vectorized_udf_struct_with_empty_partition}`

### What changes were proposed in this pull request?
enable two test in `PandasUDFScalarParityTests`

### Why are the changes needed?
test coverage

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46296 from zhengruifeng/enable_test_vectorized_udf_empty_partition.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../sql/tests/connect/test_parity_pandas_udf_scalar.py| 11 ---
 python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py |  8 +---
 2 files changed, 5 insertions(+), 14 deletions(-)

diff --git a/python/pyspark/sql/tests/connect/test_parity_pandas_udf_scalar.py 
b/python/pyspark/sql/tests/connect/test_parity_pandas_udf_scalar.py
index b42bfaf0f58d..590ab695ee07 100644
--- a/python/pyspark/sql/tests/connect/test_parity_pandas_udf_scalar.py
+++ b/python/pyspark/sql/tests/connect/test_parity_pandas_udf_scalar.py
@@ -21,17 +21,6 @@ from pyspark.testing.connectutils import 
ReusedConnectTestCase
 
 
 class PandasUDFScalarParityTests(ScalarPandasUDFTestsMixin, 
ReusedConnectTestCase):
-def test_nondeterministic_vectorized_udf_in_aggregate(self):
-self.check_nondeterministic_analysis_exception()
-
-@unittest.skip("Spark Connect doesn't support RDD but the test depends on 
it.")
-def test_vectorized_udf_empty_partition(self):
-super().test_vectorized_udf_empty_partition()
-
-@unittest.skip("Spark Connect doesn't support RDD but the test depends on 
it.")
-def test_vectorized_udf_struct_with_empty_partition(self):
-super().test_vectorized_udf_struct_with_empty_partition()
-
 # TODO(SPARK-43727): Parity returnType check in Spark Connect
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_vectorized_udf_wrong_return_type(self):
diff --git a/python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py 
b/python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py
index 9edd585da6a0..38bc633cd1ed 100644
--- a/python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py
+++ b/python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py
@@ -764,15 +764,17 @@ class ScalarPandasUDFTestsMixin:
 self.assertEqual(df.collect(), res.collect())
 
 def test_vectorized_udf_empty_partition(self):
-df = self.spark.createDataFrame(self.sc.parallelize([Row(id=1)], 2))
+df = self.spark.createDataFrame([Row(id=1)]).repartition(2)
 for udf_type in [PandasUDFType.SCALAR, PandasUDFType.SCALAR_ITER]:
 f = pandas_udf(lambda x: x, LongType(), udf_type)
 res = df.select(f(col("id")))
 self.assertEqual(df.collect(), res.collect())
 
 def test_vectorized_udf_struct_with_empty_partition(self):
-df = self.spark.createDataFrame(self.sc.parallelize([Row(id=1)], 
2)).withColumn(
-"name", lit("John Doe")
+df = (
+self.spark.createDataFrame([Row(id=1)])
+.repartition(2)
+.withColumn("name", lit("John Doe"))
 )
 
 @pandas_udf("first string, last string")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-47129][CONNECT][SQL][3.4] Make ResolveRelations cache connect plan properly

2024-04-29 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 5f58fa7738eb [SPARK-47129][CONNECT][SQL][3.4] Make ResolveRelations 
cache connect plan properly
5f58fa7738eb is described below

commit 5f58fa7738eb51d6319fdd6e95ced69f40241cb4
Author: Ruifeng Zheng 
AuthorDate: Tue Apr 30 14:32:52 2024 +0800

[SPARK-47129][CONNECT][SQL][3.4] Make ResolveRelations cache connect plan 
properly

### What changes were proposed in this pull request?
Make `ResolveRelations` handle plan id properly

cherry-pick bugfix https://github.com/apache/spark/pull/45214 to 3.4

### Why are the changes needed?
bug fix for Spark Connect, it won't affect classic Spark SQL

before this PR:
```
from pyspark.sql import functions as sf

spark.range(10).withColumn("value_1", 
sf.lit(1)).write.saveAsTable("test_table_1")
spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", 
sf.lit(2)).write.saveAsTable("test_table_2")

df1 = spark.read.table("test_table_1")
df2 = spark.read.table("test_table_2")
df3 = spark.read.table("test_table_1")

join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2)
join2 = df3.join(join1, how="left", on=join1.index==df3.id)

join2.schema
```

fails with
```
AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve 
dataframe column "id". It's probably because of illegal references like 
`df1.select(df2.col("a"))`. SQLSTATE: 42704
```

That is due to existing plan caching in `ResolveRelations` doesn't work 
with Spark Connect

```
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 '[#12]Join LeftOuter, '`==`('index, 'id) '[#12]Join 
LeftOuter, '`==`('index, 'id)
!:- '[#9]UnresolvedRelation [test_table_1], [], false :- 
'[#9]SubqueryAlias spark_catalog.default.test_table_1
!+- '[#11]Project ['index, 'value_2]  :  +- 
'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false
!   +- '[#10]Join Inner, '`==`('id, 'index)   +- 
'[#11]Project ['index, 'value_2]
!  :- '[#7]UnresolvedRelation [test_table_1], [], false  +- 
'[#10]Join Inner, '`==`('id, 'index)
!  +- '[#8]UnresolvedRelation [test_table_2], [], false :- 
'[#9]SubqueryAlias spark_catalog.default.test_table_1
!   :  +- 
'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false
!   +- 
'[#8]SubqueryAlias spark_catalog.default.test_table_2
!  +- 
'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false

Can not resolve 'id with plan 7
```

`[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to 
the cached one
```
:- '[#9]SubqueryAlias spark_catalog.default.test_table_1
   +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, 
[], false
```

### Does this PR introduce _any_ user-facing change?
yes, bug fix

### How was this patch tested?
added ut

### Was this patch authored or co-authored using generative AI tooling?
ci

Closes #46290 from zhengruifeng/connect_fix_read_join_34.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/tests/test_readwriter.py| 21 +
 .../spark/sql/catalyst/analysis/Analyzer.scala | 27 --
 2 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/python/pyspark/sql/tests/test_readwriter.py 
b/python/pyspark/sql/tests/test_readwriter.py
index f51b0ef06208..9113fb350f63 100644
--- a/python/pyspark/sql/tests/test_readwriter.py
+++ b/python/pyspark/sql/tests/test_readwriter.py
@@ -181,6 +181,27 @@ class ReadwriterTestsMixin:
 df.write.mode("overwrite").insertInto("test_table", False)
 self.assertEqual(6, self.spark.sql("select * from 
test_table").count())
 
+def test_cached_table(self):
+with self.table("test_cached_table_1"):
+self.spark.range(10).withColumn(
+"value_1",
+lit(1),
+

(spark) branch branch-3.5 updated: [SPARK-47129][CONNECT][SQL][3.5] Make ResolveRelations cache connect plan properly

2024-04-29 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 9bc2ab01dde1 [SPARK-47129][CONNECT][SQL][3.5] Make ResolveRelations 
cache connect plan properly
9bc2ab01dde1 is described below

commit 9bc2ab01dde1eed9c4d6f4edd751f5bf0b28be3a
Author: Ruifeng Zheng 
AuthorDate: Tue Apr 30 12:43:56 2024 +0800

[SPARK-47129][CONNECT][SQL][3.5] Make ResolveRelations cache connect plan 
properly

### What changes were proposed in this pull request?
Make `ResolveRelations` handle plan id properly

cherry-pick bugfix https://github.com/apache/spark/pull/45214 to 3.5

### Why are the changes needed?
bug fix for Spark Connect, it won't affect classic Spark SQL

before this PR:
```
from pyspark.sql import functions as sf

spark.range(10).withColumn("value_1", 
sf.lit(1)).write.saveAsTable("test_table_1")
spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", 
sf.lit(2)).write.saveAsTable("test_table_2")

df1 = spark.read.table("test_table_1")
df2 = spark.read.table("test_table_2")
df3 = spark.read.table("test_table_1")

join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2)
join2 = df3.join(join1, how="left", on=join1.index==df3.id)

join2.schema
```

fails with
```
AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve 
dataframe column "id". It's probably because of illegal references like 
`df1.select(df2.col("a"))`. SQLSTATE: 42704
```

That is due to existing plan caching in `ResolveRelations` doesn't work 
with Spark Connect

```
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 '[#12]Join LeftOuter, '`==`('index, 'id) '[#12]Join 
LeftOuter, '`==`('index, 'id)
!:- '[#9]UnresolvedRelation [test_table_1], [], false :- 
'[#9]SubqueryAlias spark_catalog.default.test_table_1
!+- '[#11]Project ['index, 'value_2]  :  +- 
'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false
!   +- '[#10]Join Inner, '`==`('id, 'index)   +- 
'[#11]Project ['index, 'value_2]
!  :- '[#7]UnresolvedRelation [test_table_1], [], false  +- 
'[#10]Join Inner, '`==`('id, 'index)
!  +- '[#8]UnresolvedRelation [test_table_2], [], false :- 
'[#9]SubqueryAlias spark_catalog.default.test_table_1
!   :  +- 
'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false
!   +- 
'[#8]SubqueryAlias spark_catalog.default.test_table_2
!  +- 
'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false

Can not resolve 'id with plan 7
```

`[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to 
the cached one
```
:- '[#9]SubqueryAlias spark_catalog.default.test_table_1
   +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, 
[], false
```

### Does this PR introduce _any_ user-facing change?
yes, bug fix

### How was this patch tested?
added ut

### Was this patch authored or co-authored using generative AI tooling?
ci

Closes #46291 from zhengruifeng/connect_fix_read_join_35.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/tests/test_readwriter.py| 21 +
 .../spark/sql/catalyst/analysis/Analyzer.scala | 27 --
 2 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/python/pyspark/sql/tests/test_readwriter.py 
b/python/pyspark/sql/tests/test_readwriter.py
index 528b88ca0c2d..921d2eba5ac7 100644
--- a/python/pyspark/sql/tests/test_readwriter.py
+++ b/python/pyspark/sql/tests/test_readwriter.py
@@ -181,6 +181,27 @@ class ReadwriterTestsMixin:
 df.write.mode("overwrite").insertInto("test_table", False)
 self.assertEqual(6, self.spark.sql("select * from 
test_table").count())
 
+def test_cached_table(self):
+with self.table("test_cached_table_1"):
+self.spark.range(10).withColumn(
+"value_1",
+lit(1),
+

(spark) branch master updated: [SPARK-47986][CONNECT][PYTHON] Unable to create a new session when the default session is closed by the server

2024-04-26 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7d04d0f043d2 [SPARK-47986][CONNECT][PYTHON] Unable to create a new 
session when the default session is closed by the server
7d04d0f043d2 is described below

commit 7d04d0f043d2af6b518c6567443a6a5bed7ae541
Author: Niranjan Jayakar 
AuthorDate: Fri Apr 26 15:24:02 2024 +0800

[SPARK-47986][CONNECT][PYTHON] Unable to create a new session when the 
default session is closed by the server

### What changes were proposed in this pull request?

When the server closes a session, usually after a cluster restart,
the client is unaware of this until it receives an error.

At this point, the client in unable to create a new session to the
same connect endpoint, since the stale session is still recorded
as the active and default session.

With this change, when the server communicates that the session
has changed via a GRPC error, the session and the respective client
are marked as stale. A new default connection can be created
via the session builder.

### Why are the changes needed?

See section above.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Attached unit tests

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46221 from nija-at/session-expires.

Authored-by: Niranjan Jayakar 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/client/core.py|  3 +++
 python/pyspark/sql/connect/session.py|  4 ++--
 python/pyspark/sql/tests/connect/test_connect_session.py | 14 ++
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/connect/client/core.py 
b/python/pyspark/sql/connect/client/core.py
index 0bdfb4bb7910..badd9a33397e 100644
--- a/python/pyspark/sql/connect/client/core.py
+++ b/python/pyspark/sql/connect/client/core.py
@@ -1763,6 +1763,9 @@ class SparkConnectClient(object):
 info = error_details_pb2.ErrorInfo()
 d.Unpack(info)
 
+if info.metadata["errorClass"] == 
"INVALID_HANDLE.SESSION_CHANGED":
+self._closed = True
+
 raise convert_exception(
 info,
 status.message,
diff --git a/python/pyspark/sql/connect/session.py 
b/python/pyspark/sql/connect/session.py
index 5e677efe6ca6..eb7a546ca18d 100644
--- a/python/pyspark/sql/connect/session.py
+++ b/python/pyspark/sql/connect/session.py
@@ -237,9 +237,9 @@ class SparkSession:
 def getOrCreate(self) -> "SparkSession":
 with SparkSession._lock:
 session = SparkSession.getActiveSession()
-if session is None:
+if session is None or session.is_stopped:
 session = SparkSession._default_session
-if session is None:
+if session is None or session.is_stopped:
 session = self.create()
 self._apply_options(session)
 return session
diff --git a/python/pyspark/sql/tests/connect/test_connect_session.py 
b/python/pyspark/sql/tests/connect/test_connect_session.py
index 1caf3525cfbb..c5ce697a9561 100644
--- a/python/pyspark/sql/tests/connect/test_connect_session.py
+++ b/python/pyspark/sql/tests/connect/test_connect_session.py
@@ -242,6 +242,20 @@ class SparkConnectSessionTests(ReusedConnectTestCase):
 session = 
RemoteSparkSession.builder.channelBuilder(CustomChannelBuilder()).create()
 session.sql("select 1 + 1")
 
+def test_reset_when_server_session_changes(self):
+session = 
RemoteSparkSession.builder.remote("sc://localhost").getOrCreate()
+# run a simple query so the session id is synchronized.
+session.range(3).collect()
+
+# trigger a mismatch between client session id and server session id.
+session._client._session_id = str(uuid.uuid4())
+with self.assertRaises(SparkConnectException):
+session.range(3).collect()
+
+# assert that getOrCreate() generates a new session
+session = 
RemoteSparkSession.builder.remote("sc://localhost").getOrCreate()
+session.range(3).collect()
+
 
 class SparkConnectSessionWithOptionsTest(unittest.TestCase):
 def setUp(self) -> None:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47985][PYTHON] Simplify functions with `lit`

2024-04-25 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 24b0c7560718 [SPARK-47985][PYTHON] Simplify functions with `lit`
24b0c7560718 is described below

commit 24b0c75607182b284f563cad0a2c20329c5c4895
Author: Ruifeng Zheng 
AuthorDate: Thu Apr 25 20:56:37 2024 +0800

[SPARK-47985][PYTHON] Simplify functions with `lit`

### What changes were proposed in this pull request?
Simplify functions with `lit`

### Why are the changes needed?
code clean up, there are many such `if-else` in functions, which can be 
removed:
```
if isinstance(json, Column):
_json = json
elif isinstance(json, str):
_json = lit(json)
```

because `lit` function actually accepts the Column type input

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46219 from zhengruifeng/simplify_percentile.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/functions/builtin.py | 66 +++--
 1 file changed, 17 insertions(+), 49 deletions(-)

diff --git a/python/pyspark/sql/connect/functions/builtin.py 
b/python/pyspark/sql/connect/functions/builtin.py
index 8fffb1831466..cbbad941bf29 100644
--- a/python/pyspark/sql/connect/functions/builtin.py
+++ b/python/pyspark/sql/connect/functions/builtin.py
@@ -1188,20 +1188,10 @@ def percentile(
 percentage: Union[Column, float, List[float], Tuple[float]],
 frequency: Union[Column, int] = 1,
 ) -> Column:
-if isinstance(percentage, Column):
-_percentage = percentage
-elif isinstance(percentage, (list, tuple)):
-# Convert tuple to list
-_percentage = lit(list(percentage))
-else:
-# Probably scalar
-_percentage = lit(percentage)
+if isinstance(percentage, (list, tuple)):
+percentage = list(percentage)
 
-if isinstance(frequency, int):
-_frequency = lit(frequency)
-elif isinstance(frequency, Column):
-_frequency = frequency
-else:
+if not isinstance(frequency, (int, Column)):
 raise PySparkTypeError(
 error_class="NOT_COLUMN_OR_INT",
 message_parameters={
@@ -1210,7 +1200,7 @@ def percentile(
 },
 )
 
-return _invoke_function("percentile", _to_col(col), _percentage, 
_frequency)
+return _invoke_function("percentile", _to_col(col), lit(percentage), 
lit(frequency))
 
 
 percentile.__doc__ = pysparkfuncs.percentile.__doc__
@@ -1221,16 +1211,10 @@ def percentile_approx(
 percentage: Union[Column, float, List[float], Tuple[float]],
 accuracy: Union[Column, float] = 1,
 ) -> Column:
-if isinstance(percentage, Column):
-percentage_col = percentage
-elif isinstance(percentage, (list, tuple)):
-# Convert tuple to list
-percentage_col = lit(list(percentage))
-else:
-# Probably scalar
-percentage_col = lit(percentage)
+if isinstance(percentage, (list, tuple)):
+percentage = lit(list(percentage))
 
-return _invoke_function("percentile_approx", _to_col(col), percentage_col, 
lit(accuracy))
+return _invoke_function("percentile_approx", _to_col(col), 
lit(percentage), lit(accuracy))
 
 
 percentile_approx.__doc__ = pysparkfuncs.percentile_approx.__doc__
@@ -1241,16 +1225,10 @@ def approx_percentile(
 percentage: Union[Column, float, List[float], Tuple[float]],
 accuracy: Union[Column, float] = 1,
 ) -> Column:
-if isinstance(percentage, Column):
-percentage_col = percentage
-elif isinstance(percentage, (list, tuple)):
-# Convert tuple to list
-percentage_col = lit(list(percentage))
-else:
-# Probably scalar
-percentage_col = lit(percentage)
+if isinstance(percentage, (list, tuple)):
+percentage = list(percentage)
 
-return _invoke_function("approx_percentile", _to_col(col), percentage_col, 
lit(accuracy))
+return _invoke_function("approx_percentile", _to_col(col), 
lit(percentage), lit(accuracy))
 
 
 approx_percentile.__doc__ = pysparkfuncs.approx_percentile.__doc__
@@ -1878,12 +1856,10 @@ def from_json(
 schema: Union[ArrayType, StructType, Column, str],
 options: Optional[Dict[str, str]] = None,
 ) -> Column:
-if isinstance(schema, Column):
-_schema = schema
+if isinstance(schema, (str, Column)):
+_schema = lit(schema)
 elif isinstance(schema, DataType):
 _schema = lit(schema.json())
-elif isinstance(schema, str):
-_schema

(spark) branch master updated: [SPARK-47971][PYTHON][CONNECT][TESTS] Reenable `PandasUDFGroupedAggParityTests.test_grouped_with_empty_partition`

2024-04-24 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0f1b6446a25a [SPARK-47971][PYTHON][CONNECT][TESTS] Reenable 
`PandasUDFGroupedAggParityTests.test_grouped_with_empty_partition`
0f1b6446a25a is described below

commit 0f1b6446a25a022fd17661a1eba1b066c39cf3d4
Author: Ruifeng Zheng 
AuthorDate: Wed Apr 24 16:56:23 2024 +0800

[SPARK-47971][PYTHON][CONNECT][TESTS] Reenable 
`PandasUDFGroupedAggParityTests.test_grouped_with_empty_partition`

### What changes were proposed in this pull request?
Reenable `PandasUDFGroupedAggParityTests. test_grouped_with_empty_partition`

### Why are the changes needed?
for test coverage

the test needs a dataframe with empty partitions, switch to 
`df.repartition` to be able to reuse it in Spark Connect

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46202 from zhengruifeng/enable_udf_empty_partition.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../pyspark/sql/tests/connect/test_parity_pandas_udf_grouped_agg.py   | 4 
 python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py| 4 ++--
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git 
a/python/pyspark/sql/tests/connect/test_parity_pandas_udf_grouped_agg.py 
b/python/pyspark/sql/tests/connect/test_parity_pandas_udf_grouped_agg.py
index 6a3f8ab2569b..53e806126cd6 100644
--- a/python/pyspark/sql/tests/connect/test_parity_pandas_udf_grouped_agg.py
+++ b/python/pyspark/sql/tests/connect/test_parity_pandas_udf_grouped_agg.py
@@ -26,10 +26,6 @@ class 
PandasUDFGroupedAggParityTests(GroupedAggPandasUDFTestsMixin, ReusedConnec
 def test_unsupported_types(self):
 super().test_unsupported_types()
 
-@unittest.skip("Spark Connect doesn't support RDD but the test depends on 
it.")
-def test_grouped_with_empty_partition(self):
-super().test_grouped_with_empty_partition()
-
 @unittest.skip("Spark Connect does not support convert UNPARSED to 
catalyst types.")
 def test_manual(self):
 super().test_manual()
diff --git a/python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py 
b/python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py
index a7cf45e3bcbe..70fa31fd515b 100644
--- a/python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py
+++ b/python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py
@@ -538,11 +538,11 @@ class GroupedAggPandasUDFTestsMixin:
 data = [Row(id=1, x=2), Row(id=1, x=3), Row(id=2, x=4)]
 expected = [Row(id=1, sum=5), Row(id=2, x=4)]
 num_parts = len(data) + 1
-df = self.spark.createDataFrame(self.sc.parallelize(data, 
numSlices=num_parts))
+df = self.spark.createDataFrame(data).repartition(num_parts)
 
 f = pandas_udf(lambda x: x.sum(), "int", PandasUDFType.GROUPED_AGG)
 
-result = df.groupBy("id").agg(f(df["x"]).alias("sum")).collect()
+result = 
df.groupBy("id").agg(f(df["x"]).alias("sum")).sort("id").collect()
 self.assertEqual(result, expected)
 
 def test_grouped_without_group_by_clause(self):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47771][PYTHON][DOCS][TESTS][FOLLOWUP] Make `max_by, min_by` doctests deterministic

2024-04-23 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3cf0c83d29aa [SPARK-47771][PYTHON][DOCS][TESTS][FOLLOWUP] Make 
`max_by, min_by` doctests deterministic
3cf0c83d29aa is described below

commit 3cf0c83d29aa9a266f6f4802bfcf67607cc21555
Author: Ruifeng Zheng 
AuthorDate: Wed Apr 24 11:23:43 2024 +0800

[SPARK-47771][PYTHON][DOCS][TESTS][FOLLOWUP] Make `max_by, min_by` doctests 
deterministic

### What changes were proposed in this pull request?
Make `max_by, min_by` doctests deterministic

### Why are the changes needed?
https://github.com/apache/spark/pull/45939 fixed this issue by sorting the 
rows,
unfortunately, it is not enough:

in group `department=Finance`, two rows `("Finance", "Frank", 5)` and 
`("Finance", "George", 5)` have the same value `years_in_dept=5`, so 
`min_by("name", "years_in_dept")` and `max_by("name", "years_in_dept")` is 
still non-deterministic.

This test failed in some env:
```
**
File "/home/jenkins/python/pyspark/sql/connect/functions/builtin.py", line 
1177, in pyspark.sql.connect.functions.builtin.max_by
Failed example:
df.groupby("department").agg(
sf.max_by("name", "years_in_dept")
).sort("department").show()
Expected:
+--+---+
|department|max_by(name, years_in_dept)|
+--+---+
|   Consult|  Henry|
|   Finance| George|
+--+---+
Got:
+--+---+
|department|max_by(name, years_in_dept)|
+--+---+
|   Consult|  Henry|
|   Finance|  Frank|
+--+---+

**
File "/home/jenkins/python/pyspark/sql/connect/functions/builtin.py", line 
1205, in pyspark.sql.connect.functions.builtin.min_by
Failed example:
df.groupby("department").agg(
sf.min_by("name", "years_in_dept")
).sort("department").show()
Expected:
+--+---+
|department|min_by(name, years_in_dept)|
+--+---+
|   Consult|Eva|
|   Finance| George|
+--+---+
Got:
+--+---+
|department|min_by(name, years_in_dept)|
+--+---+
|   Consult|Eva|
|   Finance|  Frank|
+--+---+

**
```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46196 from zhengruifeng/doc_max_min_by.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/functions/builtin.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index 96be5de0180b..b54b377aaebc 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -1275,7 +1275,7 @@ def max_by(col: "ColumnOrName", ord: "ColumnOrName") -> 
Column:
 >>> import pyspark.sql.functions as sf
 >>> df = spark.createDataFrame([
 ... ("Consult", "Eva", 6), ("Finance", "Frank", 5),
-... ("Finance", "George", 5), ("Consult", "Henry", 7)],
+... ("Finance", "George", 9), ("Consult", "Henry", 7)],
 ... schema=("department", "name", "years_in_dept"))
 >>> df.groupby("department").agg(
 ... sf.max_by("name", "years_in_dept")
@@ -1356,7 +1356,7 @@ def min_by(col: "ColumnOrName", ord: "ColumnOrName") -> 
Column:
 >>> import pyspark.sql.functions as sf
 >>> df = spark.createDataFrame([

(spark) branch master updated (2d9d444b122d -> 9f34b8eca2f3)

2024-04-21 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 2d9d444b122d [MINOR][DOCS] Change `SPARK_ANSI_SQL_MODE`in 
PlanStabilitySuite documentation
 add 9f34b8eca2f3 [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type 
in split function for scala and python

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/sql/functions.scala |  35 +++
 .../apache/spark/sql/PlanGenerationTestSuite.scala |   8 +++
 ...xplain => function_split_using_columns.explain} |   2 +-
 ...unction_split_with_limit_using_columns.explain} |   2 +-
 ...like.json => function_split_using_columns.json} |   2 +-
 bin => function_split_using_columns.proto.bin} | Bin 181 -> 181 bytes
 ...> function_split_with_limit_using_columns.json} |   4 +-
 ...ction_split_with_limit_using_columns.proto.bin} | Bin 188 -> 188 bytes
 python/pyspark/sql/connect/functions/builtin.py|   9 ++-
 python/pyspark/sql/functions/builtin.py|  65 ++---
 .../scala/org/apache/spark/sql/functions.scala |  65 -
 .../apache/spark/sql/StringFunctionsSuite.scala|  27 +
 12 files changed, 190 insertions(+), 29 deletions(-)
 copy 
connector/connect/common/src/test/resources/query-tests/explain-results/{column_add.explain
 => function_split_using_columns.explain} (55%)
 copy 
connector/connect/common/src/test/resources/query-tests/explain-results/{column_asc_nulls_first.explain
 => function_split_with_limit_using_columns.explain} (56%)
 copy 
connector/connect/common/src/test/resources/query-tests/queries/{function_ilike.json
 => function_split_using_columns.json} (95%)
 copy 
connector/connect/common/src/test/resources/query-tests/queries/{function_split.proto.bin
 => function_split_using_columns.proto.bin} (95%)
 copy 
connector/connect/common/src/test/resources/query-tests/queries/{function_split_with_limit.json
 => function_split_with_limit_using_columns.json} (90%)
 copy 
connector/connect/common/src/test/resources/query-tests/queries/{function_ilike_with_escape.proto.bin
 => function_split_with_limit_using_columns.proto.bin} (88%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (2bf43460b923 -> 0d553d06fe2f)

2024-04-19 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 2bf43460b923 [SPARK-47833][SQL][CORE] Supply caller stackstrace for 
checkAndGlobPathIfNecessary AnalysisException
 add 0d553d06fe2f [SPARK-47906][PYTHON][DOCS] Fix docstring and type hint 
of `hll_union_agg`

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/functions/builtin.py | 12 +---
 python/pyspark/sql/functions/builtin.py | 14 +-
 2 files changed, 10 insertions(+), 16 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47883][SQL] Make `CollectTailExec.doExecute` lazy with RowQueue

2024-04-18 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new fe47edece059 [SPARK-47883][SQL] Make `CollectTailExec.doExecute` lazy 
with RowQueue
fe47edece059 is described below

commit fe47edece059e9189d8500b3c9b3881b44678785
Author: Ruifeng Zheng 
AuthorDate: Fri Apr 19 12:16:58 2024 +0800

[SPARK-47883][SQL] Make `CollectTailExec.doExecute` lazy with RowQueue

### What changes were proposed in this pull request?
Make `CollectTailExec.doExecute` execute lazily

### Why are the changes needed?
1, in Spark Connect, `dataframe.tail` is based on `Tail(...).collect()`
2, make `Tail` to be able to use alone;

### Does this PR introduce any user-facing change?
no

### How was this patch tested?
existing unit tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46101 from zhengruifeng/sql_tail_row_queue.

Lead-authored-by: Ruifeng Zheng 
Co-authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../org/apache/spark/sql/execution/limit.scala | 62 ++
 .../spark/sql/execution/python/RowQueue.scala  |  7 ++-
 2 files changed, 57 insertions(+), 12 deletions(-)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
index db5728d669ef..c0fb1c37b210 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
@@ -17,6 +17,7 @@
 
 package org.apache.spark.sql.execution
 
+import org.apache.spark.TaskContext
 import org.apache.spark.rdd.{ParallelCollectionRDD, RDD}
 import org.apache.spark.serializer.Serializer
 import org.apache.spark.sql.catalyst.InternalRow
@@ -26,7 +27,7 @@ import org.apache.spark.sql.catalyst.plans.physical._
 import org.apache.spark.sql.catalyst.util.truncatedString
 import org.apache.spark.sql.execution.exchange.ShuffleExchangeExec
 import org.apache.spark.sql.execution.metric.{SQLShuffleReadMetricsReporter, 
SQLShuffleWriteMetricsReporter}
-import org.apache.spark.util.ArrayImplicits._
+import org.apache.spark.sql.execution.python.HybridRowQueue
 import org.apache.spark.util.collection.Utils
 
 /**
@@ -68,13 +69,13 @@ case class CollectLimitExec(limit: Int = -1, child: 
SparkPlan, offset: Int = 0)
   override lazy val metrics = readMetrics ++ writeMetrics
   protected override def doExecute(): RDD[InternalRow] = {
 val childRDD = child.execute()
-if (childRDD.getNumPartitions == 0) {
+if (childRDD.getNumPartitions == 0 || limit == 0) {
   new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, 
Map.empty)
 } else {
   val singlePartitionRDD = if (childRDD.getNumPartitions == 1) {
 childRDD
   } else {
-val locallyLimited = if (limit >= 0) {
+val locallyLimited = if (limit > 0) {
   childRDD.mapPartitionsInternal(_.take(limit))
 } else {
   childRDD
@@ -118,18 +119,57 @@ case class CollectLimitExec(limit: Int = -1, child: 
SparkPlan, offset: Int = 0)
  * logical plan, which happens when the user is collecting results back to the 
driver.
  */
 case class CollectTailExec(limit: Int, child: SparkPlan) extends LimitExec {
+  assert(limit >= 0)
+
   override def output: Seq[Attribute] = child.output
   override def outputPartitioning: Partitioning = SinglePartition
   override def executeCollect(): Array[InternalRow] = child.executeTail(limit)
+  private val serializer: Serializer = new 
UnsafeRowSerializer(child.output.size)
+  private lazy val writeMetrics =
+SQLShuffleWriteMetricsReporter.createShuffleWriteMetrics(sparkContext)
+  private lazy val readMetrics =
+SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
+  override lazy val metrics = readMetrics ++ writeMetrics
   protected override def doExecute(): RDD[InternalRow] = {
-// This is a bit hacky way to avoid a shuffle and scanning all data when 
it performs
-// at `Dataset.tail`.
-// Since this execution plan and `execute` are currently called only when
-// `Dataset.tail` is invoked, the jobs are always executed when they are 
supposed to be.
-
-// If we use this execution plan separately like `Dataset.limit` without 
an actual
-// job launch, we might just have to mimic the implementation of 
`CollectLimitExec`.
-sparkContext.parallelize(executeCollect().toImmutableArraySeq, numSlices = 
1)
+val childRDD = child.execute()
+if (childRDD.getNumPartitions == 0 || limit == 0) {
+  new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, 
Map.empty)
+} else {
+  val singlePartitionRDD = if (childRDD.getNumPartitions == 1) {
+

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1006 matches

Mail list logo