date:20230914

[GitHub] [spark-website] xuanyuanking commented on pull request #476: Add News, Release Note, Download Link for Apache Spark 3.5.0

2023-09-14 Thread via GitHub



xuanyuanking commented on PR #476:
URL: https://github.com/apache/spark-website/pull/476#issuecomment-1718927679

   Sorry for missing the ping @Yikun ! Just uploaded my key. Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] xuanyuanking closed pull request #476: Add News, Release Note, Download Link for Apache Spark 3.5.0

2023-09-14 Thread via GitHub



xuanyuanking closed pull request #476: Add News, Release Note, Download Link 
for Apache Spark 3.5.0
URL: https://github.com/apache/spark-website/pull/476


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] xuanyuanking commented on pull request #476: Add News, Release Note, Download Link for Apache Spark 3.5.0

2023-09-14 Thread via GitHub



xuanyuanking commented on PR #476:
URL: https://github.com/apache/spark-website/pull/476#issuecomment-1718953342

   Thank you all for the review! 🥂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-45162][SQL] Support maps and array parameters constructed via `call_function`

2023-09-14 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cd672b09ac6 [SPARK-45162][SQL] Support maps and array parameters 
constructed via `call_function`
cd672b09ac6 is described below

commit cd672b09ac69724cd99dc12c9bb49dd117025be1
Author: Max Gekk 
AuthorDate: Thu Sep 14 11:31:56 2023 +0300

[SPARK-45162][SQL] Support maps and array parameters constructed via 
`call_function`

### What changes were proposed in this pull request?
In the PR, I propose to move the `BindParameters` rules from the 
`Substitution` to the `Resolution` batch, and change types of the `args` 
parameter of `NameParameterizedQuery` and `PosParameterizedQuery` to an 
`Iterable` to resolve argument expressions.

### Why are the changes needed?
After the PR, the parameterized `sql()` allows map/array/struct constructed 
by functions like `map()`, `array()`, and `struct()`, but the same functions 
invoked via `call_function` are not supported:
```scala
scala> sql("SELECT element_at(:mapParam, 'a')", Map("mapParam" -> 
call_function("map", lit("a"), lit(1
org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[UNBOUND_SQL_PARAMETER] Found the unbound parameter: mapParam. Please, fix 
`args` and provide a mapping of the parameter to a SQL literal.; line 1 pos 18;
```

### Does this PR introduce _any_ user-facing change?
No, should not since it fixes an issue. Only if user code depends on the 
error message.

After the changes:
```scala
scala> sql("SELECT element_at(:mapParam, 'a')", Map("mapParam" -> 
call_function("map", lit("a"), lit(1.show(false)
++
|element_at(map(a, 1), a)|
++
|1   |
++
```

### How was this patch tested?
By running new tests:
```
$ build/sbt "test:testOnly *ParametersSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #42894 from MaxGekk/fix-parameterized-sql-unresolved.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
---
 .../sql/connect/planner/SparkConnectPlanner.scala  |  2 +-
 .../spark/sql/catalyst/analysis/Analyzer.scala |  2 +-
 .../spark/sql/catalyst/analysis/parameters.scala   | 28 +-
 .../sql/catalyst/analysis/AnalysisSuite.scala  |  4 ++--
 .../org/apache/spark/sql/ParametersSuite.scala | 19 ---
 5 files changed, 42 insertions(+), 13 deletions(-)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
index 24dee006f0b..74a8ff290eb 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
@@ -269,7 +269,7 @@ class SparkConnectPlanner(val sessionHolder: SessionHolder) 
extends Logging {
 if (!args.isEmpty) {
   NameParameterizedQuery(parsedPlan, 
args.asScala.mapValues(transformLiteral).toMap)
 } else if (!posArgs.isEmpty) {
-  PosParameterizedQuery(parsedPlan, 
posArgs.asScala.map(transformLiteral).toArray)
+  PosParameterizedQuery(parsedPlan, 
posArgs.asScala.map(transformLiteral).toSeq)
 } else {
   parsedPlan
 }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index e15b9730111..6491a4eea95 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -260,7 +260,6 @@ class Analyzer(override val catalogManager: CatalogManager) 
extends RuleExecutor
   // at the beginning of analysis.
   OptimizeUpdateFields,
   CTESubstitution,
-  BindParameters,
   WindowsSubstitution,
   EliminateUnions,
   SubstituteUnresolvedOrdinals),
@@ -322,6 +321,7 @@ class Analyzer(override val catalogManager: CatalogManager) 
extends RuleExecutor
   RewriteDeleteFromTable ::
   RewriteUpdateTable ::
   RewriteMergeIntoTable ::
+  BindParameters ::
   typeCoercionRules ++
   Seq(
 ResolveWithCTE,
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/parameters.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/parameters.scala
index 13404797490..a6072dcdd2c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/parameters.scala
+++ 
b/sql/cata

[spark] branch master updated (cd672b09ac6 -> 6653f94d489)

2023-09-14 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from cd672b09ac6 [SPARK-45162][SQL] Support maps and array parameters 
constructed via `call_function`
 add 6653f94d489 [SPARK-45156][SQL] Wrap `inputName` by backticks in the 
`NON_FOLDABLE_INPUT` error class

No new revisions were added by this update.

Summary of changes:
 .../expressions/CallMethodViaReflection.scala |  4 ++--
 .../aggregate/ApproxCountDistinctForIntervals.scala   |  2 +-
 .../expressions/aggregate/ApproximatePercentile.scala |  4 ++--
 .../expressions/aggregate/BloomFilterAggregate.scala  |  4 ++--
 .../expressions/aggregate/CountMinSketchAgg.scala |  6 +++---
 .../expressions/aggregate/HistogramNumeric.scala  |  2 +-
 .../catalyst/expressions/aggregate/percentiles.scala  |  2 +-
 .../sql/catalyst/expressions/csvExpressions.scala |  2 +-
 .../spark/sql/catalyst/expressions/generators.scala   |  2 +-
 .../sql/catalyst/expressions/jsonExpressions.scala|  2 +-
 .../sql/catalyst/expressions/maskExpressions.scala|  2 +-
 .../sql/catalyst/expressions/mathExpressions.scala|  2 +-
 .../sql/catalyst/expressions/regexpExpressions.scala  |  2 +-
 .../sql/catalyst/expressions/stringExpressions.scala  |  2 +-
 .../sql/catalyst/expressions/windowExpressions.scala  | 19 +++
 .../spark/sql/catalyst/expressions/xml/xpath.scala|  2 +-
 .../sql/catalyst/expressions/xmlExpressions.scala |  2 +-
 .../analysis/ExpressionTypeCheckingSuite.scala|  6 +++---
 .../expressions/CallMethodViaReflectionSuite.scala|  2 +-
 .../catalyst/expressions/RegexpExpressionsSuite.scala |  2 +-
 .../catalyst/expressions/StringExpressionsSuite.scala |  6 +++---
 .../ApproxCountDistinctForIntervalsSuite.scala|  2 +-
 .../aggregate/ApproximatePercentileSuite.scala|  4 ++--
 .../aggregate/CountMinSketchAggSuite.scala|  6 +++---
 .../expressions/aggregate/HistogramNumericSuite.scala |  2 +-
 .../expressions/aggregate/PercentileSuite.scala   |  2 +-
 .../expressions/xml/XPathExpressionSuite.scala|  2 +-
 .../analyzer-results/ansi/string-functions.sql.out|  2 +-
 .../sql-tests/analyzer-results/csv-functions.sql.out  |  2 +-
 .../sql-tests/analyzer-results/join-lateral.sql.out   |  2 +-
 .../sql-tests/analyzer-results/json-functions.sql.out |  2 +-
 .../sql-tests/analyzer-results/mask-functions.sql.out |  4 ++--
 .../sql-tests/analyzer-results/percentiles.sql.out|  2 +-
 .../analyzer-results/string-functions.sql.out |  2 +-
 .../sql-tests/results/ansi/string-functions.sql.out   |  2 +-
 .../resources/sql-tests/results/csv-functions.sql.out |  2 +-
 .../resources/sql-tests/results/join-lateral.sql.out  |  2 +-
 .../sql-tests/results/json-functions.sql.out  |  2 +-
 .../sql-tests/results/mask-functions.sql.out  |  4 ++--
 .../resources/sql-tests/results/percentiles.sql.out   |  2 +-
 .../sql-tests/results/string-functions.sql.out|  2 +-
 .../spark/sql/DataFrameWindowFunctionsSuite.scala |  2 +-
 .../org/apache/spark/sql/GeneratorFunctionSuite.scala |  2 +-
 43 files changed, 67 insertions(+), 64 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (6653f94d489 -> 0e9ad5ad66a)

2023-09-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 6653f94d489 [SPARK-45156][SQL] Wrap `inputName` by backticks in the 
`NON_FOLDABLE_INPUT` error class
 add 0e9ad5ad66a [SPARK-45056][PYTHON][SS][CONNECT][TESTS][FOLLOW-UP] 
Remove listeners only when they exists

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/connect/service/SparkConnectSessionHodlerSuite.scala  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR][PYTHON][DOCS] Fix default value of parameter `barrier` in MapInXXX

2023-09-14 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e1d2372b891 [MINOR][PYTHON][DOCS] Fix default value of parameter 
`barrier` in MapInXXX
e1d2372b891 is described below

commit e1d2372b8916741fe199ee7b154e53af1eb1ba5a
Author: Ruifeng Zheng 
AuthorDate: Thu Sep 14 18:23:38 2023 +0800

[MINOR][PYTHON][DOCS] Fix default value of parameter `barrier` in MapInXXX

### What changes were proposed in this pull request?
Fix default value of parameter `barrier`

### Why are the changes needed?
they default to `False`

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes #42923 from zhengruifeng/45114_followup.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/pandas/map_ops.py | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/python/pyspark/sql/pandas/map_ops.py 
b/python/pyspark/sql/pandas/map_ops.py
index bc26fdede28..710fc8a9a37 100644
--- a/python/pyspark/sql/pandas/map_ops.py
+++ b/python/pyspark/sql/pandas/map_ops.py
@@ -60,11 +60,10 @@ class PandasMapOpsMixin:
 schema : :class:`pyspark.sql.types.DataType` or str
 the return type of the `func` in PySpark. The value can be either a
 :class:`pyspark.sql.types.DataType` object or a DDL-formatted type 
string.
-barrier : bool, optional, default True
+barrier : bool, optional, default False
 Use barrier mode execution.
 
-.. versionchanged: 3.5.0
-Added ``barrier`` argument.
+.. versionadded: 3.5.0
 
 Examples
 
@@ -139,11 +138,10 @@ class PandasMapOpsMixin:
 schema : :class:`pyspark.sql.types.DataType` or str
 the return type of the `func` in PySpark. The value can be either a
 :class:`pyspark.sql.types.DataType` object or a DDL-formatted type 
string.
-barrier : bool, optional, default True
+barrier : bool, optional, default False
 Use barrier mode execution.
 
-.. versionchanged: 3.5.0
-Added ``barrier`` argument.
+.. versionadded: 3.5.0
 
 Examples
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.5 updated: [MINOR][PYTHON][DOCS] Fix default value of parameter `barrier` in MapInXXX

2023-09-14 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 0e1a9b65d48 [MINOR][PYTHON][DOCS] Fix default value of parameter 
`barrier` in MapInXXX
0e1a9b65d48 is described below

commit 0e1a9b65d48389e2bbed11dabfa6c61cca5f41f0
Author: Ruifeng Zheng 
AuthorDate: Thu Sep 14 18:23:38 2023 +0800

[MINOR][PYTHON][DOCS] Fix default value of parameter `barrier` in MapInXXX

### What changes were proposed in this pull request?
Fix default value of parameter `barrier`

### Why are the changes needed?
they default to `False`

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes #42923 from zhengruifeng/45114_followup.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
(cherry picked from commit e1d2372b8916741fe199ee7b154e53af1eb1ba5a)
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/pandas/map_ops.py | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/python/pyspark/sql/pandas/map_ops.py 
b/python/pyspark/sql/pandas/map_ops.py
index bc26fdede28..710fc8a9a37 100644
--- a/python/pyspark/sql/pandas/map_ops.py
+++ b/python/pyspark/sql/pandas/map_ops.py
@@ -60,11 +60,10 @@ class PandasMapOpsMixin:
 schema : :class:`pyspark.sql.types.DataType` or str
 the return type of the `func` in PySpark. The value can be either a
 :class:`pyspark.sql.types.DataType` object or a DDL-formatted type 
string.
-barrier : bool, optional, default True
+barrier : bool, optional, default False
 Use barrier mode execution.
 
-.. versionchanged: 3.5.0
-Added ``barrier`` argument.
+.. versionadded: 3.5.0
 
 Examples
 
@@ -139,11 +138,10 @@ class PandasMapOpsMixin:
 schema : :class:`pyspark.sql.types.DataType` or str
 the return type of the `func` in PySpark. The value can be either a
 :class:`pyspark.sql.types.DataType` object or a DDL-formatted type 
string.
-barrier : bool, optional, default True
+barrier : bool, optional, default False
 Use barrier mode execution.
 
-.. versionchanged: 3.5.0
-Added ``barrier`` argument.
+.. versionadded: 3.5.0
 
 Examples
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

2023-09-14 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 73d3c49c97a [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with 
duplicated columns
73d3c49c97a is described below

commit 73d3c49c97ae1be3f9f96fbc86be1c91cd17a656
Author: Ruifeng Zheng 
AuthorDate: Thu Sep 14 18:27:01 2023 +0800

[SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

### What changes were proposed in this pull request?

- Make `getitem` work with duplicated columns
- ~~Disallow bool type index~~
- ~~Disallow negative index~~

### Why are the changes needed?
1, SQL feature OrderBy ordinal works with duplicated columns
```
In [4]: df = spark.sql("SELECT * FROM VALUES (1, 1.1, 'a'), (2, 2.2, 'b'), 
(4, 4.4, 'c') AS TAB(a, a, a)")

In [5]: df.createOrReplaceTempView("v")

In [6]: spark.sql("SELECT * FROM v ORDER BY 1, 2").show()
+---+---+---+
|  a|  a|  a|
+---+---+---+
|  1|1.1|  a|
|  2|2.2|  b|
|  4|4.4|  c|
+---+---+---+
```

To support it in DataFame APIs, we need to make `getitem` work with 
duplicated columns

~~2 & 3: should be unintentional~~

### Does this PR introduce _any_ user-facing change?
YES

1, Make `getitem` work with duplicated columns
before
```
In [1]: df = spark.sql("SELECT * FROM VALUES (1, 1.1, 'a'), (2, 2.2, 'b'), 
(4, 4.4, 'c') AS TAB(a, a, a)")

In [2]: df[0]
---
AnalysisException Traceback (most recent call last)
Cell In[2], line 1
> 1 df[0]
...

AnalysisException: [AMBIGUOUS_REFERENCE] Reference `a` is ambiguous, could 
be: [`TAB`.`a`, `TAB`.`a`, `TAB`.`a`].

In [3]: df[1]
---
AnalysisException Traceback (most recent call last)
...

AnalysisException: [AMBIGUOUS_REFERENCE] Reference `a` is ambiguous, could 
be: [`TAB`.`a`, `TAB`.`a`, `TAB`.`a`].

In [4]: df.orderBy(1, 2).show()
---
AnalysisException Traceback (most recent call last)
Cell In[7], line 1
> 1 df.orderBy(1, 2).show()

...

AnalysisException: [AMBIGUOUS_REFERENCE] Reference `a` is ambiguous, could 
be: [`TAB`.`a`, `TAB`.`a`, `TAB`.`a`].

```

after

```
In [1]: df = spark.sql("SELECT * FROM VALUES (1, 1.1, 'a'), (2, 2.2, 'b'), 
(4, 4.4, 'c') AS TAB(a, a, a)")

In [2]: df[0]
Out[2]: Column<'a'>

In [3]: df[1]
Out[3]: Column<'a'>

In [4]: df.orderBy(1, 2).show()
+---+---+---+
|  a|  a|  a|
+---+---+---+
|  1|1.1|  a|
|  2|2.2|  b|
|  4|4.4|  c|
+---+---+---+
```

### How was this patch tested?
added UTs

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes #42828 from zhengruifeng/col_by_index.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../main/protobuf/spark/connect/expressions.proto  |  10 ++
 .../sql/connect/planner/SparkConnectPlanner.scala  |  14 ++-
 python/pyspark/sql/connect/dataframe.py|  21 +++-
 python/pyspark/sql/connect/expressions.py  |  34 ++
 .../pyspark/sql/connect/proto/expressions_pb2.py   | 126 +++--
 .../pyspark/sql/connect/proto/expressions_pb2.pyi  |  40 +++
 python/pyspark/sql/dataframe.py|   5 +-
 python/pyspark/sql/tests/test_dataframe.py |  47 +++-
 python/pyspark/sql/tests/test_group.py |  27 +
 .../main/scala/org/apache/spark/sql/Dataset.scala  |  12 ++
 10 files changed, 265 insertions(+), 71 deletions(-)

diff --git 
a/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto 
b/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto
index 4aac2bcc612..782bcc5d1fc 100644
--- a/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto
+++ b/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto
@@ -48,6 +48,7 @@ message Expression {
 CommonInlineUserDefinedFunction common_inline_user_defined_function = 15;
 CallFunction call_function = 16;
 NamedArgumentExpression named_argument_expression = 17;
+GetColumnByOrdinal get_column_by_ordinal = 18;
 
 // This field is used to mark extensions to the protocol. When plugins 
generate arbitrary
 // relations they can add them here. During the planning the correct 
resolution is done.
@@ -228,6 +229,15 @@ message Expression {
 optional bool is_metadata_colu

[spark] branch master updated: [SPARK-45119][PYTHON][DOCS] Refine docstring of inline

2023-09-14 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cb298e80219 [SPARK-45119][PYTHON][DOCS] Refine docstring of inline
cb298e80219 is described below

commit cb298e8021965a83d40bc43c9c010361e62e6dbd
Author: allisonwang-db 
AuthorDate: Thu Sep 14 19:20:05 2023 +0800

[SPARK-45119][PYTHON][DOCS] Refine docstring of inline

### What changes were proposed in this pull request?

This PR improves the docstring of the function `inline` by adding more 
examples.

### Why are the changes needed?

To improve PySpark documentation.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

doctest

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #42875 from allisonwang-db/spark-45119-refine-inline.

Authored-by: allisonwang-db 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/functions.py | 105 +---
 1 file changed, 97 insertions(+), 8 deletions(-)

diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 2d4194c98e9..31936241619 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -12471,37 +12471,126 @@ def inline(col: "ColumnOrName") -> Column:
 """
 Explodes an array of structs into a table.
 
+This function takes an input column containing an array of structs and 
returns a
+new column where each struct in the array is exploded into a separate row.
+
 .. versionadded:: 3.4.0
 
 Parameters
 --
 col : :class:`~pyspark.sql.Column` or str
-input column of values to explode.
+Input column of values to explode.
 
 Returns
 ---
 :class:`~pyspark.sql.Column`
-generator expression with the inline exploded result.
+Generator expression with the inline exploded result.
 
 See Also
 
-:meth:`explode`
-
-Notes
--
-Supports Spark Connect.
+:meth:`pyspark.functions.explode`
+:meth:`pyspark.functions.inline_outer`
 
 Examples
 
+Example 1: Using inline with a single struct array column
+
+>>> import pyspark.sql.functions as sf
+>>> from pyspark.sql import Row
+>>> df = spark.createDataFrame([Row(structlist=[Row(a=1, b=2), Row(a=3, 
b=4)])])
+>>> df.select(sf.inline(df.structlist)).show()
++---+---+
+|  a|  b|
++---+---+
+|  1|  2|
+|  3|  4|
++---+---+
+
+Example 2: Using inline with a column name
+
+>>> import pyspark.sql.functions as sf
 >>> from pyspark.sql import Row
 >>> df = spark.createDataFrame([Row(structlist=[Row(a=1, b=2), Row(a=3, 
b=4)])])
->>> df.select(inline(df.structlist)).show()
+>>> df.select(sf.inline("structlist")).show()
 +---+---+
 |  a|  b|
 +---+---+
 |  1|  2|
 |  3|  4|
 +---+---+
+
+Example 3: Using inline with an alias
+
+>>> import pyspark.sql.functions as sf
+>>> from pyspark.sql import Row
+>>> df = spark.createDataFrame([Row(structlist=[Row(a=1, b=2), Row(a=3, 
b=4)])])
+>>> df.select(sf.inline("structlist").alias("c1", "c2")).show()
++---+---+
+| c1| c2|
++---+---+
+|  1|  2|
+|  3|  4|
++---+---+
+
+Example 4: Using inline with multiple struct array columns
+
+>>> import pyspark.sql.functions as sf
+>>> from pyspark.sql import Row
+>>> df = spark.createDataFrame([
+... Row(structlist1=[Row(a=1, b=2), Row(a=3, b=4)],
+... structlist2=[Row(c=5, d=6), Row(c=7, d=8)])
+... ])
+>>> df.select(sf.inline("structlist1"), "structlist2") \\
+... .select("a", "b", sf.inline("structlist2")).show()
++---+---+---+---+
+|  a|  b|  c|  d|
++---+---+---+---+
+|  1|  2|  5|  6|
+|  1|  2|  7|  8|
+|  3|  4|  5|  6|
+|  3|  4|  7|  8|
++---+---+---+---+
+
+Example 5: Using inline with a nested struct array column
+
+>>> import pyspark.sql.functions as sf
+>>> from pyspark.sql import Row
+>>> df = spark.createDataFrame([
+... Row(structlist=Row(a=1, b=2, nested=[Row(c=3, d=4), Row(c=5, 
d=6)]))
+... ])
+>>> df.select(sf.inline("structlist.nested")).show()
++---+---+
+|  c|  d|
++---+---+
+|  3|  4|
+|  5|  6|
++---+---+
+
+Example 6: Using inline with an empty struct array column
+
+>>> import pyspark.sql.functions as sf
+>>> from pyspark.sql import Row
+>>> df = spark.createDataFrame(
+... [Row(structlist=[])], "structlist: array>")
+>>> df.select(sf.inline(df.structlist)).show()
++---+---+
+|  a|  b|
++---+---+
++---+---+
+
+Example 7: Using inline with a struct array column containin

[spark-docker] branch master updated: [SPARK-45169] Add official image Dockerfile for Apache Spark 3.5.0

2023-09-14 Thread yikun

This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-docker.git


The following commit(s) were added to refs/heads/master by this push:
 new 028efd4  [SPARK-45169] Add official image Dockerfile for Apache Spark 
3.5.0
028efd4 is described below

commit 028efd4637fb2cf791d5bd9ea70b2fca472de4b7
Author: Yikun Jiang 
AuthorDate: Thu Sep 14 21:22:32 2023 +0800

[SPARK-45169] Add official image Dockerfile for Apache Spark 3.5.0

### What changes were proposed in this pull request?
Add Apache Spark 3.5.0 Dockerfiles.

- Add 3.5.0 GPG key
- Add .github/workflows/build_3.5.0.yaml
- `./add-dockerfiles.sh 3.5.0` to generate dockerfiles
- Add version and tag info
- Backport 
https://github.com/apache/spark/commit/1d2c338c867c69987d8ed1f3666358af54a040e3 
and 
https://github.com/apache/spark/commit/0c7b4306c7c5fbdd6c54f8172f82e1d23e3b 
entrypoint changes

### Why are the changes needed?
Apache Spark 3.5.0 released

### Does this PR introduce _any_ user-facing change?
Docker image will be published.

### How was this patch tested?
Add workflow and CI passed

Closes #55 from Yikun/3.5.0.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 .github/workflows/build_3.5.0.yaml | 41 +++
 .github/workflows/publish.yml  |  3 +-
 .github/workflows/test.yml |  3 +-
 3.5.0/scala2.12-java11-python3-r-ubuntu/Dockerfile | 29 
 3.5.0/scala2.12-java11-python3-ubuntu/Dockerfile   | 26 +++
 3.5.0/scala2.12-java11-r-ubuntu/Dockerfile | 28 
 3.5.0/scala2.12-java11-ubuntu/Dockerfile   | 79 ++
 .../scala2.12-java11-ubuntu/entrypoint.sh  |  4 ++
 entrypoint.sh.template |  4 ++
 tools/template.py  |  4 +-
 versions.json  | 42 ++--
 11 files changed, 253 insertions(+), 10 deletions(-)

diff --git a/.github/workflows/build_3.5.0.yaml 
b/.github/workflows/build_3.5.0.yaml
new file mode 100644
index 000..6eb3ad6
--- /dev/null
+++ b/.github/workflows/build_3.5.0.yaml
@@ -0,0 +1,41 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+name: "Build and Test (3.5.0)"
+
+on:
+  pull_request:
+branches:
+  - 'master'
+paths:
+  - '3.5.0/**'
+
+jobs:
+  run-build:
+strategy:
+  matrix:
+image-type: ["all", "python", "scala", "r"]
+name: Run
+secrets: inherit
+uses: ./.github/workflows/main.yml
+with:
+  spark: 3.5.0
+  scala: 2.12
+  java: 11
+  image-type: ${{ matrix.image-type }}
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
index d213ada..8cfa95d 100644
--- a/.github/workflows/publish.yml
+++ b/.github/workflows/publish.yml
@@ -25,9 +25,10 @@ on:
   spark:
 description: 'The Spark version of Spark image.'
 required: true
-default: '3.4.1'
+default: '3.5.0'
 type: choice
 options:
+- 3.5.0
 - 3.4.1
 - 3.4.0
 - 3.3.3
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
index 4f0f741..47dac20 100644
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -25,9 +25,10 @@ on:
   spark:
 description: 'The Spark version of Spark image.'
 required: true
-default: '3.4.1'
+default: '3.5.0'
 type: choice
 options:
+- 3.5.0
 - 3.4.1
 - 3.4.0
 - 3.3.3
diff --git a/3.5.0/scala2.12-java11-python3-r-ubuntu/Dockerfile 
b/3.5.0/scala2.12-java11-python3-r-ubuntu/Dockerfile
new file mode 100644
index 000..d6faaa7
--- /dev/null
+++ b/3.5.0/scala2.12-java11-python3-r-ubuntu/Dockerfile
@@ -0,0 +1,29 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2

[spark] branch master updated: [SPARK-45118][PYTHON] Refactor converters for complex types to short cut when the element types don't need converters

2023-09-14 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 090fd18f362 [SPARK-45118][PYTHON] Refactor converters for complex 
types to short cut when the element types don't need converters
090fd18f362 is described below

commit 090fd18f36242857a8d7b81ef78428775c1d1e42
Author: Takuya UESHIN 
AuthorDate: Thu Sep 14 10:44:07 2023 -0700

[SPARK-45118][PYTHON] Refactor converters for complex types to short cut 
when the element types don't need converters

### What changes were proposed in this pull request?

Refactors converters for complex types to short cut when the element types 
don't need converters.

The following refactors are done in this PR:

- Provide a shortcut when the element types in complex types don't need 
converters
- Check `None`s before calling the converter
- Remove extra type checks just for assertions

### Why are the changes needed?

When the element types in complex types don't need converters, we can 
provide a shortcut to avoid extra function calls.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added related tests and existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42874 from ueshin/issues/SPARK-45118/converters.

Authored-by: Takuya UESHIN 
Signed-off-by: Takuya UESHIN 
---
 dev/sparktestsupport/modules.py   |   1 +
 python/pyspark/sql/pandas/types.py| 442 ++--
 python/pyspark/sql/tests/pandas/test_converter.py | 595 ++
 3 files changed, 886 insertions(+), 152 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index 63cbbe6003d..0a751052491 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -491,6 +491,7 @@ pyspark_sql = Module(
 "pyspark.sql.tests.pandas.test_pandas_udf_typehints",
 
"pyspark.sql.tests.pandas.test_pandas_udf_typehints_with_future_annotations",
 "pyspark.sql.tests.pandas.test_pandas_udf_window",
+"pyspark.sql.tests.pandas.test_converter",
 "pyspark.sql.tests.test_pandas_sqlmetrics",
 "pyspark.sql.tests.test_readwriter",
 "pyspark.sql.tests.test_serde",
diff --git a/python/pyspark/sql/pandas/types.py 
b/python/pyspark/sql/pandas/types.py
index b02a003e632..54cd6fa7016 100644
--- a/python/pyspark/sql/pandas/types.py
+++ b/python/pyspark/sql/pandas/types.py
@@ -47,7 +47,6 @@ from pyspark.sql.types import (
 NullType,
 DataType,
 UserDefinedType,
-Row,
 _create_row,
 )
 from pyspark.errors import PySparkTypeError, UnsupportedOperationException
@@ -580,15 +579,21 @@ def _create_converter_to_pandas(
 
 if _ndarray_as_list:
 if _element_conv is None:
-_element_conv = lambda x: x  # noqa: E731
 
-def convert_array_ndarray_as_list(value: Any) -> Any:
-if value is None:
-return None
-else:
+def convert_array_ndarray_as_list(value: Any) -> Any:
 # In Arrow Python UDF, ArrayType is converted to 
`np.ndarray`
 # whereas a list is expected.
-return [_element_conv(v) for v in value]  # type: 
ignore[misc]
+return list(value)
+
+else:
+
+def convert_array_ndarray_as_list(value: Any) -> Any:
+# In Arrow Python UDF, ArrayType is converted to 
`np.ndarray`
+# whereas a list is expected.
+return [
+_element_conv(v) if v is not None else None  # 
type: ignore[misc]
+for v in value
+]
 
 return convert_array_ndarray_as_list
 else:
@@ -596,34 +601,53 @@ def _create_converter_to_pandas(
 return None
 
 def convert_array_ndarray_as_ndarray(value: Any) -> Any:
-if value is None:
-return None
-elif isinstance(value, np.ndarray):
+if isinstance(value, np.ndarray):
 # `pyarrow.Table.to_pandas` uses `np.ndarray`.
-return np.array([_element_conv(v) for v in value])  # 
type: ignore[misc]
+return np.array(
+[
+_element_conv(v) if v is not None else None  # 
type: ignore[misc]
+for v in value
+]
+

[spark] branch dependabot/maven/org.apache.commons-commons-compress-1.24.0 created (now e12d8824507)

2023-09-14 Thread github-bot

This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a change to branch 
dependabot/maven/org.apache.commons-commons-compress-1.24.0
in repository https://gitbox.apache.org/repos/asf/spark.git


  at e12d8824507 Bump org.apache.commons:commons-compress from 1.23.0 to 
1.24.0

No new revisions were added by this update.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-45161][INFRA] Bump previousSparkVersion to 3.5.0

2023-09-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new db5f37c1fac [SPARK-45161][INFRA] Bump previousSparkVersion to 3.5.0
db5f37c1fac is described below

commit db5f37c1fac3491e3062ebc4bcd23468f6bc9d2b
Author: yangjie01 
AuthorDate: Thu Sep 14 15:03:51 2023 -0700

[SPARK-45161][INFRA] Bump previousSparkVersion to 3.5.0

### What changes were proposed in this pull request?
The main change of this pr as follows:
1. Bump MiMa's `previousSparkVersion` to 3.5.0
2. Clean up expired rules and case match
2.1. Remove  `v35excludes` and `case v if v.startsWith("3.5")`
2.2. The target SPARK-43265, SPARK-44255, SPARK-44475, SPARK-44496 and 
SPARK-43997 are all Apache 3.5.0, clean up them from `defaultExcludes `,
2.3. The targets of SPARK-44705 and SPARK-44198 are all Apache 4.0.0, 
move them to `v40excludes`
4. Let `commonUtils` and `sqlApi` modules start participating in the mima 
check’
5. Increase the mem of mima check from 5120m to 5632m to avoid 
`java.lang.OutOfMemoryError: GC overhead limit exceeded`

### Why are the changes needed?
To ensure that MiMa checks cover new APIs added in Spark 3.5.0.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Scala 2.12

```
dev/mima -Pscala-2.12
```

Scala 2.13

```
dev/change-scala-version.sh 2.13
dev/mima -Pscala-2.13
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #42921 from LuciferYang/SPARK-45161.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 dev/mima   |   2 +-
 project/MimaBuild.scala|   2 +-
 project/MimaExcludes.scala | 166 ++---
 project/SparkBuild.scala   |   3 +-
 4 files changed, 10 insertions(+), 163 deletions(-)

diff --git a/dev/mima b/dev/mima
index 4a9e343b0a7..859301b4d66 100755
--- a/dev/mima
+++ b/dev/mima
@@ -42,7 +42,7 @@ $JAVA_CMD \
   -cp "$TOOLS_CLASSPATH:$OLD_DEPS_CLASSPATH" \
   org.apache.spark.tools.GenerateMIMAIgnore
 
-echo -e "q\n" | build/sbt -mem 5120 -DcopyDependencies=false "$@" 
mimaReportBinaryIssues | grep -v -e "info.*Resolving"
+echo -e "q\n" | build/sbt -mem 5632 -DcopyDependencies=false "$@" 
mimaReportBinaryIssues | grep -v -e "info.*Resolving"
 ret_val=$?
 
 if [ $ret_val != 0 ]; then
diff --git a/project/MimaBuild.scala b/project/MimaBuild.scala
index 53612daa30e..819742ea8c6 100644
--- a/project/MimaBuild.scala
+++ b/project/MimaBuild.scala
@@ -86,7 +86,7 @@ object MimaBuild {
 
   def mimaSettings(sparkHome: File, projectRef: ProjectRef): Seq[Setting[_]] = 
{
 val organization = "org.apache.spark"
-val previousSparkVersion = "3.4.0"
+val previousSparkVersion = "3.5.0"
 val project = projectRef.project
 val id = "spark-" + project
 
diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index 6bdf8d460e0..52440ca7d17 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -16,7 +16,6 @@
  */
 
 import com.typesafe.tools.mima.core._
-import com.typesafe.tools.mima.core.ProblemFilters._
 
 /**
  * Additional excludes for checking of Spark's binary compatibility.
@@ -34,56 +33,15 @@ import com.typesafe.tools.mima.core.ProblemFilters._
  */
 object MimaExcludes {
 
-  // Exclude rules for 4.0.x
-  lazy val v40excludes = v35excludes ++ Seq(
+  // Exclude rules for 4.0.x from 3.5.0
+  lazy val v40excludes = defaultExcludes ++ Seq(
 // [SPARK-44863][UI] Add a button to download thread dump as a txt in 
Spark UI
 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.status.api.v1.ThreadStackTrace.*"),
-
ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.status.api.v1.ThreadStackTrace$")
-  )
-
-  // Exclude rules for 3.5.x from 3.4.0
-  lazy val v35excludes = defaultExcludes ++ Seq(
-// [SPARK-44531][CONNECT][SQL] Move encoder inference to sql/api
-
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.types.DataTypes"),
-
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.types.SQLUserDefinedType"),
-// [SPARK-43165][SQL] Move canWrite to DataTypeUtils
-
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.types.DataType.canWrite"),
-// [SPARK-43195][CORE] Remove unnecessary serializable wrapper in 
HadoopFSUtils
-
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.util.HadoopFSUtils$SerializableBlockLocation"),
-
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.util.HadoopFSUtils$SerializableBlockLocation$"),
-
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.util.HadoopFSUtils$SerializableFileStatus"),
-
Probl

[spark] branch dependabot/maven/org.apache.commons-commons-compress-1.24.0 deleted (was e12d8824507)

2023-09-14 Thread github-bot

This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a change to branch 
dependabot/maven/org.apache.commons-commons-compress-1.24.0
in repository https://gitbox.apache.org/repos/asf/spark.git


 was e12d8824507 Bump org.apache.commons:commons-compress from 1.23.0 to 
1.24.0

The revisions that were on this branch are still contained in
other references; therefore, this change does not discard any commits
from the repository.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-45084][SS] StateOperatorProgress to use accurate effective shuffle partition number

2023-09-14 Thread kabhwan

This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 25c624f269b [SPARK-45084][SS] StateOperatorProgress to use accurate 
effective shuffle partition number
25c624f269b is described below

commit 25c624f269bfec027ad889c1764d2904f19a2506
Author: Siying Dong 
AuthorDate: Fri Sep 15 10:36:44 2023 +0900

[SPARK-45084][SS] StateOperatorProgress to use accurate effective shuffle 
partition number

### What changes were proposed in this pull request?
Make StateOperatorProgress.numShufflePartitions to use the effective number 
of shuffle partitions is reported.
This metric StateStoreWriter.numShufflePartitions is dropped at the same 
time, as it is not a metric anymore.

### Why are the changes needed?
Currently, there is a numShufflePartitions "metric" reported in
StateOperatorProgress part of the progress report. However, the number is 
reported by aggregating executors so in the case of task retry or speculative 
executor, the metric is higher than number of shuffle partitions for the query 
plan. We change the metric to use the value to use to make it more usable.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
StreamingAggregationSuite contains a unit test that validates the value

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #42822 from siying/numShufflePartitionsMetric.

Authored-by: Siying Dong 
Signed-off-by: Jungtaek Lim 
---
 .../apache/spark/sql/execution/streaming/statefulOperators.scala| 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala
index b31f6151fce..67d89c7f40f 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala
@@ -143,7 +143,6 @@ trait StateStoreWriter extends StatefulOperator with 
PythonSQLMetrics { self: Sp
 "allRemovalsTimeMs" -> SQLMetrics.createTimingMetric(sparkContext, "time 
to remove"),
 "commitTimeMs" -> SQLMetrics.createTimingMetric(sparkContext, "time to 
commit changes"),
 "stateMemory" -> SQLMetrics.createSizeMetric(sparkContext, "memory used by 
state"),
-"numShufflePartitions" -> SQLMetrics.createMetric(sparkContext, "number of 
shuffle partitions"),
 "numStateStoreInstances" -> SQLMetrics.createMetric(sparkContext,
   "number of state store instances")
   ) ++ stateStoreCustomMetrics ++ pythonMetrics
@@ -159,6 +158,8 @@ trait StateStoreWriter extends StatefulOperator with 
PythonSQLMetrics { self: Sp
 val javaConvertedCustomMetrics: java.util.HashMap[String, java.lang.Long] =
   new java.util.HashMap(customMetrics.mapValues(long2Long).toMap.asJava)
 
+// We now don't report number of shuffle partitions inside the state 
operator. Instead,
+// it will be filled when the stream query progress is reported
 new StateOperatorProgress(
   operatorName = shortName,
   numRowsTotal = longMetric("numTotalStateRows").value,
@@ -169,7 +170,7 @@ trait StateStoreWriter extends StatefulOperator with 
PythonSQLMetrics { self: Sp
   commitTimeMs = longMetric("commitTimeMs").value,
   memoryUsedBytes = longMetric("stateMemory").value,
   numRowsDroppedByWatermark = 
longMetric("numRowsDroppedByWatermark").value,
-  numShufflePartitions = longMetric("numShufflePartitions").value,
+  numShufflePartitions = 
stateInfo.map(_.numPartitions.toLong).getOrElse(-1L),
   numStateStoreInstances = longMetric("numStateStoreInstances").value,
   javaConvertedCustomMetrics
 )
@@ -183,7 +184,6 @@ trait StateStoreWriter extends StatefulOperator with 
PythonSQLMetrics { self: Sp
 assert(numStateStoreInstances >= 1, s"invalid number of stores: 
$numStateStoreInstances")
 // Shuffle partitions capture the number of tasks that have this stateful 
operator instance.
 // For each task instance this number is incremented by one.
-longMetric("numShufflePartitions") += 1
 longMetric("numStateStoreInstances") += numStateStoreInstances
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-45159][PYTHON] Handle named arguments only when necessary

2023-09-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 822f58f0d26 [SPARK-45159][PYTHON] Handle named arguments only when 
necessary
822f58f0d26 is described below

commit 822f58f0d26b7d760469151a65eaf9ee863a07a1
Author: Takuya UESHIN 
AuthorDate: Fri Sep 15 11:13:02 2023 +0900

[SPARK-45159][PYTHON] Handle named arguments only when necessary

### What changes were proposed in this pull request?

Handles named arguments only when necessary.

### Why are the changes needed?

Constructing `kwargs` as `dict` could be expensive. It should be done only 
when necessary.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42915 from ueshin/issues/SPARK-45159/kwargs.

Authored-by: Takuya UESHIN 
Signed-off-by: Hyukjin Kwon 
---
 .../tests/connect/test_parity_arrow_python_udf.py  |  24 +++
 python/pyspark/worker.py   | 199 -
 2 files changed, 137 insertions(+), 86 deletions(-)

diff --git a/python/pyspark/sql/tests/connect/test_parity_arrow_python_udf.py 
b/python/pyspark/sql/tests/connect/test_parity_arrow_python_udf.py
index e4a64a7d591..fa329b598d9 100644
--- a/python/pyspark/sql/tests/connect/test_parity_arrow_python_udf.py
+++ b/python/pyspark/sql/tests/connect/test_parity_arrow_python_udf.py
@@ -17,6 +17,8 @@
 
 import unittest
 
+from pyspark.errors import AnalysisException, PythonException
+from pyspark.sql.functions import udf
 from pyspark.sql.tests.connect.test_parity_udf import UDFParityTests
 from pyspark.sql.tests.test_arrow_python_udf import PythonUDFArrowTestsMixin
 
@@ -34,6 +36,28 @@ class ArrowPythonUDFParityTests(UDFParityTests, 
PythonUDFArrowTestsMixin):
 finally:
 super(ArrowPythonUDFParityTests, cls).tearDownClass()
 
+def test_named_arguments_negative(self):
+@udf("int")
+def test_udf(a, b):
+return a + b
+
+self.spark.udf.register("test_udf", test_udf)
+
+with self.assertRaisesRegex(
+AnalysisException,
+
"DUPLICATE_ROUTINE_PARAMETER_ASSIGNMENT.DOUBLE_NAMED_ARGUMENT_REFERENCE",
+):
+self.spark.sql("SELECT test_udf(a => id, a => id * 10) FROM 
range(2)").show()
+
+with self.assertRaisesRegex(AnalysisException, 
"UNEXPECTED_POSITIONAL_ARGUMENT"):
+self.spark.sql("SELECT test_udf(a => id, id * 10) FROM 
range(2)").show()
+
+with self.assertRaises(PythonException):
+self.spark.sql("SELECT test_udf(c => 'x') FROM range(2)").show()
+
+with self.assertRaises(PythonException):
+self.spark.sql("SELECT test_udf(id, a => id * 10) FROM 
range(2)").show()
+
 
 if __name__ == "__main__":
 import unittest
diff --git a/python/pyspark/worker.py b/python/pyspark/worker.py
index 92bc622775b..eea6e8fa783 100644
--- a/python/pyspark/worker.py
+++ b/python/pyspark/worker.py
@@ -81,15 +81,19 @@ def chain(f, g):
 return lambda *a: g(f(*a))
 
 
-def wrap_udf(f, return_type):
+def wrap_udf(f, args_offsets, kwargs_offsets, return_type):
+func, args_kwargs_offsets = wrap_kwargs_support(f, args_offsets, 
kwargs_offsets)
+
 if return_type.needConversion():
 toInternal = return_type.toInternal
-return lambda *a, **kw: toInternal(f(*a, **kw))
+return args_kwargs_offsets, lambda *a: toInternal(func(*a))
 else:
-return lambda *a, **kw: f(*a, **kw)
+return args_kwargs_offsets, lambda *a: func(*a)
+
 
+def wrap_scalar_pandas_udf(f, args_offsets, kwargs_offsets, return_type):
+func, args_kwargs_offsets = wrap_kwargs_support(f, args_offsets, 
kwargs_offsets)
 
-def wrap_scalar_pandas_udf(f, return_type):
 arrow_return_type = to_arrow_type(return_type)
 
 def verify_result_type(result):
@@ -115,17 +119,20 @@ def wrap_scalar_pandas_udf(f, return_type):
 )
 return result
 
-return lambda *a, **kw: (
-verify_result_length(
-verify_result_type(f(*a, **kw)), len((list(a) + 
list(kw.values()))[0])
+return (
+args_kwargs_offsets,
+lambda *a: (
+verify_result_length(verify_result_type(func(*a)), len(a[0])),
+arrow_return_type,
 ),
-arrow_return_type,
 )
 
 
-def wrap_arrow_batch_udf(f, return_type):
+def wrap_arrow_batch_udf(f, args_offsets, kwargs_offsets, return_type):
 import pandas as pd
 
+func, args_kwargs_offsets = wrap_kwargs_support(f, args_offsets, 
kwargs_offsets)
+
 arrow_return_type = to_arrow_type(return_type)
 
 # "result_func" ensures the result of a Python UDF to be consistent

[spark] branch master updated: [SPARK-45165][PS] Remove `inplace` parameter from `CategoricalIndex` APIs

2023-09-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 91ccc0f8e9f [SPARK-45165][PS] Remove `inplace` parameter from 
`CategoricalIndex` APIs
91ccc0f8e9f is described below

commit 91ccc0f8e9fd59ae13ac11640c8440317121d8b7
Author: Haejoon Lee 
AuthorDate: Thu Sep 14 21:09:58 2023 -0700

[SPARK-45165][PS] Remove `inplace` parameter from `CategoricalIndex` APIs

### What changes were proposed in this pull request?

This PR proposes to deprecated `inplace` parameter from `CategoricalIndex` 
APIs.

### Why are the changes needed?

Because they're also removed from Pandas.
https://github.com/apache/spark/assets/44108233/ef997036-77e0-49d4-9031-7dc892ef45d2";>

We should match our behavior with the latest Pandas.

### Does this PR introduce _any_ user-facing change?

Yes, the `inplace` parameter is no longer available for `CategoricalIndex` 
APIs

### How was this patch tested?

Updated UTs. This existing CI should pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42927 from itholic/SPARK-45165.

Authored-by: Haejoon Lee 
Signed-off-by: Dongjoon Hyun 
---
 .../source/migration_guide/pyspark_upgrade.rst |   1 +
 python/pyspark/pandas/categorical.py   |  30 ++---
 python/pyspark/pandas/indexes/category.py  | 147 +++--
 .../pyspark/pandas/tests/indexes/test_category.py  |  18 ---
 4 files changed, 37 insertions(+), 159 deletions(-)

diff --git a/python/docs/source/migration_guide/pyspark_upgrade.rst 
b/python/docs/source/migration_guide/pyspark_upgrade.rst
index d743384dee6..992101734ff 100644
--- a/python/docs/source/migration_guide/pyspark_upgrade.rst
+++ b/python/docs/source/migration_guide/pyspark_upgrade.rst
@@ -31,6 +31,7 @@ Upgrading from PySpark 3.5 to 4.0
 * In Spark 4.0, ``Series.mad`` has been removed from pandas API on Spark.
 * In Spark 4.0, ``na_sentinel`` parameter from ``Index.factorize`` and 
``Series.factorize`` has been removed from pandas API on Spark, use 
``use_na_sentinel`` instead.
 * In Spark 4.0, ``inplace`` parameter from ``Categorical.add_categories``, 
``Categorical.remove_categories``, ``Categorical.set_categories``, 
``Categorical.rename_categories``, ``Categorical.reorder_categories``, 
``Categorical.as_ordered``, ``Categorical.as_unordered`` have been removed from 
pandas API on Spark.
+* In Spark 4.0, ``inplace`` parameter from 
``CategoricalIndex.add_categories``, ``CategoricalIndex.remove_categories``, 
``CategoricalIndex.remove_unused_categories``, 
``CategoricalIndex.set_categories``, ``CategoricalIndex.rename_categories``, 
``CategoricalIndex.reorder_categories``, ``CategoricalIndex.as_ordered``, 
``CategoricalIndex.as_unordered`` have been removed from pandas API on Spark.
 * In Spark 4.0, ``closed`` parameter from ``ps.date_range`` has been removed 
from pandas API on Spark.
 * In Spark 4.0, ``include_start`` and ``include_end`` parameters from 
``DataFrame.between_time`` have been removed from pandas API on Spark, use 
``inclusive`` instead.
 * In Spark 4.0, ``include_start`` and ``include_end`` parameters from 
``Series.between_time`` have been removed from pandas API on Spark, use 
``inclusive`` instead.
diff --git a/python/pyspark/pandas/categorical.py 
b/python/pyspark/pandas/categorical.py
index 7043d1709ee..c7e6ab873f6 100644
--- a/python/pyspark/pandas/categorical.py
+++ b/python/pyspark/pandas/categorical.py
@@ -185,8 +185,8 @@ class CategoricalAccessor:
 
 Returns
 ---
-Series or None
-Categorical with new categories added or None if ``inplace=True``.
+Series
+Categorical with new categories added
 
 Raises
 --
@@ -270,8 +270,8 @@ class CategoricalAccessor:
 
 Returns
 ---
-Series or None
-Ordered Categorical or None if ``inplace=True``.
+Series
+Ordered Categorical
 
 Examples
 
@@ -304,8 +304,8 @@ class CategoricalAccessor:
 
 Returns
 ---
-Series or None
-Unordered Categorical or None if ``inplace=True``.
+Series
+Unordered Categorical
 
 Examples
 
@@ -346,8 +346,8 @@ class CategoricalAccessor:
 
 Returns
 ---
-Series or None
-Categorical with removed categories or None if ``inplace=True``.
+Series
+Categorical with removed categories
 
 Raises
 --
@@ -421,8 +421,8 @@ class CategoricalAccessor:
 
 Returns
 ---
-cat : Series or None
-Categorical with unused categories dropped or None if 
``inplace=True``.
+

[spark] branch master updated (91ccc0f8e9f -> 5678dbecc29)

2023-09-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 91ccc0f8e9f [SPARK-45165][PS] Remove `inplace` parameter from 
`CategoricalIndex` APIs
 add 5678dbecc29 [SPARK-45174][CORE] Support `spark.deploy.maxDrivers`

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/deploy/master/Master.scala|  7 +++--
 .../org/apache/spark/internal/config/Deploy.scala  |  6 
 .../apache/spark/deploy/master/MasterSuite.scala   | 33 ++
 3 files changed, 43 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-45171][SQL] Initialize non-deterministic expressions in `GenerateExec`

2023-09-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e097f916a27 [SPARK-45171][SQL] Initialize non-deterministic 
expressions in `GenerateExec`
e097f916a27 is described below

commit e097f916a2769dfe82bfd216fedcd6962e8280c8
Author: Bruce Robbins 
AuthorDate: Fri Sep 15 13:22:40 2023 +0900

[SPARK-45171][SQL] Initialize non-deterministic expressions in 
`GenerateExec`

### What changes were proposed in this pull request?

Before evaluating the generator function in `GenerateExec`, initialize 
non-deterministic expressions.

### Why are the changes needed?

The following query fails:
```
select *
from explode(
  transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22)
);

23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
expression org.apache.spark.sql.catalyst.expressions.Rand should be initialized 
before eval.
at scala.Predef$.require(Predef.scala:281)
at 
org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497)
at 
org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495)
at 
org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35)
at 
org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543)
at 
org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
at 
org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062)
at 
org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275)
at 
org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274)
at 
org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308)
at 
org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375)
at 
org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
...
```
However, this query succeeds:
```
select *
from explode(
  sequence(0, cast(rand()*1000 as int) + 1)
);

0
1
2
3
...
801
802
803
```
The difference is that `transform` turns off whole-stage codegen, which 
exposes a bug in `GenerateExec` in which the non-deterministic expression 
passed to the generator function is not initialized before being used.

This PR fixes the bug in `GenerateExec`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42933 from bersprockets/nondeterm_issue.

Lead-authored-by: Bruce Robbins 
Co-authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .../main/scala/org/apache/spark/sql/execution/GenerateExec.scala   | 4 
 .../test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala   | 7 +++
 2 files changed, 11 insertions(+)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala
index f6dbf5fda18..b99361437e0 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala
@@ -78,6 +78,10 @@ case class GenerateExec(
 // boundGenerator.terminate() should be triggered after all of the rows in 
the partition
 val numOutputRows = longMetric("numOutputRows")
 child.execute().mapPartitionsWithIndexInternal { (index, iter) =>
+  boundGenerator.foreach {
+case n: Nondeterministic => n.initialize(index)
+case _ =>
+  }
   val generatorNullRow = new 
GenericInternalRow(generator.elementSchema.length)
   val rows = if (requiredChildOutput.nonEmpty) {
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala
index 24061b4c7c2..68f63feb5c5 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala
@@ -536,6 +536,13 @@ class GeneratorFunctionSuite extends QueryTest with 
SharedSparkSession {
 checkAnswer(df,
   Row(1, 1) :: Row(1, 2) :

[spark] branch branch-3.5 updated: [SPARK-45171][SQL] Initialize non-deterministic expressions in `GenerateExec`

2023-09-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 9c0b803ba12 [SPARK-45171][SQL] Initialize non-deterministic 
expressions in `GenerateExec`
9c0b803ba12 is described below

commit 9c0b803ba124a6e70762aec1e5559b0d66529f4d
Author: Bruce Robbins 
AuthorDate: Fri Sep 15 13:22:40 2023 +0900

[SPARK-45171][SQL] Initialize non-deterministic expressions in 
`GenerateExec`

### What changes were proposed in this pull request?

Before evaluating the generator function in `GenerateExec`, initialize 
non-deterministic expressions.

### Why are the changes needed?

The following query fails:
```
select *
from explode(
  transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22)
);

23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
expression org.apache.spark.sql.catalyst.expressions.Rand should be initialized 
before eval.
at scala.Predef$.require(Predef.scala:281)
at 
org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497)
at 
org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495)
at 
org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35)
at 
org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543)
at 
org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
at 
org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062)
at 
org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275)
at 
org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274)
at 
org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308)
at 
org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375)
at 
org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
...
```
However, this query succeeds:
```
select *
from explode(
  sequence(0, cast(rand()*1000 as int) + 1)
);

0
1
2
3
...
801
802
803
```
The difference is that `transform` turns off whole-stage codegen, which 
exposes a bug in `GenerateExec` in which the non-deterministic expression 
passed to the generator function is not initialized before being used.

This PR fixes the bug in `GenerateExec`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42933 from bersprockets/nondeterm_issue.

Lead-authored-by: Bruce Robbins 
Co-authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit e097f916a2769dfe82bfd216fedcd6962e8280c8)
Signed-off-by: Hyukjin Kwon 
---
 .../main/scala/org/apache/spark/sql/execution/GenerateExec.scala   | 4 
 .../test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala   | 7 +++
 2 files changed, 11 insertions(+)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala
index f6dbf5fda18..b99361437e0 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala
@@ -78,6 +78,10 @@ case class GenerateExec(
 // boundGenerator.terminate() should be triggered after all of the rows in 
the partition
 val numOutputRows = longMetric("numOutputRows")
 child.execute().mapPartitionsWithIndexInternal { (index, iter) =>
+  boundGenerator.foreach {
+case n: Nondeterministic => n.initialize(index)
+case _ =>
+  }
   val generatorNullRow = new 
GenericInternalRow(generator.elementSchema.length)
   val rows = if (requiredChildOutput.nonEmpty) {
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala
index abec582d43a..0746a4b92af 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala
@@ -536,6 +536,13 @@ class Genera

[spark] branch master updated (e097f916a27 -> 5d4ca79aad1)

2023-09-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from e097f916a27 [SPARK-45171][SQL] Initialize non-deterministic 
expressions in `GenerateExec`
 add 5d4ca79aad1 [SPARK-45172][BUILD] Upgrade `commons-compress` to 1.24.0

No new revisions were added by this update.

Summary of changes:
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +-
 pom.xml   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0

2023-09-14 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c1c710e7da7 [SPARK-45143][PYTHON][CONNECT] Make PySpark compatible 
with PyArrow 13.0.0
c1c710e7da7 is described below

commit c1c710e7da75b989f4d14e84e85f336bc10920e0
Author: Ruifeng Zheng 
AuthorDate: Thu Sep 14 21:26:17 2023 -0700

[SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0

### What changes were proposed in this pull request?
1, in PyArrow 13.0.0, the behavior of 
[Table#to_pandas](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas)
 and 
[ChunkedArray#to_pandas](https://arrow.apache.org/docs/python/generated/pyarrow.ChunkedArray.html#pyarrow.ChunkedArray.to_pandas)
 changed, set the `coerce_temporal_nanoseconds=True`

2, there is another undocumented breaking change in data type conversion 
[`TimestampType#to_pandas_dtype`](https://arrow.apache.org/docs/python/generated/pyarrow.TimestampType.html#pyarrow.TimestampType.to_pandas_dtype):

12.0.1:
```
In [1]: import pyarrow as pa

In [2]: pa.timestamp("us", tz=None).to_pandas_dtype()
Out[2]: dtype('
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml   |  4 ++--
 dev/infra/Dockerfile   |  2 +-
 python/pyspark/pandas/typedef/typehints.py |  4 +++-
 python/pyspark/sql/connect/client/core.py  | 28 
 python/pyspark/sql/pandas/conversion.py| 16 ++--
 python/pyspark/sql/pandas/serializers.py   | 17 -
 6 files changed, 56 insertions(+), 15 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 25c95bd607d..9806a4cf415 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -263,7 +263,7 @@ jobs:
 - name: Install Python packages (Python 3.8)
   if: (contains(matrix.modules, 'sql') && !contains(matrix.modules, 
'sql-')) || contains(matrix.modules, 'connect')
   run: |
-python3.8 -m pip install 'numpy>=1.20.0' 'pyarrow==12.0.1' pandas 
scipy unittest-xml-reporting 'grpcio>=1.48,<1.57' 'grpcio-status>=1.48,<1.57' 
'protobuf==3.20.3'
+python3.8 -m pip install 'numpy>=1.20.0' pyarrow pandas scipy 
unittest-xml-reporting 'grpcio>=1.48,<1.57' 'grpcio-status>=1.48,<1.57' 
'protobuf==3.20.3'
 python3.8 -m pip list
 # Run the tests.
 - name: Run tests
@@ -728,7 +728,7 @@ jobs:
 #   See also https://issues.apache.org/jira/browse/SPARK-38279.
 python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 'pyzmq<24.0.0'
 python3.9 -m pip install ipython_genutils # See SPARK-38517
-python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
'pyarrow==12.0.1' pandas 'plotly>=4.8'
+python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
pyarrow pandas 'plotly>=4.8'
 python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421
 apt-get update -y
 apt-get install -y ruby ruby-dev
diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile
index 99423ce072c..8e5a3cb7c05 100644
--- a/dev/infra/Dockerfile
+++ b/dev/infra/Dockerfile
@@ -85,7 +85,7 @@ RUN Rscript -e "devtools::install_version('roxygen2', 
version='7.2.0', repos='ht
 ENV R_LIBS_SITE 
"/usr/local/lib/R/site-library:${R_LIBS_SITE}:/usr/lib/R/library"
 
 RUN pypy3 -m pip install numpy 'pandas<=2.0.3' scipy coverage matplotlib
-RUN python3.9 -m pip install numpy 'pyarrow==12.0.1' 'pandas<=2.0.3' scipy 
unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 
'memory-profiler==0.60.0' 'scikit-learn==1.1.*'
+RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.0.3' scipy 
unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 
'memory-profiler==0.60.0' 'scikit-learn==1.1.*'
 
 # Add Python deps for Spark Connect.
 RUN python3.9 -m pip install 'grpcio>=1.48,<1.57' 'grpcio-status>=1.48,<1.57' 
'protobuf==3.20.3' 'googleapis-common-protos==1.56.4'
diff --git a/python/pyspark/pandas/typedef/typehints.py 
b/python/pyspark/pandas/typedef/typehints.py
index 6e41395186d..413ef20dae0 100644
--- a/python/pyspark/pandas/typedef/typehints.py
+++ b/python/pyspark/pandas/typedef/typehints.py
@@ -293,7 +293,9 @@ def spark_type_to_pandas_dtype(
 ),
 ):
 return np.dtype("object")
-elif isinstance(spark_type, types.TimestampType):
+elif isinstance(spark_type, types.DayTimeIntervalType):
+return np.dtype("timedelta64[ns]")
+elif isinstance(spark_type, (types.TimestampType, types.TimestampNTZType)):
 return np.dtype("datetime64[ns]")
 else:
 return np.dtype(to_arrow_type(spark_type).to_pandas_

[spark] branch master updated (c1c710e7da7 -> c3b676d818e)

2023-09-14 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from c1c710e7da7 [SPARK-45143][PYTHON][CONNECT] Make PySpark compatible 
with PyArrow 13.0.0
 add c3b676d818e [SPARK-45168][PYTHON] Increase Pandas minimum version to 
1.4.4

No new revisions were added by this update.

Summary of changes:
 python/docs/source/getting_started/install.rst |   2 +-
 python/docs/source/user_guide/sql/arrow_pandas.rst |   2 +-
 python/pyspark/pandas/groupby.py   |  10 +-
 python/pyspark/pandas/resample.py  |   8 -
 .../pandas/tests/computation/test_binary_ops.py|  10 +-
 .../pandas/tests/computation/test_combine.py   |   5 +-
 .../pandas/tests/computation/test_describe.py  | 294 +++---
 .../pyspark/pandas/tests/computation/test_eval.py  |   6 +-
 .../pandas/tests/computation/test_missing_data.py  |  74 ++---
 .../tests/connect/data_type_ops/testing_utils.py   |  28 +-
 .../pandas/tests/data_type_ops/test_boolean_ops.py |   5 +-
 .../tests/data_type_ops/test_categorical_ops.py|   8 +-
 .../pandas/tests/data_type_ops/test_num_ops.py |  32 +-
 .../pandas/tests/data_type_ops/testing_utils.py|  27 +-
 python/pyspark/pandas/tests/frame/test_attrs.py|  35 +--
 .../pyspark/pandas/tests/frame/test_constructor.py |  27 +-
 .../pyspark/pandas/tests/frame/test_reindexing.py  |  27 +-
 .../pyspark/pandas/tests/frame/test_reshaping.py   |  20 +-
 python/pyspark/pandas/tests/frame/test_truncate.py |  24 +-
 .../pyspark/pandas/tests/groupby/test_groupby.py   |  50 +--
 python/pyspark/pandas/tests/groupby/test_index.py  |  14 +-
 python/pyspark/pandas/tests/indexes/test_base.py   |  13 +-
 .../pyspark/pandas/tests/indexes/test_base_slow.py | 199 
 .../pyspark/pandas/tests/indexes/test_category.py  |  41 +--
 .../pyspark/pandas/tests/indexes/test_datetime.py  |   6 +-
 python/pyspark/pandas/tests/io/test_io.py  |   6 +-
 python/pyspark/pandas/tests/series/test_as_type.py |  42 +--
 python/pyspark/pandas/tests/series/test_compute.py |  96 ++
 .../pandas/tests/series/test_missing_data.py   |  43 +--
 python/pyspark/pandas/tests/series/test_series.py  |  31 +-
 python/pyspark/pandas/tests/series/test_stat.py|   8 +-
 python/pyspark/pandas/tests/test_categorical.py|  41 +--
 python/pyspark/pandas/tests/test_expanding.py  |  73 ++---
 python/pyspark/pandas/tests/test_indexing.py   |  17 +-
 python/pyspark/pandas/tests/test_namespace.py  |  43 ++-
 .../pandas/tests/test_ops_on_diff_frames.py| 341 ++---
 .../test_ops_on_diff_frames_groupby_expanding.py   |  17 +-
 .../test_ops_on_diff_frames_groupby_rolling.py |  17 +-
 python/pyspark/pandas/tests/test_reshape.py|  37 +--
 python/pyspark/sql/pandas/utils.py |   2 +-
 python/pyspark/testing/pandasutils.py  |  31 +-
 python/setup.py|   2 +-
 42 files changed, 440 insertions(+), 1374 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] xuanyuanking commented on pull request #476: Add News, Release Note, Download Link for Apache Spark 3.5.0

[GitHub] [spark-website] xuanyuanking closed pull request #476: Add News, Release Note, Download Link for Apache Spark 3.5.0

[GitHub] [spark-website] xuanyuanking commented on pull request #476: Add News, Release Note, Download Link for Apache Spark 3.5.0

[spark] branch master updated: [SPARK-45162][SQL] Support maps and array parameters constructed via `call_function`

[spark] branch master updated (cd672b09ac6 -> 6653f94d489)

[spark] branch master updated (6653f94d489 -> 0e9ad5ad66a)

[spark] branch master updated: [MINOR][PYTHON][DOCS] Fix default value of parameter `barrier` in MapInXXX

[spark] branch branch-3.5 updated: [MINOR][PYTHON][DOCS] Fix default value of parameter `barrier` in MapInXXX

[spark] branch master updated: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

[spark] branch master updated: [SPARK-45119][PYTHON][DOCS] Refine docstring of inline

[spark-docker] branch master updated: [SPARK-45169] Add official image Dockerfile for Apache Spark 3.5.0

[spark] branch master updated: [SPARK-45118][PYTHON] Refactor converters for complex types to short cut when the element types don't need converters

[spark] branch dependabot/maven/org.apache.commons-commons-compress-1.24.0 created (now e12d8824507)

[spark] branch master updated: [SPARK-45161][INFRA] Bump previousSparkVersion to 3.5.0

[spark] branch dependabot/maven/org.apache.commons-commons-compress-1.24.0 deleted (was e12d8824507)

[spark] branch master updated: [SPARK-45084][SS] StateOperatorProgress to use accurate effective shuffle partition number

[spark] branch master updated: [SPARK-45159][PYTHON] Handle named arguments only when necessary

[spark] branch master updated: [SPARK-45165][PS] Remove `inplace` parameter from `CategoricalIndex` APIs

[spark] branch master updated (91ccc0f8e9f -> 5678dbecc29)

[spark] branch master updated: [SPARK-45171][SQL] Initialize non-deterministic expressions in `GenerateExec`

[spark] branch branch-3.5 updated: [SPARK-45171][SQL] Initialize non-deterministic expressions in `GenerateExec`

[spark] branch master updated (e097f916a27 -> 5d4ca79aad1)

[spark] branch master updated: [SPARK-45143][PYTHON][CONNECT] Make PySpark compatible with PyArrow 13.0.0

[spark] branch master updated (c1c710e7da7 -> c3b676d818e)

24 matches

Site Navigation

Mail list logo

Footer information