[jira] [Updated] (SPARK-43903) Improve ArrayType input support in Arrow-optimized Python UDF

2023-06-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43903:
-
Summary: Improve ArrayType input support in Arrow-optimized Python UDF  
(was: Non-atomic data type support in Arrow-optimized Python UDF)

> Improve ArrayType input support in Arrow-optimized Python UDF
> -
>
> Key: SPARK-43903
> URL: https://issues.apache.org/jira/browse/SPARK-43903
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43893) Non-atomic data type support in Arrow-optimized Python UDF

2023-06-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43893:
-
Summary: Non-atomic data type support in Arrow-optimized Python UDF  (was: 
StructType input/output support in Arrow-optimized Python UDF)

> Non-atomic data type support in Arrow-optimized Python UDF
> --
>
> Key: SPARK-43893
> URL: https://issues.apache.org/jira/browse/SPARK-43893
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43893) StructType input/output support in Arrow-optimized Python UDF

2023-06-06 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng reassigned SPARK-43893:


Assignee: Xinrong Meng

> StructType input/output support in Arrow-optimized Python UDF
> -
>
> Key: SPARK-43893
> URL: https://issues.apache.org/jira/browse/SPARK-43893
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43893) StructType input/output support in Arrow-optimized Python UDF

2023-06-06 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-43893.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41321
[https://github.com/apache/spark/pull/41321]

> StructType input/output support in Arrow-optimized Python UDF
> -
>
> Key: SPARK-43893
> URL: https://issues.apache.org/jira/browse/SPARK-43893
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43903) Non-atomic data type support in Arrow-optimized Python UDF

2023-05-31 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43903:
-
Summary: Non-atomic data type support in Arrow-optimized Python UDF  (was: 
Standardize ArrayType conversion for Python UDF)

> Non-atomic data type support in Arrow-optimized Python UDF
> --
>
> Key: SPARK-43903
> URL: https://issues.apache.org/jira/browse/SPARK-43903
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43903) Standardize ArrayType conversion for Python UDF

2023-05-31 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43903:


 Summary: Standardize ArrayType conversion for Python UDF
 Key: SPARK-43903
 URL: https://issues.apache.org/jira/browse/SPARK-43903
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43893) StructType input/output support in Arrow-optimized Python UDF

2023-05-30 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43893:


 Summary: StructType input/output support in Arrow-optimized Python 
UDF
 Key: SPARK-43893
 URL: https://issues.apache.org/jira/browse/SPARK-43893
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43886) Accept generics tuple as typing hints in Pandas UDF

2023-05-30 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43886:


 Summary: Accept generics tuple as typing hints in Pandas UDF
 Key: SPARK-43886
 URL: https://issues.apache.org/jira/browse/SPARK-43886
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43804) Test on nested structs support in Pandas UDF

2023-05-25 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43804:


 Summary: Test on nested structs support in Pandas UDF
 Key: SPARK-43804
 URL: https://issues.apache.org/jira/browse/SPARK-43804
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng


Test on nested structs support in Pandas UDF. That support is newly enabled 
(compared to Spark 3.4).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43545) Support Nested Timestamp Types

2023-05-25 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43545:
-
Summary: Support Nested Timestamp Types  (was: Remove outdated 
UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION)

> Support Nested Timestamp Types
> --
>
> Key: SPARK-43545
> URL: https://issues.apache.org/jira/browse/SPARK-43545
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-43543) Standardize Nested Complex DataTypes Support

2023-05-24 Thread Xinrong Meng (Jira)


[ https://issues.apache.org/jira/browse/SPARK-43543 ]


Xinrong Meng deleted comment on SPARK-43543:
--

was (Author: xinrongm):
Issue resolved by pull request 41147
[https://github.com/apache/spark/pull/41147]

> Standardize Nested Complex DataTypes Support
> 
>
> Key: SPARK-43543
> URL: https://issues.apache.org/jira/browse/SPARK-43543
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43544) Fix nested MapType behavior in Pandas UDF

2023-05-24 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725949#comment-17725949
 ] 

Xinrong Meng commented on SPARK-43544:
--

Resolved by https://github.com/apache/spark/pull/41147.

> Fix nested MapType behavior in Pandas UDF
> -
>
> Key: SPARK-43544
> URL: https://issues.apache.org/jira/browse/SPARK-43544
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43544) Fix nested MapType behavior in Pandas UDF

2023-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng reassigned SPARK-43544:


Assignee: Xinrong Meng

> Fix nested MapType behavior in Pandas UDF
> -
>
> Key: SPARK-43544
> URL: https://issues.apache.org/jira/browse/SPARK-43544
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43544) Fix nested MapType behavior in Pandas UDF

2023-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-43544.
--
Resolution: Done

> Fix nested MapType behavior in Pandas UDF
> -
>
> Key: SPARK-43544
> URL: https://issues.apache.org/jira/browse/SPARK-43544
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43546) Complete parity tests of Pandas UDF

2023-05-22 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43546:
-
Summary: Complete parity tests of Pandas UDF  (was: Complete Pandas UDF 
parity tests)

> Complete parity tests of Pandas UDF
> ---
>
> Key: SPARK-43546
> URL: https://issues.apache.org/jira/browse/SPARK-43546
> Project: Spark
>  Issue Type: Test
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Tests as shown below should be added to Connect.
> test_pandas_udf_grouped_agg.py
> test_pandas_udf_scalar.py
> test_pandas_udf_window.py



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43734) Expression "(v)" within a window function doesn't raise a AnalysisException

2023-05-22 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43734:


 Summary: Expression "(v)" within a window function doesn't 
raise a AnalysisException
 Key: SPARK-43734
 URL: https://issues.apache.org/jira/browse/SPARK-43734
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng


Expression "(v)" within a window function doesn't raise a 
AnalysisException

 

See PandasUDFWindowParityTests.test_invalid_args for reproduction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43727) Parity returnType check in Spark Connect

2023-05-22 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43727:


 Summary: Parity returnType check in Spark Connect
 Key: SPARK-43727
 URL: https://issues.apache.org/jira/browse/SPARK-43727
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43543) Standardize Nested Complex DataTypes Support

2023-05-19 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-43543.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41147
[https://github.com/apache/spark/pull/41147]

> Standardize Nested Complex DataTypes Support
> 
>
> Key: SPARK-43543
> URL: https://issues.apache.org/jira/browse/SPARK-43543
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43543) Standardize Nested Complex DataTypes Support

2023-05-19 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng reassigned SPARK-43543:


Assignee: Xinrong Meng

> Standardize Nested Complex DataTypes Support
> 
>
> Key: SPARK-43543
> URL: https://issues.apache.org/jira/browse/SPARK-43543
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43579) Cache the converter between Arrow and pandas for reuse

2023-05-18 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43579:


 Summary: Cache the converter between Arrow and pandas for reuse
 Key: SPARK-43579
 URL: https://issues.apache.org/jira/browse/SPARK-43579
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43544) Fix nested MapType behavior in Pandas UDF

2023-05-17 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43544:
-
Summary: Fix nested MapType behavior in Pandas UDF  (was: Standardize 
nested non-atomic input type support in Pandas UDF)

> Fix nested MapType behavior in Pandas UDF
> -
>
> Key: SPARK-43544
> URL: https://issues.apache.org/jira/browse/SPARK-43544
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43546) Complete Pandas UDF parity tests

2023-05-17 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43546:


 Summary: Complete Pandas UDF parity tests
 Key: SPARK-43546
 URL: https://issues.apache.org/jira/browse/SPARK-43546
 Project: Spark
  Issue Type: Test
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng


Tests as shown below should be added to Connect.

test_pandas_udf_grouped_agg.py
test_pandas_udf_scalar.py
test_pandas_udf_window.py



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43545) Remove outdated UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION

2023-05-17 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43545:


 Summary: Remove outdated UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION
 Key: SPARK-43545
 URL: https://issues.apache.org/jira/browse/SPARK-43545
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43544) Standardize nested non-atomic input type support in Pandas UDF

2023-05-17 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43544:


 Summary: Standardize nested non-atomic input type support in 
Pandas UDF
 Key: SPARK-43544
 URL: https://issues.apache.org/jira/browse/SPARK-43544
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43543) Standardize Nested Complex DataTypes Support

2023-05-17 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43543:


 Summary: Standardize Nested Complex DataTypes Support
 Key: SPARK-43543
 URL: https://issues.apache.org/jira/browse/SPARK-43543
 Project: Spark
  Issue Type: Umbrella
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43440) Support registration of an Arrow-optimized Python UDF

2023-05-10 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43440:
-
Description: 
Currently, when users register an Arrow-optimized Python UDF, it will be 
registered as a pickled Python UDF and thus, executed without Arrow 
optimization. 
We should support Arrow-optimized Python UDFs registration and execute them 
with Arrow optimization.

  was:Support registration of an Arrow-optimized Python UDF


> Support registration of an Arrow-optimized Python UDF 
> --
>
> Key: SPARK-43440
> URL: https://issues.apache.org/jira/browse/SPARK-43440
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, when users register an Arrow-optimized Python UDF, it will be 
> registered as a pickled Python UDF and thus, executed without Arrow 
> optimization. 
> We should support Arrow-optimized Python UDFs registration and execute them 
> with Arrow optimization.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43440) Support registration of an Arrow-optimized Python UDF

2023-05-10 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43440:


 Summary: Support registration of an Arrow-optimized Python UDF 
 Key: SPARK-43440
 URL: https://issues.apache.org/jira/browse/SPARK-43440
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng


Support registration of an Arrow-optimized Python UDF



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42523) Apache Spark 3.4 release

2023-05-10 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17721549#comment-17721549
 ] 

Xinrong Meng commented on SPARK-42523:
--

I am wondering if we shall keep the ticket open for minor releases such as the 
upcoming 3.4.1.

> Apache Spark 3.4 release
> 
>
> Key: SPARK-42523
> URL: https://issues.apache.org/jira/browse/SPARK-42523
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> An umbrella for Apache Spark 3.4 release



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43412) Introduce `SQL_ARROW_BATCHED_UDF` EvalType for Arrow-optimized Python UDFs

2023-05-10 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-43412.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41053
[https://github.com/apache/spark/pull/41053]

> Introduce `SQL_ARROW_BATCHED_UDF` EvalType for Arrow-optimized Python UDFs
> --
>
> Key: SPARK-43412
> URL: https://issues.apache.org/jira/browse/SPARK-43412
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>
> We are about to improve nested non-atomic input/output support of an 
> Arrow-optimized Python UDF.
> However, currently, it shares the same EvalType with a pickled Python UDF, 
> but the same implementation with a Pandas UDF.
> Introducing an EvalType enables isolating the changes to Arrow-optimized 
> Python UDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43412) Introduce `SQL_ARROW_BATCHED_UDF` EvalType for Arrow-optimized Python UDFs

2023-05-08 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43412:


 Summary: Introduce `SQL_ARROW_BATCHED_UDF` EvalType for 
Arrow-optimized Python UDFs
 Key: SPARK-43412
 URL: https://issues.apache.org/jira/browse/SPARK-43412
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng


We are about to improve nested non-atomic input/output support of an 
Arrow-optimized Python UDF.

However, currently, it shares the same EvalType with a pickled Python UDF, but 
the same implementation with a Pandas UDF.

Introducing an EvalType enables isolating the changes to Arrow-optimized Python 
UDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on

2023-05-04 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719463#comment-17719463
 ] 

Xinrong Meng commented on SPARK-41971:
--

Hi [~nikj] , the issue has been resolved. Feel free to pick other issues that 
you are interested in. Normally we comment on the ticket and file the pull 
request afterward directly.

> `toPandas` should support duplicate filed names when arrow-optimization is on
> -
>
> Key: SPARK-41971
> URL: https://issues.apache.org/jira/browse/SPARK-41971
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Takuya Ueshin
>Priority: Minor
> Fix For: 3.5.0
>
>
> toPandas support duplicate columns name, but for a struct column, it doesnot 
> support duplicate field names.
> {code:java}
> In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False)
> In [28]: spark.sql("select 1 v, 1 v").toPandas()
> Out[28]: 
>v  v
> 0  1  1
> In [29]: spark.sql("select struct(1 v, 1 v)").toPandas()
> Out[29]: 
>   struct(1 AS v, 1 AS v)
> 0 (1, 1)
> In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)
> In [31]: spark.sql("select 1 v, 1 v").toPandas()
> Out[31]: 
>v  v
> 0  1  1
> In [32]: spark.sql("select struct(1 v, 1 v)").toPandas()
> /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: 
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
>   Ran out of field metadata, likely malformed
>   warn(msg)
> ---
> ArrowInvalid  Traceback (most recent call last)
> Cell In[32], line 1
> > 1 spark.sql("select struct(1 v, 1 v)").toPandas()
> File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in 
> PandasConversionMixin.toPandas(self)
> 141 tmp_column_names = ["col_{}".format(i) for i in 
> range(len(self.columns))]
> 142 self_destruct = jconf.arrowPySparkSelfDestructEnabled()
> --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow(
> 144 split_batches=self_destruct
> 145 )
> 146 if len(batches) > 0:
> 147 table = pyarrow.Table.from_batches(batches)
> File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in 
> PandasConversionMixin._collect_as_arrow(self, split_batches)
> 356 results.append(batch_or_indices)
> 357 else:
> --> 358 results = list(batch_stream)
> 359 finally:
> 360 # Join serving thread and raise any exceptions from 
> collectAsArrowToPython
> 361 jsocket_auth_server.getResult()
> File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in 
> ArrowCollectSerializer.load_stream(self, stream)
>  50 """
>  51 Load a stream of un-ordered Arrow RecordBatches, where the last 
> iteration yields
>  52 a list of indices that can be used to put the RecordBatches in the 
> correct order.
>  53 """
>  54 # load the batches
> ---> 55 for batch in self.serializer.load_stream(stream):
>  56 yield batch
>  58 # load the batch order indices or propagate any error that occurred 
> in the JVM
> File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in 
> ArrowStreamSerializer.load_stream(self, stream)
>  95 import pyarrow as pa
>  97 reader = pa.ipc.open_stream(stream)
> ---> 98 for batch in reader:
>  99 yield batch
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638,
>  in __iter__()
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674,
>  in pyarrow.lib.RecordBatchReader.read_next_batch()
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100,
>  in pyarrow.lib.check_status()
> ArrowInvalid: Ran out of field metadata, likely malformed
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on

2023-05-04 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng reassigned SPARK-41971:


Assignee: Takuya Ueshin

> `toPandas` should support duplicate filed names when arrow-optimization is on
> -
>
> Key: SPARK-41971
> URL: https://issues.apache.org/jira/browse/SPARK-41971
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Takuya Ueshin
>Priority: Minor
> Fix For: 3.5.0
>
>
> toPandas support duplicate columns name, but for a struct column, it doesnot 
> support duplicate field names.
> {code:java}
> In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False)
> In [28]: spark.sql("select 1 v, 1 v").toPandas()
> Out[28]: 
>v  v
> 0  1  1
> In [29]: spark.sql("select struct(1 v, 1 v)").toPandas()
> Out[29]: 
>   struct(1 AS v, 1 AS v)
> 0 (1, 1)
> In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)
> In [31]: spark.sql("select 1 v, 1 v").toPandas()
> Out[31]: 
>v  v
> 0  1  1
> In [32]: spark.sql("select struct(1 v, 1 v)").toPandas()
> /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: 
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
>   Ran out of field metadata, likely malformed
>   warn(msg)
> ---
> ArrowInvalid  Traceback (most recent call last)
> Cell In[32], line 1
> > 1 spark.sql("select struct(1 v, 1 v)").toPandas()
> File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in 
> PandasConversionMixin.toPandas(self)
> 141 tmp_column_names = ["col_{}".format(i) for i in 
> range(len(self.columns))]
> 142 self_destruct = jconf.arrowPySparkSelfDestructEnabled()
> --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow(
> 144 split_batches=self_destruct
> 145 )
> 146 if len(batches) > 0:
> 147 table = pyarrow.Table.from_batches(batches)
> File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in 
> PandasConversionMixin._collect_as_arrow(self, split_batches)
> 356 results.append(batch_or_indices)
> 357 else:
> --> 358 results = list(batch_stream)
> 359 finally:
> 360 # Join serving thread and raise any exceptions from 
> collectAsArrowToPython
> 361 jsocket_auth_server.getResult()
> File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in 
> ArrowCollectSerializer.load_stream(self, stream)
>  50 """
>  51 Load a stream of un-ordered Arrow RecordBatches, where the last 
> iteration yields
>  52 a list of indices that can be used to put the RecordBatches in the 
> correct order.
>  53 """
>  54 # load the batches
> ---> 55 for batch in self.serializer.load_stream(stream):
>  56 yield batch
>  58 # load the batch order indices or propagate any error that occurred 
> in the JVM
> File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in 
> ArrowStreamSerializer.load_stream(self, stream)
>  95 import pyarrow as pa
>  97 reader = pa.ipc.open_stream(stream)
> ---> 98 for batch in reader:
>  99 yield batch
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638,
>  in __iter__()
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674,
>  in pyarrow.lib.RecordBatchReader.read_next_batch()
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100,
>  in pyarrow.lib.check_status()
> ArrowInvalid: Ran out of field metadata, likely malformed
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on

2023-05-04 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-41971.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40988
[https://github.com/apache/spark/pull/40988]

> `toPandas` should support duplicate filed names when arrow-optimization is on
> -
>
> Key: SPARK-41971
> URL: https://issues.apache.org/jira/browse/SPARK-41971
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.5.0
>
>
> toPandas support duplicate columns name, but for a struct column, it doesnot 
> support duplicate field names.
> {code:java}
> In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False)
> In [28]: spark.sql("select 1 v, 1 v").toPandas()
> Out[28]: 
>v  v
> 0  1  1
> In [29]: spark.sql("select struct(1 v, 1 v)").toPandas()
> Out[29]: 
>   struct(1 AS v, 1 AS v)
> 0 (1, 1)
> In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)
> In [31]: spark.sql("select 1 v, 1 v").toPandas()
> Out[31]: 
>v  v
> 0  1  1
> In [32]: spark.sql("select struct(1 v, 1 v)").toPandas()
> /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: 
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
>   Ran out of field metadata, likely malformed
>   warn(msg)
> ---
> ArrowInvalid  Traceback (most recent call last)
> Cell In[32], line 1
> > 1 spark.sql("select struct(1 v, 1 v)").toPandas()
> File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in 
> PandasConversionMixin.toPandas(self)
> 141 tmp_column_names = ["col_{}".format(i) for i in 
> range(len(self.columns))]
> 142 self_destruct = jconf.arrowPySparkSelfDestructEnabled()
> --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow(
> 144 split_batches=self_destruct
> 145 )
> 146 if len(batches) > 0:
> 147 table = pyarrow.Table.from_batches(batches)
> File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in 
> PandasConversionMixin._collect_as_arrow(self, split_batches)
> 356 results.append(batch_or_indices)
> 357 else:
> --> 358 results = list(batch_stream)
> 359 finally:
> 360 # Join serving thread and raise any exceptions from 
> collectAsArrowToPython
> 361 jsocket_auth_server.getResult()
> File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in 
> ArrowCollectSerializer.load_stream(self, stream)
>  50 """
>  51 Load a stream of un-ordered Arrow RecordBatches, where the last 
> iteration yields
>  52 a list of indices that can be used to put the RecordBatches in the 
> correct order.
>  53 """
>  54 # load the batches
> ---> 55 for batch in self.serializer.load_stream(stream):
>  56 yield batch
>  58 # load the batch order indices or propagate any error that occurred 
> in the JVM
> File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in 
> ArrowStreamSerializer.load_stream(self, stream)
>  95 import pyarrow as pa
>  97 reader = pa.ipc.open_stream(stream)
> ---> 98 for batch in reader:
>  99 yield batch
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638,
>  in __iter__()
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674,
>  in pyarrow.lib.RecordBatchReader.read_next_batch()
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100,
>  in pyarrow.lib.check_status()
> ArrowInvalid: Ran out of field metadata, likely malformed
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43032) Add StreamingQueryManager API

2023-05-02 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-43032.
--
  Assignee: Wei Liu
Resolution: Done

Resolved by https://github.com/apache/spark/pull/40861.

> Add StreamingQueryManager API
> -
>
> Key: SPARK-43032
> URL: https://issues.apache.org/jira/browse/SPARK-43032
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Assignee: Wei Liu
>Priority: Major
>
> Add StreamingQueryManager API. It would include API that can be directly 
> support. API like registering streaming listener will be handled separately. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39892) Use ArrowType.Decimal(precision, scale, bitWidth) instead of ArrowType.Decimal(precision, scale)

2023-04-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39892:
-
Fix Version/s: 3.5.0
   (was: 3.4.0)

> Use ArrowType.Decimal(precision, scale, bitWidth) instead of 
> ArrowType.Decimal(precision, scale)
> 
>
> Key: SPARK-39892
> URL: https://issues.apache.org/jira/browse/SPARK-39892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>
> [warn] 
> /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:48:49:
>  [deprecation @ org.apache.spark.sql.util.ArrowUtils.toArrowType | 
> origin=org.apache.arrow.vector.types.pojo.ArrowType.Decimal. | 
> version=] constructor Decimal in class Decimal is deprecated



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41259) Spark-sql cli query results should correspond to schema

2023-04-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-41259:
-
Fix Version/s: 3.5.0
   (was: 3.4.0)

> Spark-sql cli query results should correspond to schema
> ---
>
> Key: SPARK-41259
> URL: https://issues.apache.org/jira/browse/SPARK-41259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: yikaifei
>Priority: Minor
> Fix For: 3.5.0
>
>
> When using the spark-sql cli, Spark outputs only one column in the `show 
> tables` and `show views` commands to be compatible with Hive output, but the 
> output schema is still the three columns of Spark



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39814) Use AmazonKinesisClientBuilder.withCredentials instead of new AmazonKinesisClient(credentials)

2023-04-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39814:
-
Fix Version/s: 3.5.0
   (was: 3.4.0)

> Use AmazonKinesisClientBuilder.withCredentials instead of new 
> AmazonKinesisClient(credentials)
> --
>
> Key: SPARK-39814
> URL: https://issues.apache.org/jira/browse/SPARK-39814
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>
> [warn] 
> /home/runner/work/spark/spark/connector/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala:108:25:
>  [deprecation @ 
> org.apache.spark.examples.streaming.KinesisWordCountASL.main.kinesisClient | 
> origin=com.amazonaws.services.kinesis.AmazonKinesisClient. | version=] 
> constructor AmazonKinesisClient in class AmazonKinesisClient is deprecated
> [warn] 
> /home/runner/work/spark/spark/connector/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala:224:25:
>  [deprecation @ 
> org.apache.spark.examples.streaming.KinesisWordProducerASL.generate.kinesisClient
>  | origin=com.amazonaws.services.kinesis.AmazonKinesisClient. | 
> version=] constructor AmazonKinesisClient in class AmazonKinesisClient is 
> deprecated
> [warn] 
> /home/runner/work/spark/spark/connector/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala:142:24:
>  [deprecation @ 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.client | 
> origin=com.amazonaws.services.kinesis.AmazonKinesisClient. | version=] 
> constructor AmazonKinesisClient in class AmazonKinesisClient is deprecated
> [warn] 
> /home/runner/work/spark/spark/connector/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisTestUtils.scala:58:18:
>  [deprecation @ 
> org.apache.spark.streaming.kinesis.KinesisTestUtils.kinesisClient.client | 
> origin=com.amazonaws.services.kinesis.AmazonKinesisClient. | version=] 
> constructor AmazonKinesisClient in class AmazonKinesisClient is deprecated



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39136) JDBCTable support properties

2023-04-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39136:
-
Fix Version/s: 3.5.0
   (was: 3.4.0)

> JDBCTable support properties
> 
>
> Key: SPARK-39136
> URL: https://issues.apache.org/jira/browse/SPARK-39136
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Priority: Major
> Fix For: 3.5.0
>
>
> {code:java}
>  >
>  > desc formatted jdbc.test.people;
> NAME  string
> IDint
> # Partitioning
> Not partitioned
> # Detailed Table Information
> Name  test.people
> Table Properties  []
> Time taken: 0.048 seconds, Fetched 9 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37935) Migrate onto error classes

2023-04-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-37935:
-
Fix Version/s: 3.5.0
   (was: 3.4.0)

> Migrate onto error classes
> --
>
> Key: SPARK-37935
> URL: https://issues.apache.org/jira/browse/SPARK-37935
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.5.0
>
>
> The PR https://github.com/apache/spark/pull/32850 introduced error classes as 
> a part of the error messages framework 
> (https://issues.apache.org/jira/browse/SPARK-33539). Need to migrate all 
> exceptions from QueryExecutionErrors, QueryCompilationErrors and 
> QueryParsingErrors on the error classes using instances of SparkThrowable, 
> and carefully test every error class by writing tests in dedicated test 
> suites:
> *  QueryExecutionErrorsSuite for the errors that are occurred during query 
> execution
> * QueryCompilationErrorsSuite ... query compilation or eagerly executing 
> commands
> * QueryParsingErrorsSuite ... parsing errors
> Here is an example https://github.com/apache/spark/pull/35157 of how an 
> existing Java exception can be replaced, and testing of related error 
> classes.At the end, we should migrate all exceptions from the files 
> Query.*Errors.scala and cover all error classes from the error-classes.json 
> file by tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42169) Implement code generation for `to_csv` function (StructsToCsv)

2023-04-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-42169:
-
Fix Version/s: 3.5.0
   (was: 3.4.0)

> Implement code generation for `to_csv` function (StructsToCsv)
> --
>
> Key: SPARK-42169
> URL: https://issues.apache.org/jira/browse/SPARK-42169
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Narek Karapetian
>Priority: Minor
>  Labels: csv, sql
> Fix For: 3.5.0
>
>
> Implement code generation for `to_csv` function instead of extending it from 
> CodegenFallback trait.
> {code:java}
> org.apache.spark.sql.catalyst.expressions.StructsToCsv.doGenCode(...){code}
>  
> This is good to have from performance point of view.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38945) simply KEYTAB and PRINCIPAL in KerberosConfDriverFeatureStep

2023-04-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38945:
-
Fix Version/s: 3.5.0
   (was: 3.4.0)

> simply KEYTAB and PRINCIPAL in KerberosConfDriverFeatureStep
> 
>
> Key: SPARK-38945
> URL: https://issues.apache.org/jira/browse/SPARK-38945
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: Qian Sun
>Priority: Minor
> Fix For: 3.5.0
>
>
> Simply KEYTAB and PRINCIPAL in KerberosConfDriverFeatureStep, because already 
> imported



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43082) Arrow-optimized Python UDFs in Spark Connect

2023-04-10 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43082:


 Summary: Arrow-optimized Python UDFs in Spark Connect
 Key: SPARK-43082
 URL: https://issues.apache.org/jira/browse/SPARK-43082
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng


Implement Arrow-optimized Python UDFs in Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39696) Uncaught exception in thread executor-heartbeater java.util.ConcurrentModificationException: mutation occurred during iteration

2023-04-06 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39696:
-
Priority: Blocker  (was: Major)

> Uncaught exception in thread executor-heartbeater 
> java.util.ConcurrentModificationException: mutation occurred during iteration
> ---
>
> Key: SPARK-39696
> URL: https://issues.apache.org/jira/browse/SPARK-39696
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0, 3.4.0
> Environment: Spark 3.3.0 (spark-3.3.0-bin-hadoop3-scala2.13 
> distribution)
> Scala 2.13.8 / OpenJDK 17.0.3 application compilation
> Alpine Linux 3.14.3
> JVM OpenJDK 64-Bit Server VM Temurin-17.0.1+12
>Reporter: Stephen Mcmullan
>Priority: Blocker
> Fix For: 3.4.0
>
>
> {noformat}
> 2022-06-21 18:17:49.289Z ERROR [executor-heartbeater] 
> org.apache.spark.util.Utils - Uncaught exception in thread 
> executor-heartbeater
> java.util.ConcurrentModificationException: mutation occurred during iteration
> at 
> scala.collection.mutable.MutationTracker$.checkMutations(MutationTracker.scala:43)
>  ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.mutable.CheckedIndexedSeqView$CheckedIterator.hasNext(CheckedIndexedSeqView.scala:47)
>  ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.IterableOnceOps.copyToArray(IterableOnce.scala:873) 
> ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.IterableOnceOps.copyToArray$(IterableOnce.scala:869) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractIterator.copyToArray(Iterator.scala:1293) 
> ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.IterableOnceOps.copyToArray(IterableOnce.scala:852) 
> ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.IterableOnceOps.copyToArray$(IterableOnce.scala:852) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractIterator.copyToArray(Iterator.scala:1293) 
> ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.immutable.VectorStatics$.append1IfSpace(Vector.scala:1959) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.immutable.Vector1.appendedAll0(Vector.scala:425) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.immutable.Vector.appendedAll(Vector.scala:203) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.immutable.Vector.appendedAll(Vector.scala:113) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.SeqOps.concat(Seq.scala:187) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.SeqOps.concat$(Seq.scala:187) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractSeq.concat(Seq.scala:1161) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.IterableOps.$plus$plus(Iterable.scala:726) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.IterableOps.$plus$plus$(Iterable.scala:726) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractIterable.$plus$plus(Iterable.scala:926) 
> ~[scala-library-2.13.8.jar:?]
> at 
> org.apache.spark.executor.TaskMetrics.accumulators(TaskMetrics.scala:261) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at 
> org.apache.spark.executor.Executor.$anonfun$reportHeartBeat$1(Executor.scala:1042)
>  ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractIterable.foreach(Iterable.scala:926) 
> ~[scala-library-2.13.8.jar:?]
> at 
> org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1036) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at 
> org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:238) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) 
> ~[scala-library-2.13.8.jar:?]
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2066) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) 
> ~[?:?]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
>  ~[?:?]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>  ~[?:?]
> at 
> 

[jira] [Created] (SPARK-43041) Restore constructors of exceptions for compatibility in connector API

2023-04-05 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43041:


 Summary: Restore constructors of exceptions for compatibility in 
connector API
 Key: SPARK-43041
 URL: https://issues.apache.org/jira/browse/SPARK-43041
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Thanks [~aokolnychyi] for raising the issue as shown below:
{quote}
I have a question about changes to exceptions used in the public connector API, 
such as NoSuchTableException and TableAlreadyExistsException.

I consider those as part of the public Catalog API (TableCatalog uses them in 
method definitions). However, it looks like PR #37887 has changed them in an 
incompatible way. Old constructors accepting Identifier objects got removed. 
The only way to construct such exceptions is either by passing database and 
table strings or Scala Seq. Shall we add back old constructors to avoid 
breaking connectors?
{quote}
We should restore constructors of those exceptions to preserve the 
compatibility in connector API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43011) array_insert should fail with 0 index

2023-04-04 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43011:
-
Priority: Blocker  (was: Major)

> array_insert should fail with 0 index
> -
>
> Key: SPARK-43011
> URL: https://issues.apache.org/jira/browse/SPARK-43011
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Blocker
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43009) Parameterized sql() with constants

2023-04-04 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43009:
-
Priority: Blocker  (was: Major)

> Parameterized sql() with constants
> --
>
> Key: SPARK-43009
> URL: https://issues.apache.org/jira/browse/SPARK-43009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Blocker
>
> Change Scala/Java/Python APIs to accept any language objects from which would 
> be possible to construct literal columns.
> The current implementation the parameterized sql() requires arguments as 
> string values parsed to SQL literal expressions that causes the following 
> issues:
> * SQL comments are skipped while parsing, so, some fragments of input might 
> be skipped. For example, 'Europe -- Amsterdam'. In this case, -- Amsterdam is 
> excluded from the input.
> * Special chars in string values must be escaped, for instance 'E\'Twaun 
> Moore'



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42693) API Auditing

2023-03-29 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-42693.
--
Resolution: Done

> API Auditing
> 
>
> Key: SPARK-42693
> URL: https://issues.apache.org/jira/browse/SPARK-42693
> Project: Spark
>  Issue Type: Story
>  Components: ML, PySpark, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Blocker
>
> Audit user-facing API of Spark 3.4. The main goal is to ensure public API 
> docs to be ready for release, for example, no private classes/methods is 
> leaking and marked public.
> There are 3 common ways to audit API:
>  * build docs (into a local website) against branch-3.4 to check
>  * 'git diff' to check the source code differences between v3.3.2 and 
> branch-3.4
>  * [https://github.com/apache/spark-website/pull/443] shows most of the API 
> doc differences between v3.3.2 and the 3.4.0 RC4(the latest RC); commits are 
> categorized by components



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42862) Review and fix issues in Core API docs

2023-03-29 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-42862.
--
Resolution: Resolved

> Review and fix issues in Core API docs
> --
>
> Key: SPARK-42862
> URL: https://issues.apache.org/jira/browse/SPARK-42862
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Yuanjian Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42866) Review and fix issues in Spark Connect - Scala API docs

2023-03-29 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-42866.
--
Resolution: Won't Do

There doesn't seem to be a separate API doc for Spark Connect Scala Client. So 
no API auditing is required for now.

> Review and fix issues in Spark Connect - Scala API docs
> ---
>
> Key: SPARK-42866
> URL: https://issues.apache.org/jira/browse/SPARK-42866
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42393) Support for Pandas/Arrow Functions API

2023-03-28 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-42393:
-
Description: 
There are derivative APIs which depend on the implementation of Pandas UDFs: 
Pandas Function APIs and Arrow Function APIs, as shown below:

!image-2023-03-29-11-40-44-318.png|width=576,height=225!

 

Spark Connect Python Client (SCPC), as a client and server interface for 
PySpark will eventually replace the legacy API of PySpark. Supporting PySpark 
UDFs is essential for Spark Connect to reach parity with the PySpark legacy API.

See design doc 
[here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].

  was:See design doc 
[here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].


> Support for Pandas/Arrow Functions API
> --
>
> Key: SPARK-42393
> URL: https://issues.apache.org/jira/browse/SPARK-42393
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Attachments: image-2023-03-29-11-40-44-318.png
>
>
> There are derivative APIs which depend on the implementation of Pandas UDFs: 
> Pandas Function APIs and Arrow Function APIs, as shown below:
> !image-2023-03-29-11-40-44-318.png|width=576,height=225!
>  
> Spark Connect Python Client (SCPC), as a client and server interface for 
> PySpark will eventually replace the legacy API of PySpark. Supporting PySpark 
> UDFs is essential for Spark Connect to reach parity with the PySpark legacy 
> API.
> See design doc 
> [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42393) Support for Pandas/Arrow Functions API

2023-03-28 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-42393:
-
Attachment: image-2023-03-29-11-40-44-318.png

> Support for Pandas/Arrow Functions API
> --
>
> Key: SPARK-42393
> URL: https://issues.apache.org/jira/browse/SPARK-42393
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Attachments: image-2023-03-29-11-40-44-318.png
>
>
> See design doc 
> [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41661) Support for User-defined Functions in Python

2023-03-28 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-41661:
-
Description: 
See design doc 
[here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].

User-defined Functions in Python consist of (pickled) Python UDFs and 
(Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code 
on top of the Apache Spark™ engine. Users only have to state "what to do"; 
PySpark, as a sandbox, encapsulates "how to do it".

Spark Connect Python Client (SCPC), as a client and server interface for 
PySpark will eventually replace the legacy API of PySpark. Supporting PySpark 
UDFs is essential for Spark Connect to reach parity with the PySpark legacy API.

  was:
See design doc 
[here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].

User-defined Functions in Python consist of (pickled) Python UDFs and 
(Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code 
on top of the Apache Spark™ engine. Users only have to state "what to do"; 
PySpark, as a sandbox, encapsulates "how to do it".

Spark Connect Python Client (SCPC), as a client and server interface for 
PySpark will eventually replace the legacy API of PySpark in OSS. Supporting 
PySpark UDFs is essential for Spark Connect to reach parity with the PySpark 
legacy API.


> Support for User-defined Functions in Python
> 
>
> Key: SPARK-41661
> URL: https://issues.apache.org/jira/browse/SPARK-41661
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Xinrong Meng
>Priority: Major
>
> See design doc 
> [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].
> User-defined Functions in Python consist of (pickled) Python UDFs and 
> (Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code 
> on top of the Apache Spark™ engine. Users only have to state "what to do"; 
> PySpark, as a sandbox, encapsulates "how to do it".
> Spark Connect Python Client (SCPC), as a client and server interface for 
> PySpark will eventually replace the legacy API of PySpark. Supporting PySpark 
> UDFs is essential for Spark Connect to reach parity with the PySpark legacy 
> API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41661) Support for User-defined Functions in Python

2023-03-28 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-41661:
-
Description: 
User-defined Functions in Python consist of (pickled) Python UDFs and 
(Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code 
on top of the Apache Spark™ engine. Users only have to state "what to do"; 
PySpark, as a sandbox, encapsulates "how to do it".

Spark Connect Python Client (SCPC), as a client and server interface for 
PySpark will eventually replace the legacy API of PySpark. Supporting PySpark 
UDFs is essential for Spark Connect to reach parity with the PySpark legacy API.

See design doc 
[here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].

  was:
See design doc 
[here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].

User-defined Functions in Python consist of (pickled) Python UDFs and 
(Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code 
on top of the Apache Spark™ engine. Users only have to state "what to do"; 
PySpark, as a sandbox, encapsulates "how to do it".

Spark Connect Python Client (SCPC), as a client and server interface for 
PySpark will eventually replace the legacy API of PySpark. Supporting PySpark 
UDFs is essential for Spark Connect to reach parity with the PySpark legacy API.


> Support for User-defined Functions in Python
> 
>
> Key: SPARK-41661
> URL: https://issues.apache.org/jira/browse/SPARK-41661
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Xinrong Meng
>Priority: Major
>
> User-defined Functions in Python consist of (pickled) Python UDFs and 
> (Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code 
> on top of the Apache Spark™ engine. Users only have to state "what to do"; 
> PySpark, as a sandbox, encapsulates "how to do it".
> Spark Connect Python Client (SCPC), as a client and server interface for 
> PySpark will eventually replace the legacy API of PySpark. Supporting PySpark 
> UDFs is essential for Spark Connect to reach parity with the PySpark legacy 
> API.
> See design doc 
> [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41661) Support for User-defined Functions in Python

2023-03-28 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-41661:
-
Description: 
See design doc 
[here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].

User-defined Functions in Python consist of (pickled) Python UDFs and 
(Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code 
on top of the Apache Spark™ engine. Users only have to state "what to do"; 
PySpark, as a sandbox, encapsulates "how to do it".

Spark Connect Python Client (SCPC), as a client and server interface for 
PySpark will eventually replace the legacy API of PySpark in OSS. Supporting 
PySpark UDFs is essential for Spark Connect to reach parity with the PySpark 
legacy API.

  was:
See design doc 
[here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].

User-defined Functions in Python consist of (pickled) Python UDFs and 
(Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code 
on top of the Apache Spark™ engine. Users only have to state "what to do"; 
PySpark, as a sandbox, encapsulates "how to do it".

Spark Connect Python Client (SCPC), as a client and server interface for 
PySpark, will eventually (probably Spark 4.0) replace the legacy API of PySpark 
in both OSS. Supporting PySpark UDFs is essential for Spark Connect to reach 
parity with the PySpark legacy API.


> Support for User-defined Functions in Python
> 
>
> Key: SPARK-41661
> URL: https://issues.apache.org/jira/browse/SPARK-41661
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Xinrong Meng
>Priority: Major
>
> See design doc 
> [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].
> User-defined Functions in Python consist of (pickled) Python UDFs and 
> (Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code 
> on top of the Apache Spark™ engine. Users only have to state "what to do"; 
> PySpark, as a sandbox, encapsulates "how to do it".
> Spark Connect Python Client (SCPC), as a client and server interface for 
> PySpark will eventually replace the legacy API of PySpark in OSS. Supporting 
> PySpark UDFs is essential for Spark Connect to reach parity with the PySpark 
> legacy API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41661) Support for User-defined Functions in Python

2023-03-28 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-41661:
-
Description: 
See design doc 
[here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].

User-defined Functions in Python consist of (pickled) Python UDFs and 
(Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code 
on top of the Apache Spark™ engine. Users only have to state "what to do"; 
PySpark, as a sandbox, encapsulates "how to do it".

Spark Connect Python Client (SCPC), as a client and server interface for 
PySpark, will eventually (probably Spark 4.0) replace the legacy API of PySpark 
in both OSS. Supporting PySpark UDFs is essential for Spark Connect to reach 
parity with the PySpark legacy API.

  was:
See design doc 
[here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].

PySpark UDFs  mainly consist of (pickled) Python UDFs and (Arrow-optimized) 
Pandas UDFs.


> Support for User-defined Functions in Python
> 
>
> Key: SPARK-41661
> URL: https://issues.apache.org/jira/browse/SPARK-41661
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Xinrong Meng
>Priority: Major
>
> See design doc 
> [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].
> User-defined Functions in Python consist of (pickled) Python UDFs and 
> (Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code 
> on top of the Apache Spark™ engine. Users only have to state "what to do"; 
> PySpark, as a sandbox, encapsulates "how to do it".
> Spark Connect Python Client (SCPC), as a client and server interface for 
> PySpark, will eventually (probably Spark 4.0) replace the legacy API of 
> PySpark in both OSS. Supporting PySpark UDFs is essential for Spark Connect 
> to reach parity with the PySpark legacy API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41661) Support for User-defined Functions in Python

2023-03-28 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-41661:
-
Description: 
See design doc 
[here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].

PySpark UDFs  mainly consist of (pickled) Python UDFs and (Arrow-optimized) 
Pandas UDFs.

  was:Spark Connect should support Python UDFs


> Support for User-defined Functions in Python
> 
>
> Key: SPARK-41661
> URL: https://issues.apache.org/jira/browse/SPARK-41661
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Xinrong Meng
>Priority: Major
>
> See design doc 
> [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].
> PySpark UDFs  mainly consist of (pickled) Python UDFs and (Arrow-optimized) 
> Pandas UDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42393) Support for Pandas/Arrow Functions API

2023-03-28 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-42393:
-
Description: See design doc 
[here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].

> Support for Pandas/Arrow Functions API
> --
>
> Key: SPARK-42393
> URL: https://issues.apache.org/jira/browse/SPARK-42393
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> See design doc 
> [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42393) Support for Pandas/Arrow Functions API

2023-03-28 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-42393.
--
Resolution: Resolved

> Support for Pandas/Arrow Functions API
> --
>
> Key: SPARK-42393
> URL: https://issues.apache.org/jira/browse/SPARK-42393
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, PySpark
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42393) Support for Pandas/Arrow Functions API

2023-03-28 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-42393:
-
Affects Version/s: (was: 3.5.0)

> Support for Pandas/Arrow Functions API
> --
>
> Key: SPARK-42393
> URL: https://issues.apache.org/jira/browse/SPARK-42393
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42891) Implement CoGrouped Map API

2023-03-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-42891.
--
  Assignee: Xinrong Meng
Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/40487] and 
[https://github.com/apache/spark/pull/40539]

 

> Implement CoGrouped Map API
> ---
>
> Key: SPARK-42891
> URL: https://issues.apache.org/jira/browse/SPARK-42891
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Implement CoGrouped Map API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42693) API Auditing

2023-03-23 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-42693:
-
Description: 
Audit user-facing API of Spark 3.4. The main goal is to ensure public API docs 
to be ready for release, for example, no private classes/methods is leaking and 
marked public.

There are 3 common ways to audit API:
 * build docs (into a local website) against branch-3.4 to check
 * 'git diff' to check the source code differences between v3.3.2 and branch-3.4
 * [https://github.com/apache/spark-website/pull/443] shows most of the API doc 
differences between v3.3.2 and the 3.4.0 RC4(the latest RC); commits are 
categorized by components

  was:
Audit user-facing API of Spark 3.4. The main goal is to ensure public API docs 
to be ready for release, for example, no private classes/methods is leaking and 
marked public.

There are 3 common ways to audit API:
 * [https://github.com/apache/spark-website/pull/443] shows most of the API doc 
differences between 3.3.2 and the 3.4.0 RC4(the latest RC); commits are 
categorized by components
 * 'git diff' to check the source code differences between v3.3.2 and branch-3.4
 * build a local website against branch-3.4 to check


> API Auditing
> 
>
> Key: SPARK-42693
> URL: https://issues.apache.org/jira/browse/SPARK-42693
> Project: Spark
>  Issue Type: Story
>  Components: ML, PySpark, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Blocker
>
> Audit user-facing API of Spark 3.4. The main goal is to ensure public API 
> docs to be ready for release, for example, no private classes/methods is 
> leaking and marked public.
> There are 3 common ways to audit API:
>  * build docs (into a local website) against branch-3.4 to check
>  * 'git diff' to check the source code differences between v3.3.2 and 
> branch-3.4
>  * [https://github.com/apache/spark-website/pull/443] shows most of the API 
> doc differences between v3.3.2 and the 3.4.0 RC4(the latest RC); commits are 
> categorized by components



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42693) API Auditing

2023-03-23 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-42693:
-
Description: 
Audit user-facing API of Spark 3.4. The main goal is to ensure public API docs 
to be ready for release, for example, no private classes/methods is leaking and 
marked public.

There are 3 common ways to audit API:
 * [https://github.com/apache/spark-website/pull/443] shows most of the API doc 
differences between 3.3.2 and the 3.4.0 RC4(the latest RC); commits are 
categorized by components
 * 'git diff' to check the source code differences between v3.3.2 and branch-3.4
 * build a local website against branch-3.4 to check

  was:Audit user-facing API of Spark 3.4.


> API Auditing
> 
>
> Key: SPARK-42693
> URL: https://issues.apache.org/jira/browse/SPARK-42693
> Project: Spark
>  Issue Type: Story
>  Components: ML, PySpark, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Blocker
>
> Audit user-facing API of Spark 3.4. The main goal is to ensure public API 
> docs to be ready for release, for example, no private classes/methods is 
> leaking and marked public.
> There are 3 common ways to audit API:
>  * [https://github.com/apache/spark-website/pull/443] shows most of the API 
> doc differences between 3.3.2 and the 3.4.0 RC4(the latest RC); commits are 
> categorized by components
>  * 'git diff' to check the source code differences between v3.3.2 and 
> branch-3.4
>  * build a local website against branch-3.4 to check



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42908) Raise RuntimeError if SparkContext is not initialized when parsing DDL-formatted type strings

2023-03-23 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-42908:


 Summary: Raise RuntimeError if SparkContext is not initialized 
when parsing DDL-formatted type strings
 Key: SPARK-42908
 URL: https://issues.apache.org/jira/browse/SPARK-42908
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0, 3.5.0
Reporter: Xinrong Meng


Raise RuntimeError if SparkContext is not initialized when parsing 
DDL-formatted type strings.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-40307) Introduce Arrow-optimized Python UDFs

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng reopened SPARK-40307:
--

> Introduce Arrow-optimized Python UDFs
> -
>
> Key: SPARK-40307
> URL: https://issues.apache.org/jira/browse/SPARK-40307
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Python user-defined function (UDF) enables users to run arbitrary code 
> against PySpark columns. It uses Pickle for (de)serialization and executes 
> row by row.
> One major performance bottleneck of Python UDFs is (de)serialization, that 
> is, the data interchanging between the worker JVM and the spawned Python 
> subprocess which actually executes the UDF. We should seek an alternative to 
> handle the (de)serialization: Arrow, which is used in the (de)serialization 
> of Pandas UDF already.
> There should be two ways to enable/disable the Arrow optimization for Python 
> UDFs:
> - the Spark configuration `spark.sql.execution.pythonUDF.arrow.enabled`, 
> disabled by default.
> - the `useArrow` parameter of the `udf` function, None by default.
> The Spark configuration takes effect only when `useArrow` is None. Otherwise, 
> `useArrow` decides whether a specific user-defined function is optimized by 
> Arrow or not.
> The reason why we introduce these two ways is to provide both a convenient, 
> per-Spark-session control and a finer-grained, per-UDF control of the Arrow 
> optimization for Python UDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40307) Introduce Arrow-optimized Python UDFs

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-40307:
-
Affects Version/s: 3.5.0

> Introduce Arrow-optimized Python UDFs
> -
>
> Key: SPARK-40307
> URL: https://issues.apache.org/jira/browse/SPARK-40307
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Python user-defined function (UDF) enables users to run arbitrary code 
> against PySpark columns. It uses Pickle for (de)serialization and executes 
> row by row.
> One major performance bottleneck of Python UDFs is (de)serialization, that 
> is, the data interchanging between the worker JVM and the spawned Python 
> subprocess which actually executes the UDF. We should seek an alternative to 
> handle the (de)serialization: Arrow, which is used in the (de)serialization 
> of Pandas UDF already.
> There should be two ways to enable/disable the Arrow optimization for Python 
> UDFs:
> - the Spark configuration `spark.sql.execution.pythonUDF.arrow.enabled`, 
> disabled by default.
> - the `useArrow` parameter of the `udf` function, None by default.
> The Spark configuration takes effect only when `useArrow` is None. Otherwise, 
> `useArrow` decides whether a specific user-defined function is optimized by 
> Arrow or not.
> The reason why we introduce these two ways is to provide both a convenient, 
> per-Spark-session control and a finer-grained, per-UDF control of the Arrow 
> optimization for Python UDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42893) Block the usage of Arrow-optimized Python UDFs

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-42893:
-
Description: 
Considering the upcoming improvements on the result inconsistencies between 
traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better 
block the feature, otherwise, users who try out the feature will expect 
behavior changes in the next release.

In addition, since Spark Connect Python Client(SCPC) has been introduced in 
Spark 3.4, we'd better ensure the feature is ready in both vanilla PySpark and 
SCPC at the same time for compatibility.

  was:Considering the upcoming improvements on the result inconsistencies 
between traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd 
better block the feature, otherwise, users who try out the feature will expect 
behavior changes in the next release.


> Block the usage of Arrow-optimized Python UDFs
> --
>
> Key: SPARK-42893
> URL: https://issues.apache.org/jira/browse/SPARK-42893
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Considering the upcoming improvements on the result inconsistencies between 
> traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better 
> block the feature, otherwise, users who try out the feature will expect 
> behavior changes in the next release.
> In addition, since Spark Connect Python Client(SCPC) has been introduced in 
> Spark 3.4, we'd better ensure the feature is ready in both vanilla PySpark 
> and SCPC at the same time for compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42893) Block Arrow-optimized Python UDFs

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-42893:
-
Summary: Block Arrow-optimized Python UDFs  (was: Block the usage of 
Arrow-optimized Python UDFs)

> Block Arrow-optimized Python UDFs
> -
>
> Key: SPARK-42893
> URL: https://issues.apache.org/jira/browse/SPARK-42893
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Considering the upcoming improvements on the result inconsistencies between 
> traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better 
> block the feature, otherwise, users who try out the feature will expect 
> behavior changes in the next release.
> In addition, since Spark Connect Python Client(SCPC) has been introduced in 
> Spark 3.4, we'd better ensure the feature is ready in both vanilla PySpark 
> and SCPC at the same time for compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42893) Block the usage of Arrow-optimized Python UDFs

2023-03-21 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-42893:


 Summary: Block the usage of Arrow-optimized Python UDFs
 Key: SPARK-42893
 URL: https://issues.apache.org/jira/browse/SPARK-42893
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Considering the upcoming improvements on the result inconsistencies between 
traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better 
block the feature, otherwise, users who try out the feature will expect 
behavior changes in the next release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42340) Implement Grouped Map API

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-42340:
-
Summary: Implement Grouped Map API  (was: Implement 
GroupedData.applyInPandas)

> Implement Grouped Map API
> -
>
> Key: SPARK-42340
> URL: https://issues.apache.org/jira/browse/SPARK-42340
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42891) Implement CoGrouped Map API

2023-03-21 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-42891:


 Summary: Implement CoGrouped Map API
 Key: SPARK-42891
 URL: https://issues.apache.org/jira/browse/SPARK-42891
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement CoGrouped Map API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2023-03-21 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703123#comment-17703123
 ] 

Xinrong Meng edited comment on SPARK-40327 at 3/21/23 9:48 AM:
---

All resolved issues are moved to 
https://issues.apache.org/jira/browse/SPARK-42882 for clarity and references in 
the release note.

The version is also modified to Spark 3.5.0.


was (Author: xinrongm):
Hi, all resolved issues are moved to 
https://issues.apache.org/jira/browse/SPARK-42882 for clarity and references in 
the release note.

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-40327:
-
Affects Version/s: 3.5.0
   (was: 3.4.0)

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2023-03-21 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703123#comment-17703123
 ] 

Xinrong Meng commented on SPARK-40327:
--

Hi, all resolved issues are moved to 
https://issues.apache.org/jira/browse/SPARK-42882 for clarity and references in 
the release note.

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-40327:
-
Fix Version/s: (was: 3.4.0)

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40340) Implement `Expanding.sem`.

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-40340:
-
Parent: SPARK-40327  (was: SPARK-42882)

> Implement `Expanding.sem`.
> --
>
> Key: SPARK-40340
> URL: https://issues.apache.org/jira/browse/SPARK-40340
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> We should implement `Expanding.sem` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.window.expanding.Expanding.sem.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40341) Implement `Rolling.median`.

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-40341:
-
Parent: SPARK-40327  (was: SPARK-42882)

> Implement `Rolling.median`.
> ---
>
> Key: SPARK-40341
> URL: https://issues.apache.org/jira/browse/SPARK-40341
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
>
> We should implement `Rolling.median` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.median.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39199) Implement pandas API missing parameters

2023-03-21 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703121#comment-17703121
 ] 

Xinrong Meng commented on SPARK-39199:
--

Please see https://issues.apache.org/jira/browse/SPARK-42883

> Implement pandas API missing parameters
> ---
>
> Key: SPARK-39199
> URL: https://issues.apache.org/jira/browse/SPARK-39199
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.3.0, 3.3.1, 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42883) Implement Pandas API Missing Parameters

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-42883.
--
Resolution: Resolved

> Implement Pandas API Missing Parameters
> ---
>
> Key: SPARK-42883
> URL: https://issues.apache.org/jira/browse/SPARK-42883
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> pandas API on Spark aims to make pandas code work on Spark clusters without 
> any changes. So full API coverage has been one of our major goals. Currently, 
> most pandas functions are implemented, whereas some of them are have 
> incomplete parameters support.
> There are some common parameters missing (resolved):
>  * How to do with NAs   
>  * Filter data types    
>  * Control result length    
>  * Reindex result   
> There are remaining missing parameters to implement (see doc below).
> See the design and the current status at 
> [https://docs.google.com/document/d/1H6RXL6oc-v8qLJbwKl6OEqBjRuMZaXcTYmrZb9yNm5I/edit?usp=sharing].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38552) Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to resolve ties

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38552:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to 
> resolve ties
> --
>
> Key: SPARK-38552
> URL: https://issues.apache.org/jira/browse/SPARK-38552
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to 
> resolve ties



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42882) Pandas API Coverage Improvements

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-42882.
--
Resolution: Resolved

> Pandas API Coverage Improvements
> 
>
> Key: SPARK-42882
> URL: https://issues.apache.org/jira/browse/SPARK-42882
> Project: Spark
>  Issue Type: Epic
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Pandas API on Spark aims to make pandas code work on Spark clusters without 
> any changes. So full API coverage has been one of our major goals. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38938) Implement `inplace` and `columns` parameters of `Series.drop`

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38938:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Implement `inplace` and `columns` parameters of `Series.drop`
> -
>
> Key: SPARK-38938
> URL: https://issues.apache.org/jira/browse/SPARK-38938
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `inplace` and `columns` parameters of `Series.drop`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38479) Add `Series.duplicated` to indicate duplicate Series values.

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38479:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Add `Series.duplicated` to indicate duplicate Series values.
> 
>
> Key: SPARK-38479
> URL: https://issues.apache.org/jira/browse/SPARK-38479
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Add `Series.duplicated` to indicate duplicate Series values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38518) Implement `skipna` of `Series.all/Index.all` to exclude NA/null values

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38518:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Implement `skipna` of `Series.all/Index.all` to exclude NA/null values
> --
>
> Key: SPARK-38518
> URL: https://issues.apache.org/jira/browse/SPARK-38518
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Implement `skipna` of `Series.all/Index.all` to exclude NA/null values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39189) interpolate supports limit_area

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39189:
-
Parent: SPARK-42883  (was: SPARK-39199)

> interpolate supports limit_area
> ---
>
> Key: SPARK-39189
> URL: https://issues.apache.org/jira/browse/SPARK-39189
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38903) Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38903:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`
> 
>
> Key: SPARK-38903
> URL: https://issues.apache.org/jira/browse/SPARK-38903
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38943) EWM support ignore_na

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38943:
-
Parent: SPARK-42883  (was: SPARK-39199)

> EWM support ignore_na
> -
>
> Key: SPARK-38943
> URL: https://issues.apache.org/jira/browse/SPARK-38943
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39907) Implement axis and skipna of Series.argmin

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39907:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Implement axis and skipna of Series.argmin
> --
>
> Key: SPARK-39907
> URL: https://issues.apache.org/jira/browse/SPARK-39907
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38765) Implement `inplace` parameter of `Series.clip`

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38765:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Implement `inplace` parameter of `Series.clip`
> --
>
> Key: SPARK-38765
> URL: https://issues.apache.org/jira/browse/SPARK-38765
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `inplace` parameter of `Series.clip`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38686) Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates`

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38686:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates`
> --
>
> Key: SPARK-38686
> URL: https://issues.apache.org/jira/browse/SPARK-38686
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38704) Support string `inclusive` parameter of `Series.between`

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38704:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Support string `inclusive` parameter of `Series.between`
> 
>
> Key: SPARK-38704
> URL: https://issues.apache.org/jira/browse/SPARK-38704
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Support string `inclusive` parameter of `Series.between`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39201) Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39201:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Implement `ignore_index` of `DataFrame.explode` and 
> `DataFrame.drop_duplicates`
> ---
>
> Key: SPARK-39201
> URL: https://issues.apache.org/jira/browse/SPARK-39201
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `ignore_index` of `DataFrame.explode` and 
> `DataFrame.drop_duplicates`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38387) Support `na_action` and Series input correspondence in `Series.map`

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38387:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Support `na_action` and Series input correspondence in `Series.map`
> ---
>
> Key: SPARK-38387
> URL: https://issues.apache.org/jira/browse/SPARK-38387
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Support `na_action` and Series input correspondence in `Series.map`, in order 
> to reach parity to pandas API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38576) Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38576:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank 
> numeric columns only
> ---
>
> Key: SPARK-38576
> URL: https://issues.apache.org/jira/browse/SPARK-38576
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank 
> numeric columns only.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38608) Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any`

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38608:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any`
> -
>
> Key: SPARK-38608
> URL: https://issues.apache.org/jira/browse/SPARK-38608
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any` to 
> include only boolean columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38726) Support `how` parameter of `MultiIndex.dropna`

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38726:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Support `how` parameter of `MultiIndex.dropna`
> --
>
> Key: SPARK-38726
> URL: https://issues.apache.org/jira/browse/SPARK-38726
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Support `how` parameter of `MultiIndex.dropna`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38441) Support string and bool `regex` in `Series.replace`

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38441:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Support string and bool `regex` in `Series.replace`
> ---
>
> Key: SPARK-38441
> URL: https://issues.apache.org/jira/browse/SPARK-38441
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Support string and bool `regex` in `Series.replace` in order to reach parity 
> with pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38989) Implement `ignore_index` of `DataFrame/Series.sample`

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38989:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Implement `ignore_index` of `DataFrame/Series.sample`
> -
>
> Key: SPARK-38989
> URL: https://issues.apache.org/jira/browse/SPARK-38989
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `ignore_index` of `DataFrame/Series.sample`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38793) Support `return_indexer` parameter of `Index/MultiIndex.sort_values`

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38793:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Support `return_indexer` parameter of `Index/MultiIndex.sort_values`
> 
>
> Key: SPARK-38793
> URL: https://issues.apache.org/jira/browse/SPARK-38793
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Support `return_indexer` parameter of `Index/MultiIndex.sort_values`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38491) Support `ignore_index` of `Series.sort_values`

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38491:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Support `ignore_index` of `Series.sort_values`
> --
>
> Key: SPARK-38491
> URL: https://issues.apache.org/jira/browse/SPARK-38491
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Support `ignore_index` of `Series.sort_values`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38863) Implement `skipna` parameter of `DataFrame.all`

2023-03-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38863:
-
Parent: SPARK-42883  (was: SPARK-39199)

> Implement `skipna` parameter of `DataFrame.all`
> ---
>
> Key: SPARK-38863
> URL: https://issues.apache.org/jira/browse/SPARK-38863
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `skipna` parameter of `DataFrame.all`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >