[jira] [Updated] (SPARK-43903) Improve ArrayType input support in Arrow-optimized Python UDF
[ https://issues.apache.org/jira/browse/SPARK-43903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-43903: - Summary: Improve ArrayType input support in Arrow-optimized Python UDF (was: Non-atomic data type support in Arrow-optimized Python UDF) > Improve ArrayType input support in Arrow-optimized Python UDF > - > > Key: SPARK-43903 > URL: https://issues.apache.org/jira/browse/SPARK-43903 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43893) Non-atomic data type support in Arrow-optimized Python UDF
[ https://issues.apache.org/jira/browse/SPARK-43893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-43893: - Summary: Non-atomic data type support in Arrow-optimized Python UDF (was: StructType input/output support in Arrow-optimized Python UDF) > Non-atomic data type support in Arrow-optimized Python UDF > -- > > Key: SPARK-43893 > URL: https://issues.apache.org/jira/browse/SPARK-43893 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43893) StructType input/output support in Arrow-optimized Python UDF
[ https://issues.apache.org/jira/browse/SPARK-43893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng reassigned SPARK-43893: Assignee: Xinrong Meng > StructType input/output support in Arrow-optimized Python UDF > - > > Key: SPARK-43893 > URL: https://issues.apache.org/jira/browse/SPARK-43893 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43893) StructType input/output support in Arrow-optimized Python UDF
[ https://issues.apache.org/jira/browse/SPARK-43893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-43893. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41321 [https://github.com/apache/spark/pull/41321] > StructType input/output support in Arrow-optimized Python UDF > - > > Key: SPARK-43893 > URL: https://issues.apache.org/jira/browse/SPARK-43893 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43903) Non-atomic data type support in Arrow-optimized Python UDF
[ https://issues.apache.org/jira/browse/SPARK-43903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-43903: - Summary: Non-atomic data type support in Arrow-optimized Python UDF (was: Standardize ArrayType conversion for Python UDF) > Non-atomic data type support in Arrow-optimized Python UDF > -- > > Key: SPARK-43903 > URL: https://issues.apache.org/jira/browse/SPARK-43903 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43903) Standardize ArrayType conversion for Python UDF
Xinrong Meng created SPARK-43903: Summary: Standardize ArrayType conversion for Python UDF Key: SPARK-43903 URL: https://issues.apache.org/jira/browse/SPARK-43903 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43893) StructType input/output support in Arrow-optimized Python UDF
Xinrong Meng created SPARK-43893: Summary: StructType input/output support in Arrow-optimized Python UDF Key: SPARK-43893 URL: https://issues.apache.org/jira/browse/SPARK-43893 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43886) Accept generics tuple as typing hints in Pandas UDF
Xinrong Meng created SPARK-43886: Summary: Accept generics tuple as typing hints in Pandas UDF Key: SPARK-43886 URL: https://issues.apache.org/jira/browse/SPARK-43886 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43804) Test on nested structs support in Pandas UDF
Xinrong Meng created SPARK-43804: Summary: Test on nested structs support in Pandas UDF Key: SPARK-43804 URL: https://issues.apache.org/jira/browse/SPARK-43804 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng Test on nested structs support in Pandas UDF. That support is newly enabled (compared to Spark 3.4). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43545) Support Nested Timestamp Types
[ https://issues.apache.org/jira/browse/SPARK-43545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-43545: - Summary: Support Nested Timestamp Types (was: Remove outdated UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION) > Support Nested Timestamp Types > -- > > Key: SPARK-43545 > URL: https://issues.apache.org/jira/browse/SPARK-43545 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-43543) Standardize Nested Complex DataTypes Support
[ https://issues.apache.org/jira/browse/SPARK-43543 ] Xinrong Meng deleted comment on SPARK-43543: -- was (Author: xinrongm): Issue resolved by pull request 41147 [https://github.com/apache/spark/pull/41147] > Standardize Nested Complex DataTypes Support > > > Key: SPARK-43543 > URL: https://issues.apache.org/jira/browse/SPARK-43543 > Project: Spark > Issue Type: Umbrella > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43544) Fix nested MapType behavior in Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-43544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725949#comment-17725949 ] Xinrong Meng commented on SPARK-43544: -- Resolved by https://github.com/apache/spark/pull/41147. > Fix nested MapType behavior in Pandas UDF > - > > Key: SPARK-43544 > URL: https://issues.apache.org/jira/browse/SPARK-43544 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43544) Fix nested MapType behavior in Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-43544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng reassigned SPARK-43544: Assignee: Xinrong Meng > Fix nested MapType behavior in Pandas UDF > - > > Key: SPARK-43544 > URL: https://issues.apache.org/jira/browse/SPARK-43544 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43544) Fix nested MapType behavior in Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-43544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-43544. -- Resolution: Done > Fix nested MapType behavior in Pandas UDF > - > > Key: SPARK-43544 > URL: https://issues.apache.org/jira/browse/SPARK-43544 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43546) Complete parity tests of Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-43546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-43546: - Summary: Complete parity tests of Pandas UDF (was: Complete Pandas UDF parity tests) > Complete parity tests of Pandas UDF > --- > > Key: SPARK-43546 > URL: https://issues.apache.org/jira/browse/SPARK-43546 > Project: Spark > Issue Type: Test > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Priority: Major > > Tests as shown below should be added to Connect. > test_pandas_udf_grouped_agg.py > test_pandas_udf_scalar.py > test_pandas_udf_window.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43734) Expression "(v)" within a window function doesn't raise a AnalysisException
Xinrong Meng created SPARK-43734: Summary: Expression "(v)" within a window function doesn't raise a AnalysisException Key: SPARK-43734 URL: https://issues.apache.org/jira/browse/SPARK-43734 Project: Spark Issue Type: Improvement Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng Expression "(v)" within a window function doesn't raise a AnalysisException See PandasUDFWindowParityTests.test_invalid_args for reproduction. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43727) Parity returnType check in Spark Connect
Xinrong Meng created SPARK-43727: Summary: Parity returnType check in Spark Connect Key: SPARK-43727 URL: https://issues.apache.org/jira/browse/SPARK-43727 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43543) Standardize Nested Complex DataTypes Support
[ https://issues.apache.org/jira/browse/SPARK-43543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-43543. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41147 [https://github.com/apache/spark/pull/41147] > Standardize Nested Complex DataTypes Support > > > Key: SPARK-43543 > URL: https://issues.apache.org/jira/browse/SPARK-43543 > Project: Spark > Issue Type: Umbrella > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43543) Standardize Nested Complex DataTypes Support
[ https://issues.apache.org/jira/browse/SPARK-43543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng reassigned SPARK-43543: Assignee: Xinrong Meng > Standardize Nested Complex DataTypes Support > > > Key: SPARK-43543 > URL: https://issues.apache.org/jira/browse/SPARK-43543 > Project: Spark > Issue Type: Umbrella > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43579) Cache the converter between Arrow and pandas for reuse
Xinrong Meng created SPARK-43579: Summary: Cache the converter between Arrow and pandas for reuse Key: SPARK-43579 URL: https://issues.apache.org/jira/browse/SPARK-43579 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43544) Fix nested MapType behavior in Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-43544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-43544: - Summary: Fix nested MapType behavior in Pandas UDF (was: Standardize nested non-atomic input type support in Pandas UDF) > Fix nested MapType behavior in Pandas UDF > - > > Key: SPARK-43544 > URL: https://issues.apache.org/jira/browse/SPARK-43544 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43546) Complete Pandas UDF parity tests
Xinrong Meng created SPARK-43546: Summary: Complete Pandas UDF parity tests Key: SPARK-43546 URL: https://issues.apache.org/jira/browse/SPARK-43546 Project: Spark Issue Type: Test Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng Tests as shown below should be added to Connect. test_pandas_udf_grouped_agg.py test_pandas_udf_scalar.py test_pandas_udf_window.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43545) Remove outdated UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION
Xinrong Meng created SPARK-43545: Summary: Remove outdated UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION Key: SPARK-43545 URL: https://issues.apache.org/jira/browse/SPARK-43545 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43544) Standardize nested non-atomic input type support in Pandas UDF
Xinrong Meng created SPARK-43544: Summary: Standardize nested non-atomic input type support in Pandas UDF Key: SPARK-43544 URL: https://issues.apache.org/jira/browse/SPARK-43544 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43543) Standardize Nested Complex DataTypes Support
Xinrong Meng created SPARK-43543: Summary: Standardize Nested Complex DataTypes Support Key: SPARK-43543 URL: https://issues.apache.org/jira/browse/SPARK-43543 Project: Spark Issue Type: Umbrella Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43440) Support registration of an Arrow-optimized Python UDF
[ https://issues.apache.org/jira/browse/SPARK-43440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-43440: - Description: Currently, when users register an Arrow-optimized Python UDF, it will be registered as a pickled Python UDF and thus, executed without Arrow optimization. We should support Arrow-optimized Python UDFs registration and execute them with Arrow optimization. was:Support registration of an Arrow-optimized Python UDF > Support registration of an Arrow-optimized Python UDF > -- > > Key: SPARK-43440 > URL: https://issues.apache.org/jira/browse/SPARK-43440 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, when users register an Arrow-optimized Python UDF, it will be > registered as a pickled Python UDF and thus, executed without Arrow > optimization. > We should support Arrow-optimized Python UDFs registration and execute them > with Arrow optimization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43440) Support registration of an Arrow-optimized Python UDF
Xinrong Meng created SPARK-43440: Summary: Support registration of an Arrow-optimized Python UDF Key: SPARK-43440 URL: https://issues.apache.org/jira/browse/SPARK-43440 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng Support registration of an Arrow-optimized Python UDF -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42523) Apache Spark 3.4 release
[ https://issues.apache.org/jira/browse/SPARK-42523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17721549#comment-17721549 ] Xinrong Meng commented on SPARK-42523: -- I am wondering if we shall keep the ticket open for minor releases such as the upcoming 3.4.1. > Apache Spark 3.4 release > > > Key: SPARK-42523 > URL: https://issues.apache.org/jira/browse/SPARK-42523 > Project: Spark > Issue Type: Umbrella > Components: Build >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > An umbrella for Apache Spark 3.4 release -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43412) Introduce `SQL_ARROW_BATCHED_UDF` EvalType for Arrow-optimized Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-43412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-43412. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41053 [https://github.com/apache/spark/pull/41053] > Introduce `SQL_ARROW_BATCHED_UDF` EvalType for Arrow-optimized Python UDFs > -- > > Key: SPARK-43412 > URL: https://issues.apache.org/jira/browse/SPARK-43412 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.5.0 > > > We are about to improve nested non-atomic input/output support of an > Arrow-optimized Python UDF. > However, currently, it shares the same EvalType with a pickled Python UDF, > but the same implementation with a Pandas UDF. > Introducing an EvalType enables isolating the changes to Arrow-optimized > Python UDFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43412) Introduce `SQL_ARROW_BATCHED_UDF` EvalType for Arrow-optimized Python UDFs
Xinrong Meng created SPARK-43412: Summary: Introduce `SQL_ARROW_BATCHED_UDF` EvalType for Arrow-optimized Python UDFs Key: SPARK-43412 URL: https://issues.apache.org/jira/browse/SPARK-43412 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng We are about to improve nested non-atomic input/output support of an Arrow-optimized Python UDF. However, currently, it shares the same EvalType with a pickled Python UDF, but the same implementation with a Pandas UDF. Introducing an EvalType enables isolating the changes to Arrow-optimized Python UDFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on
[ https://issues.apache.org/jira/browse/SPARK-41971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719463#comment-17719463 ] Xinrong Meng commented on SPARK-41971: -- Hi [~nikj] , the issue has been resolved. Feel free to pick other issues that you are interested in. Normally we comment on the ticket and file the pull request afterward directly. > `toPandas` should support duplicate filed names when arrow-optimization is on > - > > Key: SPARK-41971 > URL: https://issues.apache.org/jira/browse/SPARK-41971 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Takuya Ueshin >Priority: Minor > Fix For: 3.5.0 > > > toPandas support duplicate columns name, but for a struct column, it doesnot > support duplicate field names. > {code:java} > In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False) > In [28]: spark.sql("select 1 v, 1 v").toPandas() > Out[28]: >v v > 0 1 1 > In [29]: spark.sql("select struct(1 v, 1 v)").toPandas() > Out[29]: > struct(1 AS v, 1 AS v) > 0 (1, 1) > In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True) > In [31]: spark.sql("select 1 v, 1 v").toPandas() > Out[31]: >v v > 0 1 1 > In [32]: spark.sql("select struct(1 v, 1 v)").toPandas() > /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached > the error below and can not continue. Note that > 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect > on failures in the middle of computation. > Ran out of field metadata, likely malformed > warn(msg) > --- > ArrowInvalid Traceback (most recent call last) > Cell In[32], line 1 > > 1 spark.sql("select struct(1 v, 1 v)").toPandas() > File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in > PandasConversionMixin.toPandas(self) > 141 tmp_column_names = ["col_{}".format(i) for i in > range(len(self.columns))] > 142 self_destruct = jconf.arrowPySparkSelfDestructEnabled() > --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow( > 144 split_batches=self_destruct > 145 ) > 146 if len(batches) > 0: > 147 table = pyarrow.Table.from_batches(batches) > File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in > PandasConversionMixin._collect_as_arrow(self, split_batches) > 356 results.append(batch_or_indices) > 357 else: > --> 358 results = list(batch_stream) > 359 finally: > 360 # Join serving thread and raise any exceptions from > collectAsArrowToPython > 361 jsocket_auth_server.getResult() > File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in > ArrowCollectSerializer.load_stream(self, stream) > 50 """ > 51 Load a stream of un-ordered Arrow RecordBatches, where the last > iteration yields > 52 a list of indices that can be used to put the RecordBatches in the > correct order. > 53 """ > 54 # load the batches > ---> 55 for batch in self.serializer.load_stream(stream): > 56 yield batch > 58 # load the batch order indices or propagate any error that occurred > in the JVM > File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in > ArrowStreamSerializer.load_stream(self, stream) > 95 import pyarrow as pa > 97 reader = pa.ipc.open_stream(stream) > ---> 98 for batch in reader: > 99 yield batch > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638, > in __iter__() > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674, > in pyarrow.lib.RecordBatchReader.read_next_batch() > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100, > in pyarrow.lib.check_status() > ArrowInvalid: Ran out of field metadata, likely malformed > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on
[ https://issues.apache.org/jira/browse/SPARK-41971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng reassigned SPARK-41971: Assignee: Takuya Ueshin > `toPandas` should support duplicate filed names when arrow-optimization is on > - > > Key: SPARK-41971 > URL: https://issues.apache.org/jira/browse/SPARK-41971 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Takuya Ueshin >Priority: Minor > Fix For: 3.5.0 > > > toPandas support duplicate columns name, but for a struct column, it doesnot > support duplicate field names. > {code:java} > In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False) > In [28]: spark.sql("select 1 v, 1 v").toPandas() > Out[28]: >v v > 0 1 1 > In [29]: spark.sql("select struct(1 v, 1 v)").toPandas() > Out[29]: > struct(1 AS v, 1 AS v) > 0 (1, 1) > In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True) > In [31]: spark.sql("select 1 v, 1 v").toPandas() > Out[31]: >v v > 0 1 1 > In [32]: spark.sql("select struct(1 v, 1 v)").toPandas() > /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached > the error below and can not continue. Note that > 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect > on failures in the middle of computation. > Ran out of field metadata, likely malformed > warn(msg) > --- > ArrowInvalid Traceback (most recent call last) > Cell In[32], line 1 > > 1 spark.sql("select struct(1 v, 1 v)").toPandas() > File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in > PandasConversionMixin.toPandas(self) > 141 tmp_column_names = ["col_{}".format(i) for i in > range(len(self.columns))] > 142 self_destruct = jconf.arrowPySparkSelfDestructEnabled() > --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow( > 144 split_batches=self_destruct > 145 ) > 146 if len(batches) > 0: > 147 table = pyarrow.Table.from_batches(batches) > File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in > PandasConversionMixin._collect_as_arrow(self, split_batches) > 356 results.append(batch_or_indices) > 357 else: > --> 358 results = list(batch_stream) > 359 finally: > 360 # Join serving thread and raise any exceptions from > collectAsArrowToPython > 361 jsocket_auth_server.getResult() > File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in > ArrowCollectSerializer.load_stream(self, stream) > 50 """ > 51 Load a stream of un-ordered Arrow RecordBatches, where the last > iteration yields > 52 a list of indices that can be used to put the RecordBatches in the > correct order. > 53 """ > 54 # load the batches > ---> 55 for batch in self.serializer.load_stream(stream): > 56 yield batch > 58 # load the batch order indices or propagate any error that occurred > in the JVM > File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in > ArrowStreamSerializer.load_stream(self, stream) > 95 import pyarrow as pa > 97 reader = pa.ipc.open_stream(stream) > ---> 98 for batch in reader: > 99 yield batch > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638, > in __iter__() > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674, > in pyarrow.lib.RecordBatchReader.read_next_batch() > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100, > in pyarrow.lib.check_status() > ArrowInvalid: Ran out of field metadata, likely malformed > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on
[ https://issues.apache.org/jira/browse/SPARK-41971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-41971. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40988 [https://github.com/apache/spark/pull/40988] > `toPandas` should support duplicate filed names when arrow-optimization is on > - > > Key: SPARK-41971 > URL: https://issues.apache.org/jira/browse/SPARK-41971 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > Fix For: 3.5.0 > > > toPandas support duplicate columns name, but for a struct column, it doesnot > support duplicate field names. > {code:java} > In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False) > In [28]: spark.sql("select 1 v, 1 v").toPandas() > Out[28]: >v v > 0 1 1 > In [29]: spark.sql("select struct(1 v, 1 v)").toPandas() > Out[29]: > struct(1 AS v, 1 AS v) > 0 (1, 1) > In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True) > In [31]: spark.sql("select 1 v, 1 v").toPandas() > Out[31]: >v v > 0 1 1 > In [32]: spark.sql("select struct(1 v, 1 v)").toPandas() > /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached > the error below and can not continue. Note that > 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect > on failures in the middle of computation. > Ran out of field metadata, likely malformed > warn(msg) > --- > ArrowInvalid Traceback (most recent call last) > Cell In[32], line 1 > > 1 spark.sql("select struct(1 v, 1 v)").toPandas() > File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in > PandasConversionMixin.toPandas(self) > 141 tmp_column_names = ["col_{}".format(i) for i in > range(len(self.columns))] > 142 self_destruct = jconf.arrowPySparkSelfDestructEnabled() > --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow( > 144 split_batches=self_destruct > 145 ) > 146 if len(batches) > 0: > 147 table = pyarrow.Table.from_batches(batches) > File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in > PandasConversionMixin._collect_as_arrow(self, split_batches) > 356 results.append(batch_or_indices) > 357 else: > --> 358 results = list(batch_stream) > 359 finally: > 360 # Join serving thread and raise any exceptions from > collectAsArrowToPython > 361 jsocket_auth_server.getResult() > File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in > ArrowCollectSerializer.load_stream(self, stream) > 50 """ > 51 Load a stream of un-ordered Arrow RecordBatches, where the last > iteration yields > 52 a list of indices that can be used to put the RecordBatches in the > correct order. > 53 """ > 54 # load the batches > ---> 55 for batch in self.serializer.load_stream(stream): > 56 yield batch > 58 # load the batch order indices or propagate any error that occurred > in the JVM > File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in > ArrowStreamSerializer.load_stream(self, stream) > 95 import pyarrow as pa > 97 reader = pa.ipc.open_stream(stream) > ---> 98 for batch in reader: > 99 yield batch > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638, > in __iter__() > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674, > in pyarrow.lib.RecordBatchReader.read_next_batch() > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100, > in pyarrow.lib.check_status() > ArrowInvalid: Ran out of field metadata, likely malformed > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43032) Add StreamingQueryManager API
[ https://issues.apache.org/jira/browse/SPARK-43032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-43032. -- Assignee: Wei Liu Resolution: Done Resolved by https://github.com/apache/spark/pull/40861. > Add StreamingQueryManager API > - > > Key: SPARK-43032 > URL: https://issues.apache.org/jira/browse/SPARK-43032 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Wei Liu >Priority: Major > > Add StreamingQueryManager API. It would include API that can be directly > support. API like registering streaming listener will be handled separately. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39892) Use ArrowType.Decimal(precision, scale, bitWidth) instead of ArrowType.Decimal(precision, scale)
[ https://issues.apache.org/jira/browse/SPARK-39892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39892: - Fix Version/s: 3.5.0 (was: 3.4.0) > Use ArrowType.Decimal(precision, scale, bitWidth) instead of > ArrowType.Decimal(precision, scale) > > > Key: SPARK-39892 > URL: https://issues.apache.org/jira/browse/SPARK-39892 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > > [warn] > /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:48:49: > [deprecation @ org.apache.spark.sql.util.ArrowUtils.toArrowType | > origin=org.apache.arrow.vector.types.pojo.ArrowType.Decimal. | > version=] constructor Decimal in class Decimal is deprecated -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41259) Spark-sql cli query results should correspond to schema
[ https://issues.apache.org/jira/browse/SPARK-41259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-41259: - Fix Version/s: 3.5.0 (was: 3.4.0) > Spark-sql cli query results should correspond to schema > --- > > Key: SPARK-41259 > URL: https://issues.apache.org/jira/browse/SPARK-41259 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: yikaifei >Priority: Minor > Fix For: 3.5.0 > > > When using the spark-sql cli, Spark outputs only one column in the `show > tables` and `show views` commands to be compatible with Hive output, but the > output schema is still the three columns of Spark -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39814) Use AmazonKinesisClientBuilder.withCredentials instead of new AmazonKinesisClient(credentials)
[ https://issues.apache.org/jira/browse/SPARK-39814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39814: - Fix Version/s: 3.5.0 (was: 3.4.0) > Use AmazonKinesisClientBuilder.withCredentials instead of new > AmazonKinesisClient(credentials) > -- > > Key: SPARK-39814 > URL: https://issues.apache.org/jira/browse/SPARK-39814 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > > [warn] > /home/runner/work/spark/spark/connector/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala:108:25: > [deprecation @ > org.apache.spark.examples.streaming.KinesisWordCountASL.main.kinesisClient | > origin=com.amazonaws.services.kinesis.AmazonKinesisClient. | version=] > constructor AmazonKinesisClient in class AmazonKinesisClient is deprecated > [warn] > /home/runner/work/spark/spark/connector/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala:224:25: > [deprecation @ > org.apache.spark.examples.streaming.KinesisWordProducerASL.generate.kinesisClient > | origin=com.amazonaws.services.kinesis.AmazonKinesisClient. | > version=] constructor AmazonKinesisClient in class AmazonKinesisClient is > deprecated > [warn] > /home/runner/work/spark/spark/connector/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala:142:24: > [deprecation @ > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.client | > origin=com.amazonaws.services.kinesis.AmazonKinesisClient. | version=] > constructor AmazonKinesisClient in class AmazonKinesisClient is deprecated > [warn] > /home/runner/work/spark/spark/connector/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisTestUtils.scala:58:18: > [deprecation @ > org.apache.spark.streaming.kinesis.KinesisTestUtils.kinesisClient.client | > origin=com.amazonaws.services.kinesis.AmazonKinesisClient. | version=] > constructor AmazonKinesisClient in class AmazonKinesisClient is deprecated -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39136) JDBCTable support properties
[ https://issues.apache.org/jira/browse/SPARK-39136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39136: - Fix Version/s: 3.5.0 (was: 3.4.0) > JDBCTable support properties > > > Key: SPARK-39136 > URL: https://issues.apache.org/jira/browse/SPARK-39136 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: angerszhu >Priority: Major > Fix For: 3.5.0 > > > {code:java} > > > > desc formatted jdbc.test.people; > NAME string > IDint > # Partitioning > Not partitioned > # Detailed Table Information > Name test.people > Table Properties [] > Time taken: 0.048 seconds, Fetched 9 row(s) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37935) Migrate onto error classes
[ https://issues.apache.org/jira/browse/SPARK-37935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-37935: - Fix Version/s: 3.5.0 (was: 3.4.0) > Migrate onto error classes > -- > > Key: SPARK-37935 > URL: https://issues.apache.org/jira/browse/SPARK-37935 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.5.0 > > > The PR https://github.com/apache/spark/pull/32850 introduced error classes as > a part of the error messages framework > (https://issues.apache.org/jira/browse/SPARK-33539). Need to migrate all > exceptions from QueryExecutionErrors, QueryCompilationErrors and > QueryParsingErrors on the error classes using instances of SparkThrowable, > and carefully test every error class by writing tests in dedicated test > suites: > * QueryExecutionErrorsSuite for the errors that are occurred during query > execution > * QueryCompilationErrorsSuite ... query compilation or eagerly executing > commands > * QueryParsingErrorsSuite ... parsing errors > Here is an example https://github.com/apache/spark/pull/35157 of how an > existing Java exception can be replaced, and testing of related error > classes.At the end, we should migrate all exceptions from the files > Query.*Errors.scala and cover all error classes from the error-classes.json > file by tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42169) Implement code generation for `to_csv` function (StructsToCsv)
[ https://issues.apache.org/jira/browse/SPARK-42169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-42169: - Fix Version/s: 3.5.0 (was: 3.4.0) > Implement code generation for `to_csv` function (StructsToCsv) > -- > > Key: SPARK-42169 > URL: https://issues.apache.org/jira/browse/SPARK-42169 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Narek Karapetian >Priority: Minor > Labels: csv, sql > Fix For: 3.5.0 > > > Implement code generation for `to_csv` function instead of extending it from > CodegenFallback trait. > {code:java} > org.apache.spark.sql.catalyst.expressions.StructsToCsv.doGenCode(...){code} > > This is good to have from performance point of view. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38945) simply KEYTAB and PRINCIPAL in KerberosConfDriverFeatureStep
[ https://issues.apache.org/jira/browse/SPARK-38945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38945: - Fix Version/s: 3.5.0 (was: 3.4.0) > simply KEYTAB and PRINCIPAL in KerberosConfDriverFeatureStep > > > Key: SPARK-38945 > URL: https://issues.apache.org/jira/browse/SPARK-38945 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: Qian Sun >Priority: Minor > Fix For: 3.5.0 > > > Simply KEYTAB and PRINCIPAL in KerberosConfDriverFeatureStep, because already > imported -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43082) Arrow-optimized Python UDFs in Spark Connect
Xinrong Meng created SPARK-43082: Summary: Arrow-optimized Python UDFs in Spark Connect Key: SPARK-43082 URL: https://issues.apache.org/jira/browse/SPARK-43082 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng Implement Arrow-optimized Python UDFs in Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39696) Uncaught exception in thread executor-heartbeater java.util.ConcurrentModificationException: mutation occurred during iteration
[ https://issues.apache.org/jira/browse/SPARK-39696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39696: - Priority: Blocker (was: Major) > Uncaught exception in thread executor-heartbeater > java.util.ConcurrentModificationException: mutation occurred during iteration > --- > > Key: SPARK-39696 > URL: https://issues.apache.org/jira/browse/SPARK-39696 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0, 3.4.0 > Environment: Spark 3.3.0 (spark-3.3.0-bin-hadoop3-scala2.13 > distribution) > Scala 2.13.8 / OpenJDK 17.0.3 application compilation > Alpine Linux 3.14.3 > JVM OpenJDK 64-Bit Server VM Temurin-17.0.1+12 >Reporter: Stephen Mcmullan >Priority: Blocker > Fix For: 3.4.0 > > > {noformat} > 2022-06-21 18:17:49.289Z ERROR [executor-heartbeater] > org.apache.spark.util.Utils - Uncaught exception in thread > executor-heartbeater > java.util.ConcurrentModificationException: mutation occurred during iteration > at > scala.collection.mutable.MutationTracker$.checkMutations(MutationTracker.scala:43) > ~[scala-library-2.13.8.jar:?] > at > scala.collection.mutable.CheckedIndexedSeqView$CheckedIterator.hasNext(CheckedIndexedSeqView.scala:47) > ~[scala-library-2.13.8.jar:?] > at > scala.collection.IterableOnceOps.copyToArray(IterableOnce.scala:873) > ~[scala-library-2.13.8.jar:?] > at > scala.collection.IterableOnceOps.copyToArray$(IterableOnce.scala:869) > ~[scala-library-2.13.8.jar:?] > at scala.collection.AbstractIterator.copyToArray(Iterator.scala:1293) > ~[scala-library-2.13.8.jar:?] > at > scala.collection.IterableOnceOps.copyToArray(IterableOnce.scala:852) > ~[scala-library-2.13.8.jar:?] > at > scala.collection.IterableOnceOps.copyToArray$(IterableOnce.scala:852) > ~[scala-library-2.13.8.jar:?] > at scala.collection.AbstractIterator.copyToArray(Iterator.scala:1293) > ~[scala-library-2.13.8.jar:?] > at > scala.collection.immutable.VectorStatics$.append1IfSpace(Vector.scala:1959) > ~[scala-library-2.13.8.jar:?] > at scala.collection.immutable.Vector1.appendedAll0(Vector.scala:425) > ~[scala-library-2.13.8.jar:?] > at scala.collection.immutable.Vector.appendedAll(Vector.scala:203) > ~[scala-library-2.13.8.jar:?] > at scala.collection.immutable.Vector.appendedAll(Vector.scala:113) > ~[scala-library-2.13.8.jar:?] > at scala.collection.SeqOps.concat(Seq.scala:187) > ~[scala-library-2.13.8.jar:?] > at scala.collection.SeqOps.concat$(Seq.scala:187) > ~[scala-library-2.13.8.jar:?] > at scala.collection.AbstractSeq.concat(Seq.scala:1161) > ~[scala-library-2.13.8.jar:?] > at scala.collection.IterableOps.$plus$plus(Iterable.scala:726) > ~[scala-library-2.13.8.jar:?] > at scala.collection.IterableOps.$plus$plus$(Iterable.scala:726) > ~[scala-library-2.13.8.jar:?] > at scala.collection.AbstractIterable.$plus$plus(Iterable.scala:926) > ~[scala-library-2.13.8.jar:?] > at > org.apache.spark.executor.TaskMetrics.accumulators(TaskMetrics.scala:261) > ~[spark-core_2.13-3.3.0.jar:3.3.0] > at > org.apache.spark.executor.Executor.$anonfun$reportHeartBeat$1(Executor.scala:1042) > ~[spark-core_2.13-3.3.0.jar:3.3.0] > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563) > ~[scala-library-2.13.8.jar:?] > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561) > ~[scala-library-2.13.8.jar:?] > at scala.collection.AbstractIterable.foreach(Iterable.scala:926) > ~[scala-library-2.13.8.jar:?] > at > org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1036) > ~[spark-core_2.13-3.3.0.jar:3.3.0] > at > org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:238) > ~[spark-core_2.13-3.3.0.jar:3.3.0] > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) > ~[scala-library-2.13.8.jar:?] > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2066) > ~[spark-core_2.13-3.3.0.jar:3.3.0] > at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46) > ~[spark-core_2.13-3.3.0.jar:3.3.0] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?] > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) > ~[?:?] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > ~[?:?] > at >
[jira] [Created] (SPARK-43041) Restore constructors of exceptions for compatibility in connector API
Xinrong Meng created SPARK-43041: Summary: Restore constructors of exceptions for compatibility in connector API Key: SPARK-43041 URL: https://issues.apache.org/jira/browse/SPARK-43041 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: Xinrong Meng Thanks [~aokolnychyi] for raising the issue as shown below: {quote} I have a question about changes to exceptions used in the public connector API, such as NoSuchTableException and TableAlreadyExistsException. I consider those as part of the public Catalog API (TableCatalog uses them in method definitions). However, it looks like PR #37887 has changed them in an incompatible way. Old constructors accepting Identifier objects got removed. The only way to construct such exceptions is either by passing database and table strings or Scala Seq. Shall we add back old constructors to avoid breaking connectors? {quote} We should restore constructors of those exceptions to preserve the compatibility in connector API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43011) array_insert should fail with 0 index
[ https://issues.apache.org/jira/browse/SPARK-43011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-43011: - Priority: Blocker (was: Major) > array_insert should fail with 0 index > - > > Key: SPARK-43011 > URL: https://issues.apache.org/jira/browse/SPARK-43011 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Blocker > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43009) Parameterized sql() with constants
[ https://issues.apache.org/jira/browse/SPARK-43009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-43009: - Priority: Blocker (was: Major) > Parameterized sql() with constants > -- > > Key: SPARK-43009 > URL: https://issues.apache.org/jira/browse/SPARK-43009 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Blocker > > Change Scala/Java/Python APIs to accept any language objects from which would > be possible to construct literal columns. > The current implementation the parameterized sql() requires arguments as > string values parsed to SQL literal expressions that causes the following > issues: > * SQL comments are skipped while parsing, so, some fragments of input might > be skipped. For example, 'Europe -- Amsterdam'. In this case, -- Amsterdam is > excluded from the input. > * Special chars in string values must be escaped, for instance 'E\'Twaun > Moore' -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42693) API Auditing
[ https://issues.apache.org/jira/browse/SPARK-42693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-42693. -- Resolution: Done > API Auditing > > > Key: SPARK-42693 > URL: https://issues.apache.org/jira/browse/SPARK-42693 > Project: Spark > Issue Type: Story > Components: ML, PySpark, Spark Core, SQL, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Blocker > > Audit user-facing API of Spark 3.4. The main goal is to ensure public API > docs to be ready for release, for example, no private classes/methods is > leaking and marked public. > There are 3 common ways to audit API: > * build docs (into a local website) against branch-3.4 to check > * 'git diff' to check the source code differences between v3.3.2 and > branch-3.4 > * [https://github.com/apache/spark-website/pull/443] shows most of the API > doc differences between v3.3.2 and the 3.4.0 RC4(the latest RC); commits are > categorized by components -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42862) Review and fix issues in Core API docs
[ https://issues.apache.org/jira/browse/SPARK-42862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-42862. -- Resolution: Resolved > Review and fix issues in Core API docs > -- > > Key: SPARK-42862 > URL: https://issues.apache.org/jira/browse/SPARK-42862 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Yuanjian Li >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42866) Review and fix issues in Spark Connect - Scala API docs
[ https://issues.apache.org/jira/browse/SPARK-42866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-42866. -- Resolution: Won't Do There doesn't seem to be a separate API doc for Spark Connect Scala Client. So no API auditing is required for now. > Review and fix issues in Spark Connect - Scala API docs > --- > > Key: SPARK-42866 > URL: https://issues.apache.org/jira/browse/SPARK-42866 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42393) Support for Pandas/Arrow Functions API
[ https://issues.apache.org/jira/browse/SPARK-42393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-42393: - Description: There are derivative APIs which depend on the implementation of Pandas UDFs: Pandas Function APIs and Arrow Function APIs, as shown below: !image-2023-03-29-11-40-44-318.png|width=576,height=225! Spark Connect Python Client (SCPC), as a client and server interface for PySpark will eventually replace the legacy API of PySpark. Supporting PySpark UDFs is essential for Spark Connect to reach parity with the PySpark legacy API. See design doc [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. was:See design doc [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. > Support for Pandas/Arrow Functions API > -- > > Key: SPARK-42393 > URL: https://issues.apache.org/jira/browse/SPARK-42393 > Project: Spark > Issue Type: Umbrella > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Attachments: image-2023-03-29-11-40-44-318.png > > > There are derivative APIs which depend on the implementation of Pandas UDFs: > Pandas Function APIs and Arrow Function APIs, as shown below: > !image-2023-03-29-11-40-44-318.png|width=576,height=225! > > Spark Connect Python Client (SCPC), as a client and server interface for > PySpark will eventually replace the legacy API of PySpark. Supporting PySpark > UDFs is essential for Spark Connect to reach parity with the PySpark legacy > API. > See design doc > [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42393) Support for Pandas/Arrow Functions API
[ https://issues.apache.org/jira/browse/SPARK-42393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-42393: - Attachment: image-2023-03-29-11-40-44-318.png > Support for Pandas/Arrow Functions API > -- > > Key: SPARK-42393 > URL: https://issues.apache.org/jira/browse/SPARK-42393 > Project: Spark > Issue Type: Umbrella > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Attachments: image-2023-03-29-11-40-44-318.png > > > See design doc > [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41661) Support for User-defined Functions in Python
[ https://issues.apache.org/jira/browse/SPARK-41661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-41661: - Description: See design doc [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. User-defined Functions in Python consist of (pickled) Python UDFs and (Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code on top of the Apache Spark™ engine. Users only have to state "what to do"; PySpark, as a sandbox, encapsulates "how to do it". Spark Connect Python Client (SCPC), as a client and server interface for PySpark will eventually replace the legacy API of PySpark. Supporting PySpark UDFs is essential for Spark Connect to reach parity with the PySpark legacy API. was: See design doc [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. User-defined Functions in Python consist of (pickled) Python UDFs and (Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code on top of the Apache Spark™ engine. Users only have to state "what to do"; PySpark, as a sandbox, encapsulates "how to do it". Spark Connect Python Client (SCPC), as a client and server interface for PySpark will eventually replace the legacy API of PySpark in OSS. Supporting PySpark UDFs is essential for Spark Connect to reach parity with the PySpark legacy API. > Support for User-defined Functions in Python > > > Key: SPARK-41661 > URL: https://issues.apache.org/jira/browse/SPARK-41661 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Xinrong Meng >Priority: Major > > See design doc > [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. > User-defined Functions in Python consist of (pickled) Python UDFs and > (Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code > on top of the Apache Spark™ engine. Users only have to state "what to do"; > PySpark, as a sandbox, encapsulates "how to do it". > Spark Connect Python Client (SCPC), as a client and server interface for > PySpark will eventually replace the legacy API of PySpark. Supporting PySpark > UDFs is essential for Spark Connect to reach parity with the PySpark legacy > API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41661) Support for User-defined Functions in Python
[ https://issues.apache.org/jira/browse/SPARK-41661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-41661: - Description: User-defined Functions in Python consist of (pickled) Python UDFs and (Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code on top of the Apache Spark™ engine. Users only have to state "what to do"; PySpark, as a sandbox, encapsulates "how to do it". Spark Connect Python Client (SCPC), as a client and server interface for PySpark will eventually replace the legacy API of PySpark. Supporting PySpark UDFs is essential for Spark Connect to reach parity with the PySpark legacy API. See design doc [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. was: See design doc [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. User-defined Functions in Python consist of (pickled) Python UDFs and (Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code on top of the Apache Spark™ engine. Users only have to state "what to do"; PySpark, as a sandbox, encapsulates "how to do it". Spark Connect Python Client (SCPC), as a client and server interface for PySpark will eventually replace the legacy API of PySpark. Supporting PySpark UDFs is essential for Spark Connect to reach parity with the PySpark legacy API. > Support for User-defined Functions in Python > > > Key: SPARK-41661 > URL: https://issues.apache.org/jira/browse/SPARK-41661 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Xinrong Meng >Priority: Major > > User-defined Functions in Python consist of (pickled) Python UDFs and > (Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code > on top of the Apache Spark™ engine. Users only have to state "what to do"; > PySpark, as a sandbox, encapsulates "how to do it". > Spark Connect Python Client (SCPC), as a client and server interface for > PySpark will eventually replace the legacy API of PySpark. Supporting PySpark > UDFs is essential for Spark Connect to reach parity with the PySpark legacy > API. > See design doc > [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41661) Support for User-defined Functions in Python
[ https://issues.apache.org/jira/browse/SPARK-41661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-41661: - Description: See design doc [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. User-defined Functions in Python consist of (pickled) Python UDFs and (Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code on top of the Apache Spark™ engine. Users only have to state "what to do"; PySpark, as a sandbox, encapsulates "how to do it". Spark Connect Python Client (SCPC), as a client and server interface for PySpark will eventually replace the legacy API of PySpark in OSS. Supporting PySpark UDFs is essential for Spark Connect to reach parity with the PySpark legacy API. was: See design doc [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. User-defined Functions in Python consist of (pickled) Python UDFs and (Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code on top of the Apache Spark™ engine. Users only have to state "what to do"; PySpark, as a sandbox, encapsulates "how to do it". Spark Connect Python Client (SCPC), as a client and server interface for PySpark, will eventually (probably Spark 4.0) replace the legacy API of PySpark in both OSS. Supporting PySpark UDFs is essential for Spark Connect to reach parity with the PySpark legacy API. > Support for User-defined Functions in Python > > > Key: SPARK-41661 > URL: https://issues.apache.org/jira/browse/SPARK-41661 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Xinrong Meng >Priority: Major > > See design doc > [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. > User-defined Functions in Python consist of (pickled) Python UDFs and > (Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code > on top of the Apache Spark™ engine. Users only have to state "what to do"; > PySpark, as a sandbox, encapsulates "how to do it". > Spark Connect Python Client (SCPC), as a client and server interface for > PySpark will eventually replace the legacy API of PySpark in OSS. Supporting > PySpark UDFs is essential for Spark Connect to reach parity with the PySpark > legacy API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41661) Support for User-defined Functions in Python
[ https://issues.apache.org/jira/browse/SPARK-41661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-41661: - Description: See design doc [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. User-defined Functions in Python consist of (pickled) Python UDFs and (Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code on top of the Apache Spark™ engine. Users only have to state "what to do"; PySpark, as a sandbox, encapsulates "how to do it". Spark Connect Python Client (SCPC), as a client and server interface for PySpark, will eventually (probably Spark 4.0) replace the legacy API of PySpark in both OSS. Supporting PySpark UDFs is essential for Spark Connect to reach parity with the PySpark legacy API. was: See design doc [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. PySpark UDFs mainly consist of (pickled) Python UDFs and (Arrow-optimized) Pandas UDFs. > Support for User-defined Functions in Python > > > Key: SPARK-41661 > URL: https://issues.apache.org/jira/browse/SPARK-41661 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Xinrong Meng >Priority: Major > > See design doc > [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. > User-defined Functions in Python consist of (pickled) Python UDFs and > (Arrow-optimized) Pandas UDFs. They enable users to run arbitrary Python code > on top of the Apache Spark™ engine. Users only have to state "what to do"; > PySpark, as a sandbox, encapsulates "how to do it". > Spark Connect Python Client (SCPC), as a client and server interface for > PySpark, will eventually (probably Spark 4.0) replace the legacy API of > PySpark in both OSS. Supporting PySpark UDFs is essential for Spark Connect > to reach parity with the PySpark legacy API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41661) Support for User-defined Functions in Python
[ https://issues.apache.org/jira/browse/SPARK-41661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-41661: - Description: See design doc [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. PySpark UDFs mainly consist of (pickled) Python UDFs and (Arrow-optimized) Pandas UDFs. was:Spark Connect should support Python UDFs > Support for User-defined Functions in Python > > > Key: SPARK-41661 > URL: https://issues.apache.org/jira/browse/SPARK-41661 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Xinrong Meng >Priority: Major > > See design doc > [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. > PySpark UDFs mainly consist of (pickled) Python UDFs and (Arrow-optimized) > Pandas UDFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42393) Support for Pandas/Arrow Functions API
[ https://issues.apache.org/jira/browse/SPARK-42393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-42393: - Description: See design doc [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. > Support for Pandas/Arrow Functions API > -- > > Key: SPARK-42393 > URL: https://issues.apache.org/jira/browse/SPARK-42393 > Project: Spark > Issue Type: Umbrella > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > See design doc > [here|https://docs.google.com/document/d/e/2PACX-1vRXF8nTdjwH0LbYyp3b6Zt6STEKWsvfKSO7_s4foOB-3zJ2h4_06JF147hUPlADJxZ_X22RFxgZ-fRS/pub]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42393) Support for Pandas/Arrow Functions API
[ https://issues.apache.org/jira/browse/SPARK-42393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-42393. -- Resolution: Resolved > Support for Pandas/Arrow Functions API > -- > > Key: SPARK-42393 > URL: https://issues.apache.org/jira/browse/SPARK-42393 > Project: Spark > Issue Type: Umbrella > Components: Connect, PySpark >Affects Versions: 3.4.0, 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42393) Support for Pandas/Arrow Functions API
[ https://issues.apache.org/jira/browse/SPARK-42393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-42393: - Affects Version/s: (was: 3.5.0) > Support for Pandas/Arrow Functions API > -- > > Key: SPARK-42393 > URL: https://issues.apache.org/jira/browse/SPARK-42393 > Project: Spark > Issue Type: Umbrella > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42891) Implement CoGrouped Map API
[ https://issues.apache.org/jira/browse/SPARK-42891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-42891. -- Assignee: Xinrong Meng Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/40487] and [https://github.com/apache/spark/pull/40539] > Implement CoGrouped Map API > --- > > Key: SPARK-42891 > URL: https://issues.apache.org/jira/browse/SPARK-42891 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > Implement CoGrouped Map API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42693) API Auditing
[ https://issues.apache.org/jira/browse/SPARK-42693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-42693: - Description: Audit user-facing API of Spark 3.4. The main goal is to ensure public API docs to be ready for release, for example, no private classes/methods is leaking and marked public. There are 3 common ways to audit API: * build docs (into a local website) against branch-3.4 to check * 'git diff' to check the source code differences between v3.3.2 and branch-3.4 * [https://github.com/apache/spark-website/pull/443] shows most of the API doc differences between v3.3.2 and the 3.4.0 RC4(the latest RC); commits are categorized by components was: Audit user-facing API of Spark 3.4. The main goal is to ensure public API docs to be ready for release, for example, no private classes/methods is leaking and marked public. There are 3 common ways to audit API: * [https://github.com/apache/spark-website/pull/443] shows most of the API doc differences between 3.3.2 and the 3.4.0 RC4(the latest RC); commits are categorized by components * 'git diff' to check the source code differences between v3.3.2 and branch-3.4 * build a local website against branch-3.4 to check > API Auditing > > > Key: SPARK-42693 > URL: https://issues.apache.org/jira/browse/SPARK-42693 > Project: Spark > Issue Type: Story > Components: ML, PySpark, Spark Core, SQL, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Blocker > > Audit user-facing API of Spark 3.4. The main goal is to ensure public API > docs to be ready for release, for example, no private classes/methods is > leaking and marked public. > There are 3 common ways to audit API: > * build docs (into a local website) against branch-3.4 to check > * 'git diff' to check the source code differences between v3.3.2 and > branch-3.4 > * [https://github.com/apache/spark-website/pull/443] shows most of the API > doc differences between v3.3.2 and the 3.4.0 RC4(the latest RC); commits are > categorized by components -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42693) API Auditing
[ https://issues.apache.org/jira/browse/SPARK-42693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-42693: - Description: Audit user-facing API of Spark 3.4. The main goal is to ensure public API docs to be ready for release, for example, no private classes/methods is leaking and marked public. There are 3 common ways to audit API: * [https://github.com/apache/spark-website/pull/443] shows most of the API doc differences between 3.3.2 and the 3.4.0 RC4(the latest RC); commits are categorized by components * 'git diff' to check the source code differences between v3.3.2 and branch-3.4 * build a local website against branch-3.4 to check was:Audit user-facing API of Spark 3.4. > API Auditing > > > Key: SPARK-42693 > URL: https://issues.apache.org/jira/browse/SPARK-42693 > Project: Spark > Issue Type: Story > Components: ML, PySpark, Spark Core, SQL, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Blocker > > Audit user-facing API of Spark 3.4. The main goal is to ensure public API > docs to be ready for release, for example, no private classes/methods is > leaking and marked public. > There are 3 common ways to audit API: > * [https://github.com/apache/spark-website/pull/443] shows most of the API > doc differences between 3.3.2 and the 3.4.0 RC4(the latest RC); commits are > categorized by components > * 'git diff' to check the source code differences between v3.3.2 and > branch-3.4 > * build a local website against branch-3.4 to check -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42908) Raise RuntimeError if SparkContext is not initialized when parsing DDL-formatted type strings
Xinrong Meng created SPARK-42908: Summary: Raise RuntimeError if SparkContext is not initialized when parsing DDL-formatted type strings Key: SPARK-42908 URL: https://issues.apache.org/jira/browse/SPARK-42908 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0, 3.5.0 Reporter: Xinrong Meng Raise RuntimeError if SparkContext is not initialized when parsing DDL-formatted type strings. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-40307) Introduce Arrow-optimized Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-40307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng reopened SPARK-40307: -- > Introduce Arrow-optimized Python UDFs > - > > Key: SPARK-40307 > URL: https://issues.apache.org/jira/browse/SPARK-40307 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > Python user-defined function (UDF) enables users to run arbitrary code > against PySpark columns. It uses Pickle for (de)serialization and executes > row by row. > One major performance bottleneck of Python UDFs is (de)serialization, that > is, the data interchanging between the worker JVM and the spawned Python > subprocess which actually executes the UDF. We should seek an alternative to > handle the (de)serialization: Arrow, which is used in the (de)serialization > of Pandas UDF already. > There should be two ways to enable/disable the Arrow optimization for Python > UDFs: > - the Spark configuration `spark.sql.execution.pythonUDF.arrow.enabled`, > disabled by default. > - the `useArrow` parameter of the `udf` function, None by default. > The Spark configuration takes effect only when `useArrow` is None. Otherwise, > `useArrow` decides whether a specific user-defined function is optimized by > Arrow or not. > The reason why we introduce these two ways is to provide both a convenient, > per-Spark-session control and a finer-grained, per-UDF control of the Arrow > optimization for Python UDFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40307) Introduce Arrow-optimized Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-40307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-40307: - Affects Version/s: 3.5.0 > Introduce Arrow-optimized Python UDFs > - > > Key: SPARK-40307 > URL: https://issues.apache.org/jira/browse/SPARK-40307 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.4.0, 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > Python user-defined function (UDF) enables users to run arbitrary code > against PySpark columns. It uses Pickle for (de)serialization and executes > row by row. > One major performance bottleneck of Python UDFs is (de)serialization, that > is, the data interchanging between the worker JVM and the spawned Python > subprocess which actually executes the UDF. We should seek an alternative to > handle the (de)serialization: Arrow, which is used in the (de)serialization > of Pandas UDF already. > There should be two ways to enable/disable the Arrow optimization for Python > UDFs: > - the Spark configuration `spark.sql.execution.pythonUDF.arrow.enabled`, > disabled by default. > - the `useArrow` parameter of the `udf` function, None by default. > The Spark configuration takes effect only when `useArrow` is None. Otherwise, > `useArrow` decides whether a specific user-defined function is optimized by > Arrow or not. > The reason why we introduce these two ways is to provide both a convenient, > per-Spark-session control and a finer-grained, per-UDF control of the Arrow > optimization for Python UDFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42893) Block the usage of Arrow-optimized Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-42893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-42893: - Description: Considering the upcoming improvements on the result inconsistencies between traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better block the feature, otherwise, users who try out the feature will expect behavior changes in the next release. In addition, since Spark Connect Python Client(SCPC) has been introduced in Spark 3.4, we'd better ensure the feature is ready in both vanilla PySpark and SCPC at the same time for compatibility. was:Considering the upcoming improvements on the result inconsistencies between traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better block the feature, otherwise, users who try out the feature will expect behavior changes in the next release. > Block the usage of Arrow-optimized Python UDFs > -- > > Key: SPARK-42893 > URL: https://issues.apache.org/jira/browse/SPARK-42893 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Considering the upcoming improvements on the result inconsistencies between > traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better > block the feature, otherwise, users who try out the feature will expect > behavior changes in the next release. > In addition, since Spark Connect Python Client(SCPC) has been introduced in > Spark 3.4, we'd better ensure the feature is ready in both vanilla PySpark > and SCPC at the same time for compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42893) Block Arrow-optimized Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-42893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-42893: - Summary: Block Arrow-optimized Python UDFs (was: Block the usage of Arrow-optimized Python UDFs) > Block Arrow-optimized Python UDFs > - > > Key: SPARK-42893 > URL: https://issues.apache.org/jira/browse/SPARK-42893 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Considering the upcoming improvements on the result inconsistencies between > traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better > block the feature, otherwise, users who try out the feature will expect > behavior changes in the next release. > In addition, since Spark Connect Python Client(SCPC) has been introduced in > Spark 3.4, we'd better ensure the feature is ready in both vanilla PySpark > and SCPC at the same time for compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42893) Block the usage of Arrow-optimized Python UDFs
Xinrong Meng created SPARK-42893: Summary: Block the usage of Arrow-optimized Python UDFs Key: SPARK-42893 URL: https://issues.apache.org/jira/browse/SPARK-42893 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Considering the upcoming improvements on the result inconsistencies between traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better block the feature, otherwise, users who try out the feature will expect behavior changes in the next release. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42340) Implement Grouped Map API
[ https://issues.apache.org/jira/browse/SPARK-42340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-42340: - Summary: Implement Grouped Map API (was: Implement GroupedData.applyInPandas) > Implement Grouped Map API > - > > Key: SPARK-42340 > URL: https://issues.apache.org/jira/browse/SPARK-42340 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42891) Implement CoGrouped Map API
Xinrong Meng created SPARK-42891: Summary: Implement CoGrouped Map API Key: SPARK-42891 URL: https://issues.apache.org/jira/browse/SPARK-42891 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement CoGrouped Map API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703123#comment-17703123 ] Xinrong Meng edited comment on SPARK-40327 at 3/21/23 9:48 AM: --- All resolved issues are moved to https://issues.apache.org/jira/browse/SPARK-42882 for clarity and references in the release note. The version is also modified to Spark 3.5.0. was (Author: xinrongm): Hi, all resolved issues are moved to https://issues.apache.org/jira/browse/SPARK-42882 for clarity and references in the release note. > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-40327: - Affects Version/s: 3.5.0 (was: 3.4.0) > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703123#comment-17703123 ] Xinrong Meng commented on SPARK-40327: -- Hi, all resolved issues are moved to https://issues.apache.org/jira/browse/SPARK-42882 for clarity and references in the release note. > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-40327: - Fix Version/s: (was: 3.4.0) > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40340) Implement `Expanding.sem`.
[ https://issues.apache.org/jira/browse/SPARK-40340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-40340: - Parent: SPARK-40327 (was: SPARK-42882) > Implement `Expanding.sem`. > -- > > Key: SPARK-40340 > URL: https://issues.apache.org/jira/browse/SPARK-40340 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > We should implement `Expanding.sem` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.window.expanding.Expanding.sem.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40341) Implement `Rolling.median`.
[ https://issues.apache.org/jira/browse/SPARK-40341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-40341: - Parent: SPARK-40327 (was: SPARK-42882) > Implement `Rolling.median`. > --- > > Key: SPARK-40341 > URL: https://issues.apache.org/jira/browse/SPARK-40341 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Yikun Jiang >Priority: Major > > We should implement `Rolling.median` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.median.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39199) Implement pandas API missing parameters
[ https://issues.apache.org/jira/browse/SPARK-39199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703121#comment-17703121 ] Xinrong Meng commented on SPARK-39199: -- Please see https://issues.apache.org/jira/browse/SPARK-42883 > Implement pandas API missing parameters > --- > > Key: SPARK-39199 > URL: https://issues.apache.org/jira/browse/SPARK-39199 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark, PySpark >Affects Versions: 3.3.0, 3.3.1, 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42883) Implement Pandas API Missing Parameters
[ https://issues.apache.org/jira/browse/SPARK-42883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-42883. -- Resolution: Resolved > Implement Pandas API Missing Parameters > --- > > Key: SPARK-42883 > URL: https://issues.apache.org/jira/browse/SPARK-42883 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > pandas API on Spark aims to make pandas code work on Spark clusters without > any changes. So full API coverage has been one of our major goals. Currently, > most pandas functions are implemented, whereas some of them are have > incomplete parameters support. > There are some common parameters missing (resolved): > * How to do with NAs > * Filter data types > * Control result length > * Reindex result > There are remaining missing parameters to implement (see doc below). > See the design and the current status at > [https://docs.google.com/document/d/1H6RXL6oc-v8qLJbwKl6OEqBjRuMZaXcTYmrZb9yNm5I/edit?usp=sharing]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38552) Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to resolve ties
[ https://issues.apache.org/jira/browse/SPARK-38552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38552: - Parent: SPARK-42883 (was: SPARK-39199) > Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to > resolve ties > -- > > Key: SPARK-38552 > URL: https://issues.apache.org/jira/browse/SPARK-38552 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to > resolve ties -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42882) Pandas API Coverage Improvements
[ https://issues.apache.org/jira/browse/SPARK-42882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-42882. -- Resolution: Resolved > Pandas API Coverage Improvements > > > Key: SPARK-42882 > URL: https://issues.apache.org/jira/browse/SPARK-42882 > Project: Spark > Issue Type: Epic > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Pandas API on Spark aims to make pandas code work on Spark clusters without > any changes. So full API coverage has been one of our major goals. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38938) Implement `inplace` and `columns` parameters of `Series.drop`
[ https://issues.apache.org/jira/browse/SPARK-38938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38938: - Parent: SPARK-42883 (was: SPARK-39199) > Implement `inplace` and `columns` parameters of `Series.drop` > - > > Key: SPARK-38938 > URL: https://issues.apache.org/jira/browse/SPARK-38938 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `inplace` and `columns` parameters of `Series.drop` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38479) Add `Series.duplicated` to indicate duplicate Series values.
[ https://issues.apache.org/jira/browse/SPARK-38479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38479: - Parent: SPARK-42883 (was: SPARK-39199) > Add `Series.duplicated` to indicate duplicate Series values. > > > Key: SPARK-38479 > URL: https://issues.apache.org/jira/browse/SPARK-38479 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Add `Series.duplicated` to indicate duplicate Series values. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38518) Implement `skipna` of `Series.all/Index.all` to exclude NA/null values
[ https://issues.apache.org/jira/browse/SPARK-38518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38518: - Parent: SPARK-42883 (was: SPARK-39199) > Implement `skipna` of `Series.all/Index.all` to exclude NA/null values > -- > > Key: SPARK-38518 > URL: https://issues.apache.org/jira/browse/SPARK-38518 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Implement `skipna` of `Series.all/Index.all` to exclude NA/null values. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39189) interpolate supports limit_area
[ https://issues.apache.org/jira/browse/SPARK-39189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39189: - Parent: SPARK-42883 (was: SPARK-39199) > interpolate supports limit_area > --- > > Key: SPARK-39189 > URL: https://issues.apache.org/jira/browse/SPARK-39189 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38903) Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`
[ https://issues.apache.org/jira/browse/SPARK-38903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38903: - Parent: SPARK-42883 (was: SPARK-39199) > Implement `ignore_index` of `Series.sort_values` and `Series.sort_index` > > > Key: SPARK-38903 > URL: https://issues.apache.org/jira/browse/SPARK-38903 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `ignore_index` of `Series.sort_values` and `Series.sort_index` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38943) EWM support ignore_na
[ https://issues.apache.org/jira/browse/SPARK-38943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38943: - Parent: SPARK-42883 (was: SPARK-39199) > EWM support ignore_na > - > > Key: SPARK-38943 > URL: https://issues.apache.org/jira/browse/SPARK-38943 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39907) Implement axis and skipna of Series.argmin
[ https://issues.apache.org/jira/browse/SPARK-39907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39907: - Parent: SPARK-42883 (was: SPARK-39199) > Implement axis and skipna of Series.argmin > -- > > Key: SPARK-39907 > URL: https://issues.apache.org/jira/browse/SPARK-39907 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38765) Implement `inplace` parameter of `Series.clip`
[ https://issues.apache.org/jira/browse/SPARK-38765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38765: - Parent: SPARK-42883 (was: SPARK-39199) > Implement `inplace` parameter of `Series.clip` > -- > > Key: SPARK-38765 > URL: https://issues.apache.org/jira/browse/SPARK-38765 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `inplace` parameter of `Series.clip` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38686) Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates`
[ https://issues.apache.org/jira/browse/SPARK-38686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38686: - Parent: SPARK-42883 (was: SPARK-39199) > Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates` > -- > > Key: SPARK-38686 > URL: https://issues.apache.org/jira/browse/SPARK-38686 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38704) Support string `inclusive` parameter of `Series.between`
[ https://issues.apache.org/jira/browse/SPARK-38704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38704: - Parent: SPARK-42883 (was: SPARK-39199) > Support string `inclusive` parameter of `Series.between` > > > Key: SPARK-38704 > URL: https://issues.apache.org/jira/browse/SPARK-38704 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Support string `inclusive` parameter of `Series.between` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39201) Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`
[ https://issues.apache.org/jira/browse/SPARK-39201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39201: - Parent: SPARK-42883 (was: SPARK-39199) > Implement `ignore_index` of `DataFrame.explode` and > `DataFrame.drop_duplicates` > --- > > Key: SPARK-39201 > URL: https://issues.apache.org/jira/browse/SPARK-39201 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `ignore_index` of `DataFrame.explode` and > `DataFrame.drop_duplicates` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38387) Support `na_action` and Series input correspondence in `Series.map`
[ https://issues.apache.org/jira/browse/SPARK-38387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38387: - Parent: SPARK-42883 (was: SPARK-39199) > Support `na_action` and Series input correspondence in `Series.map` > --- > > Key: SPARK-38387 > URL: https://issues.apache.org/jira/browse/SPARK-38387 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Support `na_action` and Series input correspondence in `Series.map`, in order > to reach parity to pandas API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38576) Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only
[ https://issues.apache.org/jira/browse/SPARK-38576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38576: - Parent: SPARK-42883 (was: SPARK-39199) > Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank > numeric columns only > --- > > Key: SPARK-38576 > URL: https://issues.apache.org/jira/browse/SPARK-38576 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank > numeric columns only. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38608) Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any`
[ https://issues.apache.org/jira/browse/SPARK-38608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38608: - Parent: SPARK-42883 (was: SPARK-39199) > Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any` > - > > Key: SPARK-38608 > URL: https://issues.apache.org/jira/browse/SPARK-38608 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any` to > include only boolean columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38726) Support `how` parameter of `MultiIndex.dropna`
[ https://issues.apache.org/jira/browse/SPARK-38726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38726: - Parent: SPARK-42883 (was: SPARK-39199) > Support `how` parameter of `MultiIndex.dropna` > -- > > Key: SPARK-38726 > URL: https://issues.apache.org/jira/browse/SPARK-38726 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Support `how` parameter of `MultiIndex.dropna` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38441) Support string and bool `regex` in `Series.replace`
[ https://issues.apache.org/jira/browse/SPARK-38441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38441: - Parent: SPARK-42883 (was: SPARK-39199) > Support string and bool `regex` in `Series.replace` > --- > > Key: SPARK-38441 > URL: https://issues.apache.org/jira/browse/SPARK-38441 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Support string and bool `regex` in `Series.replace` in order to reach parity > with pandas. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38989) Implement `ignore_index` of `DataFrame/Series.sample`
[ https://issues.apache.org/jira/browse/SPARK-38989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38989: - Parent: SPARK-42883 (was: SPARK-39199) > Implement `ignore_index` of `DataFrame/Series.sample` > - > > Key: SPARK-38989 > URL: https://issues.apache.org/jira/browse/SPARK-38989 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `ignore_index` of `DataFrame/Series.sample` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38793) Support `return_indexer` parameter of `Index/MultiIndex.sort_values`
[ https://issues.apache.org/jira/browse/SPARK-38793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38793: - Parent: SPARK-42883 (was: SPARK-39199) > Support `return_indexer` parameter of `Index/MultiIndex.sort_values` > > > Key: SPARK-38793 > URL: https://issues.apache.org/jira/browse/SPARK-38793 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Support `return_indexer` parameter of `Index/MultiIndex.sort_values` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38491) Support `ignore_index` of `Series.sort_values`
[ https://issues.apache.org/jira/browse/SPARK-38491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38491: - Parent: SPARK-42883 (was: SPARK-39199) > Support `ignore_index` of `Series.sort_values` > -- > > Key: SPARK-38491 > URL: https://issues.apache.org/jira/browse/SPARK-38491 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Support `ignore_index` of `Series.sort_values` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38863) Implement `skipna` parameter of `DataFrame.all`
[ https://issues.apache.org/jira/browse/SPARK-38863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38863: - Parent: SPARK-42883 (was: SPARK-39199) > Implement `skipna` parameter of `DataFrame.all` > --- > > Key: SPARK-38863 > URL: https://issues.apache.org/jira/browse/SPARK-38863 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `skipna` parameter of `DataFrame.all`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org