[jira] [Created] (SPARK-39581) Better error messages for Pandas API on Spark
Xinrong Meng created SPARK-39581: Summary: Better error messages for Pandas API on Spark Key: SPARK-39581 URL: https://issues.apache.org/jira/browse/SPARK-39581 Project: Spark Issue Type: Improvement Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Xinrong Meng Currently, some error messages for pandas API on Spark are confusing. Specifically, * when users assume pandas-on-Spark has implemented pandas Scalars * when users pass a ps.Index when creating a DataFrame/Series Better error messages enhance usability and furthermore, user adoption. We should improve error messages for Pandas API on Spark. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39574) Better error message when `ps.Index` is used for DataFrame/Series creation
[ https://issues.apache.org/jira/browse/SPARK-39574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39574: - Parent: SPARK-39581 Issue Type: Sub-task (was: Improvement) > Better error message when `ps.Index` is used for DataFrame/Series creation > -- > > Key: SPARK-39574 > URL: https://issues.apache.org/jira/browse/SPARK-39574 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Better error message when `ps.Index` is used for DataFrame/Series creation. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39574) Better error message when `ps.Index` is used for DataFrame/Series creation
[ https://issues.apache.org/jira/browse/SPARK-39574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39574: - Summary: Better error message when `ps.Index` is used for DataFrame/Series creation (was: Better error message when ps.Index is used for DataFrame/Series creation) > Better error message when `ps.Index` is used for DataFrame/Series creation > -- > > Key: SPARK-39574 > URL: https://issues.apache.org/jira/browse/SPARK-39574 > Project: Spark > Issue Type: Improvement > Components: Pandas API on Spark, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Better error message when `ps.Index` is used for DataFrame/Series creation. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39574) Better error message when ps.Index is used for DataFrame/Series creation
Xinrong Meng created SPARK-39574: Summary: Better error message when ps.Index is used for DataFrame/Series creation Key: SPARK-39574 URL: https://issues.apache.org/jira/browse/SPARK-39574 Project: Spark Issue Type: Improvement Components: Pandas API on Spark, PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Better error message when `ps.Index` is used for DataFrame/Series creation. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39494) Support `createDataFrame` from a list of scalars
[ https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39494: - Description: Currently, DataFrame creation from a list of scalars is unsupported as below: |>>> spark.createDataFrame([1, 2]) Traceback (most recent call last): ... *raise* TypeError("Can not infer schema for type: %s" % type(row)) TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>| However, cases below are supported. |>>> spark.createDataFrame([(1,), (2,)]).collect() [Row(_1=1), Row(_1=2)]| |>>> schema StructType([StructField('_1', LongType(), \{*}True\{*})]) >>> spark.createDataFrame([1, 2], schema=schema).collect() [Row(_1=1), Row(_1=2)]| In addition, Spark DataFrame Scala API supports creating a DataFrame from a list of scalars as below: |scala> Seq(1, 2).toDF().collect() res6: Array[org.apache.spark.sql.Row] = Array([1], [2])| To maintain API consistency, we propose to support DataFrame creation from a list of scalars. See more at [https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing] was: Currently, DataFrame creation from a list of scalars is unsupported as below: |>>> spark.createDataFrame([1, 2]) Traceback (most recent call last): ... *raise* TypeError("Can not infer schema for type: %s" % type(row)) TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>| However, cases below are supported. |>>> spark.createDataFrame([(1,), (2,)]).collect() [Row(_1=1), Row(_1=2)]| |>>> schema StructType([StructField('_1', LongType(), {*}True{*})]) >>> spark.createDataFrame([1, 2], schema=schema).collect() [Row(_1=1), Row(_1=2)]| In addition, Spark DataFrame Scala API supports creating a DataFrame from a list of scalars as below: |scala> Seq(1, 2).toDF().collect() res6: Array[org.apache.spark.sql.Row] = Array([1], [2]| To maintain API consistency, we propose to support DataFrame creation from a list of scalars. See more at [https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing] > Support `createDataFrame` from a list of scalars > > > Key: SPARK-39494 > URL: https://issues.apache.org/jira/browse/SPARK-39494 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, DataFrame creation from a list of scalars is unsupported as below: > |>>> spark.createDataFrame([1, 2]) > Traceback (most recent call last): > ... > *raise* TypeError("Can not infer schema for type: %s" % type(row)) > TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>| > > However, cases below are supported. > |>>> spark.createDataFrame([(1,), (2,)]).collect() > [Row(_1=1), Row(_1=2)]| > > |>>> schema > StructType([StructField('_1', LongType(), \{*}True\{*})]) > >>> spark.createDataFrame([1, 2], schema=schema).collect() > [Row(_1=1), Row(_1=2)]| > > In addition, Spark DataFrame Scala API supports creating a DataFrame from a > list of scalars as below: > |scala> Seq(1, 2).toDF().collect() > res6: Array[org.apache.spark.sql.Row] = Array([1], [2])| > > To maintain API consistency, we propose to support DataFrame creation from a > list of scalars. See more at > [https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39494) Support `createDataFrame` from a list of scalars
[ https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39494: - Description: Currently, DataFrame creation from a list of scalars is unsupported as below: |>>> spark.createDataFrame([1, 2]) Traceback (most recent call last): ... *raise* TypeError("Can not infer schema for type: %s" % type(row)) TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>| However, cases below are supported. |>>> spark.createDataFrame([(1,), (2,)]).collect() [Row(_1=1), Row(_1=2)]| |>>> schema StructType([StructField('_1', LongType(), {*}True{*})]) >>> spark.createDataFrame([1, 2], schema=schema).collect() [Row(_1=1), Row(_1=2)]| In addition, Spark DataFrame Scala API supports creating a DataFrame from a list of scalars as below: |scala> Seq(1, 2).toDF().collect() res6: Array[org.apache.spark.sql.Row] = Array([1], [2]| To maintain API consistency, we propose to support DataFrame creation from a list of scalars. See more at [https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing] was: - Support `createDataFrame` from a list of scalars. - Standardize error messages when the input list contains any scalars. > Support `createDataFrame` from a list of scalars > > > Key: SPARK-39494 > URL: https://issues.apache.org/jira/browse/SPARK-39494 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, DataFrame creation from a list of scalars is unsupported as below: > |>>> spark.createDataFrame([1, 2]) > Traceback (most recent call last): > ... > *raise* TypeError("Can not infer schema for type: %s" % type(row)) > TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>| > > However, cases below are supported. > |>>> spark.createDataFrame([(1,), (2,)]).collect() > [Row(_1=1), Row(_1=2)]| > > |>>> schema > StructType([StructField('_1', LongType(), {*}True{*})]) > >>> spark.createDataFrame([1, 2], schema=schema).collect() > [Row(_1=1), Row(_1=2)]| > > In addition, Spark DataFrame Scala API supports creating a DataFrame from a > list of scalars as below: > |scala> Seq(1, 2).toDF().collect() > res6: Array[org.apache.spark.sql.Row] = Array([1], [2]| > > To maintain API consistency, we propose to support DataFrame creation from a > list of scalars. See more at > [https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39550) Fix `MultiIndex.value_counts()` when Arrow Execution is enabled
[ https://issues.apache.org/jira/browse/SPARK-39550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39550: - Description: When Arrow Execution is enabled, {code:java} >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") 'true' >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() {'__index_level_0__': 1, '__index_level_1__': 'a'} 1 {'__index_level_0__': 2, '__index_level_1__': 'b'} 1 dtype: int64 {code} When Arrow Execution is disabled, {code:java} >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") 'false' >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() (1, a) 1 (2, b) 1 dtype: int64 {code} Notice how indexes of their results are different. Especially, `value_counts` returns an Index (rather than a MultiIndex), under the hood, a Spark column of StructType (rather than multiple Spark columns), so when Arrow Execution is enabled, Arrow converts the StructType column to a dictionary, where we expect a tuple instead. was: When Arrow Execution is enabled, {code:java} >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") 'true' >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() {'__index_level_0__': 1, '__index_level_1__': 'a'} 1 {'__index_level_0__': 2, '__index_level_1__': 'b'} 1 dtype: int64 {code} When Arrow Execution is disabled, {code:java} >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") 'false' >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() (1, a) 1 (2, b) 1 dtype: int64 {code} Notice how indexes of their results are different. Especially, `value_counts` returns a Index (rather than a MultiIndex), under the hood, a Spark column of StructType (rather than multiple Spark columns), so when Arrow Execution is enabled, Arrow converts the StructType column to a dictionary, where we expect a tuple instad. > Fix `MultiIndex.value_counts()` when Arrow Execution is enabled > --- > > Key: SPARK-39550 > URL: https://issues.apache.org/jira/browse/SPARK-39550 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > > When Arrow Execution is enabled, > {code:java} > >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") > 'true' > >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() > {'__index_level_0__': 1, '__index_level_1__': 'a'} 1 > {'__index_level_0__': 2, '__index_level_1__': 'b'} 1 > dtype: int64 > {code} > When Arrow Execution is disabled, > {code:java} > >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") > 'false' > >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() > (1, a) 1 > (2, b) 1 > dtype: int64 {code} > Notice how indexes of their results are different. > Especially, `value_counts` returns an Index (rather than a MultiIndex), under > the hood, a Spark column of StructType (rather than multiple Spark columns), > so when Arrow Execution is enabled, Arrow converts the StructType column to a > dictionary, where we expect a tuple instead. > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39550) Fix `MultiIndex.value_counts()` when Arrow Execution is enabled
[ https://issues.apache.org/jira/browse/SPARK-39550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557145#comment-17557145 ] Xinrong Meng commented on SPARK-39550: -- I am working on that. > Fix `MultiIndex.value_counts()` when Arrow Execution is enabled > --- > > Key: SPARK-39550 > URL: https://issues.apache.org/jira/browse/SPARK-39550 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > > When Arrow Execution is enabled, > {code:java} > >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") > 'true' > >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() > {'__index_level_0__': 1, '__index_level_1__': 'a'} 1 > {'__index_level_0__': 2, '__index_level_1__': 'b'} 1 > dtype: int64 > {code} > When Arrow Execution is disabled, > {code:java} > >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") > 'false' > >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() > (1, a) 1 > (2, b) 1 > dtype: int64 {code} > Notice how indexes of their results are different. > Especially, `value_counts` returns an Index (rather than a MultiIndex), under > the hood, a Spark column of StructType (rather than multiple Spark columns), > so when Arrow Execution is enabled, Arrow converts the StructType column to a > dictionary, where we expect a tuple instead. > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39550) Fix `MultiIndex.value_counts()` when Arrow Execution is enabled
Xinrong Meng created SPARK-39550: Summary: Fix `MultiIndex.value_counts()` when Arrow Execution is enabled Key: SPARK-39550 URL: https://issues.apache.org/jira/browse/SPARK-39550 Project: Spark Issue Type: Bug Components: Pandas API on Spark, PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng When Arrow Execution is enabled, {code:java} >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") 'true' >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() {'__index_level_0__': 1, '__index_level_1__': 'a'} 1 {'__index_level_0__': 2, '__index_level_1__': 'b'} 1 dtype: int64 {code} When Arrow Execution is disabled, {code:java} >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") 'false' >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() (1, a) 1 (2, b) 1 dtype: int64 {code} Notice how indexes of their results are different. Especially, `value_counts` returns a Index (rather than a MultiIndex), under the hood, a Spark column of StructType (rather than multiple Spark columns), so when Arrow Execution is enabled, Arrow converts the StructType column to a dictionary, where we expect a tuple instad. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39494) Support `createDataFrame` from a list of scalars
Xinrong Meng created SPARK-39494: Summary: Support `createDataFrame` from a list of scalars Key: SPARK-39494 URL: https://issues.apache.org/jira/browse/SPARK-39494 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng - Support `createDataFrame` from a list of scalars. - Standardize error messages when the input list contains any scalars. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39483) Construct the schema from `np.dtype` when `createDataFrame` from a NumPy array
Xinrong Meng created SPARK-39483: Summary: Construct the schema from `np.dtype` when `createDataFrame` from a NumPy array Key: SPARK-39483 URL: https://issues.apache.org/jira/browse/SPARK-39483 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Construct the schema from `np.dtype` when `createDataFrame` from a NumPy array. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39443) Improve docstring of pyspark.sql.functions.col/first
Xinrong Meng created SPARK-39443: Summary: Improve docstring of pyspark.sql.functions.col/first Key: SPARK-39443 URL: https://issues.apache.org/jira/browse/SPARK-39443 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Improve docstring of pyspark.sql.functions.col/first. `col` is malformatted. `first` misses examples. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39405) NumPy support in SQL
[ https://issues.apache.org/jira/browse/SPARK-39405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39405: - Description: NumPy is the fundamental package for scientific computing with Python. It is very commonly used, especially in the data science world. For example, Pandas is backed by NumPy, and Tensors also supports interchangeable conversion from/to NumPy arrays. However, PySpark only supports Python built-in types with the exception of “SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. This issue has been raised multiple times internally and externally, see also SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857. With the NumPy support in SQL, we expect more adaptations from naive data scientists and newcomers leveraging their existing background and codebase with NumPy. See more [https://docs.google.com/document/d/1WsBiHoQB3UWERP47C47n_frffxZ9YIoGRwXSwIeMank/edit#] . was: NumPy is the fundamental package for scientific computing with Python. It is very commonly used, especially in the data science world. For example, Pandas is backed by NumPy, and Tensors also supports interchangeable conversion from/to NumPy arrays. However, PySpark only supports Python built-in types with the exception of “SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. This issue has been raised multiple times internally and externally, see also SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857. With the NumPy support in SQL, we expect more adaptations from naive data scientists and newcomers leveraging their existing background and codebase with NumPy. See more at [NumPy support in SQL|https://docs.google.com/document/d/1ZC3e-GpvpoQFtEFnwct0me1XPsiwFf_qu4nRdKCpMBg/edit#]. > NumPy support in SQL > > > Key: SPARK-39405 > URL: https://issues.apache.org/jira/browse/SPARK-39405 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > NumPy is the fundamental package for scientific computing with Python. It is > very commonly used, especially in the data science world. For example, Pandas > is backed by NumPy, and Tensors also supports interchangeable conversion > from/to NumPy arrays. > > However, PySpark only supports Python built-in types with the exception of > “SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. > > This issue has been raised multiple times internally and externally, see also > SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857. > > With the NumPy support in SQL, we expect more adaptations from naive data > scientists and newcomers leveraging their existing background and codebase > with NumPy. > > See more > [https://docs.google.com/document/d/1WsBiHoQB3UWERP47C47n_frffxZ9YIoGRwXSwIeMank/edit#] > . -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39405) NumPy support in SQL
[ https://issues.apache.org/jira/browse/SPARK-39405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39405: - Description: NumPy is the fundamental package for scientific computing with Python. It is very commonly used, especially in the data science world. For example, Pandas is backed by NumPy, and Tensors also supports interchangeable conversion from/to NumPy arrays. However, PySpark only supports Python built-in types with the exception of “SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. This issue has been raised multiple times internally and externally, see also SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857. With the NumPy support in SQL, we expect more adaptations from naive data scientists and newcomers leveraging their existing background and codebase with NumPy. See more at [NumPy support in SQL|https://docs.google.com/document/d/1ZC3e-GpvpoQFtEFnwct0me1XPsiwFf_qu4nRdKCpMBg/edit#]. was: NumPy is the fundamental package for scientific computing with Python. It is very commonly used, especially in the data science world. For example, Pandas is backed by NumPy, and Tensors also supports interchangeable conversion from/to NumPy arrays. However, PySpark only supports Python built-in types with the exception of “SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. This issue has been raised multiple times internally and externally, see also SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857. With the NumPy support in SQL, we expect more adaptations from naive data scientists and newcomers leveraging their existing background and codebase with NumPy. See more at []. > NumPy support in SQL > > > Key: SPARK-39405 > URL: https://issues.apache.org/jira/browse/SPARK-39405 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > NumPy is the fundamental package for scientific computing with Python. It is > very commonly used, especially in the data science world. For example, Pandas > is backed by NumPy, and Tensors also supports interchangeable conversion > from/to NumPy arrays. > > However, PySpark only supports Python built-in types with the exception of > “SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. > > This issue has been raised multiple times internally and externally, see also > SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857. > > With the NumPy support in SQL, we expect more adaptations from naive data > scientists and newcomers leveraging their existing background and codebase > with NumPy. > > See more at [NumPy support in > SQL|https://docs.google.com/document/d/1ZC3e-GpvpoQFtEFnwct0me1XPsiwFf_qu4nRdKCpMBg/edit#]. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39406) Accept NumPy array in createDataFrame
[ https://issues.apache.org/jira/browse/SPARK-39406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39406: - Summary: Accept NumPy array in createDataFrame (was: Accept numpy array in createDataFrame) > Accept NumPy array in createDataFrame > - > > Key: SPARK-39406 > URL: https://issues.apache.org/jira/browse/SPARK-39406 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Accept numpy array in createDataFrame, with existing dtypes support. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39406) Accept numpy array in createDataFrame
Xinrong Meng created SPARK-39406: Summary: Accept numpy array in createDataFrame Key: SPARK-39406 URL: https://issues.apache.org/jira/browse/SPARK-39406 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Accept numpy array in createDataFrame, with existing dtypes support. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39405) NumPy support in SQL
Xinrong Meng created SPARK-39405: Summary: NumPy support in SQL Key: SPARK-39405 URL: https://issues.apache.org/jira/browse/SPARK-39405 Project: Spark Issue Type: Umbrella Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng NumPy is the fundamental package for scientific computing with Python. It is very commonly used, especially in the data science world. For example, Pandas is backed by NumPy, and Tensors also supports interchangeable conversion from/to NumPy arrays. However, PySpark only supports Python built-in types with the exception of “SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. This issue has been raised multiple times internally and externally, see also SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857. With the NumPy support in SQL, we expect more adaptations from naive data scientists and newcomers leveraging their existing background and codebase with NumPy. See more at []. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39262) Correct the behavior of creating DataFrame from an RDD
[ https://issues.apache.org/jira/browse/SPARK-39262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39262: - Description: Correct the behavior of creating DataFrame from an RDD **with `0` or an empty list as the first element**. Before: ```py >>> spark.createDataFrame(spark._sc.parallelize([0, 1])) Traceback (most recent call last): ... ValueError: The first row in RDD is empty, can not infer schema >>> spark.createDataFrame(spark._sc.parallelize([[], []])) Traceback (most recent call last): ... ValueError: The first row in RDD is empty, can not infer schema ``` After: ```py >>> spark.createDataFrame(spark._sc.parallelize([0, 1])) Traceback (most recent call last): ... TypeError: Can not infer schema for type: >>> spark.createDataFrame(spark._sc.parallelize([[], []])) DataFrame[] >>> spark.createDataFrame(spark._sc.parallelize([[], []])).show() ++ || ++ || || ++ ``` was: Correct error messages when creating DataFrame from an RDD with the first element `0`. Previously, we raise a ValueError "The first row in RDD is empty, can not infer schema" in such case. However, a TypeError "Can not infer schema for type: " should be raised instead. > Correct the behavior of creating DataFrame from an RDD > -- > > Key: SPARK-39262 > URL: https://issues.apache.org/jira/browse/SPARK-39262 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Correct the behavior of creating DataFrame from an RDD **with `0` or an empty > list as the first element**. > > Before: > ```py > >>> spark.createDataFrame(spark._sc.parallelize([0, 1])) > Traceback (most recent call last): > ... > ValueError: The first row in RDD is empty, can not infer schema > >>> spark.createDataFrame(spark._sc.parallelize([[], []])) > Traceback (most recent call last): > ... > ValueError: The first row in RDD is empty, can not infer schema > ``` > After: > ```py > >>> spark.createDataFrame(spark._sc.parallelize([0, 1])) > Traceback (most recent call last): > > ... > TypeError: Can not infer schema for type: > >>> spark.createDataFrame(spark._sc.parallelize([[], []])) > DataFrame[] > > >>> spark.createDataFrame(spark._sc.parallelize([[], []])).show() > ++ > || > ++ > || > || > ++ > ``` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39262) Correct the behavior of creating DataFrame from an RDD
[ https://issues.apache.org/jira/browse/SPARK-39262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39262: - Summary: Correct the behavior of creating DataFrame from an RDD (was: Correct error messages when creating DataFrame from an RDD with the first element `0`) > Correct the behavior of creating DataFrame from an RDD > -- > > Key: SPARK-39262 > URL: https://issues.apache.org/jira/browse/SPARK-39262 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Correct error messages when creating DataFrame from an RDD with the first > element `0`. > Previously, we raise a ValueError "The first row in RDD is empty, can not > infer schema" in such case. > However, a TypeError "Can not infer schema for type: " should be > raised instead. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39048) Refactor `GroupBy._reduce_for_stat_function` on accepted data types
[ https://issues.apache.org/jira/browse/SPARK-39048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39048: - Parent: SPARK-39076 Issue Type: Sub-task (was: Improvement) > Refactor `GroupBy._reduce_for_stat_function` on accepted data types > > > Key: SPARK-39048 > URL: https://issues.apache.org/jira/browse/SPARK-39048 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > `Groupby._reduce_for_stat_function` is a common helper function leveraged by > multiple statistical functions of GroupBy objects. > It defines parameters `only_numeric` and `bool_as_numeric` to control > accepted Spark types. > To be consistent with pandas API, we may also have to introduce > `str_as_numeric` for `sum` for example. > Instead of introducing parameters designated for each Spark type, the PR is > proposed to introduce a parameter `accepted_spark_types` to specify accepted > types of Spark columns to be aggregated. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38880) Implement `numeric_only` parameter of `GroupBy.max/min`
[ https://issues.apache.org/jira/browse/SPARK-38880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38880: - Parent: SPARK-39076 Issue Type: Sub-task (was: Improvement) > Implement `numeric_only` parameter of `GroupBy.max/min` > --- > > Key: SPARK-38880 > URL: https://issues.apache.org/jira/browse/SPARK-38880 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `numeric_only` parameter of `GroupBy.max/min` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39000) Convert bools to ints in basic statistical functions of GroupBy objects
[ https://issues.apache.org/jira/browse/SPARK-39000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39000: - Parent: SPARK-39076 Issue Type: Sub-task (was: Improvement) > Convert bools to ints in basic statistical functions of GroupBy objects > --- > > Key: SPARK-39000 > URL: https://issues.apache.org/jira/browse/SPARK-39000 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Convert bools to ints in basic statistical functions of GroupBy objects -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39227) Reach parity with pandas boolean cast
[ https://issues.apache.org/jira/browse/SPARK-39227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39227: - Parent: SPARK-39076 Issue Type: Sub-task (was: Improvement) > Reach parity with pandas boolean cast > - > > Key: SPARK-39227 > URL: https://issues.apache.org/jira/browse/SPARK-39227 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > There are pandas APIs that need boolean casts: all, any. > Currently, pandas-on-Spark has different behaviors on special inputs against > these APIs, for example, empty string, list, etc, as mentioned > https://github.com/apache/spark/pull/36547#issuecomment-1129228724 by > [~zero323]. > We shall match pandas behavior on boolean cast. > Meanwhile, Series/Frame that contains empty strings, lists should be > considered as test input to increase test coverage. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38952) Implement `numeric_only` of `GroupBy.first` and `GroupBy.last`
[ https://issues.apache.org/jira/browse/SPARK-38952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38952: - Parent: SPARK-39076 Issue Type: Sub-task (was: Improvement) > Implement `numeric_only` of `GroupBy.first` and `GroupBy.last` > -- > > Key: SPARK-38952 > URL: https://issues.apache.org/jira/browse/SPARK-38952 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `numeric_only` of `GroupBy.first` and `GroupBy.last` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.
[ https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38763: - Parent: SPARK-39199 Issue Type: Sub-task (was: Bug) > Pandas API on spark Can`t apply lamda to columns. > --- > > Key: SPARK-38763 > URL: https://issues.apache.org/jira/browse/SPARK-38763 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0, 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > When I use a spark master build from 08 November 21 I can use this code to > rename columns > {code:java} > pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x)) > pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x)) > pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x)) > {code} > But now after I get this error when I use this code. > --- > ValueErrorTraceback (most recent call last) > Input In [5], in () > > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', > x)) > 2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x)) > 3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x)) > File /opt/spark/python/pyspark/pandas/frame.py:10636, in > DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors) > 10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = > gen_mapper_fn( > 10633 index > 10634 ) > 10635 if columns: > > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns) > 10638 if not index and not columns: > 10639 raise ValueError("Either `index` or `columns` should be > provided.") > File /opt/spark/python/pyspark/pandas/frame.py:10603, in > DataFrame.rename..gen_mapper_fn(mapper) > 10601 elif callable(mapper): > 10602 mapper_callable = cast(Callable, mapper) > > 10603 return_type = cast(ScalarType, infer_return_type(mapper)) > 10604 dtype = return_type.dtype > 10605 spark_return_type = return_type.spark_type > File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in > infer_return_type(f) > 560 tpe = get_type_hints(f).get("return", None) > 562 if tpe is None: > --> 563 raise ValueError("A return value is required for the input > function") > 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, > SeriesType): > 566 tpe = tpe.__args__[0] > ValueError: A return value is required for the input function -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38766) Support lambda `column` parameter of `DataFrame.rename`
[ https://issues.apache.org/jira/browse/SPARK-38766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38766: - Parent: (was: SPARK-39199) Issue Type: Improvement (was: Sub-task) > Support lambda `column` parameter of `DataFrame.rename` > --- > > Key: SPARK-38766 > URL: https://issues.apache.org/jira/browse/SPARK-38766 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Support lambda `column` parameter of `DataFrame.rename`. > The issue was detected in https://issues.apache.org/jira/browse/SPARK-38763. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38387) Support `na_action` and Series input correspondence in `Series.map`
[ https://issues.apache.org/jira/browse/SPARK-38387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38387: - Parent: SPARK-39199 Issue Type: Sub-task (was: New Feature) > Support `na_action` and Series input correspondence in `Series.map` > --- > > Key: SPARK-38387 > URL: https://issues.apache.org/jira/browse/SPARK-38387 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Support `na_action` and Series input correspondence in `Series.map`, in order > to reach parity to pandas API. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38766) Support lambda `column` parameter of `DataFrame.rename`
[ https://issues.apache.org/jira/browse/SPARK-38766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38766: - Parent: SPARK-39199 Issue Type: Sub-task (was: Bug) > Support lambda `column` parameter of `DataFrame.rename` > --- > > Key: SPARK-38766 > URL: https://issues.apache.org/jira/browse/SPARK-38766 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Support lambda `column` parameter of `DataFrame.rename`. > The issue was detected in https://issues.apache.org/jira/browse/SPARK-38763. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38400) Enable Series.rename to change index labels
[ https://issues.apache.org/jira/browse/SPARK-38400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38400: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Enable Series.rename to change index labels > --- > > Key: SPARK-38400 > URL: https://issues.apache.org/jira/browse/SPARK-38400 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Enable Series.rename to change index labels, with function `index` input. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38491) Support `ignore_index` of `Series.sort_values`
[ https://issues.apache.org/jira/browse/SPARK-38491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38491: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Support `ignore_index` of `Series.sort_values` > -- > > Key: SPARK-38491 > URL: https://issues.apache.org/jira/browse/SPARK-38491 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Support `ignore_index` of `Series.sort_values` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38518) Implement `skipna` of `Series.all/Index.all` to exclude NA/null values
[ https://issues.apache.org/jira/browse/SPARK-38518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38518: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `skipna` of `Series.all/Index.all` to exclude NA/null values > -- > > Key: SPARK-38518 > URL: https://issues.apache.org/jira/browse/SPARK-38518 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Implement `skipna` of `Series.all/Index.all` to exclude NA/null values. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38441) Support string and bool `regex` in `Series.replace`
[ https://issues.apache.org/jira/browse/SPARK-38441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38441: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Support string and bool `regex` in `Series.replace` > --- > > Key: SPARK-38441 > URL: https://issues.apache.org/jira/browse/SPARK-38441 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Support string and bool `regex` in `Series.replace` in order to reach parity > with pandas. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38479) Add `Series.duplicated` to indicate duplicate Series values.
[ https://issues.apache.org/jira/browse/SPARK-38479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38479: - Parent: SPARK-39199 Issue Type: Sub-task (was: New Feature) > Add `Series.duplicated` to indicate duplicate Series values. > > > Key: SPARK-38479 > URL: https://issues.apache.org/jira/browse/SPARK-38479 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Add `Series.duplicated` to indicate duplicate Series values. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38576) Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only
[ https://issues.apache.org/jira/browse/SPARK-38576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38576: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank > numeric columns only > --- > > Key: SPARK-38576 > URL: https://issues.apache.org/jira/browse/SPARK-38576 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank > numeric columns only. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38608) Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any`
[ https://issues.apache.org/jira/browse/SPARK-38608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38608: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any` > - > > Key: SPARK-38608 > URL: https://issues.apache.org/jira/browse/SPARK-38608 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any` to > include only boolean columns. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38552) Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to resolve ties
[ https://issues.apache.org/jira/browse/SPARK-38552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38552: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to > resolve ties > -- > > Key: SPARK-38552 > URL: https://issues.apache.org/jira/browse/SPARK-38552 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to > resolve ties -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38686) Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates`
[ https://issues.apache.org/jira/browse/SPARK-38686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38686: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates` > -- > > Key: SPARK-38686 > URL: https://issues.apache.org/jira/browse/SPARK-38686 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38704) Support string `inclusive` parameter of `Series.between`
[ https://issues.apache.org/jira/browse/SPARK-38704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38704: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Support string `inclusive` parameter of `Series.between` > > > Key: SPARK-38704 > URL: https://issues.apache.org/jira/browse/SPARK-38704 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Support string `inclusive` parameter of `Series.between` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38726) Support `how` parameter of `MultiIndex.dropna`
[ https://issues.apache.org/jira/browse/SPARK-38726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38726: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Support `how` parameter of `MultiIndex.dropna` > -- > > Key: SPARK-38726 > URL: https://issues.apache.org/jira/browse/SPARK-38726 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Support `how` parameter of `MultiIndex.dropna` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38765) Implement `inplace` parameter of `Series.clip`
[ https://issues.apache.org/jira/browse/SPARK-38765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38765: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `inplace` parameter of `Series.clip` > -- > > Key: SPARK-38765 > URL: https://issues.apache.org/jira/browse/SPARK-38765 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `inplace` parameter of `Series.clip` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38837) Implement `dropna` parameter of `SeriesGroupBy.value_counts`
[ https://issues.apache.org/jira/browse/SPARK-38837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38837: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `dropna` parameter of `SeriesGroupBy.value_counts` > > > Key: SPARK-38837 > URL: https://issues.apache.org/jira/browse/SPARK-38837 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0, 3.4.0 > > > Implement `dropna` parameter of `SeriesGroupBy.value_counts` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38863) Implement `skipna` parameter of `DataFrame.all`
[ https://issues.apache.org/jira/browse/SPARK-38863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38863: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `skipna` parameter of `DataFrame.all` > --- > > Key: SPARK-38863 > URL: https://issues.apache.org/jira/browse/SPARK-38863 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `skipna` parameter of `DataFrame.all`. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38793) Support `return_indexer` parameter of `Index/MultiIndex.sort_values`
[ https://issues.apache.org/jira/browse/SPARK-38793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38793: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Support `return_indexer` parameter of `Index/MultiIndex.sort_values` > > > Key: SPARK-38793 > URL: https://issues.apache.org/jira/browse/SPARK-38793 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Support `return_indexer` parameter of `Index/MultiIndex.sort_values` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38903) Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`
[ https://issues.apache.org/jira/browse/SPARK-38903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38903: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `ignore_index` of `Series.sort_values` and `Series.sort_index` > > > Key: SPARK-38903 > URL: https://issues.apache.org/jira/browse/SPARK-38903 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `ignore_index` of `Series.sort_values` and `Series.sort_index` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38890) Implement `ignore_index` of `DataFrame.sort_index`.
[ https://issues.apache.org/jira/browse/SPARK-38890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38890: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `ignore_index` of `DataFrame.sort_index`. > --- > > Key: SPARK-38890 > URL: https://issues.apache.org/jira/browse/SPARK-38890 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `ignore_index` of `DataFrame.sort_index`. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38938) Implement `inplace` and `columns` parameters of `Series.drop`
[ https://issues.apache.org/jira/browse/SPARK-38938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38938: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `inplace` and `columns` parameters of `Series.drop` > - > > Key: SPARK-38938 > URL: https://issues.apache.org/jira/browse/SPARK-38938 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `inplace` and `columns` parameters of `Series.drop` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38989) Implement `ignore_index` of `DataFrame/Series.sample`
[ https://issues.apache.org/jira/browse/SPARK-38989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38989: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `ignore_index` of `DataFrame/Series.sample` > - > > Key: SPARK-38989 > URL: https://issues.apache.org/jira/browse/SPARK-38989 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `ignore_index` of `DataFrame/Series.sample` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39201) Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`
[ https://issues.apache.org/jira/browse/SPARK-39201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39201: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > Implement `ignore_index` of `DataFrame.explode` and > `DataFrame.drop_duplicates` > --- > > Key: SPARK-39201 > URL: https://issues.apache.org/jira/browse/SPARK-39201 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `ignore_index` of `DataFrame.explode` and > `DataFrame.drop_duplicates` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39201) Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`
[ https://issues.apache.org/jira/browse/SPARK-39201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39201: - Issue Type: Improvement (was: Umbrella) > Implement `ignore_index` of `DataFrame.explode` and > `DataFrame.drop_duplicates` > --- > > Key: SPARK-39201 > URL: https://issues.apache.org/jira/browse/SPARK-39201 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `ignore_index` of `DataFrame.explode` and > `DataFrame.drop_duplicates` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39262) Correct error messages when creating DataFrame from an RDD with the first element `0`
[ https://issues.apache.org/jira/browse/SPARK-39262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39262: - Description: Correct error messages when creating DataFrame from an RDD with the first element `0`. Previously, we raise a ValueError "The first row in RDD is empty, can not infer schema" in such case. However, a TypeError "Can not infer schema for type: " should be raised instead. was: Correct error messages when creating DataFrame from an RDD with the first row is `0`. Previously, we raise a ValueError "The first row in RDD is empty, can not infer schema" in such case. However, a TypeError "Can not infer schema for type: " should be raised instead. > Correct error messages when creating DataFrame from an RDD with the first > element `0` > - > > Key: SPARK-39262 > URL: https://issues.apache.org/jira/browse/SPARK-39262 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Correct error messages when creating DataFrame from an RDD with the first > element `0`. > Previously, we raise a ValueError "The first row in RDD is empty, can not > infer schema" in such case. > However, a TypeError "Can not infer schema for type: " should be > raised instead. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39262) Correct error messages when creating DataFrame from an RDD with the first element `0`
[ https://issues.apache.org/jira/browse/SPARK-39262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39262: - Summary: Correct error messages when creating DataFrame from an RDD with the first element `0` (was: Correct error messages when creating DataFrame from an RDD with the first row is `0`) > Correct error messages when creating DataFrame from an RDD with the first > element `0` > - > > Key: SPARK-39262 > URL: https://issues.apache.org/jira/browse/SPARK-39262 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Correct error messages when creating DataFrame from an RDD with the first row > is `0`. > > Previously, we raise a ValueError "The first row in RDD is empty, can not > infer schema" in such case. > However, a TypeError "Can not infer schema for type: " should be > raised instead. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39262) Correct error messages when creating DataFrame from an RDD with the first row is `0`
Xinrong Meng created SPARK-39262: Summary: Correct error messages when creating DataFrame from an RDD with the first row is `0` Key: SPARK-39262 URL: https://issues.apache.org/jira/browse/SPARK-39262 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Correct error messages when creating DataFrame from an RDD with the first row is `0`. Previously, we raise a ValueError "The first row in RDD is empty, can not infer schema" in such case. However, a TypeError "Can not infer schema for type: " should be raised instead. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39199) Implement pandas API missing parameters
[ https://issues.apache.org/jira/browse/SPARK-39199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39199: - Description: pandas API on Spark aims to make pandas code work on Spark clusters without any changes. So full API coverage has been one of our major goals. Currently, most pandas functions are implemented, whereas some of them are have incomplete parameters support. There are some common parameters missing (resolved): * How to do with NAs * Filter data types * Control result length * Reindex result There are remaining missing parameters to implement (see doc below). See the design and the current status at [https://docs.google.com/document/d/1H6RXL6oc-v8qLJbwKl6OEqBjRuMZaXcTYmrZb9yNm5I/edit?usp=sharing]. was: pandas API on Spark aims to achieve full pandas API coverage. Currently, most pandas functions are supported in pandas API on Spark with parameters missing. There are some common parameters missing: - how to do with NAs: `skipna`, `dropna` - filter data types: `numeric_only`, `bool_only` - filter result length: `keep` - reindex result: `ignore_index` They support common use cases and should be prioritized. > Implement pandas API missing parameters > --- > > Key: SPARK-39199 > URL: https://issues.apache.org/jira/browse/SPARK-39199 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark, PySpark >Affects Versions: 3.3.0, 3.4.0, 3.3.1 >Reporter: Xinrong Meng >Priority: Major > > pandas API on Spark aims to make pandas code work on Spark clusters without > any changes. So full API coverage has been one of our major goals. Currently, > most pandas functions are implemented, whereas some of them are have > incomplete parameters support. > There are some common parameters missing (resolved): > * How to do with NAs > * Filter data types > * Control result length > * Reindex result > There are remaining missing parameters to implement (see doc below). > See the design and the current status at > [https://docs.google.com/document/d/1H6RXL6oc-v8qLJbwKl6OEqBjRuMZaXcTYmrZb9yNm5I/edit?usp=sharing]. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37525) Timedelta support in pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-37525: - Summary: Timedelta support in pandas API on Spark (was: Support TimedeltaIndex in pandas API on Spark) > Timedelta support in pandas API on Spark > > > Key: SPARK-37525 > URL: https://issues.apache.org/jira/browse/SPARK-37525 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > Labels: release-notes > > Since DayTimeIntervalType is supported in PySpark, we may add TimedeltaIndex > support in pandas API on Spark accordingly. > We shall approach it in steps below: > - introduce > - properties > - functions and basic operations > - creation (from Series/Index, generic methods) > - type conversion (astype) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37525) Support TimedeltaIndex in pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-37525. -- Resolution: Resolved > Support TimedeltaIndex in pandas API on Spark > - > > Key: SPARK-37525 > URL: https://issues.apache.org/jira/browse/SPARK-37525 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > Labels: release-notes > > Since DayTimeIntervalType is supported in PySpark, we may add TimedeltaIndex > support in pandas API on Spark accordingly. > We shall approach it in steps below: > - introduce > - properties > - functions and basic operations > - creation (from Series/Index, generic methods) > - type conversion (astype) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39228) Implement `skipna` of `Series.argmax`
Xinrong Meng created SPARK-39228: Summary: Implement `skipna` of `Series.argmax` Key: SPARK-39228 URL: https://issues.apache.org/jira/browse/SPARK-39228 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark, PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `skipna` of `Series.argmax` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39227) Reach parity with pandas boolean cast
[ https://issues.apache.org/jira/browse/SPARK-39227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39227: - Description: There are pandas APIs that need boolean casts: all, any. Currently, pandas-on-Spark has different behaviors on special inputs against these APIs, for example, empty string, list, etc, as mentioned https://github.com/apache/spark/pull/36547#issuecomment-1129228724 by [~zero323]. We shall match pandas behavior on boolean cast. Meanwhile, Series/Frame that contains empty strings, lists should be considered as test input to increase test coverage. was: There are pandas APIs that need boolean casts: all, any. Currently, pandas-on-Spark has different behaviors on special inputs against these APIs, for example, empty string, list, etc, as mentioned https://github.com/apache/spark/pull/36547#issuecomment-1129228724 by [~zero323]. We shall match pandas behavior on boolean cast. > Reach parity with pandas boolean cast > - > > Key: SPARK-39227 > URL: https://issues.apache.org/jira/browse/SPARK-39227 > Project: Spark > Issue Type: Improvement > Components: Pandas API on Spark, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > There are pandas APIs that need boolean casts: all, any. > Currently, pandas-on-Spark has different behaviors on special inputs against > these APIs, for example, empty string, list, etc, as mentioned > https://github.com/apache/spark/pull/36547#issuecomment-1129228724 by > [~zero323]. > We shall match pandas behavior on boolean cast. > Meanwhile, Series/Frame that contains empty strings, lists should be > considered as test input to increase test coverage. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39227) Reach parity with pandas boolean cast
Xinrong Meng created SPARK-39227: Summary: Reach parity with pandas boolean cast Key: SPARK-39227 URL: https://issues.apache.org/jira/browse/SPARK-39227 Project: Spark Issue Type: Improvement Components: Pandas API on Spark, PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng There are pandas APIs that need boolean casts: all, any. Currently, pandas-on-Spark has different behaviors on special inputs against these APIs, for example, empty string, list, etc, as mentioned https://github.com/apache/spark/pull/36547#issuecomment-1129228724 by [~zero323]. We shall match pandas behavior on boolean cast. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39201) Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`
Xinrong Meng created SPARK-39201: Summary: Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates` Key: SPARK-39201 URL: https://issues.apache.org/jira/browse/SPARK-39201 Project: Spark Issue Type: Umbrella Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39199) Implement pandas API missing parameters
[ https://issues.apache.org/jira/browse/SPARK-39199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39199: - Description: pandas API on Spark aims to achieve full pandas API coverage. Currently, most pandas functions are supported in pandas API on Spark with parameters missing. There are some common parameters missing: - how to do with NAs: `skipna`, `dropna` - filter data types: `numeric_only`, `bool_only` - filter result length: `keep` - reindex result: `ignore_index` They support common use cases and should be prioritized. was: pandas API on Spark aims to achieve full pandas API coverage. Currently, most pandas functions are supported in pandas API on Spark with parameters missing. There are some common parameters missing: > Implement pandas API missing parameters > --- > > Key: SPARK-39199 > URL: https://issues.apache.org/jira/browse/SPARK-39199 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0, 3.4.0, 3.3.1 >Reporter: Xinrong Meng >Priority: Major > > pandas API on Spark aims to achieve full pandas API coverage. Currently, most > pandas functions are supported in pandas API on Spark with parameters missing. > There are some common parameters missing: > - how to do with NAs: `skipna`, `dropna` > - filter data types: `numeric_only`, `bool_only` > - filter result length: `keep` > - reindex result: `ignore_index` > They support common use cases and should be prioritized. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39199) Implement pandas API missing parameters
[ https://issues.apache.org/jira/browse/SPARK-39199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39199: - Description: pandas API on Spark aims to achieve full pandas API coverage. Currently, most pandas functions are supported in pandas API on Spark with parameters missing. There are some common parameters missing: was:Implement pandas API missing parameters > Implement pandas API missing parameters > --- > > Key: SPARK-39199 > URL: https://issues.apache.org/jira/browse/SPARK-39199 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0, 3.4.0, 3.3.1 >Reporter: Xinrong Meng >Priority: Major > > pandas API on Spark aims to achieve full pandas API coverage. Currently, most > pandas functions are supported in pandas API on Spark with parameters missing. > There are some common parameters missing: -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38608) Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any`
[ https://issues.apache.org/jira/browse/SPARK-38608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38608: - Summary: Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any` (was: [SPARK-38608][PYTHON] Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any`) > Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any` > - > > Key: SPARK-38608 > URL: https://issues.apache.org/jira/browse/SPARK-38608 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any` to > include only boolean columns. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39199) Implement pandas API missing parameters
Xinrong Meng created SPARK-39199: Summary: Implement pandas API missing parameters Key: SPARK-39199 URL: https://issues.apache.org/jira/browse/SPARK-39199 Project: Spark Issue Type: Umbrella Components: PySpark Affects Versions: 3.3.0, 3.4.0, 3.3.1 Reporter: Xinrong Meng Implement pandas API missing parameters -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37525) Support TimedeltaIndex in pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-37525: - Description: Since DayTimeIntervalType is supported in PySpark, we may add TimedeltaIndex support in pandas API on Spark accordingly. We shall approach it in steps below: - introduce - properties - functions and basic operations - creation (from Series/Index, generic methods) - type conversion (astype) was:Since DayTimeIntervalType is supported in PySpark, we may add TimedeltaIndex support in pandas API on Spark accordingly. > Support TimedeltaIndex in pandas API on Spark > - > > Key: SPARK-37525 > URL: https://issues.apache.org/jira/browse/SPARK-37525 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > Labels: release-notes > > Since DayTimeIntervalType is supported in PySpark, we may add TimedeltaIndex > support in pandas API on Spark accordingly. > We shall approach it in steps below: > - introduce > - properties > - functions and basic operations > - creation (from Series/Index, generic methods) > - type conversion (astype) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39197) Implement `skipna` parameter of `GroupBy.all`
Xinrong Meng created SPARK-39197: Summary: Implement `skipna` parameter of `GroupBy.all` Key: SPARK-39197 URL: https://issues.apache.org/jira/browse/SPARK-39197 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `skipna` parameter of `GroupBy.all` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39155) Access to JVM through passed-in GatewayClient during type conversion
[ https://issues.apache.org/jira/browse/SPARK-39155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39155: - Description: Access to JVM through passed-in GatewayClient during type conversion. In customized type converters, we may utilize `jvm` field of passed-in GatewayClient directly to access JVM, rather than rely on the `SparkContext._jvm`. That's [how](https://github.com/py4j/py4j/blob/master/py4j-python/src/py4j/java_collections.py#L508) Py4J explicit converters access JVM. was:Access to JVM through passed-in GatewayClient during type conversion > Access to JVM through passed-in GatewayClient during type conversion > > > Key: SPARK-39155 > URL: https://issues.apache.org/jira/browse/SPARK-39155 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Access to JVM through passed-in GatewayClient during type conversion. > In customized type converters, we may utilize `jvm` field of passed-in > GatewayClient directly to access JVM, rather than rely on the > `SparkContext._jvm`. > That's > [how](https://github.com/py4j/py4j/blob/master/py4j-python/src/py4j/java_collections.py#L508) > Py4J explicit converters access JVM. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39155) Access to JVM through passed-in GatewayClient during type conversion
Xinrong Meng created SPARK-39155: Summary: Access to JVM through passed-in GatewayClient during type conversion Key: SPARK-39155 URL: https://issues.apache.org/jira/browse/SPARK-39155 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Access to JVM through passed-in GatewayClient during type conversion -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39154) Remove outdated statements on distributed-sequence default index
Xinrong Meng created SPARK-39154: Summary: Remove outdated statements on distributed-sequence default index Key: SPARK-39154 URL: https://issues.apache.org/jira/browse/SPARK-39154 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Remove outdated statements on distributed-sequence default index -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38819) Run Pandas on Spark with Pandas 1.4.x
[ https://issues.apache.org/jira/browse/SPARK-38819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534428#comment-17534428 ] Xinrong Meng commented on SPARK-38819: -- Thanks Yikun! > Run Pandas on Spark with Pandas 1.4.x > - > > Key: SPARK-38819 > URL: https://issues.apache.org/jira/browse/SPARK-38819 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > This is a umbrella to track issues when pandas upgrade to 1.4.x > > I disable the fast-failed in test, 19 failed: > [https://github.com/Yikun/spark/pull/88/checks?check_run_id=5873627048] > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39133) Mention log level setting in PYSPARK_JVM_STACKTRACE_ENABLED
Xinrong Meng created SPARK-39133: Summary: Mention log level setting in PYSPARK_JVM_STACKTRACE_ENABLED Key: SPARK-39133 URL: https://issues.apache.org/jira/browse/SPARK-39133 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Mention log level setting in PYSPARK_JVM_STACKTRACE_ENABLED -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39109) Adjust `GroupBy.mean/median` to match pandas 1.4
Xinrong Meng created SPARK-39109: Summary: Adjust `GroupBy.mean/median` to match pandas 1.4 Key: SPARK-39109 URL: https://issues.apache.org/jira/browse/SPARK-39109 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Adjust `GroupBy.mean/median` to match pandas 1.4 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39095) Adjust `GroupBy.std` to match pandas 1.4
Xinrong Meng created SPARK-39095: Summary: Adjust `GroupBy.std` to match pandas 1.4 Key: SPARK-39095 URL: https://issues.apache.org/jira/browse/SPARK-39095 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39076) Standardize Statistical Functions of pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-39076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39076: - Description: Statistical functions are the most commonly-used functions in Data Engineering and Data Analysis. Spark and pandas provide statistical functions in the context of SQL and Data Science separately. pandas API on Spark implements the pandas API on top of Apache Spark. Although there may be semantic differences of certain functions due to the high cost of big data calculations, for example, median. We should still try to reach the parity from the API level. However, critical parameters, such as `skipna`, of statistical functions are missing of basic objects: DataFrame, Series, and Index are missing. There is even a larger gap between statistical functions of pandas-on-Spark GroupBy objects and those of pandas GroupBy objects. In addition, tests coverage is far from perfect. With statistical functions standardized, pandas API coverage will be increased since missing parameters will be implemented. That would further improve the user adoption. See details at https://docs.google.com/document/d/1IHUQkSVMPWiK8Jhe0GUtMHnDS6LB4_z9K2ktWmORSSg/edit?usp=sharing. was: Statistical functions are the most commonly-used functions in Data Engineering and Data Analysis. Spark and pandas provide statistical functions in the context of SQL and Data Science separately. pandas API on Spark implements the pandas API on top of Apache Spark. Although there may be semantic differences of certain functions due to the high cost of big data calculations, for example, median. We should still try to reach the parity from the API level. However, critical parameters, such as `skipna`, of statistical functions are missing of basic objects: DataFrame, Series, and Index are missing. There is even a larger gap between statistical functions of pandas-on-Spark GroupBy objects and those of pandas GroupBy objects. In addition, tests coverage is far from perfect. With statistical functions standardized, pandas API coverage will be increased since missing parameters will be implemented. That would further improve the user adoption. > Standardize Statistical Functions of pandas API on Spark > > > Key: SPARK-39076 > URL: https://issues.apache.org/jira/browse/SPARK-39076 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Statistical functions are the most commonly-used functions in Data > Engineering and Data Analysis. > Spark and pandas provide statistical functions in the context of SQL and Data > Science separately. > pandas API on Spark implements the pandas API on top of Apache Spark. > Although there may be semantic differences of certain functions due to the > high cost of big data calculations, for example, median. We should still try > to reach the parity from the API level. > However, critical parameters, such as `skipna`, of statistical functions are > missing of basic objects: DataFrame, Series, and Index are missing. > There is even a larger gap between statistical functions of pandas-on-Spark > GroupBy objects and those of pandas GroupBy objects. In addition, tests > coverage is far from perfect. > With statistical functions standardized, pandas API coverage will be > increased since missing parameters will be implemented. That would further > improve the user adoption. > See details at > https://docs.google.com/document/d/1IHUQkSVMPWiK8Jhe0GUtMHnDS6LB4_z9K2ktWmORSSg/edit?usp=sharing. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39077) Implement `skipna` of basic statistical functions of DataFrame and Series
Xinrong Meng created SPARK-39077: Summary: Implement `skipna` of basic statistical functions of DataFrame and Series Key: SPARK-39077 URL: https://issues.apache.org/jira/browse/SPARK-39077 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `skipna` of basic statistical functions of DataFrame and Series -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39076) Standardize Statistical Functions of pandas API on Spark
Xinrong Meng created SPARK-39076: Summary: Standardize Statistical Functions of pandas API on Spark Key: SPARK-39076 URL: https://issues.apache.org/jira/browse/SPARK-39076 Project: Spark Issue Type: Umbrella Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Statistical functions are the most commonly-used functions in Data Engineering and Data Analysis. Spark and pandas provide statistical functions in the context of SQL and Data Science separately. pandas API on Spark implements the pandas API on top of Apache Spark. Although there may be semantic differences of certain functions due to the high cost of big data calculations, for example, median. We should still try to reach the parity from the API level. However, critical parameters, such as `skipna`, of statistical functions are missing of basic objects: DataFrame, Series, and Index are missing. There is even a larger gap between statistical functions of pandas-on-Spark GroupBy objects and those of pandas GroupBy objects. In addition, tests coverage is far from perfect. With statistical functions standardized, pandas API coverage will be increased since missing parameters will be implemented. That would further improve the user adoption. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39051) Minor refactoring of `python/pyspark/sql/pandas/conversion.py`
Xinrong Meng created SPARK-39051: Summary: Minor refactoring of `python/pyspark/sql/pandas/conversion.py` Key: SPARK-39051 URL: https://issues.apache.org/jira/browse/SPARK-39051 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Minor refactoring of `python/pyspark/sql/pandas/conversion.py` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39048) Refactor `GroupBy._reduce_for_stat_function` on accepted data types
[ https://issues.apache.org/jira/browse/SPARK-39048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39048: - Summary: Refactor `GroupBy._reduce_for_stat_function` on accepted data types (was: Refactor GroupBy._reduce_for_stat_function on accepted data types ) > Refactor `GroupBy._reduce_for_stat_function` on accepted data types > > > Key: SPARK-39048 > URL: https://issues.apache.org/jira/browse/SPARK-39048 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > `Groupby._reduce_for_stat_function` is a common helper function leveraged by > multiple statistical functions of GroupBy objects. > It defines parameters `only_numeric` and `bool_as_numeric` to control > accepted Spark types. > To be consistent with pandas API, we may also have to introduce > `str_as_numeric` for `sum` for example. > Instead of introducing parameters designated for each Spark type, the PR is > proposed to introduce a parameter `accepted_spark_types` to specify accepted > types of Spark columns to be aggregated. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39048) Refactor GroupBy._reduce_for_stat_function on accepted data types
Xinrong Meng created SPARK-39048: Summary: Refactor GroupBy._reduce_for_stat_function on accepted data types Key: SPARK-39048 URL: https://issues.apache.org/jira/browse/SPARK-39048 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng `Groupby._reduce_for_stat_function` is a common helper function leveraged by multiple statistical functions of GroupBy objects. It defines parameters `only_numeric` and `bool_as_numeric` to control accepted Spark types. To be consistent with pandas API, we may also have to introduce `str_as_numeric` for `sum` for example. Instead of introducing parameters designated for each Spark type, the PR is proposed to introduce a parameter `accepted_spark_types` to specify accepted types of Spark columns to be aggregated. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38988) Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get printed many times.
[ https://issues.apache.org/jira/browse/SPARK-38988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528459#comment-17528459 ] Xinrong Meng commented on SPARK-38988: -- Thank you for raising that! I will try muting the warnings for now. > Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get > printed many times. > --- > > Key: SPARK-38988 > URL: https://issues.apache.org/jira/browse/SPARK-38988 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0, 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > Attachments: Untitled.html, info.txt, warning printed.txt > > > I add a file and a notebook with the info msg I get when I run df.info() > Spark master build from 13.04.22. > df.shape > (763300, 224) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39000) Convert bools to ints in basic statistical functions of GroupBy objects
Xinrong Meng created SPARK-39000: Summary: Convert bools to ints in basic statistical functions of GroupBy objects Key: SPARK-39000 URL: https://issues.apache.org/jira/browse/SPARK-39000 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Convert bools to ints in basic statistical functions of GroupBy objects -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38991) Implement `numeric_only` of `GroupBy.mean` and `GroupBy.sum`
Xinrong Meng created SPARK-38991: Summary: Implement `numeric_only` of `GroupBy.mean` and `GroupBy.sum` Key: SPARK-38991 URL: https://issues.apache.org/jira/browse/SPARK-38991 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `numeric_only` of `GroupBy.mean` and `GroupBy.sum` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38991) Implement `numeric_only` of `GroupBy.mean` and `GroupBy.sum`
[ https://issues.apache.org/jira/browse/SPARK-38991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526121#comment-17526121 ] Xinrong Meng commented on SPARK-38991: -- I am working on that. > Implement `numeric_only` of `GroupBy.mean` and `GroupBy.sum` > > > Key: SPARK-38991 > URL: https://issues.apache.org/jira/browse/SPARK-38991 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement `numeric_only` of `GroupBy.mean` and `GroupBy.sum` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38989) Implement `ignore_index` of `DataFrame/Series.sample`
Xinrong Meng created SPARK-38989: Summary: Implement `ignore_index` of `DataFrame/Series.sample` Key: SPARK-38989 URL: https://issues.apache.org/jira/browse/SPARK-38989 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `ignore_index` of `DataFrame/Series.sample` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38971) Test anchor frame for in-place `Series.rename_axis`
Xinrong Meng created SPARK-38971: Summary: Test anchor frame for in-place `Series.rename_axis` Key: SPARK-38971 URL: https://issues.apache.org/jira/browse/SPARK-38971 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Test anchor frame for in-place `Series.rename_axis` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38953) Document PySpark common exceptions / errors
[ https://issues.apache.org/jira/browse/SPARK-38953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38953: - Component/s: Documentation > Document PySpark common exceptions / errors > --- > > Key: SPARK-38953 > URL: https://issues.apache.org/jira/browse/SPARK-38953 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Document PySpark common exceptions / errors -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38953) Document PySpark common exceptions / errors
Xinrong Meng created SPARK-38953: Summary: Document PySpark common exceptions / errors Key: SPARK-38953 URL: https://issues.apache.org/jira/browse/SPARK-38953 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Document PySpark common exceptions / errors -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38952) Implement `numeric_only` of `GroupBy.first` and `GroupBy.last`
Xinrong Meng created SPARK-38952: Summary: Implement `numeric_only` of `GroupBy.first` and `GroupBy.last` Key: SPARK-38952 URL: https://issues.apache.org/jira/browse/SPARK-38952 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `numeric_only` of `GroupBy.first` and `GroupBy.last` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38940) Test Series' anchor frame for in-place updates on Series
Xinrong Meng created SPARK-38940: Summary: Test Series' anchor frame for in-place updates on Series Key: SPARK-38940 URL: https://issues.apache.org/jira/browse/SPARK-38940 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Test Series' anchor frame for in-place updates on Series -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38938) Implement `inplace` and `columns` parameters of `Series.drop`
Xinrong Meng created SPARK-38938: Summary: Implement `inplace` and `columns` parameters of `Series.drop` Key: SPARK-38938 URL: https://issues.apache.org/jira/browse/SPARK-38938 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `inplace` and `columns` parameters of `Series.drop` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38903) Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`
Xinrong Meng created SPARK-38903: Summary: Implement `ignore_index` of `Series.sort_values` and `Series.sort_index` Key: SPARK-38903 URL: https://issues.apache.org/jira/browse/SPARK-38903 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `ignore_index` of `Series.sort_values` and `Series.sort_index` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38890) Implement `ignore_index` of `DataFrame.sort_index`.
Xinrong Meng created SPARK-38890: Summary: Implement `ignore_index` of `DataFrame.sort_index`. Key: SPARK-38890 URL: https://issues.apache.org/jira/browse/SPARK-38890 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `ignore_index` of `DataFrame.sort_index`. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38880) Implement `numeric_only` parameter of `GroupBy.max/min`
Xinrong Meng created SPARK-38880: Summary: Implement `numeric_only` parameter of `GroupBy.max/min` Key: SPARK-38880 URL: https://issues.apache.org/jira/browse/SPARK-38880 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `numeric_only` parameter of `GroupBy.max/min` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38863) Implement `skipna` parameter of `DataFrame.all`
Xinrong Meng created SPARK-38863: Summary: Implement `skipna` parameter of `DataFrame.all` Key: SPARK-38863 URL: https://issues.apache.org/jira/browse/SPARK-38863 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `skipna` parameter of `DataFrame.all`. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38837) Implement `dropna` parameter of `SeriesGroupBy.value_counts`
Xinrong Meng created SPARK-38837: Summary: Implement `dropna` parameter of `SeriesGroupBy.value_counts` Key: SPARK-38837 URL: https://issues.apache.org/jira/browse/SPARK-38837 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `dropna` parameter of `SeriesGroupBy.value_counts` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38793) Support `return_indexer` parameter of `Index/MultiIndex.sort_values`
Xinrong Meng created SPARK-38793: Summary: Support `return_indexer` parameter of `Index/MultiIndex.sort_values` Key: SPARK-38793 URL: https://issues.apache.org/jira/browse/SPARK-38793 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Support `return_indexer` parameter of `Index/MultiIndex.sort_values` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.
[ https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517719#comment-17517719 ] Xinrong Meng commented on SPARK-38763: -- [~bjornjorgensen] For sure :) The fix is in Spark 3.3 (latest released version). > Pandas API on spark Can`t apply lamda to columns. > --- > > Key: SPARK-38763 > URL: https://issues.apache.org/jira/browse/SPARK-38763 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0, 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > When I use a spark master build from 08 November 21 I can use this code to > rename columns > {code:java} > pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x)) > pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x)) > pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x)) > {code} > But now after I get this error when I use this code. > --- > ValueErrorTraceback (most recent call last) > Input In [5], in () > > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', > x)) > 2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x)) > 3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x)) > File /opt/spark/python/pyspark/pandas/frame.py:10636, in > DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors) > 10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = > gen_mapper_fn( > 10633 index > 10634 ) > 10635 if columns: > > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns) > 10638 if not index and not columns: > 10639 raise ValueError("Either `index` or `columns` should be > provided.") > File /opt/spark/python/pyspark/pandas/frame.py:10603, in > DataFrame.rename..gen_mapper_fn(mapper) > 10601 elif callable(mapper): > 10602 mapper_callable = cast(Callable, mapper) > > 10603 return_type = cast(ScalarType, infer_return_type(mapper)) > 10604 dtype = return_type.dtype > 10605 spark_return_type = return_type.spark_type > File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in > infer_return_type(f) > 560 tpe = get_type_hints(f).get("return", None) > 562 if tpe is None: > --> 563 raise ValueError("A return value is required for the input > function") > 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, > SeriesType): > 566 tpe = tpe.__args__[0] > ValueError: A return value is required for the input function -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.
[ https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516177#comment-17516177 ] Xinrong Meng commented on SPARK-38763: -- I will backport the fix after approved and merged. > Pandas API on spark Can`t apply lamda to columns. > --- > > Key: SPARK-38763 > URL: https://issues.apache.org/jira/browse/SPARK-38763 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0, 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > When I use a spark master build from 08 November 21 I can use this code to > rename columns > {code:java} > pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x)) > pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x)) > pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x)) > {code} > But now after I get this error when I use this code. > --- > ValueErrorTraceback (most recent call last) > Input In [5], in () > > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', > x)) > 2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x)) > 3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x)) > File /opt/spark/python/pyspark/pandas/frame.py:10636, in > DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors) > 10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = > gen_mapper_fn( > 10633 index > 10634 ) > 10635 if columns: > > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns) > 10638 if not index and not columns: > 10639 raise ValueError("Either `index` or `columns` should be > provided.") > File /opt/spark/python/pyspark/pandas/frame.py:10603, in > DataFrame.rename..gen_mapper_fn(mapper) > 10601 elif callable(mapper): > 10602 mapper_callable = cast(Callable, mapper) > > 10603 return_type = cast(ScalarType, infer_return_type(mapper)) > 10604 dtype = return_type.dtype > 10605 spark_return_type = return_type.spark_type > File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in > infer_return_type(f) > 560 tpe = get_type_hints(f).get("return", None) > 562 if tpe is None: > --> 563 raise ValueError("A return value is required for the input > function") > 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, > SeriesType): > 566 tpe = tpe.__args__[0] > ValueError: A return value is required for the input function -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.
[ https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516172#comment-17516172 ] Xinrong Meng edited comment on SPARK-38763 at 4/1/22 11:56 PM: --- Hi [~bjornjorgensen], thanks for raising that! The workaround is to use a function with a return type rather than a lambda. I am fixing this now. was (Author: xinrongm): Hi [~bjornjorgensen], thanks for raising that! The workaround is to use a function with a return type rather than a lambda. I am fixing this in https://issues.apache.org/jira/browse/SPARK-38766. > Pandas API on spark Can`t apply lamda to columns. > --- > > Key: SPARK-38763 > URL: https://issues.apache.org/jira/browse/SPARK-38763 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0, 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > When I use a spark master build from 08 November 21 I can use this code to > rename columns > {code:java} > pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x)) > pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x)) > pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x)) > {code} > But now after I get this error when I use this code. > --- > ValueErrorTraceback (most recent call last) > Input In [5], in () > > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', > x)) > 2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x)) > 3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x)) > File /opt/spark/python/pyspark/pandas/frame.py:10636, in > DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors) > 10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = > gen_mapper_fn( > 10633 index > 10634 ) > 10635 if columns: > > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns) > 10638 if not index and not columns: > 10639 raise ValueError("Either `index` or `columns` should be > provided.") > File /opt/spark/python/pyspark/pandas/frame.py:10603, in > DataFrame.rename..gen_mapper_fn(mapper) > 10601 elif callable(mapper): > 10602 mapper_callable = cast(Callable, mapper) > > 10603 return_type = cast(ScalarType, infer_return_type(mapper)) > 10604 dtype = return_type.dtype > 10605 spark_return_type = return_type.spark_type > File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in > infer_return_type(f) > 560 tpe = get_type_hints(f).get("return", None) > 562 if tpe is None: > --> 563 raise ValueError("A return value is required for the input > function") > 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, > SeriesType): > 566 tpe = tpe.__args__[0] > ValueError: A return value is required for the input function -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38766) Support lambda `column` parameter of `DataFrame.rename`
[ https://issues.apache.org/jira/browse/SPARK-38766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-38766. -- Resolution: Duplicate > Support lambda `column` parameter of `DataFrame.rename` > --- > > Key: SPARK-38766 > URL: https://issues.apache.org/jira/browse/SPARK-38766 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Support lambda `column` parameter of `DataFrame.rename`. > The issue was detected in https://issues.apache.org/jira/browse/SPARK-38763. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.
[ https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516172#comment-17516172 ] Xinrong Meng commented on SPARK-38763: -- Hi [~bjornjorgensen], thanks for raising that! The workaround is to use a function with a return type rather than a lambda. I am fixing this in https://issues.apache.org/jira/browse/SPARK-38766. > Pandas API on spark Can`t apply lamda to columns. > --- > > Key: SPARK-38763 > URL: https://issues.apache.org/jira/browse/SPARK-38763 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0, 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > When I use a spark master build from 08 November 21 I can use this code to > rename columns > {code:java} > pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x)) > pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x)) > pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x)) > {code} > But now after I get this error when I use this code. > --- > ValueErrorTraceback (most recent call last) > Input In [5], in () > > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', > x)) > 2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x)) > 3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x)) > File /opt/spark/python/pyspark/pandas/frame.py:10636, in > DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors) > 10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = > gen_mapper_fn( > 10633 index > 10634 ) > 10635 if columns: > > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns) > 10638 if not index and not columns: > 10639 raise ValueError("Either `index` or `columns` should be > provided.") > File /opt/spark/python/pyspark/pandas/frame.py:10603, in > DataFrame.rename..gen_mapper_fn(mapper) > 10601 elif callable(mapper): > 10602 mapper_callable = cast(Callable, mapper) > > 10603 return_type = cast(ScalarType, infer_return_type(mapper)) > 10604 dtype = return_type.dtype > 10605 spark_return_type = return_type.spark_type > File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in > infer_return_type(f) > 560 tpe = get_type_hints(f).get("return", None) > 562 if tpe is None: > --> 563 raise ValueError("A return value is required for the input > function") > 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, > SeriesType): > 566 tpe = tpe.__args__[0] > ValueError: A return value is required for the input function -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org