[jira] [Created] (SPARK-39581) Better error messages for Pandas API on Spark

2022-06-24 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39581:


 Summary: Better error messages for Pandas API on Spark
 Key: SPARK-39581
 URL: https://issues.apache.org/jira/browse/SPARK-39581
 Project: Spark
  Issue Type: Improvement
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Currently, some error messages for pandas API on Spark are confusing.

Specifically,
 * when users assume pandas-on-Spark has implemented pandas Scalars 
 * when users pass a ps.Index when creating a DataFrame/Series

Better error messages enhance usability and furthermore, user adoption.

We should improve error messages for Pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39574) Better error message when `ps.Index` is used for DataFrame/Series creation

2022-06-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39574:
-
Parent: SPARK-39581
Issue Type: Sub-task  (was: Improvement)

> Better error message when `ps.Index` is used for DataFrame/Series creation
> --
>
> Key: SPARK-39574
> URL: https://issues.apache.org/jira/browse/SPARK-39574
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Better error message when `ps.Index` is used for DataFrame/Series creation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39574) Better error message when `ps.Index` is used for DataFrame/Series creation

2022-06-23 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39574:
-
Summary: Better error message when `ps.Index` is used for DataFrame/Series 
creation  (was: Better error message when ps.Index is used for DataFrame/Series 
creation)

> Better error message when `ps.Index` is used for DataFrame/Series creation
> --
>
> Key: SPARK-39574
> URL: https://issues.apache.org/jira/browse/SPARK-39574
> Project: Spark
>  Issue Type: Improvement
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Better error message when `ps.Index` is used for DataFrame/Series creation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39574) Better error message when ps.Index is used for DataFrame/Series creation

2022-06-23 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39574:


 Summary: Better error message when ps.Index is used for 
DataFrame/Series creation
 Key: SPARK-39574
 URL: https://issues.apache.org/jira/browse/SPARK-39574
 Project: Spark
  Issue Type: Improvement
  Components: Pandas API on Spark, PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Better error message when `ps.Index` is used for DataFrame/Series creation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39494) Support `createDataFrame` from a list of scalars

2022-06-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39494:
-
Description: 
Currently, DataFrame creation from a list of scalars is unsupported as below:
|>>> spark.createDataFrame([1, 2])
Traceback (most recent call last):
...
    *raise* TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>|

 

However, cases below are supported.
|>>> spark.createDataFrame([(1,), (2,)]).collect()
[Row(_1=1), Row(_1=2)]|

 
|>>> schema
StructType([StructField('_1', LongType(), \{*}True\{*})])
>>> spark.createDataFrame([1, 2], schema=schema).collect()
[Row(_1=1), Row(_1=2)]|

 

In addition, Spark DataFrame Scala API supports creating a DataFrame from a  
list of scalars as below:
|scala> Seq(1, 2).toDF().collect()
res6: Array[org.apache.spark.sql.Row] = Array([1], [2])|

 

To maintain API consistency, we propose to support DataFrame creation from a 
list of scalars. See more at 

[https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]

 

  was:
Currently, DataFrame creation from a list of scalars is unsupported as below:
|>>> spark.createDataFrame([1, 2])
Traceback (most recent call last):
...
    *raise* TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>|

 

However, cases below are supported.
|>>> spark.createDataFrame([(1,), (2,)]).collect()
[Row(_1=1), Row(_1=2)]|

 
|>>> schema
StructType([StructField('_1', LongType(), {*}True{*})])
>>> spark.createDataFrame([1, 2], schema=schema).collect()
[Row(_1=1), Row(_1=2)]|

 

In addition, Spark DataFrame Scala API supports creating a DataFrame from a  
list of scalars as below:
|scala> Seq(1, 2).toDF().collect()
res6: Array[org.apache.spark.sql.Row] = Array([1], [2]|

 

To maintain API consistency, we propose to support DataFrame creation from a 
list of scalars. See more at 

[https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]

 


> Support `createDataFrame` from a list of scalars
> 
>
> Key: SPARK-39494
> URL: https://issues.apache.org/jira/browse/SPARK-39494
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, DataFrame creation from a list of scalars is unsupported as below:
> |>>> spark.createDataFrame([1, 2])
> Traceback (most recent call last):
> ...
>     *raise* TypeError("Can not infer schema for type: %s" % type(row))
> TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>|
>  
> However, cases below are supported.
> |>>> spark.createDataFrame([(1,), (2,)]).collect()
> [Row(_1=1), Row(_1=2)]|
>  
> |>>> schema
> StructType([StructField('_1', LongType(), \{*}True\{*})])
> >>> spark.createDataFrame([1, 2], schema=schema).collect()
> [Row(_1=1), Row(_1=2)]|
>  
> In addition, Spark DataFrame Scala API supports creating a DataFrame from a  
> list of scalars as below:
> |scala> Seq(1, 2).toDF().collect()
> res6: Array[org.apache.spark.sql.Row] = Array([1], [2])|
>  
> To maintain API consistency, we propose to support DataFrame creation from a 
> list of scalars. See more at 
> [https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39494) Support `createDataFrame` from a list of scalars

2022-06-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39494:
-
Description: 
Currently, DataFrame creation from a list of scalars is unsupported as below:
|>>> spark.createDataFrame([1, 2])
Traceback (most recent call last):
...
    *raise* TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>|

 

However, cases below are supported.
|>>> spark.createDataFrame([(1,), (2,)]).collect()
[Row(_1=1), Row(_1=2)]|

 
|>>> schema
StructType([StructField('_1', LongType(), {*}True{*})])
>>> spark.createDataFrame([1, 2], schema=schema).collect()
[Row(_1=1), Row(_1=2)]|

 

In addition, Spark DataFrame Scala API supports creating a DataFrame from a  
list of scalars as below:
|scala> Seq(1, 2).toDF().collect()
res6: Array[org.apache.spark.sql.Row] = Array([1], [2]|

 

To maintain API consistency, we propose to support DataFrame creation from a 
list of scalars. See more at 

[https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]

 

  was:
- Support `createDataFrame` from a list of scalars.

- Standardize error messages when the input list contains any scalars.


> Support `createDataFrame` from a list of scalars
> 
>
> Key: SPARK-39494
> URL: https://issues.apache.org/jira/browse/SPARK-39494
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, DataFrame creation from a list of scalars is unsupported as below:
> |>>> spark.createDataFrame([1, 2])
> Traceback (most recent call last):
> ...
>     *raise* TypeError("Can not infer schema for type: %s" % type(row))
> TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>|
>  
> However, cases below are supported.
> |>>> spark.createDataFrame([(1,), (2,)]).collect()
> [Row(_1=1), Row(_1=2)]|
>  
> |>>> schema
> StructType([StructField('_1', LongType(), {*}True{*})])
> >>> spark.createDataFrame([1, 2], schema=schema).collect()
> [Row(_1=1), Row(_1=2)]|
>  
> In addition, Spark DataFrame Scala API supports creating a DataFrame from a  
> list of scalars as below:
> |scala> Seq(1, 2).toDF().collect()
> res6: Array[org.apache.spark.sql.Row] = Array([1], [2]|
>  
> To maintain API consistency, we propose to support DataFrame creation from a 
> list of scalars. See more at 
> [https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39550) Fix `MultiIndex.value_counts()` when Arrow Execution is enabled

2022-06-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39550:
-
Description: 
 

When Arrow Execution is enabled,
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'true'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
{'__index_level_0__': 1, '__index_level_1__': 'a'}    1
{'__index_level_0__': 2, '__index_level_1__': 'b'}    1
dtype: int64
{code}
When Arrow Execution is disabled,
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'false'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
(1, a)    1
(2, b)    1
dtype: int64 {code}
Notice how indexes of their results are different.

Especially, `value_counts` returns an Index (rather than a MultiIndex), under 
the hood, a Spark column of StructType (rather than multiple Spark columns), so 
when Arrow Execution is enabled, Arrow converts the StructType column to a 
dictionary, where we expect a tuple instead.

 

  was:
 

When Arrow Execution is enabled,
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'true'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
{'__index_level_0__': 1, '__index_level_1__': 'a'}    1
{'__index_level_0__': 2, '__index_level_1__': 'b'}    1
dtype: int64
{code}
When Arrow Execution is disabled,

 

 
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'false'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
(1, a)    1
(2, b)    1
dtype: int64 {code}
Notice how indexes of their results are different.

 

Especially, `value_counts` returns a Index (rather than a MultiIndex), under 
the hood, a Spark column of StructType (rather than multiple Spark columns), so 
when Arrow Execution is enabled, Arrow converts the StructType column to a 
dictionary, where we expect a tuple instad.

 


> Fix `MultiIndex.value_counts()` when Arrow Execution is enabled
> ---
>
> Key: SPARK-39550
> URL: https://issues.apache.org/jira/browse/SPARK-39550
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> When Arrow Execution is enabled,
> {code:java}
> >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
> 'true'
> >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
> {'__index_level_0__': 1, '__index_level_1__': 'a'}    1
> {'__index_level_0__': 2, '__index_level_1__': 'b'}    1
> dtype: int64
> {code}
> When Arrow Execution is disabled,
> {code:java}
> >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
> 'false'
> >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
> (1, a)    1
> (2, b)    1
> dtype: int64 {code}
> Notice how indexes of their results are different.
> Especially, `value_counts` returns an Index (rather than a MultiIndex), under 
> the hood, a Spark column of StructType (rather than multiple Spark columns), 
> so when Arrow Execution is enabled, Arrow converts the StructType column to a 
> dictionary, where we expect a tuple instead.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39550) Fix `MultiIndex.value_counts()` when Arrow Execution is enabled

2022-06-21 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557145#comment-17557145
 ] 

Xinrong Meng commented on SPARK-39550:
--

I am working on that.

> Fix `MultiIndex.value_counts()` when Arrow Execution is enabled
> ---
>
> Key: SPARK-39550
> URL: https://issues.apache.org/jira/browse/SPARK-39550
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> When Arrow Execution is enabled,
> {code:java}
> >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
> 'true'
> >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
> {'__index_level_0__': 1, '__index_level_1__': 'a'}    1
> {'__index_level_0__': 2, '__index_level_1__': 'b'}    1
> dtype: int64
> {code}
> When Arrow Execution is disabled,
> {code:java}
> >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
> 'false'
> >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
> (1, a)    1
> (2, b)    1
> dtype: int64 {code}
> Notice how indexes of their results are different.
> Especially, `value_counts` returns an Index (rather than a MultiIndex), under 
> the hood, a Spark column of StructType (rather than multiple Spark columns), 
> so when Arrow Execution is enabled, Arrow converts the StructType column to a 
> dictionary, where we expect a tuple instead.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39550) Fix `MultiIndex.value_counts()` when Arrow Execution is enabled

2022-06-21 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39550:


 Summary: Fix `MultiIndex.value_counts()` when Arrow Execution is 
enabled
 Key: SPARK-39550
 URL: https://issues.apache.org/jira/browse/SPARK-39550
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark, PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


 

When Arrow Execution is enabled,
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'true'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
{'__index_level_0__': 1, '__index_level_1__': 'a'}    1
{'__index_level_0__': 2, '__index_level_1__': 'b'}    1
dtype: int64
{code}
When Arrow Execution is disabled,

 

 
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'false'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
(1, a)    1
(2, b)    1
dtype: int64 {code}
Notice how indexes of their results are different.

 

Especially, `value_counts` returns a Index (rather than a MultiIndex), under 
the hood, a Spark column of StructType (rather than multiple Spark columns), so 
when Arrow Execution is enabled, Arrow converts the StructType column to a 
dictionary, where we expect a tuple instad.

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39494) Support `createDataFrame` from a list of scalars

2022-06-16 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39494:


 Summary: Support `createDataFrame` from a list of scalars
 Key: SPARK-39494
 URL: https://issues.apache.org/jira/browse/SPARK-39494
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


- Support `createDataFrame` from a list of scalars.

- Standardize error messages when the input list contains any scalars.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39483) Construct the schema from `np.dtype` when `createDataFrame` from a NumPy array

2022-06-15 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39483:


 Summary:  Construct the schema from `np.dtype` when 
`createDataFrame` from a NumPy array
 Key: SPARK-39483
 URL: https://issues.apache.org/jira/browse/SPARK-39483
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


 Construct the schema from `np.dtype` when `createDataFrame` from a NumPy array.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39443) Improve docstring of pyspark.sql.functions.col/first

2022-06-10 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39443:


 Summary: Improve docstring of pyspark.sql.functions.col/first
 Key: SPARK-39443
 URL: https://issues.apache.org/jira/browse/SPARK-39443
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Improve docstring of pyspark.sql.functions.col/first.

`col` is malformatted.

`first` misses examples.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39405) NumPy support in SQL

2022-06-07 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39405:
-
Description: 
NumPy is the fundamental package for scientific computing with Python. It is 
very commonly used, especially in the data science world. For example, Pandas 
is backed by NumPy, and Tensors also supports interchangeable conversion 
from/to NumPy arrays. 

 

However, PySpark only supports Python built-in types with the exception of 
“SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. 

 

This issue has been raised multiple times internally and externally, see also 
SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857.

 

With the NumPy support in SQL, we expect more adaptations from naive data 
scientists and newcomers leveraging their existing background and codebase with 
NumPy.

 

See more 
[https://docs.google.com/document/d/1WsBiHoQB3UWERP47C47n_frffxZ9YIoGRwXSwIeMank/edit#]

.

  was:
NumPy is the fundamental package for scientific computing with Python. It is 
very commonly used, especially in the data science world. For example, Pandas 
is backed by NumPy, and Tensors also supports interchangeable conversion 
from/to NumPy arrays. 

 

However, PySpark only supports Python built-in types with the exception of 
“SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. 

 

This issue has been raised multiple times internally and externally, see also 
SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857.

 

With the NumPy support in SQL, we expect more adaptations from naive data 
scientists and newcomers leveraging their existing background and codebase with 
NumPy.

 

See more at [NumPy support in 
SQL|https://docs.google.com/document/d/1ZC3e-GpvpoQFtEFnwct0me1XPsiwFf_qu4nRdKCpMBg/edit#].


> NumPy support in SQL
> 
>
> Key: SPARK-39405
> URL: https://issues.apache.org/jira/browse/SPARK-39405
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> NumPy is the fundamental package for scientific computing with Python. It is 
> very commonly used, especially in the data science world. For example, Pandas 
> is backed by NumPy, and Tensors also supports interchangeable conversion 
> from/to NumPy arrays. 
>  
> However, PySpark only supports Python built-in types with the exception of 
> “SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. 
>  
> This issue has been raised multiple times internally and externally, see also 
> SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857.
>  
> With the NumPy support in SQL, we expect more adaptations from naive data 
> scientists and newcomers leveraging their existing background and codebase 
> with NumPy.
>  
> See more 
> [https://docs.google.com/document/d/1WsBiHoQB3UWERP47C47n_frffxZ9YIoGRwXSwIeMank/edit#]
> .



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39405) NumPy support in SQL

2022-06-07 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39405:
-
Description: 
NumPy is the fundamental package for scientific computing with Python. It is 
very commonly used, especially in the data science world. For example, Pandas 
is backed by NumPy, and Tensors also supports interchangeable conversion 
from/to NumPy arrays. 

 

However, PySpark only supports Python built-in types with the exception of 
“SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. 

 

This issue has been raised multiple times internally and externally, see also 
SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857.

 

With the NumPy support in SQL, we expect more adaptations from naive data 
scientists and newcomers leveraging their existing background and codebase with 
NumPy.

 

See more at [NumPy support in 
SQL|https://docs.google.com/document/d/1ZC3e-GpvpoQFtEFnwct0me1XPsiwFf_qu4nRdKCpMBg/edit#].

  was:
NumPy is the fundamental package for scientific computing with Python. It is 
very commonly used, especially in the data science world. For example, Pandas 
is backed by NumPy, and Tensors also supports interchangeable conversion 
from/to NumPy arrays. 

 

However, PySpark only supports Python built-in types with the exception of 
“SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. 

 

This issue has been raised multiple times internally and externally, see also 
SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857.

 

With the NumPy support in SQL, we expect more adaptations from naive data 
scientists and newcomers leveraging their existing background and codebase with 
NumPy.

 

See more at [].


> NumPy support in SQL
> 
>
> Key: SPARK-39405
> URL: https://issues.apache.org/jira/browse/SPARK-39405
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> NumPy is the fundamental package for scientific computing with Python. It is 
> very commonly used, especially in the data science world. For example, Pandas 
> is backed by NumPy, and Tensors also supports interchangeable conversion 
> from/to NumPy arrays. 
>  
> However, PySpark only supports Python built-in types with the exception of 
> “SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. 
>  
> This issue has been raised multiple times internally and externally, see also 
> SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857.
>  
> With the NumPy support in SQL, we expect more adaptations from naive data 
> scientists and newcomers leveraging their existing background and codebase 
> with NumPy.
>  
> See more at [NumPy support in 
> SQL|https://docs.google.com/document/d/1ZC3e-GpvpoQFtEFnwct0me1XPsiwFf_qu4nRdKCpMBg/edit#].



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39406) Accept NumPy array in createDataFrame

2022-06-07 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39406:
-
Summary: Accept NumPy array in createDataFrame  (was: Accept numpy array in 
createDataFrame)

> Accept NumPy array in createDataFrame
> -
>
> Key: SPARK-39406
> URL: https://issues.apache.org/jira/browse/SPARK-39406
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Accept numpy array in createDataFrame, with existing dtypes support.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39406) Accept numpy array in createDataFrame

2022-06-07 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39406:


 Summary: Accept numpy array in createDataFrame
 Key: SPARK-39406
 URL: https://issues.apache.org/jira/browse/SPARK-39406
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Accept numpy array in createDataFrame, with existing dtypes support.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39405) NumPy support in SQL

2022-06-07 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39405:


 Summary: NumPy support in SQL
 Key: SPARK-39405
 URL: https://issues.apache.org/jira/browse/SPARK-39405
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


NumPy is the fundamental package for scientific computing with Python. It is 
very commonly used, especially in the data science world. For example, Pandas 
is backed by NumPy, and Tensors also supports interchangeable conversion 
from/to NumPy arrays. 

 

However, PySpark only supports Python built-in types with the exception of 
“SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. 

 

This issue has been raised multiple times internally and externally, see also 
SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857.

 

With the NumPy support in SQL, we expect more adaptations from naive data 
scientists and newcomers leveraging their existing background and codebase with 
NumPy.

 

See more at [].



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39262) Correct the behavior of creating DataFrame from an RDD

2022-05-26 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39262:
-
Description: 
Correct the behavior of creating DataFrame from an RDD **with `0` or an empty 
list as the first element**.

 

Before:
```py
>>> spark.createDataFrame(spark._sc.parallelize([0, 1]))
Traceback (most recent call last):
...
ValueError: The first row in RDD is empty, can not infer schema

>>> spark.createDataFrame(spark._sc.parallelize([[], []]))
Traceback (most recent call last):
...
ValueError: The first row in RDD is empty, can not infer schema
```

After:
```py
>>> spark.createDataFrame(spark._sc.parallelize([0, 1]))
Traceback (most recent call last):                                              
...
TypeError: Can not infer schema for type: 

>>> spark.createDataFrame(spark._sc.parallelize([[], []]))
DataFrame[]                                                                     
>>> spark.createDataFrame(spark._sc.parallelize([[], []])).show()
++
||
++
||
||
++
```

  was:
Correct error messages when creating DataFrame from an RDD with the first 
element `0`.

Previously, we raise a ValueError "The first row in RDD is empty, can not infer 
schema" in such case.

However, a TypeError "Can not infer schema for type: " should be 
raised instead.


> Correct the behavior of creating DataFrame from an RDD
> --
>
> Key: SPARK-39262
> URL: https://issues.apache.org/jira/browse/SPARK-39262
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Correct the behavior of creating DataFrame from an RDD **with `0` or an empty 
> list as the first element**.
>  
> Before:
> ```py
> >>> spark.createDataFrame(spark._sc.parallelize([0, 1]))
> Traceback (most recent call last):
> ...
> ValueError: The first row in RDD is empty, can not infer schema
> >>> spark.createDataFrame(spark._sc.parallelize([[], []]))
> Traceback (most recent call last):
> ...
> ValueError: The first row in RDD is empty, can not infer schema
> ```
> After:
> ```py
> >>> spark.createDataFrame(spark._sc.parallelize([0, 1]))
> Traceback (most recent call last):                                            
>   
> ...
> TypeError: Can not infer schema for type: 
> >>> spark.createDataFrame(spark._sc.parallelize([[], []]))
> DataFrame[]                                                                   
>   
> >>> spark.createDataFrame(spark._sc.parallelize([[], []])).show()
> ++
> ||
> ++
> ||
> ||
> ++
> ```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39262) Correct the behavior of creating DataFrame from an RDD

2022-05-26 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39262:
-
Summary: Correct the behavior of creating DataFrame from an RDD  (was: 
Correct error messages when creating DataFrame from an RDD with the first 
element `0`)

> Correct the behavior of creating DataFrame from an RDD
> --
>
> Key: SPARK-39262
> URL: https://issues.apache.org/jira/browse/SPARK-39262
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Correct error messages when creating DataFrame from an RDD with the first 
> element `0`.
> Previously, we raise a ValueError "The first row in RDD is empty, can not 
> infer schema" in such case.
> However, a TypeError "Can not infer schema for type: " should be 
> raised instead.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39048) Refactor `GroupBy._reduce_for_stat_function` on accepted data types

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39048:
-
Parent: SPARK-39076
Issue Type: Sub-task  (was: Improvement)

> Refactor `GroupBy._reduce_for_stat_function` on accepted data types 
> 
>
> Key: SPARK-39048
> URL: https://issues.apache.org/jira/browse/SPARK-39048
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> `Groupby._reduce_for_stat_function` is a common helper function leveraged by 
> multiple statistical functions of GroupBy objects.
> It defines parameters `only_numeric` and `bool_as_numeric` to control 
> accepted Spark types.
> To be consistent with pandas API, we may also have to introduce 
> `str_as_numeric` for `sum` for example.
> Instead of introducing parameters designated for each Spark type, the PR is 
> proposed to introduce a parameter `accepted_spark_types` to specify accepted 
> types of Spark columns to be aggregated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38880) Implement `numeric_only` parameter of `GroupBy.max/min`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38880:
-
Parent: SPARK-39076
Issue Type: Sub-task  (was: Improvement)

> Implement `numeric_only` parameter of `GroupBy.max/min`
> ---
>
> Key: SPARK-38880
> URL: https://issues.apache.org/jira/browse/SPARK-38880
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `numeric_only` parameter of `GroupBy.max/min`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39000) Convert bools to ints in basic statistical functions of GroupBy objects

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39000:
-
Parent: SPARK-39076
Issue Type: Sub-task  (was: Improvement)

> Convert bools to ints in basic statistical functions of GroupBy objects
> ---
>
> Key: SPARK-39000
> URL: https://issues.apache.org/jira/browse/SPARK-39000
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Convert bools to ints in basic statistical functions of GroupBy objects



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39227) Reach parity with pandas boolean cast

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39227:
-
Parent: SPARK-39076
Issue Type: Sub-task  (was: Improvement)

> Reach parity with pandas boolean cast
> -
>
> Key: SPARK-39227
> URL: https://issues.apache.org/jira/browse/SPARK-39227
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> There are pandas APIs that need boolean casts: all, any.
> Currently, pandas-on-Spark has different behaviors on special inputs against 
> these APIs, for example, empty string, list, etc, as mentioned 
> https://github.com/apache/spark/pull/36547#issuecomment-1129228724 by 
> [~zero323].
> We shall match pandas behavior on boolean cast.
> Meanwhile, Series/Frame that contains empty strings, lists should be 
> considered as test input to increase test coverage.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38952) Implement `numeric_only` of `GroupBy.first` and `GroupBy.last`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38952:
-
Parent: SPARK-39076
Issue Type: Sub-task  (was: Improvement)

> Implement `numeric_only` of `GroupBy.first` and `GroupBy.last`
> --
>
> Key: SPARK-38952
> URL: https://issues.apache.org/jira/browse/SPARK-38952
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `numeric_only` of `GroupBy.first` and `GroupBy.last`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38763:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Bug)

> Pandas API on spark Can`t apply lamda to columns.  
> ---
>
> Key: SPARK-38763
> URL: https://issues.apache.org/jira/browse/SPARK-38763
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> When I use a spark master build from 08 November 21 I can use this code to 
> rename columns 
> {code:java}
> pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> {code}
> But now after I get this error when I use this code.
> ---
> ValueErrorTraceback (most recent call last)
> Input In [5], in ()
> > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', 
> x))
>   2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
>   3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> File /opt/spark/python/pyspark/pandas/frame.py:10636, in 
> DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors)
>   10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
> gen_mapper_fn(
>   10633 index
>   10634 )
>   10635 if columns:
> > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
>   10638 if not index and not columns:
>   10639 raise ValueError("Either `index` or `columns` should be 
> provided.")
> File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
> DataFrame.rename..gen_mapper_fn(mapper)
>   10601 elif callable(mapper):
>   10602 mapper_callable = cast(Callable, mapper)
> > 10603 return_type = cast(ScalarType, infer_return_type(mapper))
>   10604 dtype = return_type.dtype
>   10605 spark_return_type = return_type.spark_type
> File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
> infer_return_type(f)
> 560 tpe = get_type_hints(f).get("return", None)
> 562 if tpe is None:
> --> 563 raise ValueError("A return value is required for the input 
> function")
> 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
> SeriesType):
> 566 tpe = tpe.__args__[0]
> ValueError: A return value is required for the input function



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38766) Support lambda `column` parameter of `DataFrame.rename`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38766:
-
Parent: (was: SPARK-39199)
Issue Type: Improvement  (was: Sub-task)

> Support lambda `column` parameter of `DataFrame.rename`
> ---
>
> Key: SPARK-38766
> URL: https://issues.apache.org/jira/browse/SPARK-38766
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Support lambda `column` parameter of `DataFrame.rename`.
> The issue was detected in https://issues.apache.org/jira/browse/SPARK-38763.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38387) Support `na_action` and Series input correspondence in `Series.map`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38387:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: New Feature)

> Support `na_action` and Series input correspondence in `Series.map`
> ---
>
> Key: SPARK-38387
> URL: https://issues.apache.org/jira/browse/SPARK-38387
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Support `na_action` and Series input correspondence in `Series.map`, in order 
> to reach parity to pandas API.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38766) Support lambda `column` parameter of `DataFrame.rename`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38766:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Bug)

> Support lambda `column` parameter of `DataFrame.rename`
> ---
>
> Key: SPARK-38766
> URL: https://issues.apache.org/jira/browse/SPARK-38766
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Support lambda `column` parameter of `DataFrame.rename`.
> The issue was detected in https://issues.apache.org/jira/browse/SPARK-38763.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38400) Enable Series.rename to change index labels

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38400:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Enable Series.rename to change index labels
> ---
>
> Key: SPARK-38400
> URL: https://issues.apache.org/jira/browse/SPARK-38400
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Enable Series.rename to change index labels, with function `index` input.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38491) Support `ignore_index` of `Series.sort_values`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38491:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Support `ignore_index` of `Series.sort_values`
> --
>
> Key: SPARK-38491
> URL: https://issues.apache.org/jira/browse/SPARK-38491
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Support `ignore_index` of `Series.sort_values`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38518) Implement `skipna` of `Series.all/Index.all` to exclude NA/null values

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38518:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `skipna` of `Series.all/Index.all` to exclude NA/null values
> --
>
> Key: SPARK-38518
> URL: https://issues.apache.org/jira/browse/SPARK-38518
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Implement `skipna` of `Series.all/Index.all` to exclude NA/null values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38441) Support string and bool `regex` in `Series.replace`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38441:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Support string and bool `regex` in `Series.replace`
> ---
>
> Key: SPARK-38441
> URL: https://issues.apache.org/jira/browse/SPARK-38441
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Support string and bool `regex` in `Series.replace` in order to reach parity 
> with pandas.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38479) Add `Series.duplicated` to indicate duplicate Series values.

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38479:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: New Feature)

> Add `Series.duplicated` to indicate duplicate Series values.
> 
>
> Key: SPARK-38479
> URL: https://issues.apache.org/jira/browse/SPARK-38479
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Add `Series.duplicated` to indicate duplicate Series values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38576) Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38576:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank 
> numeric columns only
> ---
>
> Key: SPARK-38576
> URL: https://issues.apache.org/jira/browse/SPARK-38576
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank 
> numeric columns only.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38608) Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38608:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any`
> -
>
> Key: SPARK-38608
> URL: https://issues.apache.org/jira/browse/SPARK-38608
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any` to 
> include only boolean columns.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38552) Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to resolve ties

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38552:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to 
> resolve ties
> --
>
> Key: SPARK-38552
> URL: https://issues.apache.org/jira/browse/SPARK-38552
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to 
> resolve ties



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38686) Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38686:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates`
> --
>
> Key: SPARK-38686
> URL: https://issues.apache.org/jira/browse/SPARK-38686
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38704) Support string `inclusive` parameter of `Series.between`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38704:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Support string `inclusive` parameter of `Series.between`
> 
>
> Key: SPARK-38704
> URL: https://issues.apache.org/jira/browse/SPARK-38704
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Support string `inclusive` parameter of `Series.between`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38726) Support `how` parameter of `MultiIndex.dropna`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38726:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Support `how` parameter of `MultiIndex.dropna`
> --
>
> Key: SPARK-38726
> URL: https://issues.apache.org/jira/browse/SPARK-38726
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Support `how` parameter of `MultiIndex.dropna`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38765) Implement `inplace` parameter of `Series.clip`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38765:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `inplace` parameter of `Series.clip`
> --
>
> Key: SPARK-38765
> URL: https://issues.apache.org/jira/browse/SPARK-38765
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `inplace` parameter of `Series.clip`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38837) Implement `dropna` parameter of `SeriesGroupBy.value_counts`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38837:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `dropna` parameter of `SeriesGroupBy.value_counts`
> 
>
> Key: SPARK-38837
> URL: https://issues.apache.org/jira/browse/SPARK-38837
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> Implement `dropna` parameter of `SeriesGroupBy.value_counts`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38863) Implement `skipna` parameter of `DataFrame.all`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38863:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `skipna` parameter of `DataFrame.all`
> ---
>
> Key: SPARK-38863
> URL: https://issues.apache.org/jira/browse/SPARK-38863
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `skipna` parameter of `DataFrame.all`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38793) Support `return_indexer` parameter of `Index/MultiIndex.sort_values`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38793:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Support `return_indexer` parameter of `Index/MultiIndex.sort_values`
> 
>
> Key: SPARK-38793
> URL: https://issues.apache.org/jira/browse/SPARK-38793
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Support `return_indexer` parameter of `Index/MultiIndex.sort_values`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38903) Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38903:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`
> 
>
> Key: SPARK-38903
> URL: https://issues.apache.org/jira/browse/SPARK-38903
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38890) Implement `ignore_index` of `DataFrame.sort_index`.

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38890:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `ignore_index` of `DataFrame.sort_index`.
> ---
>
> Key: SPARK-38890
> URL: https://issues.apache.org/jira/browse/SPARK-38890
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `ignore_index` of `DataFrame.sort_index`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38938) Implement `inplace` and `columns` parameters of `Series.drop`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38938:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `inplace` and `columns` parameters of `Series.drop`
> -
>
> Key: SPARK-38938
> URL: https://issues.apache.org/jira/browse/SPARK-38938
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `inplace` and `columns` parameters of `Series.drop`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38989) Implement `ignore_index` of `DataFrame/Series.sample`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38989:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `ignore_index` of `DataFrame/Series.sample`
> -
>
> Key: SPARK-38989
> URL: https://issues.apache.org/jira/browse/SPARK-38989
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `ignore_index` of `DataFrame/Series.sample`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39201) Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39201:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> Implement `ignore_index` of `DataFrame.explode` and 
> `DataFrame.drop_duplicates`
> ---
>
> Key: SPARK-39201
> URL: https://issues.apache.org/jira/browse/SPARK-39201
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `ignore_index` of `DataFrame.explode` and 
> `DataFrame.drop_duplicates`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39201) Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`

2022-05-24 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39201:
-
Issue Type: Improvement  (was: Umbrella)

> Implement `ignore_index` of `DataFrame.explode` and 
> `DataFrame.drop_duplicates`
> ---
>
> Key: SPARK-39201
> URL: https://issues.apache.org/jira/browse/SPARK-39201
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `ignore_index` of `DataFrame.explode` and 
> `DataFrame.drop_duplicates`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39262) Correct error messages when creating DataFrame from an RDD with the first element `0`

2022-05-23 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39262:
-
Description: 
Correct error messages when creating DataFrame from an RDD with the first 
element `0`.

Previously, we raise a ValueError "The first row in RDD is empty, can not infer 
schema" in such case.

However, a TypeError "Can not infer schema for type: " should be 
raised instead.

  was:
Correct error messages when creating DataFrame from an RDD with the first row 
is `0`.

 

Previously, we raise a ValueError "The first row in RDD is empty, can not infer 
schema" in such case.

However, a TypeError "Can not infer schema for type: " should be 
raised instead.


> Correct error messages when creating DataFrame from an RDD with the first 
> element `0`
> -
>
> Key: SPARK-39262
> URL: https://issues.apache.org/jira/browse/SPARK-39262
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Correct error messages when creating DataFrame from an RDD with the first 
> element `0`.
> Previously, we raise a ValueError "The first row in RDD is empty, can not 
> infer schema" in such case.
> However, a TypeError "Can not infer schema for type: " should be 
> raised instead.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39262) Correct error messages when creating DataFrame from an RDD with the first element `0`

2022-05-23 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39262:
-
Summary: Correct error messages when creating DataFrame from an RDD with 
the first element `0`  (was: Correct error messages when creating DataFrame 
from an RDD with the first row is `0`)

> Correct error messages when creating DataFrame from an RDD with the first 
> element `0`
> -
>
> Key: SPARK-39262
> URL: https://issues.apache.org/jira/browse/SPARK-39262
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Correct error messages when creating DataFrame from an RDD with the first row 
> is `0`.
>  
> Previously, we raise a ValueError "The first row in RDD is empty, can not 
> infer schema" in such case.
> However, a TypeError "Can not infer schema for type: " should be 
> raised instead.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39262) Correct error messages when creating DataFrame from an RDD with the first row is `0`

2022-05-23 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39262:


 Summary: Correct error messages when creating DataFrame from an 
RDD with the first row is `0`
 Key: SPARK-39262
 URL: https://issues.apache.org/jira/browse/SPARK-39262
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Correct error messages when creating DataFrame from an RDD with the first row 
is `0`.

 

Previously, we raise a ValueError "The first row in RDD is empty, can not infer 
schema" in such case.

However, a TypeError "Can not infer schema for type: " should be 
raised instead.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39199) Implement pandas API missing parameters

2022-05-20 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39199:
-
Description: 
pandas API on Spark aims to make pandas code work on Spark clusters without any 
changes. So full API coverage has been one of our major goals. Currently, most 
pandas functions are implemented, whereas some of them are have incomplete 
parameters support.

There are some common parameters missing (resolved):
 * How to do with NAs   
 * Filter data types    
 * Control result length    
 * Reindex result   

There are remaining missing parameters to implement (see doc below).

See the design and the current status at 
[https://docs.google.com/document/d/1H6RXL6oc-v8qLJbwKl6OEqBjRuMZaXcTYmrZb9yNm5I/edit?usp=sharing].

  was:
pandas API on Spark aims to achieve full pandas API coverage. Currently, most 
pandas functions are supported in pandas API on Spark with parameters missing.

There are some common parameters missing:
- how to do with NAs: `skipna`, `dropna`
- filter data types: `numeric_only`, `bool_only`
- filter result length: `keep`
- reindex result: `ignore_index`

They support common use cases and should be prioritized.



> Implement pandas API missing parameters
> ---
>
> Key: SPARK-39199
> URL: https://issues.apache.org/jira/browse/SPARK-39199
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.3.0, 3.4.0, 3.3.1
>Reporter: Xinrong Meng
>Priority: Major
>
> pandas API on Spark aims to make pandas code work on Spark clusters without 
> any changes. So full API coverage has been one of our major goals. Currently, 
> most pandas functions are implemented, whereas some of them are have 
> incomplete parameters support.
> There are some common parameters missing (resolved):
>  * How to do with NAs   
>  * Filter data types    
>  * Control result length    
>  * Reindex result   
> There are remaining missing parameters to implement (see doc below).
> See the design and the current status at 
> [https://docs.google.com/document/d/1H6RXL6oc-v8qLJbwKl6OEqBjRuMZaXcTYmrZb9yNm5I/edit?usp=sharing].



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37525) Timedelta support in pandas API on Spark

2022-05-18 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-37525:
-
Summary: Timedelta support in pandas API on Spark  (was: Support 
TimedeltaIndex in pandas API on Spark)

> Timedelta support in pandas API on Spark
> 
>
> Key: SPARK-37525
> URL: https://issues.apache.org/jira/browse/SPARK-37525
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>  Labels: release-notes
>
> Since DayTimeIntervalType is supported in PySpark, we may add TimedeltaIndex 
> support in pandas API on Spark accordingly.
> We shall approach it in steps below:
> - introduce
> - properties
> - functions and basic operations
> - creation (from Series/Index, generic methods)
> - type conversion (astype)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37525) Support TimedeltaIndex in pandas API on Spark

2022-05-18 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-37525.
--
Resolution: Resolved

> Support TimedeltaIndex in pandas API on Spark
> -
>
> Key: SPARK-37525
> URL: https://issues.apache.org/jira/browse/SPARK-37525
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>  Labels: release-notes
>
> Since DayTimeIntervalType is supported in PySpark, we may add TimedeltaIndex 
> support in pandas API on Spark accordingly.
> We shall approach it in steps below:
> - introduce
> - properties
> - functions and basic operations
> - creation (from Series/Index, generic methods)
> - type conversion (astype)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39228) Implement `skipna` of `Series.argmax`

2022-05-18 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39228:


 Summary: Implement `skipna` of `Series.argmax`
 Key: SPARK-39228
 URL: https://issues.apache.org/jira/browse/SPARK-39228
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark, PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `skipna` of `Series.argmax`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39227) Reach parity with pandas boolean cast

2022-05-18 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39227:
-
Description: 
There are pandas APIs that need boolean casts: all, any.

Currently, pandas-on-Spark has different behaviors on special inputs against 
these APIs, for example, empty string, list, etc, as mentioned 
https://github.com/apache/spark/pull/36547#issuecomment-1129228724 by 
[~zero323].

We shall match pandas behavior on boolean cast.

Meanwhile, Series/Frame that contains empty strings, lists should be considered 
as test input to increase test coverage.

  was:
There are pandas APIs that need boolean casts: all, any.

Currently, pandas-on-Spark has different behaviors on special inputs against 
these APIs, for example, empty string, list, etc, as mentioned 
https://github.com/apache/spark/pull/36547#issuecomment-1129228724 by 
[~zero323].

We shall match pandas behavior on boolean cast.



> Reach parity with pandas boolean cast
> -
>
> Key: SPARK-39227
> URL: https://issues.apache.org/jira/browse/SPARK-39227
> Project: Spark
>  Issue Type: Improvement
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> There are pandas APIs that need boolean casts: all, any.
> Currently, pandas-on-Spark has different behaviors on special inputs against 
> these APIs, for example, empty string, list, etc, as mentioned 
> https://github.com/apache/spark/pull/36547#issuecomment-1129228724 by 
> [~zero323].
> We shall match pandas behavior on boolean cast.
> Meanwhile, Series/Frame that contains empty strings, lists should be 
> considered as test input to increase test coverage.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39227) Reach parity with pandas boolean cast

2022-05-18 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39227:


 Summary: Reach parity with pandas boolean cast
 Key: SPARK-39227
 URL: https://issues.apache.org/jira/browse/SPARK-39227
 Project: Spark
  Issue Type: Improvement
  Components: Pandas API on Spark, PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


There are pandas APIs that need boolean casts: all, any.

Currently, pandas-on-Spark has different behaviors on special inputs against 
these APIs, for example, empty string, list, etc, as mentioned 
https://github.com/apache/spark/pull/36547#issuecomment-1129228724 by 
[~zero323].

We shall match pandas behavior on boolean cast.




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39201) Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`

2022-05-16 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39201:


 Summary: Implement `ignore_index` of `DataFrame.explode` and 
`DataFrame.drop_duplicates`
 Key: SPARK-39201
 URL: https://issues.apache.org/jira/browse/SPARK-39201
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39199) Implement pandas API missing parameters

2022-05-16 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39199:
-
Description: 
pandas API on Spark aims to achieve full pandas API coverage. Currently, most 
pandas functions are supported in pandas API on Spark with parameters missing.

There are some common parameters missing:
- how to do with NAs: `skipna`, `dropna`
- filter data types: `numeric_only`, `bool_only`
- filter result length: `keep`
- reindex result: `ignore_index`

They support common use cases and should be prioritized.


  was:
pandas API on Spark aims to achieve full pandas API coverage. Currently, most 
pandas functions are supported in pandas API on Spark with parameters missing.

There are some common parameters missing:





> Implement pandas API missing parameters
> ---
>
> Key: SPARK-39199
> URL: https://issues.apache.org/jira/browse/SPARK-39199
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0, 3.3.1
>Reporter: Xinrong Meng
>Priority: Major
>
> pandas API on Spark aims to achieve full pandas API coverage. Currently, most 
> pandas functions are supported in pandas API on Spark with parameters missing.
> There are some common parameters missing:
> - how to do with NAs: `skipna`, `dropna`
> - filter data types: `numeric_only`, `bool_only`
> - filter result length: `keep`
> - reindex result: `ignore_index`
> They support common use cases and should be prioritized.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39199) Implement pandas API missing parameters

2022-05-16 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39199:
-
Description: 
pandas API on Spark aims to achieve full pandas API coverage. Currently, most 
pandas functions are supported in pandas API on Spark with parameters missing.

There are some common parameters missing:




  was:Implement pandas API missing parameters


> Implement pandas API missing parameters
> ---
>
> Key: SPARK-39199
> URL: https://issues.apache.org/jira/browse/SPARK-39199
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0, 3.3.1
>Reporter: Xinrong Meng
>Priority: Major
>
> pandas API on Spark aims to achieve full pandas API coverage. Currently, most 
> pandas functions are supported in pandas API on Spark with parameters missing.
> There are some common parameters missing:



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38608) Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any`

2022-05-16 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38608:
-
Summary: Implement `bool_only` parameter of `DataFrame.all` 
and`DataFrame.any`  (was: [SPARK-38608][PYTHON] Implement `bool_only` parameter 
of `DataFrame.all` and`DataFrame.any`)

> Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any`
> -
>
> Key: SPARK-38608
> URL: https://issues.apache.org/jira/browse/SPARK-38608
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any` to 
> include only boolean columns.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39199) Implement pandas API missing parameters

2022-05-16 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39199:


 Summary: Implement pandas API missing parameters
 Key: SPARK-39199
 URL: https://issues.apache.org/jira/browse/SPARK-39199
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark
Affects Versions: 3.3.0, 3.4.0, 3.3.1
Reporter: Xinrong Meng


Implement pandas API missing parameters



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37525) Support TimedeltaIndex in pandas API on Spark

2022-05-16 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-37525:
-
Description: 
Since DayTimeIntervalType is supported in PySpark, we may add TimedeltaIndex 
support in pandas API on Spark accordingly.

We shall approach it in steps below:
- introduce
- properties
- functions and basic operations
- creation (from Series/Index, generic methods)
- type conversion (astype)


  was:Since DayTimeIntervalType is supported in PySpark, we may add 
TimedeltaIndex support in pandas API on Spark accordingly.


> Support TimedeltaIndex in pandas API on Spark
> -
>
> Key: SPARK-37525
> URL: https://issues.apache.org/jira/browse/SPARK-37525
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>  Labels: release-notes
>
> Since DayTimeIntervalType is supported in PySpark, we may add TimedeltaIndex 
> support in pandas API on Spark accordingly.
> We shall approach it in steps below:
> - introduce
> - properties
> - functions and basic operations
> - creation (from Series/Index, generic methods)
> - type conversion (astype)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39197) Implement `skipna` parameter of `GroupBy.all`

2022-05-16 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39197:


 Summary: Implement `skipna` parameter of `GroupBy.all`
 Key: SPARK-39197
 URL: https://issues.apache.org/jira/browse/SPARK-39197
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `skipna` parameter of `GroupBy.all`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39155) Access to JVM through passed-in GatewayClient during type conversion

2022-05-11 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39155:
-
Description: 
Access to JVM through passed-in GatewayClient during type conversion.

In customized type converters, we may utilize `jvm` field of passed-in 
GatewayClient directly to access JVM, rather than rely on the 
`SparkContext._jvm`.

That's 
[how](https://github.com/py4j/py4j/blob/master/py4j-python/src/py4j/java_collections.py#L508)
 Py4J explicit converters access JVM.

  was:Access to JVM through passed-in GatewayClient during type conversion


> Access to JVM through passed-in GatewayClient during type conversion
> 
>
> Key: SPARK-39155
> URL: https://issues.apache.org/jira/browse/SPARK-39155
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Access to JVM through passed-in GatewayClient during type conversion.
> In customized type converters, we may utilize `jvm` field of passed-in 
> GatewayClient directly to access JVM, rather than rely on the 
> `SparkContext._jvm`.
> That's 
> [how](https://github.com/py4j/py4j/blob/master/py4j-python/src/py4j/java_collections.py#L508)
>  Py4J explicit converters access JVM.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39155) Access to JVM through passed-in GatewayClient during type conversion

2022-05-11 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39155:


 Summary: Access to JVM through passed-in GatewayClient during type 
conversion
 Key: SPARK-39155
 URL: https://issues.apache.org/jira/browse/SPARK-39155
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Access to JVM through passed-in GatewayClient during type conversion



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39154) Remove outdated statements on distributed-sequence default index

2022-05-11 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39154:


 Summary: Remove outdated statements on distributed-sequence 
default index 
 Key: SPARK-39154
 URL: https://issues.apache.org/jira/browse/SPARK-39154
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Remove outdated statements on distributed-sequence default index 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38819) Run Pandas on Spark with Pandas 1.4.x

2022-05-10 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534428#comment-17534428
 ] 

Xinrong Meng commented on SPARK-38819:
--

Thanks Yikun!

> Run Pandas on Spark with Pandas 1.4.x
> -
>
> Key: SPARK-38819
> URL: https://issues.apache.org/jira/browse/SPARK-38819
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> This is a umbrella to track issues when pandas upgrade to 1.4.x
>  
> I disable the fast-failed in test, 19 failed:
> [https://github.com/Yikun/spark/pull/88/checks?check_run_id=5873627048]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39133) Mention log level setting in PYSPARK_JVM_STACKTRACE_ENABLED

2022-05-09 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39133:


 Summary: Mention log level setting in 
PYSPARK_JVM_STACKTRACE_ENABLED
 Key: SPARK-39133
 URL: https://issues.apache.org/jira/browse/SPARK-39133
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Mention log level setting in PYSPARK_JVM_STACKTRACE_ENABLED



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39109) Adjust `GroupBy.mean/median` to match pandas 1.4

2022-05-05 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39109:


 Summary: Adjust `GroupBy.mean/median` to match pandas 1.4
 Key: SPARK-39109
 URL: https://issues.apache.org/jira/browse/SPARK-39109
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Adjust `GroupBy.mean/median` to match pandas 1.4



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39095) Adjust `GroupBy.std` to match pandas 1.4

2022-05-03 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39095:


 Summary: Adjust `GroupBy.std` to match pandas 1.4
 Key: SPARK-39095
 URL: https://issues.apache.org/jira/browse/SPARK-39095
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39076) Standardize Statistical Functions of pandas API on Spark

2022-05-02 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39076:
-
Description: 
Statistical functions are the most commonly-used functions in Data Engineering 
and Data Analysis.

Spark and pandas provide statistical functions in the context of SQL and Data 
Science separately.

pandas API on Spark implements the pandas API on top of Apache Spark. Although 
there may be semantic differences of certain functions due to the high cost of 
big data calculations, for example, median. We should still try to reach the 
parity from the API level.

However, critical parameters, such as `skipna`,  of statistical functions are 
missing of basic objects: DataFrame, Series, and Index are missing. 

There is even a larger gap between statistical functions of pandas-on-Spark 
GroupBy objects and those of pandas GroupBy objects. In addition, tests 
coverage is far from perfect.

With statistical functions standardized, pandas API coverage will be increased 
since missing parameters will be implemented. That would further improve the 
user adoption.

See details at 
https://docs.google.com/document/d/1IHUQkSVMPWiK8Jhe0GUtMHnDS6LB4_z9K2ktWmORSSg/edit?usp=sharing.


  was:
Statistical functions are the most commonly-used functions in Data Engineering 
and Data Analysis.

Spark and pandas provide statistical functions in the context of SQL and Data 
Science separately.

pandas API on Spark implements the pandas API on top of Apache Spark. Although 
there may be semantic differences of certain functions due to the high cost of 
big data calculations, for example, median. We should still try to reach the 
parity from the API level.

However, critical parameters, such as `skipna`,  of statistical functions are 
missing of basic objects: DataFrame, Series, and Index are missing. 

There is even a larger gap between statistical functions of pandas-on-Spark 
GroupBy objects and those of pandas GroupBy objects. In addition, tests 
coverage is far from perfect.

With statistical functions standardized, pandas API coverage will be increased 
since missing parameters will be implemented. That would further improve the 
user adoption.



> Standardize Statistical Functions of pandas API on Spark
> 
>
> Key: SPARK-39076
> URL: https://issues.apache.org/jira/browse/SPARK-39076
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Statistical functions are the most commonly-used functions in Data 
> Engineering and Data Analysis.
> Spark and pandas provide statistical functions in the context of SQL and Data 
> Science separately.
> pandas API on Spark implements the pandas API on top of Apache Spark. 
> Although there may be semantic differences of certain functions due to the 
> high cost of big data calculations, for example, median. We should still try 
> to reach the parity from the API level.
> However, critical parameters, such as `skipna`,  of statistical functions are 
> missing of basic objects: DataFrame, Series, and Index are missing. 
> There is even a larger gap between statistical functions of pandas-on-Spark 
> GroupBy objects and those of pandas GroupBy objects. In addition, tests 
> coverage is far from perfect.
> With statistical functions standardized, pandas API coverage will be 
> increased since missing parameters will be implemented. That would further 
> improve the user adoption.
> See details at 
> https://docs.google.com/document/d/1IHUQkSVMPWiK8Jhe0GUtMHnDS6LB4_z9K2ktWmORSSg/edit?usp=sharing.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39077) Implement `skipna` of basic statistical functions of DataFrame and Series

2022-04-29 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39077:


 Summary: Implement `skipna` of basic statistical functions of 
DataFrame and Series
 Key: SPARK-39077
 URL: https://issues.apache.org/jira/browse/SPARK-39077
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `skipna` of basic statistical functions of DataFrame and Series



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39076) Standardize Statistical Functions of pandas API on Spark

2022-04-29 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39076:


 Summary: Standardize Statistical Functions of pandas API on Spark
 Key: SPARK-39076
 URL: https://issues.apache.org/jira/browse/SPARK-39076
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Statistical functions are the most commonly-used functions in Data Engineering 
and Data Analysis.

Spark and pandas provide statistical functions in the context of SQL and Data 
Science separately.

pandas API on Spark implements the pandas API on top of Apache Spark. Although 
there may be semantic differences of certain functions due to the high cost of 
big data calculations, for example, median. We should still try to reach the 
parity from the API level.

However, critical parameters, such as `skipna`,  of statistical functions are 
missing of basic objects: DataFrame, Series, and Index are missing. 

There is even a larger gap between statistical functions of pandas-on-Spark 
GroupBy objects and those of pandas GroupBy objects. In addition, tests 
coverage is far from perfect.

With statistical functions standardized, pandas API coverage will be increased 
since missing parameters will be implemented. That would further improve the 
user adoption.




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39051) Minor refactoring of `python/pyspark/sql/pandas/conversion.py`

2022-04-27 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39051:


 Summary: Minor refactoring of 
`python/pyspark/sql/pandas/conversion.py`
 Key: SPARK-39051
 URL: https://issues.apache.org/jira/browse/SPARK-39051
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Minor refactoring of `python/pyspark/sql/pandas/conversion.py`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39048) Refactor `GroupBy._reduce_for_stat_function` on accepted data types

2022-04-27 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39048:
-
Summary: Refactor `GroupBy._reduce_for_stat_function` on accepted data 
types   (was: Refactor GroupBy._reduce_for_stat_function on accepted data types 
)

> Refactor `GroupBy._reduce_for_stat_function` on accepted data types 
> 
>
> Key: SPARK-39048
> URL: https://issues.apache.org/jira/browse/SPARK-39048
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> `Groupby._reduce_for_stat_function` is a common helper function leveraged by 
> multiple statistical functions of GroupBy objects.
> It defines parameters `only_numeric` and `bool_as_numeric` to control 
> accepted Spark types.
> To be consistent with pandas API, we may also have to introduce 
> `str_as_numeric` for `sum` for example.
> Instead of introducing parameters designated for each Spark type, the PR is 
> proposed to introduce a parameter `accepted_spark_types` to specify accepted 
> types of Spark columns to be aggregated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39048) Refactor GroupBy._reduce_for_stat_function on accepted data types

2022-04-27 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39048:


 Summary: Refactor GroupBy._reduce_for_stat_function on accepted 
data types 
 Key: SPARK-39048
 URL: https://issues.apache.org/jira/browse/SPARK-39048
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


`Groupby._reduce_for_stat_function` is a common helper function leveraged by 
multiple statistical functions of GroupBy objects.

It defines parameters `only_numeric` and `bool_as_numeric` to control accepted 
Spark types.

To be consistent with pandas API, we may also have to introduce 
`str_as_numeric` for `sum` for example.

Instead of introducing parameters designated for each Spark type, the PR is 
proposed to introduce a parameter `accepted_spark_types` to specify accepted 
types of Spark columns to be aggregated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38988) Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get printed many times.

2022-04-26 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528459#comment-17528459
 ] 

Xinrong Meng commented on SPARK-38988:
--

Thank you for raising that!

I will try muting the warnings for now.

> Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get 
> printed many times. 
> ---
>
> Key: SPARK-38988
> URL: https://issues.apache.org/jira/browse/SPARK-38988
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: Untitled.html, info.txt, warning printed.txt
>
>
> I add a file and a notebook with the info msg I get when I run df.info()
> Spark master build from 13.04.22.
> df.shape
> (763300, 224)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39000) Convert bools to ints in basic statistical functions of GroupBy objects

2022-04-22 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-39000:


 Summary: Convert bools to ints in basic statistical functions of 
GroupBy objects
 Key: SPARK-39000
 URL: https://issues.apache.org/jira/browse/SPARK-39000
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Convert bools to ints in basic statistical functions of GroupBy objects



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38991) Implement `numeric_only` of `GroupBy.mean` and `GroupBy.sum`

2022-04-21 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38991:


 Summary: Implement `numeric_only` of `GroupBy.mean` and 
`GroupBy.sum`
 Key: SPARK-38991
 URL: https://issues.apache.org/jira/browse/SPARK-38991
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `numeric_only` of `GroupBy.mean` and `GroupBy.sum`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38991) Implement `numeric_only` of `GroupBy.mean` and `GroupBy.sum`

2022-04-21 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526121#comment-17526121
 ] 

Xinrong Meng commented on SPARK-38991:
--

I am working on that.

> Implement `numeric_only` of `GroupBy.mean` and `GroupBy.sum`
> 
>
> Key: SPARK-38991
> URL: https://issues.apache.org/jira/browse/SPARK-38991
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `numeric_only` of `GroupBy.mean` and `GroupBy.sum`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38989) Implement `ignore_index` of `DataFrame/Series.sample`

2022-04-21 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38989:


 Summary: Implement `ignore_index` of `DataFrame/Series.sample`
 Key: SPARK-38989
 URL: https://issues.apache.org/jira/browse/SPARK-38989
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `ignore_index` of `DataFrame/Series.sample`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38971) Test anchor frame for in-place `Series.rename_axis`

2022-04-20 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38971:


 Summary: Test anchor frame for in-place `Series.rename_axis`
 Key: SPARK-38971
 URL: https://issues.apache.org/jira/browse/SPARK-38971
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Test anchor frame for in-place `Series.rename_axis`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38953) Document PySpark common exceptions / errors

2022-04-19 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38953:
-
Component/s: Documentation

> Document PySpark common exceptions / errors
> ---
>
> Key: SPARK-38953
> URL: https://issues.apache.org/jira/browse/SPARK-38953
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Document PySpark common exceptions / errors



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38953) Document PySpark common exceptions / errors

2022-04-19 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38953:


 Summary: Document PySpark common exceptions / errors
 Key: SPARK-38953
 URL: https://issues.apache.org/jira/browse/SPARK-38953
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Document PySpark common exceptions / errors



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38952) Implement `numeric_only` of `GroupBy.first` and `GroupBy.last`

2022-04-19 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38952:


 Summary: Implement `numeric_only` of `GroupBy.first` and 
`GroupBy.last`
 Key: SPARK-38952
 URL: https://issues.apache.org/jira/browse/SPARK-38952
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `numeric_only` of `GroupBy.first` and `GroupBy.last`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38940) Test Series' anchor frame for in-place updates on Series

2022-04-18 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38940:


 Summary: Test Series' anchor frame for in-place updates on Series
 Key: SPARK-38940
 URL: https://issues.apache.org/jira/browse/SPARK-38940
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Test Series' anchor frame for in-place updates on Series



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38938) Implement `inplace` and `columns` parameters of `Series.drop`

2022-04-18 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38938:


 Summary: Implement `inplace` and `columns` parameters of 
`Series.drop`
 Key: SPARK-38938
 URL: https://issues.apache.org/jira/browse/SPARK-38938
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `inplace` and `columns` parameters of `Series.drop`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38903) Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`

2022-04-14 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38903:


 Summary: Implement `ignore_index` of `Series.sort_values` and 
`Series.sort_index`
 Key: SPARK-38903
 URL: https://issues.apache.org/jira/browse/SPARK-38903
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38890) Implement `ignore_index` of `DataFrame.sort_index`.

2022-04-13 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38890:


 Summary: Implement `ignore_index` of `DataFrame.sort_index`.
 Key: SPARK-38890
 URL: https://issues.apache.org/jira/browse/SPARK-38890
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `ignore_index` of `DataFrame.sort_index`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38880) Implement `numeric_only` parameter of `GroupBy.max/min`

2022-04-12 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38880:


 Summary: Implement `numeric_only` parameter of `GroupBy.max/min`
 Key: SPARK-38880
 URL: https://issues.apache.org/jira/browse/SPARK-38880
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `numeric_only` parameter of `GroupBy.max/min`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38863) Implement `skipna` parameter of `DataFrame.all`

2022-04-11 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38863:


 Summary: Implement `skipna` parameter of `DataFrame.all`
 Key: SPARK-38863
 URL: https://issues.apache.org/jira/browse/SPARK-38863
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `skipna` parameter of `DataFrame.all`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38837) Implement `dropna` parameter of `SeriesGroupBy.value_counts`

2022-04-08 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38837:


 Summary: Implement `dropna` parameter of 
`SeriesGroupBy.value_counts`
 Key: SPARK-38837
 URL: https://issues.apache.org/jira/browse/SPARK-38837
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `dropna` parameter of `SeriesGroupBy.value_counts`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38793) Support `return_indexer` parameter of `Index/MultiIndex.sort_values`

2022-04-05 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38793:


 Summary: Support `return_indexer` parameter of 
`Index/MultiIndex.sort_values`
 Key: SPARK-38793
 URL: https://issues.apache.org/jira/browse/SPARK-38793
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Support `return_indexer` parameter of `Index/MultiIndex.sort_values`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-04-05 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517719#comment-17517719
 ] 

Xinrong Meng commented on SPARK-38763:
--

[~bjornjorgensen] For sure :) The fix is in Spark 3.3 (latest released version).

> Pandas API on spark Can`t apply lamda to columns.  
> ---
>
> Key: SPARK-38763
> URL: https://issues.apache.org/jira/browse/SPARK-38763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> When I use a spark master build from 08 November 21 I can use this code to 
> rename columns 
> {code:java}
> pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> {code}
> But now after I get this error when I use this code.
> ---
> ValueErrorTraceback (most recent call last)
> Input In [5], in ()
> > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', 
> x))
>   2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
>   3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> File /opt/spark/python/pyspark/pandas/frame.py:10636, in 
> DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors)
>   10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
> gen_mapper_fn(
>   10633 index
>   10634 )
>   10635 if columns:
> > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
>   10638 if not index and not columns:
>   10639 raise ValueError("Either `index` or `columns` should be 
> provided.")
> File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
> DataFrame.rename..gen_mapper_fn(mapper)
>   10601 elif callable(mapper):
>   10602 mapper_callable = cast(Callable, mapper)
> > 10603 return_type = cast(ScalarType, infer_return_type(mapper))
>   10604 dtype = return_type.dtype
>   10605 spark_return_type = return_type.spark_type
> File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
> infer_return_type(f)
> 560 tpe = get_type_hints(f).get("return", None)
> 562 if tpe is None:
> --> 563 raise ValueError("A return value is required for the input 
> function")
> 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
> SeriesType):
> 566 tpe = tpe.__args__[0]
> ValueError: A return value is required for the input function



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-04-01 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516177#comment-17516177
 ] 

Xinrong Meng commented on SPARK-38763:
--

I will backport the fix after approved and merged.

> Pandas API on spark Can`t apply lamda to columns.  
> ---
>
> Key: SPARK-38763
> URL: https://issues.apache.org/jira/browse/SPARK-38763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> When I use a spark master build from 08 November 21 I can use this code to 
> rename columns 
> {code:java}
> pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> {code}
> But now after I get this error when I use this code.
> ---
> ValueErrorTraceback (most recent call last)
> Input In [5], in ()
> > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', 
> x))
>   2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
>   3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> File /opt/spark/python/pyspark/pandas/frame.py:10636, in 
> DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors)
>   10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
> gen_mapper_fn(
>   10633 index
>   10634 )
>   10635 if columns:
> > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
>   10638 if not index and not columns:
>   10639 raise ValueError("Either `index` or `columns` should be 
> provided.")
> File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
> DataFrame.rename..gen_mapper_fn(mapper)
>   10601 elif callable(mapper):
>   10602 mapper_callable = cast(Callable, mapper)
> > 10603 return_type = cast(ScalarType, infer_return_type(mapper))
>   10604 dtype = return_type.dtype
>   10605 spark_return_type = return_type.spark_type
> File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
> infer_return_type(f)
> 560 tpe = get_type_hints(f).get("return", None)
> 562 if tpe is None:
> --> 563 raise ValueError("A return value is required for the input 
> function")
> 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
> SeriesType):
> 566 tpe = tpe.__args__[0]
> ValueError: A return value is required for the input function



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-04-01 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516172#comment-17516172
 ] 

Xinrong Meng edited comment on SPARK-38763 at 4/1/22 11:56 PM:
---

Hi [~bjornjorgensen], thanks for raising that!

The workaround is to use a function with a return type rather than a lambda.

I am fixing this now.




was (Author: xinrongm):
Hi [~bjornjorgensen], thanks for raising that!

The workaround is to use a function with a return type rather than a lambda.

I am fixing this in https://issues.apache.org/jira/browse/SPARK-38766.



> Pandas API on spark Can`t apply lamda to columns.  
> ---
>
> Key: SPARK-38763
> URL: https://issues.apache.org/jira/browse/SPARK-38763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> When I use a spark master build from 08 November 21 I can use this code to 
> rename columns 
> {code:java}
> pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> {code}
> But now after I get this error when I use this code.
> ---
> ValueErrorTraceback (most recent call last)
> Input In [5], in ()
> > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', 
> x))
>   2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
>   3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> File /opt/spark/python/pyspark/pandas/frame.py:10636, in 
> DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors)
>   10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
> gen_mapper_fn(
>   10633 index
>   10634 )
>   10635 if columns:
> > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
>   10638 if not index and not columns:
>   10639 raise ValueError("Either `index` or `columns` should be 
> provided.")
> File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
> DataFrame.rename..gen_mapper_fn(mapper)
>   10601 elif callable(mapper):
>   10602 mapper_callable = cast(Callable, mapper)
> > 10603 return_type = cast(ScalarType, infer_return_type(mapper))
>   10604 dtype = return_type.dtype
>   10605 spark_return_type = return_type.spark_type
> File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
> infer_return_type(f)
> 560 tpe = get_type_hints(f).get("return", None)
> 562 if tpe is None:
> --> 563 raise ValueError("A return value is required for the input 
> function")
> 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
> SeriesType):
> 566 tpe = tpe.__args__[0]
> ValueError: A return value is required for the input function



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38766) Support lambda `column` parameter of `DataFrame.rename`

2022-04-01 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-38766.
--
Resolution: Duplicate

> Support lambda `column` parameter of `DataFrame.rename`
> ---
>
> Key: SPARK-38766
> URL: https://issues.apache.org/jira/browse/SPARK-38766
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Support lambda `column` parameter of `DataFrame.rename`.
> The issue was detected in https://issues.apache.org/jira/browse/SPARK-38763.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-04-01 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516172#comment-17516172
 ] 

Xinrong Meng commented on SPARK-38763:
--

Hi [~bjornjorgensen], thanks for raising that!

The workaround is to use a function with a return type rather than a lambda.

I am fixing this in https://issues.apache.org/jira/browse/SPARK-38766.



> Pandas API on spark Can`t apply lamda to columns.  
> ---
>
> Key: SPARK-38763
> URL: https://issues.apache.org/jira/browse/SPARK-38763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> When I use a spark master build from 08 November 21 I can use this code to 
> rename columns 
> {code:java}
> pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> {code}
> But now after I get this error when I use this code.
> ---
> ValueErrorTraceback (most recent call last)
> Input In [5], in ()
> > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', 
> x))
>   2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
>   3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> File /opt/spark/python/pyspark/pandas/frame.py:10636, in 
> DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors)
>   10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
> gen_mapper_fn(
>   10633 index
>   10634 )
>   10635 if columns:
> > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
>   10638 if not index and not columns:
>   10639 raise ValueError("Either `index` or `columns` should be 
> provided.")
> File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
> DataFrame.rename..gen_mapper_fn(mapper)
>   10601 elif callable(mapper):
>   10602 mapper_callable = cast(Callable, mapper)
> > 10603 return_type = cast(ScalarType, infer_return_type(mapper))
>   10604 dtype = return_type.dtype
>   10605 spark_return_type = return_type.spark_type
> File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
> infer_return_type(f)
> 560 tpe = get_type_hints(f).get("return", None)
> 562 if tpe is None:
> --> 563 raise ValueError("A return value is required for the input 
> function")
> 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
> SeriesType):
> 566 tpe = tpe.__args__[0]
> ValueError: A return value is required for the input function



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >