[jira] [Commented] (SPARK-33952) Python-friendly dtypes for pyspark dataframes

2021-01-04 Thread Marc de Lignie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258046#comment-17258046
 ] 

Marc de Lignie commented on SPARK-33952:


@[~hyukjin.kwon] Thanks for asking. When you write a pyspark UDF or get a 
pyspark DataFrame returned after a collect() it is much more recognizable to 
know that a column datatype is "[Row(x:[Row(x1:string, x2:string)], y:string, 
z:string)]" rather than "array>, 
y:string, z:string>>". Of course, this remains a matter of taste. Also, the 
original dtypes in terms of array, struct, map remain useful when applying 
push-down functions for which the documentation and naming uses these terms.

> Python-friendly dtypes for pyspark dataframes
> -
>
> Key: SPARK-33952
> URL: https://issues.apache.org/jira/browse/SPARK-33952
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Marc de Lignie
>Priority: Minor
>
> The pyspark.sql.DataFrame.dtypes attribute contains string representations of 
> the column datatypes in terms of JVM datatypes. However, for a python user it 
> is a significant mental step to translate these to the corresponding python 
> types encountered in UDF's and collected dataframes. This holds in particular 
> for nested composite datatypes (array, map and struct). It is proposed to 
> provide python-friendly dtypes in pyspark (as an addition, not a replacement) 
> in which array<>, map<> and struct<> are translated to [], {} and Row().
> Sample code, including tests, is available as [gist on 
> github|https://gist.github.com/vtslab/81ded1a7af006100e00bf2a4a70a8147]. More 
> explanation is provided at: 
> [https://yaaics.blogspot.com/2020/12/python-friendly-dtypes-for-pyspark.html]
> If this proposal finds sufficient support, I can provide a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33952) Python-friendly dtypes for pyspark dataframes

2020-12-31 Thread Marc de Lignie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc de Lignie updated SPARK-33952:
---
Fix Version/s: 3.2.0

> Python-friendly dtypes for pyspark dataframes
> -
>
> Key: SPARK-33952
> URL: https://issues.apache.org/jira/browse/SPARK-33952
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Marc de Lignie
>Priority: Minor
> Fix For: 3.2.0
>
>
> The pyspark.sql.DataFrame.dtypes attribute contains string representations of 
> the column datatypes in terms of JVM datatypes. However, for a python user it 
> is a significant mental step to translate these to the corresponding python 
> types encountered in UDF's and collected dataframes. This holds in particular 
> for nested composite datatypes (array, map and struct). It is proposed to 
> provide python-friendly dtypes in pyspark (as an addition, not a replacement) 
> in which array<>, map<> and struct<> are translated to [], {} and Row().
> Sample code, including tests, is available as [gist on 
> github|https://gist.github.com/vtslab/81ded1a7af006100e00bf2a4a70a8147]. More 
> explanation is provided at: 
> [https://yaaics.blogspot.com/2020/12/python-friendly-dtypes-for-pyspark.html]
> If this proposal finds sufficient support, I can provide a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33952) Python-friendly dtypes for pyspark dataframes

2020-12-31 Thread Marc de Lignie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc de Lignie updated SPARK-33952:
---
Issue Type: Improvement  (was: Task)

> Python-friendly dtypes for pyspark dataframes
> -
>
> Key: SPARK-33952
> URL: https://issues.apache.org/jira/browse/SPARK-33952
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Marc de Lignie
>Priority: Minor
> Fix For: 3.2.0
>
>
> The pyspark.sql.DataFrame.dtypes attribute contains string representations of 
> the column datatypes in terms of JVM datatypes. However, for a python user it 
> is a significant mental step to translate these to the corresponding python 
> types encountered in UDF's and collected dataframes. This holds in particular 
> for nested composite datatypes (array, map and struct). It is proposed to 
> provide python-friendly dtypes in pyspark (as an addition, not a replacement) 
> in which array<>, map<> and struct<> are translated to [], {} and Row().
> Sample code, including tests, is available as [gist on 
> github|https://gist.github.com/vtslab/81ded1a7af006100e00bf2a4a70a8147]. More 
> explanation is provided at: 
> [https://yaaics.blogspot.com/2020/12/python-friendly-dtypes-for-pyspark.html]
> If this proposal finds sufficient support, I can provide a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33952) Python-friendly dtypes for pyspark dataframes

2020-12-31 Thread Marc de Lignie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc de Lignie updated SPARK-33952:
---
Affects Version/s: (was: 3.0.1)
   3.2.0

> Python-friendly dtypes for pyspark dataframes
> -
>
> Key: SPARK-33952
> URL: https://issues.apache.org/jira/browse/SPARK-33952
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Marc de Lignie
>Priority: Minor
>
> The pyspark.sql.DataFrame.dtypes attribute contains string representations of 
> the column datatypes in terms of JVM datatypes. However, for a python user it 
> is a significant mental step to translate these to the corresponding python 
> types encountered in UDF's and collected dataframes. This holds in particular 
> for nested composite datatypes (array, map and struct). It is proposed to 
> provide python-friendly dtypes in pyspark (as an addition, not a replacement) 
> in which array<>, map<> and struct<> are translated to [], {} and Row().
> Sample code, including tests, is available as [gist on 
> github|https://gist.github.com/vtslab/81ded1a7af006100e00bf2a4a70a8147]. More 
> explanation is provided at: 
> [https://yaaics.blogspot.com/2020/12/python-friendly-dtypes-for-pyspark.html]
> If this proposal finds sufficient support, I can provide a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33952) Python-friendly dtypes for pyspark dataframes

2020-12-31 Thread Marc de Lignie (Jira)
Marc de Lignie created SPARK-33952:
--

 Summary: Python-friendly dtypes for pyspark dataframes
 Key: SPARK-33952
 URL: https://issues.apache.org/jira/browse/SPARK-33952
 Project: Spark
  Issue Type: Task
  Components: PySpark
Affects Versions: 3.0.1
Reporter: Marc de Lignie


The pyspark.sql.DataFrame.dtypes attribute contains string representations of 
the column datatypes in terms of JVM datatypes. However, for a python user it 
is a significant mental step to translate these to the corresponding python 
types encountered in UDF's and collected dataframes. This holds in particular 
for nested composite datatypes (array, map and struct). It is proposed to 
provide python-friendly dtypes in pyspark (as an addition, not a replacement) 
in which array<>, map<> and struct<> are translated to [], {} and Row().

Sample code, including tests, is available as [gist on 
github|https://gist.github.com/vtslab/81ded1a7af006100e00bf2a4a70a8147]. More 
explanation is provided at: 
[https://yaaics.blogspot.com/2020/12/python-friendly-dtypes-for-pyspark.html]

If this proposal finds sufficient support, I can provide a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org