Marc de Lignie created SPARK-33952:
--------------------------------------

             Summary: Python-friendly dtypes for pyspark dataframes
                 Key: SPARK-33952
                 URL: https://issues.apache.org/jira/browse/SPARK-33952
             Project: Spark
          Issue Type: Task
          Components: PySpark
    Affects Versions: 3.0.1
            Reporter: Marc de Lignie


The pyspark.sql.DataFrame.dtypes attribute contains string representations of 
the column datatypes in terms of JVM datatypes. However, for a python user it 
is a significant mental step to translate these to the corresponding python 
types encountered in UDF's and collected dataframes. This holds in particular 
for nested composite datatypes (array, map and struct). It is proposed to 
provide python-friendly dtypes in pyspark (as an addition, not a replacement) 
in which array<>, map<> and struct<> are translated to [], {} and Row().

Sample code, including tests, is available as [gist on 
github|https://gist.github.com/vtslab/81ded1a7af006100e00bf2a4a70a8147]. More 
explanation is provided at: 
[https://yaaics.blogspot.com/2020/12/python-friendly-dtypes-for-pyspark.html]

If this proposal finds sufficient support, I can provide a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to