[ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-25351:
------------------------------------

    Assignee: Jalpan Randeri

> Handle Pandas category type when converting from Python with Arrow
> ------------------------------------------------------------------
>
>                 Key: SPARK-25351
>                 URL: https://issues.apache.org/jira/browse/SPARK-25351
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 2.3.1
>            Reporter: Bryan Cutler
>            Assignee: Jalpan Randeri
>            Priority: Major
>              Labels: bulk-closed
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>    A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A      object
> B    category
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>    1667         spark_type = ArrayType(from_arrow_type(at.value_type))
>    1668     else:
> -> 1669         raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>    1670     return spark_type
>    1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary<values=string, indices=int8, ordered=0>
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to