[ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629572#comment-16629572
 ] 

Bryan Cutler commented on SPARK-25351:
--------------------------------------

Hi [~pgadige], yes please go ahead with this issue!  When creating a DataFrame 
from Pandas without Arrow, category columns are converted into the type of the 
category. So in the example above, column "A" becomes a string type. The same 
should be done when Arrow is enabled, so we end up with the same Spark 
DataFrame. If you are able to, we also need to see how this affects pandas_udfs 
too. Thanks!

> Handle Pandas category type when converting from Python with Arrow
> ------------------------------------------------------------------
>
>                 Key: SPARK-25351
>                 URL: https://issues.apache.org/jira/browse/SPARK-25351
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 2.3.1
>            Reporter: Bryan Cutler
>            Priority: Major
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>    A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A      object
> B    category
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>    1667         spark_type = ArrayType(from_arrow_type(at.value_type))
>    1668     else:
> -> 1669         raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>    1670     return spark_type
>    1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary<values=string, indices=int8, ordered=0>
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to