[ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118189#comment-17118189 ]
Apache Spark commented on SPARK-25351: -------------------------------------- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/28659 > Handle Pandas category type when converting from Python with Arrow > ------------------------------------------------------------------ > > Key: SPARK-25351 > URL: https://issues.apache.org/jira/browse/SPARK-25351 > Project: Spark > Issue Type: Sub-task > Components: PySpark > Affects Versions: 2.3.1 > Reporter: Bryan Cutler > Assignee: Jalpan Randeri > Priority: Major > Labels: bulk-closed > Fix For: 3.1.0 > > > There needs to be some handling of category types done when calling > {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. > Without Arrow, Spark casts each element to the category. For example > {noformat} > In [1]: import pandas as pd > In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) > In [3]: pdf["B"] = pdf["A"].astype('category') > In [4]: pdf > Out[4]: > A B > 0 a a > 1 b b > 2 c c > 3 a a > In [5]: pdf.dtypes > Out[5]: > A object > B category > dtype: object > In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) > In [8]: df = spark.createDataFrame(pdf) > In [9]: df.show() > +---+---+ > | A| B| > +---+---+ > | a| a| > | b| b| > | c| c| > | a| a| > +---+---+ > In [10]: df.printSchema() > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) > In [19]: df = spark.createDataFrame(pdf) > 1667 spark_type = ArrayType(from_arrow_type(at.value_type)) > 1668 else: > -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " > + str(at)) > 1670 return spark_type > 1671 > TypeError: Unsupported type in conversion from Arrow: > dictionary<values=string, indices=int8, ordered=0> > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org