[ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624323#comment-16624323 ]
Pooja Gadige commented on SPARK-25351: -------------------------------------- Hello! I'm working on this issue. Should we handle the category type of Spark elements without Arrow as `string`? If that is not case, I'll appreciate if you could please guide me in the right direction. Thank you. > Handle Pandas category type when converting from Python with Arrow > ------------------------------------------------------------------ > > Key: SPARK-25351 > URL: https://issues.apache.org/jira/browse/SPARK-25351 > Project: Spark > Issue Type: Sub-task > Components: PySpark > Affects Versions: 2.3.1 > Reporter: Bryan Cutler > Priority: Major > > There needs to be some handling of category types done when calling > {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. > Without Arrow, Spark casts each element to the category. For example > {noformat} > In [1]: import pandas as pd > In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) > In [3]: pdf["B"] = pdf["A"].astype('category') > In [4]: pdf > Out[4]: > A B > 0 a a > 1 b b > 2 c c > 3 a a > In [5]: pdf.dtypes > Out[5]: > A object > B category > dtype: object > In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) > In [8]: df = spark.createDataFrame(pdf) > In [9]: df.show() > +---+---+ > | A| B| > +---+---+ > | a| a| > | b| b| > | c| c| > | a| a| > +---+---+ > In [10]: df.printSchema() > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) > In [19]: df = spark.createDataFrame(pdf) > 1667 spark_type = ArrayType(from_arrow_type(at.value_type)) > 1668 else: > -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " > + str(at)) > 1670 return spark_type > 1671 > TypeError: Unsupported type in conversion from Arrow: > dictionary<values=string, indices=int8, ordered=0> > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org