[ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629572#comment-16629572 ]
Bryan Cutler commented on SPARK-25351: -------------------------------------- Hi [~pgadige], yes please go ahead with this issue! When creating a DataFrame from Pandas without Arrow, category columns are converted into the type of the category. So in the example above, column "A" becomes a string type. The same should be done when Arrow is enabled, so we end up with the same Spark DataFrame. If you are able to, we also need to see how this affects pandas_udfs too. Thanks! > Handle Pandas category type when converting from Python with Arrow > ------------------------------------------------------------------ > > Key: SPARK-25351 > URL: https://issues.apache.org/jira/browse/SPARK-25351 > Project: Spark > Issue Type: Sub-task > Components: PySpark > Affects Versions: 2.3.1 > Reporter: Bryan Cutler > Priority: Major > > There needs to be some handling of category types done when calling > {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. > Without Arrow, Spark casts each element to the category. For example > {noformat} > In [1]: import pandas as pd > In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) > In [3]: pdf["B"] = pdf["A"].astype('category') > In [4]: pdf > Out[4]: > A B > 0 a a > 1 b b > 2 c c > 3 a a > In [5]: pdf.dtypes > Out[5]: > A object > B category > dtype: object > In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) > In [8]: df = spark.createDataFrame(pdf) > In [9]: df.show() > +---+---+ > | A| B| > +---+---+ > | a| a| > | b| b| > | c| c| > | a| a| > +---+---+ > In [10]: df.printSchema() > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) > In [19]: df = spark.createDataFrame(pdf) > 1667 spark_type = ArrayType(from_arrow_type(at.value_type)) > 1668 else: > -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " > + str(at)) > 1670 return spark_type > 1671 > TypeError: Unsupported type in conversion from Arrow: > dictionary<values=string, indices=int8, ordered=0> > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org