[ https://issues.apache.org/jira/browse/SPARK-27995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-27995: ------------------------------------ Assignee: Apache Spark > Note the difference between str of Python 2 and 3 at Arrow optimized toPandas > ----------------------------------------------------------------------------- > > Key: SPARK-27995 > URL: https://issues.apache.org/jira/browse/SPARK-27995 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 3.0.0 > Reporter: Hyukjin Kwon > Assignee: Apache Spark > Priority: Minor > > When Arrow optimization is enabled in Python 2.7, > {code} > import pandas > pdf = pandas.DataFrame(["test1", "test2"]) > df = spark.createDataFrame(pdf) > df.show() > {code} > I got the following output: > {code} > +----------------+ > | 0| > +----------------+ > |[74 65 73 74 31]| > |[74 65 73 74 32]| > +----------------+ > {code} > This looks because Python's {{str}} and {{byte}} are same. it does look right: > {code} > >>> str == bytes > True > >>> isinstance("a", bytes) > True > {code} > 1. Python 2 treats `str` as `bytes`. > 2. PySpark added some special codes and hacks to recognizes `str` as string > types. > 3. PyArrow / Pandas followed Python 2 difference > We might have to match the behaviour to PySpark's but Python 2 is deprecated > anyway. I think it's better to just note it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org