Tavis Barr created SPARK-21107: ---------------------------------- Summary: Pyspark: ISO-8859-1 column names inconsistently converted to UTF-8 Key: SPARK-21107 URL: https://issues.apache.org/jira/browse/SPARK-21107 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Environment: Windows 7 standalone Reporter: Tavis Barr Priority: Minor
When I create a column name with ISO-8859-1 (or possibly, I suspect, other non-UTF-8) characters in it, they are sometimes converted to UTF-8, sometimes not. Examples: >>> df = sc.parallelize([[1,2],[1,4],[2,5],[2,6]]).toDF([u"L\xe0",u"Here"]) >>> df.show() +---+----+ | Là|Here| +---+----+ | 1| 2| | 1| 4| | 2| 5| | 2| 6| +---+----+ >>> df.columns ['L\xc3\xa0', 'Here'] >>> df.select(u'L\xc3\xa0').show() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "F:\DataScience\spark-2.2.0-SNAPSHOT-bin-hadoop2.7\python\pyspark\sql\dataframe.py", line 992, in select jdf = self._jdf.select(self._jcols(*cols)) File "F:\DataScience\spark-2.2.0-SNAPSHOT-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__ File "F:\DataScience\spark-2.2.0-SNAPSHOT-bin-hadoop2.7\python\pyspark\sql\utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"cannot resolve '`L\xc3\xa0`' given input columns: [L\xe0, Here];;\n'Project ['L\xc3\xa0]\n+- LogicalRDD [L\xe0#14L, Here#15L]\n" >>> df.select(u'L\xe0').show() +---+ | Là| +---+ | 1| | 1| | 2| | 2| +---+ >>> df.select(u'L\xe0').collect()[0].asDict() {'L\xc3\xa0': 1} This does not seem to affect the Scala version: scala> val df = sc.parallelize(Seq((1,2),(1,4),(2,5),(2,6))).toDF("L\u00e0","Here") df: org.apache.spark.sql.DataFrame = [Lα: int, Here: int] scala> df.select("L\u00e0").show() [...output elided..] +---+ | Là| +---+ | 1| | 1| | 2| | 2| +---+ scala> df.columns(0).map(c => c.toInt ) res8: scala.collection.immutable.IndexedSeq[Int] = Vector(76, 224) [Note that 224 is \u00e0, i.e., the original value] -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org