[jira] [Created] (SPARK-21107) Pyspark: ISO-8859-1 column names inconsistently converted to UTF-8

Tavis Barr (JIRA) Thu, 15 Jun 2017 11:32:18 -0700

Tavis Barr created SPARK-21107:
----------------------------------

             Summary: Pyspark: ISO-8859-1 column names inconsistently converted 
to UTF-8
                 Key: SPARK-21107
                 URL: https://issues.apache.org/jira/browse/SPARK-21107
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.2.0
         Environment: Windows 7 standalone
            Reporter: Tavis Barr
            Priority: Minor



When I create a column name with ISO-8859-1 (or possibly, I suspect, other 
non-UTF-8) characters in it, they are sometimes converted to UTF-8, sometimes 
not.

Examples:
>>> df = sc.parallelize([[1,2],[1,4],[2,5],[2,6]]).toDF([u"L\xe0",u"Here"])
>>> df.show()
+---+----+
| Là|Here|
+---+----+
|  1|   2|
|  1|   4|
|  2|   5|
|  2|   6|
+---+----+

>>> df.columns
['L\xc3\xa0', 'Here']
>>> df.select(u'L\xc3\xa0').show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"F:\DataScience\spark-2.2.0-SNAPSHOT-bin-hadoop2.7\python\pyspark\sql\dataframe.py",
 line 992, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File 
"F:\DataScience\spark-2.2.0-SNAPSHOT-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py",
 line 1133, in __call__
  File 
"F:\DataScience\spark-2.2.0-SNAPSHOT-bin-hadoop2.7\python\pyspark\sql\utils.py",
 line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve '`L\xc3\xa0`' given input 
columns: [L\xe0, Here];;\n'Project ['L\xc3\xa0]\n+- LogicalRDD [L\xe0#14L, 
Here#15L]\n"
>>> df.select(u'L\xe0').show()
+---+
| Là|
+---+
|  1|
|  1|
|  2|
|  2|
+---+
>>> df.select(u'L\xe0').collect()[0].asDict()
{'L\xc3\xa0': 1}

This does not seem to affect the Scala version:

scala> val df = 
sc.parallelize(Seq((1,2),(1,4),(2,5),(2,6))).toDF("L\u00e0","Here")
df: org.apache.spark.sql.DataFrame = [Lα: int, Here: int]

scala> df.select("L\u00e0").show()
[...output elided..]
+---+
| Là|
+---+
|  1|
|  1|
|  2|
|  2|
+---+

scala> df.columns(0).map(c => c.toInt )
res8: scala.collection.immutable.IndexedSeq[Int] = Vector(76, 224)

[Note that 224 is \u00e0, i.e., the original value]




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21107) Pyspark: ISO-8859-1 column names inconsistently converted to UTF-8

Reply via email to