Repository: spark
Updated Branches:
  refs/heads/branch-2.0 ff1cfce18 -> c0bb77132


[SPARK-15244] [PYTHON] Type of column name created with createDataFrame is not 
consistent.

## What changes were proposed in this pull request?

**createDataFrame** returns inconsistent types for column names.
```python
>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField(u"col", StringType())])
>>> df1 = spark.createDataFrame([("a",)], schema)
>>> df1.columns # "col" is str
['col']
>>> df2 = spark.createDataFrame([("a",)], [u"col"])
>>> df2.columns # "col" is unicode
[u'col']
```

The reason is only **StructField** has the following code.
```
if not isinstance(name, str):
    name = name.encode('utf-8')
```
This PR adds the same logic into **createDataFrame** for consistency.
```
if isinstance(schema, list):
    schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in 
schema]
```

## How was this patch tested?

Pass the Jenkins test (with new python doctest)

Author: Dongjoon Hyun <dongj...@apache.org>

Closes #13097 from dongjoon-hyun/SPARK-15244.

(cherry picked from commit 0f576a5748244f7e874b925f8d841f1ca238f087)
Signed-off-by: Davies Liu <davies....@gmail.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c0bb7713
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c0bb7713
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c0bb7713

Branch: refs/heads/branch-2.0
Commit: c0bb77132b9acac951074fd623892abafeb02512
Parents: ff1cfce
Author: Dongjoon Hyun <dongj...@apache.org>
Authored: Tue May 17 13:05:07 2016 -0700
Committer: Davies Liu <davies....@gmail.com>
Committed: Tue May 17 13:05:17 2016 -0700

----------------------------------------------------------------------
 python/pyspark/sql/session.py | 2 ++
 python/pyspark/sql/tests.py   | 7 +++++++
 2 files changed, 9 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/c0bb7713/python/pyspark/sql/session.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index ae31435..0781b44 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -465,6 +465,8 @@ class SparkSession(object):
                 return (obj, )
             schema = StructType().add("value", datatype)
         else:
+            if isinstance(schema, list):
+                schema = [x.encode('utf-8') if not isinstance(x, str) else x 
for x in schema]
             prepare = lambda obj: obj
 
         if isinstance(data, RDD):

http://git-wip-us.apache.org/repos/asf/spark/blob/c0bb7713/python/pyspark/sql/tests.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 0c73f58..0977c43 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -228,6 +228,13 @@ class SQLTests(ReusedPySparkTestCase):
         self.assertRaises(AnalysisException, lambda: df.select(df.c).first())
         self.assertRaises(AnalysisException, lambda: 
df.select(df["c"]).first())
 
+    def test_column_name_encoding(self):
+        """Ensure that created columns has `str` type consistently."""
+        columns = self.spark.createDataFrame([('Alice', 1)], ['name', 
u'age']).columns
+        self.assertEqual(columns, ['name', 'age'])
+        self.assertTrue(isinstance(columns[0], str))
+        self.assertTrue(isinstance(columns[1], str))
+
     def test_explode(self):
         from pyspark.sql.functions import explode
         d = [Row(a=1, intlist=[1, 2, 3], mapfield={"a": "b"})]


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to