Repository: spark Updated Branches: refs/heads/branch-2.0 ff1cfce18 -> c0bb77132
[SPARK-15244] [PYTHON] Type of column name created with createDataFrame is not consistent. ## What changes were proposed in this pull request? **createDataFrame** returns inconsistent types for column names. ```python >>> from pyspark.sql.types import StructType, StructField, StringType >>> schema = StructType([StructField(u"col", StringType())]) >>> df1 = spark.createDataFrame([("a",)], schema) >>> df1.columns # "col" is str ['col'] >>> df2 = spark.createDataFrame([("a",)], [u"col"]) >>> df2.columns # "col" is unicode [u'col'] ``` The reason is only **StructField** has the following code. ``` if not isinstance(name, str): name = name.encode('utf-8') ``` This PR adds the same logic into **createDataFrame** for consistency. ``` if isinstance(schema, list): schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in schema] ``` ## How was this patch tested? Pass the Jenkins test (with new python doctest) Author: Dongjoon Hyun <dongj...@apache.org> Closes #13097 from dongjoon-hyun/SPARK-15244. (cherry picked from commit 0f576a5748244f7e874b925f8d841f1ca238f087) Signed-off-by: Davies Liu <davies....@gmail.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c0bb7713 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c0bb7713 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c0bb7713 Branch: refs/heads/branch-2.0 Commit: c0bb77132b9acac951074fd623892abafeb02512 Parents: ff1cfce Author: Dongjoon Hyun <dongj...@apache.org> Authored: Tue May 17 13:05:07 2016 -0700 Committer: Davies Liu <davies....@gmail.com> Committed: Tue May 17 13:05:17 2016 -0700 ---------------------------------------------------------------------- python/pyspark/sql/session.py | 2 ++ python/pyspark/sql/tests.py | 7 +++++++ 2 files changed, 9 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/c0bb7713/python/pyspark/sql/session.py ---------------------------------------------------------------------- diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py index ae31435..0781b44 100644 --- a/python/pyspark/sql/session.py +++ b/python/pyspark/sql/session.py @@ -465,6 +465,8 @@ class SparkSession(object): return (obj, ) schema = StructType().add("value", datatype) else: + if isinstance(schema, list): + schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in schema] prepare = lambda obj: obj if isinstance(data, RDD): http://git-wip-us.apache.org/repos/asf/spark/blob/c0bb7713/python/pyspark/sql/tests.py ---------------------------------------------------------------------- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index 0c73f58..0977c43 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -228,6 +228,13 @@ class SQLTests(ReusedPySparkTestCase): self.assertRaises(AnalysisException, lambda: df.select(df.c).first()) self.assertRaises(AnalysisException, lambda: df.select(df["c"]).first()) + def test_column_name_encoding(self): + """Ensure that created columns has `str` type consistently.""" + columns = self.spark.createDataFrame([('Alice', 1)], ['name', u'age']).columns + self.assertEqual(columns, ['name', 'age']) + self.assertTrue(isinstance(columns[0], str)) + self.assertTrue(isinstance(columns[1], str)) + def test_explode(self): from pyspark.sql.functions import explode d = [Row(a=1, intlist=[1, 2, 3], mapfield={"a": "b"})] --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org