[ https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559532#comment-16559532 ]
Hyukjin Kwon commented on SPARK-24915: -------------------------------------- Ah, yea, I thought it's a duplicate of it but sounds slightly different. I think we should fix this one, yea, as discussed with [~bryanc]. > Calling SparkSession.createDataFrame with schema can throw exception > -------------------------------------------------------------------- > > Key: SPARK-24915 > URL: https://issues.apache.org/jira/browse/SPARK-24915 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.3.1 > Environment: Python 3.6.3 > PySpark 2.3.1 (installed via pip) > OSX 10.12.6 > Reporter: Stephen Spencer > Priority: Major > > There seems to be a bug in PySpark when using the PySparkSQL session to > create a dataframe with a pre-defined schema. > Code to reproduce the error: > {code:java} > from pyspark import SparkConf, SparkContext > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, StringType, Row > conf = SparkConf().setMaster("local").setAppName("repro") > context = SparkContext(conf=conf) > session = SparkSession(context) > # Construct schema (the order of fields is important) > schema = StructType([ > StructField('field2', StructType([StructField('sub_field', StringType(), > False)]), False), > StructField('field1', StringType(), False), > ]) > # Create data to populate data frame > data = [ > Row(field1="Hello", field2=Row(sub_field='world')) > ] > # Attempt to create the data frame supplying the schema > # this will throw a ValueError > df = session.createDataFrame(data, schema=schema) > df.show(){code} > Running this throws a ValueError > {noformat} > Traceback (most recent call last): > File "schema_bug.py", line 18, in <module> > df = session.createDataFrame(data, schema=schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 691, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in _createFromLocal > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in <listcomp> > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in toInternal > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in <genexpr> > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 439, in toInternal > return self.dataType.toInternal(obj) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 619, in toInternal > raise ValueError("Unexpected tuple %r with StructType" % obj) > ValueError: Unexpected tuple 'Hello' with StructType{noformat} > The problem seems to be here: > https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603 > specifically the bit > {code:java} > zip(self.fields, obj, self._needConversion) > {code} > This zip statement seems to assume that obj and self.fields are ordered in > the same way, so that the elements of obj will correspond to the right fields > in the schema. However this is not true, a Row orders its elements > alphabetically but the fields in the schema are in whatever order they are > specified. In this example field2 is being initialised with the field1 > element 'Hello'. If you re-order the fields in the schema to go (field1, > field2), the given example works without error. > The schema in the repro is specifically designed to elicit the problem, the > fields are out of alphabetical order and one field is a StructType, making > chema._needSerializeAnyField==True . However we encountered this in real use. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org