[jira] [Commented] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception
[ https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019770#comment-17019770 ] Bryan Cutler commented on SPARK-24915: -- [~jhereth] since there is already a lot of discussion on that PR I would leave it open until there is a conclusion on patching 2.4 or not. If so, then you could rebase or open a new PR against branch-2.4. > Calling SparkSession.createDataFrame with schema can throw exception > > > Key: SPARK-24915 > URL: https://issues.apache.org/jira/browse/SPARK-24915 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Python 3.6.3 > PySpark 2.3.1 (installed via pip) > OSX 10.12.6 >Reporter: Stephen Spencer >Priority: Major > > There seems to be a bug in PySpark when using the PySparkSQL session to > create a dataframe with a pre-defined schema. > Code to reproduce the error: > {code:java} > from pyspark import SparkConf, SparkContext > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, StringType, Row > conf = SparkConf().setMaster("local").setAppName("repro") > context = SparkContext(conf=conf) > session = SparkSession(context) > # Construct schema (the order of fields is important) > schema = StructType([ > StructField('field2', StructType([StructField('sub_field', StringType(), > False)]), False), > StructField('field1', StringType(), False), > ]) > # Create data to populate data frame > data = [ > Row(field1="Hello", field2=Row(sub_field='world')) > ] > # Attempt to create the data frame supplying the schema > # this will throw a ValueError > df = session.createDataFrame(data, schema=schema) > df.show(){code} > Running this throws a ValueError > {noformat} > Traceback (most recent call last): > File "schema_bug.py", line 18, in > df = session.createDataFrame(data, schema=schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 691, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in _createFromLocal > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in toInternal > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 439, in toInternal > return self.dataType.toInternal(obj) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 619, in toInternal > raise ValueError("Unexpected tuple %r with StructType" % obj) > ValueError: Unexpected tuple 'Hello' with StructType{noformat} > The problem seems to be here: > https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603 > specifically the bit > {code:java} > zip(self.fields, obj, self._needConversion) > {code} > This zip statement seems to assume that obj and self.fields are ordered in > the same way, so that the elements of obj will correspond to the right fields > in the schema. However this is not true, a Row orders its elements > alphabetically but the fields in the schema are in whatever order they are > specified. In this example field2 is being initialised with the field1 > element 'Hello'. If you re-order the fields in the schema to go (field1, > field2), the given example works without error. > The schema in the repro is specifically designed to elicit the problem, the > fields are out of alphabetical order and one field is a StructType, making > chema._needSerializeAnyField==True . However we encountered this in real use. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception
[ https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014870#comment-17014870 ] Joachim Hereth commented on SPARK-24915: [~bryanc] Thanks! The PR was against master and can probably be closed. If this will now be a bugfix for 2.4 - should I rebase the PR or open a new one? I couldn't find information how this is handled. > Calling SparkSession.createDataFrame with schema can throw exception > > > Key: SPARK-24915 > URL: https://issues.apache.org/jira/browse/SPARK-24915 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Python 3.6.3 > PySpark 2.3.1 (installed via pip) > OSX 10.12.6 >Reporter: Stephen Spencer >Priority: Major > > There seems to be a bug in PySpark when using the PySparkSQL session to > create a dataframe with a pre-defined schema. > Code to reproduce the error: > {code:java} > from pyspark import SparkConf, SparkContext > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, StringType, Row > conf = SparkConf().setMaster("local").setAppName("repro") > context = SparkContext(conf=conf) > session = SparkSession(context) > # Construct schema (the order of fields is important) > schema = StructType([ > StructField('field2', StructType([StructField('sub_field', StringType(), > False)]), False), > StructField('field1', StringType(), False), > ]) > # Create data to populate data frame > data = [ > Row(field1="Hello", field2=Row(sub_field='world')) > ] > # Attempt to create the data frame supplying the schema > # this will throw a ValueError > df = session.createDataFrame(data, schema=schema) > df.show(){code} > Running this throws a ValueError > {noformat} > Traceback (most recent call last): > File "schema_bug.py", line 18, in > df = session.createDataFrame(data, schema=schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 691, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in _createFromLocal > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in toInternal > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 439, in toInternal > return self.dataType.toInternal(obj) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 619, in toInternal > raise ValueError("Unexpected tuple %r with StructType" % obj) > ValueError: Unexpected tuple 'Hello' with StructType{noformat} > The problem seems to be here: > https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603 > specifically the bit > {code:java} > zip(self.fields, obj, self._needConversion) > {code} > This zip statement seems to assume that obj and self.fields are ordered in > the same way, so that the elements of obj will correspond to the right fields > in the schema. However this is not true, a Row orders its elements > alphabetically but the fields in the schema are in whatever order they are > specified. In this example field2 is being initialised with the field1 > element 'Hello'. If you re-order the fields in the schema to go (field1, > field2), the given example works without error. > The schema in the repro is specifically designed to elicit the problem, the > fields are out of alphabetical order and one field is a StructType, making > chema._needSerializeAnyField==True . However we encountered this in real use. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception
[ https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014719#comment-17014719 ] Bryan Cutler commented on SPARK-24915: -- [~jhereth] apologies for closing prematurely, I didn't know there was still some ongoing discussion in the PR. I don't think we can backport SPARK-29748, so I'll reopen this for now. > Calling SparkSession.createDataFrame with schema can throw exception > > > Key: SPARK-24915 > URL: https://issues.apache.org/jira/browse/SPARK-24915 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Python 3.6.3 > PySpark 2.3.1 (installed via pip) > OSX 10.12.6 >Reporter: Stephen Spencer >Priority: Major > > There seems to be a bug in PySpark when using the PySparkSQL session to > create a dataframe with a pre-defined schema. > Code to reproduce the error: > {code:java} > from pyspark import SparkConf, SparkContext > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, StringType, Row > conf = SparkConf().setMaster("local").setAppName("repro") > context = SparkContext(conf=conf) > session = SparkSession(context) > # Construct schema (the order of fields is important) > schema = StructType([ > StructField('field2', StructType([StructField('sub_field', StringType(), > False)]), False), > StructField('field1', StringType(), False), > ]) > # Create data to populate data frame > data = [ > Row(field1="Hello", field2=Row(sub_field='world')) > ] > # Attempt to create the data frame supplying the schema > # this will throw a ValueError > df = session.createDataFrame(data, schema=schema) > df.show(){code} > Running this throws a ValueError > {noformat} > Traceback (most recent call last): > File "schema_bug.py", line 18, in > df = session.createDataFrame(data, schema=schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 691, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in _createFromLocal > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in toInternal > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 439, in toInternal > return self.dataType.toInternal(obj) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 619, in toInternal > raise ValueError("Unexpected tuple %r with StructType" % obj) > ValueError: Unexpected tuple 'Hello' with StructType{noformat} > The problem seems to be here: > https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603 > specifically the bit > {code:java} > zip(self.fields, obj, self._needConversion) > {code} > This zip statement seems to assume that obj and self.fields are ordered in > the same way, so that the elements of obj will correspond to the right fields > in the schema. However this is not true, a Row orders its elements > alphabetically but the fields in the schema are in whatever order they are > specified. In this example field2 is being initialised with the field1 > element 'Hello'. If you re-order the fields in the schema to go (field1, > field2), the given example works without error. > The schema in the repro is specifically designed to elicit the problem, the > fields are out of alphabetical order and one field is a StructType, making > chema._needSerializeAnyField==True . However we encountered this in real use. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception
[ https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013591#comment-17013591 ] Joachim Hereth commented on SPARK-24915: [~bryanc]Is there any chance that SPARK-29748 will be backported to 2.4? If not, why not apply the bugfix from [https://github.com/apache/spark/pull/26118] to 2.4 instead of keeping this bug for 2.4 users? > Calling SparkSession.createDataFrame with schema can throw exception > > > Key: SPARK-24915 > URL: https://issues.apache.org/jira/browse/SPARK-24915 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Python 3.6.3 > PySpark 2.3.1 (installed via pip) > OSX 10.12.6 >Reporter: Stephen Spencer >Priority: Major > > There seems to be a bug in PySpark when using the PySparkSQL session to > create a dataframe with a pre-defined schema. > Code to reproduce the error: > {code:java} > from pyspark import SparkConf, SparkContext > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, StringType, Row > conf = SparkConf().setMaster("local").setAppName("repro") > context = SparkContext(conf=conf) > session = SparkSession(context) > # Construct schema (the order of fields is important) > schema = StructType([ > StructField('field2', StructType([StructField('sub_field', StringType(), > False)]), False), > StructField('field1', StringType(), False), > ]) > # Create data to populate data frame > data = [ > Row(field1="Hello", field2=Row(sub_field='world')) > ] > # Attempt to create the data frame supplying the schema > # this will throw a ValueError > df = session.createDataFrame(data, schema=schema) > df.show(){code} > Running this throws a ValueError > {noformat} > Traceback (most recent call last): > File "schema_bug.py", line 18, in > df = session.createDataFrame(data, schema=schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 691, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in _createFromLocal > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in toInternal > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 439, in toInternal > return self.dataType.toInternal(obj) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 619, in toInternal > raise ValueError("Unexpected tuple %r with StructType" % obj) > ValueError: Unexpected tuple 'Hello' with StructType{noformat} > The problem seems to be here: > https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603 > specifically the bit > {code:java} > zip(self.fields, obj, self._needConversion) > {code} > This zip statement seems to assume that obj and self.fields are ordered in > the same way, so that the elements of obj will correspond to the right fields > in the schema. However this is not true, a Row orders its elements > alphabetically but the fields in the schema are in whatever order they are > specified. In this example field2 is being initialised with the field1 > element 'Hello'. If you re-order the fields in the schema to go (field1, > field2), the given example works without error. > The schema in the repro is specifically designed to elicit the problem, the > fields are out of alphabetical order and one field is a StructType, making > chema._needSerializeAnyField==True . However we encountered this in real use. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception
[ https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951417#comment-16951417 ] Joachim Hereth commented on SPARK-24915: this is fixed by [https://github.com/apache/spark/pull/26118.] It's strange that Row is considered a tuple (it also causes the tests to look a bit strange). However, changing the hierarchy seemed a bit too adventurous. > Calling SparkSession.createDataFrame with schema can throw exception > > > Key: SPARK-24915 > URL: https://issues.apache.org/jira/browse/SPARK-24915 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Python 3.6.3 > PySpark 2.3.1 (installed via pip) > OSX 10.12.6 >Reporter: Stephen Spencer >Priority: Major > > There seems to be a bug in PySpark when using the PySparkSQL session to > create a dataframe with a pre-defined schema. > Code to reproduce the error: > {code:java} > from pyspark import SparkConf, SparkContext > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, StringType, Row > conf = SparkConf().setMaster("local").setAppName("repro") > context = SparkContext(conf=conf) > session = SparkSession(context) > # Construct schema (the order of fields is important) > schema = StructType([ > StructField('field2', StructType([StructField('sub_field', StringType(), > False)]), False), > StructField('field1', StringType(), False), > ]) > # Create data to populate data frame > data = [ > Row(field1="Hello", field2=Row(sub_field='world')) > ] > # Attempt to create the data frame supplying the schema > # this will throw a ValueError > df = session.createDataFrame(data, schema=schema) > df.show(){code} > Running this throws a ValueError > {noformat} > Traceback (most recent call last): > File "schema_bug.py", line 18, in > df = session.createDataFrame(data, schema=schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 691, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in _createFromLocal > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in toInternal > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 439, in toInternal > return self.dataType.toInternal(obj) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 619, in toInternal > raise ValueError("Unexpected tuple %r with StructType" % obj) > ValueError: Unexpected tuple 'Hello' with StructType{noformat} > The problem seems to be here: > https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603 > specifically the bit > {code:java} > zip(self.fields, obj, self._needConversion) > {code} > This zip statement seems to assume that obj and self.fields are ordered in > the same way, so that the elements of obj will correspond to the right fields > in the schema. However this is not true, a Row orders its elements > alphabetically but the fields in the schema are in whatever order they are > specified. In this example field2 is being initialised with the field1 > element 'Hello'. If you re-order the fields in the schema to go (field1, > field2), the given example works without error. > The schema in the repro is specifically designed to elicit the problem, the > fields are out of alphabetical order and one field is a StructType, making > chema._needSerializeAnyField==True . However we encountered this in real use. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception
[ https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559533#comment-16559533 ] Stephen Spencer commented on SPARK-24915: - [~bryanc] Thanks a lot for the help, we can work around this problem for now. Looking forward to version 3.0! > Calling SparkSession.createDataFrame with schema can throw exception > > > Key: SPARK-24915 > URL: https://issues.apache.org/jira/browse/SPARK-24915 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Python 3.6.3 > PySpark 2.3.1 (installed via pip) > OSX 10.12.6 >Reporter: Stephen Spencer >Priority: Major > > There seems to be a bug in PySpark when using the PySparkSQL session to > create a dataframe with a pre-defined schema. > Code to reproduce the error: > {code:java} > from pyspark import SparkConf, SparkContext > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, StringType, Row > conf = SparkConf().setMaster("local").setAppName("repro") > context = SparkContext(conf=conf) > session = SparkSession(context) > # Construct schema (the order of fields is important) > schema = StructType([ > StructField('field2', StructType([StructField('sub_field', StringType(), > False)]), False), > StructField('field1', StringType(), False), > ]) > # Create data to populate data frame > data = [ > Row(field1="Hello", field2=Row(sub_field='world')) > ] > # Attempt to create the data frame supplying the schema > # this will throw a ValueError > df = session.createDataFrame(data, schema=schema) > df.show(){code} > Running this throws a ValueError > {noformat} > Traceback (most recent call last): > File "schema_bug.py", line 18, in > df = session.createDataFrame(data, schema=schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 691, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in _createFromLocal > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in toInternal > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 439, in toInternal > return self.dataType.toInternal(obj) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 619, in toInternal > raise ValueError("Unexpected tuple %r with StructType" % obj) > ValueError: Unexpected tuple 'Hello' with StructType{noformat} > The problem seems to be here: > https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603 > specifically the bit > {code:java} > zip(self.fields, obj, self._needConversion) > {code} > This zip statement seems to assume that obj and self.fields are ordered in > the same way, so that the elements of obj will correspond to the right fields > in the schema. However this is not true, a Row orders its elements > alphabetically but the fields in the schema are in whatever order they are > specified. In this example field2 is being initialised with the field1 > element 'Hello'. If you re-order the fields in the schema to go (field1, > field2), the given example works without error. > The schema in the repro is specifically designed to elicit the problem, the > fields are out of alphabetical order and one field is a StructType, making > chema._needSerializeAnyField==True . However we encountered this in real use. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception
[ https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559532#comment-16559532 ] Hyukjin Kwon commented on SPARK-24915: -- Ah, yea, I thought it's a duplicate of it but sounds slightly different. I think we should fix this one, yea, as discussed with [~bryanc]. > Calling SparkSession.createDataFrame with schema can throw exception > > > Key: SPARK-24915 > URL: https://issues.apache.org/jira/browse/SPARK-24915 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Python 3.6.3 > PySpark 2.3.1 (installed via pip) > OSX 10.12.6 >Reporter: Stephen Spencer >Priority: Major > > There seems to be a bug in PySpark when using the PySparkSQL session to > create a dataframe with a pre-defined schema. > Code to reproduce the error: > {code:java} > from pyspark import SparkConf, SparkContext > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, StringType, Row > conf = SparkConf().setMaster("local").setAppName("repro") > context = SparkContext(conf=conf) > session = SparkSession(context) > # Construct schema (the order of fields is important) > schema = StructType([ > StructField('field2', StructType([StructField('sub_field', StringType(), > False)]), False), > StructField('field1', StringType(), False), > ]) > # Create data to populate data frame > data = [ > Row(field1="Hello", field2=Row(sub_field='world')) > ] > # Attempt to create the data frame supplying the schema > # this will throw a ValueError > df = session.createDataFrame(data, schema=schema) > df.show(){code} > Running this throws a ValueError > {noformat} > Traceback (most recent call last): > File "schema_bug.py", line 18, in > df = session.createDataFrame(data, schema=schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 691, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in _createFromLocal > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in toInternal > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 439, in toInternal > return self.dataType.toInternal(obj) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 619, in toInternal > raise ValueError("Unexpected tuple %r with StructType" % obj) > ValueError: Unexpected tuple 'Hello' with StructType{noformat} > The problem seems to be here: > https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603 > specifically the bit > {code:java} > zip(self.fields, obj, self._needConversion) > {code} > This zip statement seems to assume that obj and self.fields are ordered in > the same way, so that the elements of obj will correspond to the right fields > in the schema. However this is not true, a Row orders its elements > alphabetically but the fields in the schema are in whatever order they are > specified. In this example field2 is being initialised with the field1 > element 'Hello'. If you re-order the fields in the schema to go (field1, > field2), the given example works without error. > The schema in the repro is specifically designed to elicit the problem, the > fields are out of alphabetical order and one field is a StructType, making > chema._needSerializeAnyField==True . However we encountered this in real use. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception
[ https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16556126#comment-16556126 ] Bryan Cutler commented on SPARK-24915: -- Hi [~stspencer], I've been trying fix similar issues, but this is a little different since the StructType makes _needSerializeAnyField==True, as you pointed out. I agree that the current behavior is very confusing and should be fixed, but the related issue had to be pushed back to Spark 3.0 because it causes a behavior change. Hopefully we can improve both these issues. Until then, if you're not aware the intended way to define Row data if you care about a specific positioning is like this: {code:java} In [10]: MyRow = Row("field2", "field1") In [11]: data = [ ...: MyRow(Row(sub_field='world'), "hello") ...: ] In [12]: df = spark.createDataFrame(data, schema=schema) In [13]: df.show() +---+--+ | field2|field1| +---+--+ |[world]| hello| +---+--+{code} hope that helps > Calling SparkSession.createDataFrame with schema can throw exception > > > Key: SPARK-24915 > URL: https://issues.apache.org/jira/browse/SPARK-24915 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Python 3.6.3 > PySpark 2.3.1 (installed via pip) > OSX 10.12.6 >Reporter: Stephen Spencer >Priority: Major > > There seems to be a bug in PySpark when using the PySparkSQL session to > create a dataframe with a pre-defined schema. > Code to reproduce the error: > {code:java} > from pyspark import SparkConf, SparkContext > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, StringType, Row > conf = SparkConf().setMaster("local").setAppName("repro") > context = SparkContext(conf=conf) > session = SparkSession(context) > # Construct schema (the order of fields is important) > schema = StructType([ > StructField('field2', StructType([StructField('sub_field', StringType(), > False)]), False), > StructField('field1', StringType(), False), > ]) > # Create data to populate data frame > data = [ > Row(field1="Hello", field2=Row(sub_field='world')) > ] > # Attempt to create the data frame supplying the schema > # this will throw a ValueError > df = session.createDataFrame(data, schema=schema) > df.show(){code} > Running this throws a ValueError > {noformat} > Traceback (most recent call last): > File "schema_bug.py", line 18, in > df = session.createDataFrame(data, schema=schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 691, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in _createFromLocal > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in toInternal > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 439, in toInternal > return self.dataType.toInternal(obj) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 619, in toInternal > raise ValueError("Unexpected tuple %r with StructType" % obj) > ValueError: Unexpected tuple 'Hello' with StructType{noformat} > The problem seems to be here: > https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603 > specifically the bit > {code:java} > zip(self.fields, obj, self._needConversion) > {code} > This zip statement seems to assume that obj and self.fields are ordered in > the same way, so that the elements of obj will correspond to the right fields > in the schema. However this is not true, a Row orders its elements > alphabetically but the fields in the schema are in whatever order they are > specified. In this example field2 is being initialised with the field1 > element 'Hello'. If you re-order the fields in the schema to go (field1, > field2), the given example works without error. > The schema in the repro is specifically designed to elicit the problem, the > fields ar