[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16327661#comment-16327661 ]
Apache Spark commented on SPARK-22232: -------------------------------------- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/20280 > Row objects in pyspark created using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > ---------------------------------------------------------------------------------------------------------- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 2.2.0 > Reporter: Bago Amirbekian > Priority: Major > > The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should > be accessed by field name, not by position because {{Row.__new__}} sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {code:none} > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > # Putting fields in alphabetical order masks the issue > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org