[ https://issues.apache.org/jira/browse/SPARK-30941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun reassigned SPARK-30941: ------------------------------------- Assignee: Hyukjin Kwon > PySpark Row can be instantiated with duplicate field names > ---------------------------------------------------------- > > Key: SPARK-30941 > URL: https://issues.apache.org/jira/browse/SPARK-30941 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0 > Environment: Ubuntu 18.04 > Python 3.6.8 > Spark 2.4.4 (installed via binary from website) > > Reporter: David Roher > Assignee: Hyukjin Kwon > Priority: Major > Labels: correctness > > It is possible to create a Row that has fields with the same name when > calling `collect()` after a join. Given that the Row constructor itself > doesn't allow this, this seems to be undesired behavior. > This can possibly cause correctness issues because different ways of getting > values produce different results: {{__get_item__}} will return the leftmost > value, while {{asDict()}} will return the rightmost value (because the former > uses an index search and the latter uses a dictionary generator). > {{>>> manual_output_row = Row(a=1, b=1, b=2)}} > \{{ File "<stdin>", line 1}} > {{SyntaxError: keyword argument repeated}} > {{>>> input_rows = Row(a=1, b=1), Row(a=1, b=2)}} > {{>>> df1, df2 = (spark.createDataFrame([r]) for r in input_rows)}} > {{>>> df3 = df1.join(df2, "a")}} > {{>>> output_row = df3.collect()[0]}} > {{>>> output_row}} > {{Row(a=1, b=1, b=2)}} > {{>>> output_row["b"]}} > {{1}} > {{>>> output_row.asDict()["b"]}} > {{2}} > **SPARK 1.6.3** > {code} > >>> from pyspark.sql.types import Row > >>> input_rows = Row(a=1, b=1), Row(a=1, b=2) > >>> df1, df2 = (sqlContext.createDataFrame([r]) for r in input_rows) > >>> df3 = df1.join(df2, "a") > >>> output_row = df3.collect()[0] > >>> output_row > Row(a=1, b=1, b=2) > >>> output_row["b"] > 1 > >>> output_row.asDict()["b"] > 2 > >>> sc.version > u'1.6.3' > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org