Dongjoon Hyun created SPARK-15696: ------------------------------------- Summary: Improve `crosstab` to have a consistent column order Key: SPARK-15696 URL: https://issues.apache.org/jira/browse/SPARK-15696 Project: Spark Issue Type: Improvement Components: SQL Reporter: Dongjoon Hyun
Currently, `crosstab` have **random-order** columns obtained by just `distinct`. Also, the documentation of `crosstab` also shows the result in a sorted order which is different from the implementation. {code} scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show() +---------+---+---+---+ |key_value| 3| 2| 1| +---------+---+---+---+ | 2| 1| 0| 2| | 1| 0| 1| 1| | 3| 1| 1| 0| +---------+---+---+---+ scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", "value").show() +---------+---+---+---+ |key_value| c| a| b| +---------+---+---+---+ | 2| 1| 2| 0| | 1| 0| 1| 1| | 3| 1| 0| 1| +---------+---+---+---+ {code} This issue explicitly constructs the columns in a sorted order in order to improve user experience. Also, this implementation gives the same result with the documentation. {code} scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show() +---------+---+---+---+ |key_value| 1| 2| 3| +---------+---+---+---+ | 2| 2| 0| 1| | 1| 1| 1| 0| | 3| 0| 1| 1| +---------+---+---+---+ scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", "value").show() +---------+---+---+---+ |key_value| a| b| c| +---------+---+---+---+ | 2| 2| 0| 1| | 1| 1| 1| 0| | 3| 0| 1| 1| +---------+---+---+---+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org