Dongjoon Hyun created SPARK-15696:
-------------------------------------

             Summary: Improve `crosstab` to have a consistent column order 
                 Key: SPARK-15696
                 URL: https://issues.apache.org/jira/browse/SPARK-15696
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Dongjoon Hyun


Currently, `crosstab` have **random-order** columns obtained by just 
`distinct`. Also, the documentation of `crosstab` also shows the result in a 
sorted order which is different from the implementation.

{code}
scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 
2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show()
+---------+---+---+---+
|key_value|  3|  2|  1|
+---------+---+---+---+
|        2|  1|  0|  2|
|        1|  0|  1|  1|
|        3|  1|  1|  0|
+---------+---+---+---+

scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, 
"c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", 
"value").show()
+---------+---+---+---+
|key_value|  c|  a|  b|
+---------+---+---+---+
|        2|  1|  2|  0|
|        1|  0|  1|  1|
|        3|  1|  0|  1|
+---------+---+---+---+
{code}

This issue explicitly constructs the columns in a sorted order in order to 
improve user experience. Also, this implementation gives the same result with 
the documentation.

{code}
scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 
2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show()
+---------+---+---+---+
|key_value|  1|  2|  3|
+---------+---+---+---+
|        2|  2|  0|  1|
|        1|  1|  1|  0|
|        3|  0|  1|  1|
+---------+---+---+---+
scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, 
"c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", 
"value").show()
+---------+---+---+---+                                                         
|key_value|  a|  b|  c|
+---------+---+---+---+
|        2|  2|  0|  1|
|        1|  1|  1|  0|
|        3|  0|  1|  1|
+---------+---+---+---+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to