[GitHub] spark pull request #22030: [SPARK-25048][SQL] Pivoting by multiple columns i...

maryannxue Tue, 07 Aug 2018 21:29:52 -0700

Github user maryannxue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22030#discussion_r208453178
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
    @@ -403,20 +415,29 @@ class RelationalGroupedDataset protected[sql](
        *
        * {{{
        *   // Compute the sum of earnings for each year by course with each 
course as a separate column
    -   *   df.groupBy($"year").pivot($"course", Seq("dotNET", 
"Java")).sum($"earnings")
    +   *   df.groupBy($"year").pivot($"course", Seq(lit("dotNET"), 
lit("Java"))).sum($"earnings")
    +   * }}}
    +   *
    +   * For pivoting by multiple columns, use the `struct` function to 
combine the columns and values:
    +   *
    +   * {{{
    +   *   df
    +   *     .groupBy($"year")
    +   *     .pivot(struct($"course", $"training"), Seq(struct(lit("java"), 
lit("Experts"))))
    +   *     .agg(sum($"earnings"))
        * }}}
        *
        * @param pivotColumn the column to pivot.
        * @param values List of values that will be translated to columns in 
the output DataFrame.
        * @since 2.4.0
        */
    -  def pivot(pivotColumn: Column, values: Seq[Any]): 
RelationalGroupedDataset = {
    +  def pivot(pivotColumn: Column, values: Seq[Column]): 
RelationalGroupedDataset = {
    --- End diff --
    
    The very fundamental interface we should have is `pivot(Column, 
Seq[Column])`, which allows any form and any type of pivot column, and the same 
with pivot values. This is close to what we support in SQL (SQL pivot support 
will actually be a subset of DataFrame pivot support after we have this 
interface), and verifying that the pivot values are constant is taken care of 
in the Analyzer.
    That said, we still need to keep the old `pivot(String, Seq[Any])` for 
simple usages and for backward compatibility, but I don't think we need to 
expand its capability. It is pretty clear to me that pivot(String ...) takes a 
column name and simple objects while with pivot(Column...) you could make any 
sophisticated use of pivot you would like to.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22030: [SPARK-25048][SQL] Pivoting by multiple columns i...

Reply via email to