[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22316 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r221428797 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -330,6 +330,15 @@ class RelationalGroupedDataset protected[sql]( * df.groupBy("year").pivot("course").sum("earnings") * }}} * + * From Spark 2.5.0, values can be literal columns, for instance, struct. For pivoting by + * multiple columns, use the `struct` function to combine the columns and values: + * + * {{{ + * df.groupBy($"year") --- End diff -- we can. just to match the examples with above except the difference. really not a big deal at all. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r221427994 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -330,6 +330,15 @@ class RelationalGroupedDataset protected[sql]( * df.groupBy("year").pivot("course").sum("earnings") * }}} * + * From Spark 2.5.0, values can be literal columns, for instance, struct. For pivoting by + * multiple columns, use the `struct` function to combine the columns and values: + * + * {{{ + * df.groupBy($"year") --- End diff -- Why cannot be grouping by `Column` type? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r221427775 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -330,6 +330,15 @@ class RelationalGroupedDataset protected[sql]( * df.groupBy("year").pivot("course").sum("earnings") * }}} * + * From Spark 2.5.0, values can be literal columns, for instance, struct. For pivoting by + * multiple columns, use the `struct` function to combine the columns and values: + * + * {{{ + * df.groupBy($"year") --- End diff -- nit: `$"year"` -> `"year"` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r219686833 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -416,7 +426,7 @@ class RelationalGroupedDataset protected[sql]( new RelationalGroupedDataset( df, groupingExprs, - RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(Literal.apply))) + RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(lit(_).expr))) --- End diff -- That's true in general but specifically is decimal precision more correct? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r219368724 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -416,7 +426,7 @@ class RelationalGroupedDataset protected[sql]( new RelationalGroupedDataset( df, groupingExprs, - RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(Literal.apply))) + RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(lit(_).expr))) --- End diff -- now we eventually call `Literal.create` instead of `Literal.apply`. I'm not sure if there is a behavior change though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r219368791 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -416,7 +426,7 @@ class RelationalGroupedDataset protected[sql]( new RelationalGroupedDataset( df, groupingExprs, - RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(Literal.apply))) + RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(lit(_).expr))) --- End diff -- from a quick look, seems `Literal.create` is more powerful and should not have regressions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r219368334 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -330,6 +331,15 @@ class RelationalGroupedDataset protected[sql]( * df.groupBy("year").pivot("course").sum("earnings") * }}} * + * From Spark 3.0.0, values can be literal columns, for instance, struct. For pivoting by --- End diff -- 3.0.0 => 2.5.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r217476618 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -416,7 +426,7 @@ class RelationalGroupedDataset protected[sql]( new RelationalGroupedDataset( df, groupingExprs, - RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(Literal.apply))) + RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(lit(_).expr))) --- End diff -- > Looks at least there's one case of a potential behaviour change about scale and precision. Could you explain, please. Why do you expect some behavior change? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r217251795 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -416,7 +426,7 @@ class RelationalGroupedDataset protected[sql]( new RelationalGroupedDataset( df, groupingExprs, - RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(Literal.apply))) + RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(lit(_).expr))) --- End diff -- @MaxGekk, just for doubly doubly sure, shell we `Try(...).getOrElse(lit(...).expr)`? Looks at least there's one case of a potential behaviour change about scale and precision. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r216122957 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -330,6 +331,15 @@ class RelationalGroupedDataset protected[sql]( * df.groupBy("year").pivot("course").sum("earnings") * }}} * + * From Spark 2.4.0, values can be literal columns, for instance, struct. For pivoting by --- End diff -- Let's target 3.0.0 @MaxGekk. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r214894498 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala --- @@ -308,4 +308,27 @@ class DataFramePivotSuite extends QueryTest with SharedSQLContext { assert(exception.getMessage.contains("aggregate functions are not allowed")) } + + test("pivoting column list with values") { +val expected = Row(2012, 1.0, null) :: Row(2013, 48000.0, 3.0) :: Nil +val df = trainingSales + .groupBy($"sales.year") + .pivot(struct(lower($"sales.course"), $"training"), Seq( +struct(lit("dotnet"), lit("Experts")), +struct(lit("java"), lit("Dummies"))) + ).agg(sum($"sales.earnings")) + +checkAnswer(df, expected) + } + + test("pivoting column list") { +val exception = intercept[RuntimeException] { + trainingSales +.groupBy($"sales.year") +.pivot(struct(lower($"sales.course"), $"training")) +.agg(sum($"sales.earnings")) +.collect() --- End diff -- I think invalid queries basically throw `AnalysisException. But, yea, indeed, we'd better to keep the current behaivour. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r214842133 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala --- @@ -308,4 +308,27 @@ class DataFramePivotSuite extends QueryTest with SharedSQLContext { assert(exception.getMessage.contains("aggregate functions are not allowed")) } + + test("pivoting column list with values") { +val expected = Row(2012, 1.0, null) :: Row(2013, 48000.0, 3.0) :: Nil +val df = trainingSales + .groupBy($"sales.year") + .pivot(struct(lower($"sales.course"), $"training"), Seq( +struct(lit("dotnet"), lit("Experts")), +struct(lit("java"), lit("Dummies"))) + ).agg(sum($"sales.earnings")) + +checkAnswer(df, expected) + } + + test("pivoting column list") { +val exception = intercept[RuntimeException] { + trainingSales +.groupBy($"sales.year") +.pivot(struct(lower($"sales.course"), $"training")) +.agg(sum($"sales.earnings")) +.collect() --- End diff -- > I miss something? No, you don't. The exception for sure is thrown inside of `lit` because `collect()` returns a complex value which cannot be "wrapped" by lit. This is exactly checked in the test which I added to show existing behavior. > btw, IMHO AnalysisException is better than RuntimeException in this case? @maropu Could you explain, please, why do you think `AnalysisException` is better for the error occurs in run-time? Just in case, in the PR, I don't aim to change behavior of existing method: `def pivot(pivotColumn: Column): RelationalGroupedDataset`. I believe it should be discussed separately regarding to needs for changing user visible behavior. The PR aims to improve `def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset` to allow users to specify `struct` literals in particular. Please, see the description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r214761811 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala --- @@ -308,4 +308,27 @@ class DataFramePivotSuite extends QueryTest with SharedSQLContext { assert(exception.getMessage.contains("aggregate functions are not allowed")) } + + test("pivoting column list with values") { +val expected = Row(2012, 1.0, null) :: Row(2013, 48000.0, 3.0) :: Nil +val df = trainingSales + .groupBy($"sales.year") + .pivot(struct(lower($"sales.course"), $"training"), Seq( +struct(lit("dotnet"), lit("Experts")), +struct(lit("java"), lit("Dummies"))) + ).agg(sum($"sales.earnings")) + +checkAnswer(df, expected) + } + + test("pivoting column list") { +val exception = intercept[RuntimeException] { + trainingSales +.groupBy($"sales.year") +.pivot(struct(lower($"sales.course"), $"training")) +.agg(sum($"sales.earnings")) +.collect() --- End diff -- I tried in your branch; ``` scala> df.show +++ |training| sales| +++ | Experts|[dotNET, 2012, 10...| | Experts|[JAVA, 2012, 2000...| | Dummies|[dotNet, 2012, 50...| | Experts|[dotNET, 2013, 48...| | Dummies|[Java, 2013, 3000...| +++ scala> df.groupBy($"sales.year").pivot(struct(lower($"sales.course"), $"training")).agg(sum($"sales.earnings")) java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema [dotnet,Dummies] at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164) at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164) at scala.util.Try.getOrElse(Try.scala:79) at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:163) at org.apache.spark.sql.functions$.typedLit(functions.scala:127) ``` I miss something? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r214754379 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -416,7 +426,7 @@ class RelationalGroupedDataset protected[sql]( new RelationalGroupedDataset( df, groupingExprs, - RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(Literal.apply))) + RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(lit(_).expr))) --- End diff -- Don't see any advantages of this. It is longer and slower. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r214752855 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -416,7 +426,7 @@ class RelationalGroupedDataset protected[sql]( new RelationalGroupedDataset( df, groupingExprs, - RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(Literal.apply))) + RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(lit(_).expr))) --- End diff -- What do you think about `map(lit).map(_.expr)` instead? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r214722485 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala --- @@ -308,4 +308,27 @@ class DataFramePivotSuite extends QueryTest with SharedSQLContext { assert(exception.getMessage.contains("aggregate functions are not allowed")) } + + test("pivoting column list with values") { +val expected = Row(2012, 1.0, null) :: Row(2013, 48000.0, 3.0) :: Nil +val df = trainingSales + .groupBy($"sales.year") + .pivot(struct(lower($"sales.course"), $"training"), Seq( +struct(lit("dotnet"), lit("Experts")), +struct(lit("java"), lit("Dummies"))) + ).agg(sum($"sales.earnings")) + +checkAnswer(df, expected) + } + + test("pivoting column list") { +val exception = intercept[RuntimeException] { + trainingSales +.groupBy($"sales.year") +.pivot(struct(lower($"sales.course"), $"training")) +.agg(sum($"sales.earnings")) +.collect() --- End diff -- My changes don't throw the exception. It is thrown in the collect() : https://github.com/apache/spark/blob/41c2227a2318029709553a588e44dee28f106350/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L385 @maropu Do you propose to catch `RuntimeException` and replace it by `AnalysisException`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r214566503 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala --- @@ -308,4 +308,27 @@ class DataFramePivotSuite extends QueryTest with SharedSQLContext { assert(exception.getMessage.contains("aggregate functions are not allowed")) } + + test("pivoting column list with values") { +val expected = Row(2012, 1.0, null) :: Row(2013, 48000.0, 3.0) :: Nil +val df = trainingSales + .groupBy($"sales.year") + .pivot(struct(lower($"sales.course"), $"training"), Seq( +struct(lit("dotnet"), lit("Experts")), +struct(lit("java"), lit("Dummies"))) + ).agg(sum($"sales.earnings")) + +checkAnswer(df, expected) + } + + test("pivoting column list") { +val exception = intercept[RuntimeException] { + trainingSales +.groupBy($"sales.year") +.pivot(struct(lower($"sales.course"), $"training")) +.agg(sum($"sales.earnings")) +.collect() --- End diff -- Don't need this `.collect()` to cactch `RuntimeException`? btw, IMHO `AnalysisException` is better than `RuntimeException` in this case? Can't we? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r214566083 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -406,6 +407,14 @@ class RelationalGroupedDataset protected[sql]( * df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings") * }}} * + * For pivoting by multiple columns, use the `struct` function to combine the columns and values: --- End diff -- Since the documentation states it's an overloaded version of ``` the `pivot` method with `pivotColumn` of the `String` type. ```, shall we move this contents to that method? Also, I would document this, for instance, From Spark 2.4.0, values can be literal columns, for instance, `struct`. For pivoting by multiple columns, use the `struct` function to combine the columns and values. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r214544896 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -406,6 +407,15 @@ class RelationalGroupedDataset protected[sql]( * df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings") * }}} * + * For pivoting by multiple columns, use the `struct` function to combine the columns and values: + * + * {{{ + * df + * .groupBy($"year") --- End diff -- I would make this line up --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/22316 [SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java ## What changes were proposed in this pull request? In the PR, I propose to extend implementation of existing method: ``` def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset ``` to support values of the struct type. This allows pivoting by multiple columns combined by `struct`: ``` trainingSales .groupBy($"sales.year") .pivot( pivotColumn = struct(lower($"sales.course"), $"training"), values = Seq( struct(lit("dotnet"), lit("Experts")), struct(lit("java"), lit("Dummies"))) ).agg(sum($"sales.earnings")) ``` ## How was this patch tested? Added a test for values specified via `struct` in Java and Scala. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 pivoting-by-multiple-columns2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22316.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22316 commit 058072544fdd606392a57615119bb55dff5345c0 Author: Maxim Gekk Date: 2018-09-02T12:24:20Z Support columns as values commit 1221db39b75a9b9bd4fbc6144150283d9c24e9d5 Author: Maxim Gekk Date: 2018-09-02T13:14:55Z Added a test for the case when values are not specified commit a097b294854f99ec58ca307d85c19e54cd76d6b8 Author: Maxim Gekk Date: 2018-09-02T13:19:14Z Added a test for Java --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org