[jira] [Updated] (SPARK-26752) Multiple aggregate methods in the same column in DataFrame

Guilherme Beltramini (JIRA) Mon, 28 Jan 2019 06:09:32 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-26752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Guilherme Beltramini updated SPARK-26752:
-----------------------------------------
    Description: 
The agg function in 
[org.apache.spark.sql.RelationalGroupedDataset|https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.RelationalGroupedDataset]
 accepts as input:
 * Column*
 * Map[String, String]
 * (String, String)*

I'm proposing to add Map[String, Seq[String]], where the keys are the columns 
to aggregate, and the values are the functions to apply the aggregation. Here 
is a similar question: 
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-multiple-agg-on-the-same-column-td29541.html.

In the example below (running in spark-shell, with Spark 2.4.0), I'm showing a 
workaround. What I'm proposing is that agg should accept aggMap as input:
{code:java}
scala> val df = Seq(("a", 1), ("a", 2), ("a", 3), ("a", 4), ("b", 10), ("b", 
20), ("c", 100)).toDF("col1", "col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: int]

scala> df.show
+----+----+
|col1|col2|
+----+----+
|   a|   1|
|   a|   2|
|   a|   3|
|   a|   4|
|   b|  10|
|   b|  20|
|   c| 100|
+----+----+

scala> val aggMap = Map("col1" -> Seq("count"), "col2" -> Seq("min", "max", 
"mean"))
aggMap: scala.collection.immutable.Map[String,Seq[String]] = Map(col1 -> 
List(count), col2 -> List(min, max, mean))

scala> val aggSeq = aggMap.toSeq.flatMap{ case (c: String, fns: Seq[String]) => 
Seq(c).zipAll(fns, c, "") }
aggSeq: Seq[(String, String)] = ArrayBuffer((col1,count), (col2,min), 
(col2,max), (col2,mean))

scala> val dfAgg = df.groupBy("col1").agg(aggSeq.head, aggSeq.tail: _*)
dfAgg: org.apache.spark.sql.DataFrame = [col1: string, count(col1): bigint ... 
3 more fields]

scala> dfAgg.orderBy("col1").show
+----+-----------+---------+---------+---------+
|col1|count(col1)|min(col2)|max(col2)|avg(col2)|
+----+-----------+---------+---------+---------+
|   a|          4|        1|        4|      2.5|
|   b|          2|       10|       20|     15.0|
|   c|          1|      100|      100|    100.0|
+----+-----------+---------+---------+---------+
{code}
 

 

  was:
The agg function in 
[org.apache.spark.sql.RelationalGroupedDataset|https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.RelationalGroupedDataset]
 accepts as input:
 * Column*
 * Map[String, String]
 * (String, String)*

I'm proposing to add Map[String, Seq[String]], where the keys are the columns 
to aggregate, and the values are the functions to apply the aggregation.

In the example below (running in spark-shell, with Spark 2.4.0), I'm showing a 
workaround. What I'm proposing is that agg should accept aggMap as input:
{code:java}
scala> val df = Seq(("a", 1), ("a", 2), ("a", 3), ("a", 4), ("b", 10), ("b", 
20), ("c", 100)).toDF("col1", "col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: int]

scala> df.show
+----+----+
|col1|col2|
+----+----+
|   a|   1|
|   a|   2|
|   a|   3|
|   a|   4|
|   b|  10|
|   b|  20|
|   c| 100|
+----+----+

scala> val aggMap = Map("col1" -> Seq("count"), "col2" -> Seq("min", "max", 
"mean"))
aggMap: scala.collection.immutable.Map[String,Seq[String]] = Map(col1 -> 
List(count), col2 -> List(min, max, mean))

scala> val aggSeq = aggMap.toSeq.flatMap{ case (c: String, fns: Seq[String]) => 
Seq(c).zipAll(fns, c, "") }
aggSeq: Seq[(String, String)] = ArrayBuffer((col1,count), (col2,min), 
(col2,max), (col2,mean))

scala> val dfAgg = df.groupBy("col1").agg(aggSeq.head, aggSeq.tail: _*)
dfAgg: org.apache.spark.sql.DataFrame = [col1: string, count(col1): bigint ... 
3 more fields]

scala> dfAgg.orderBy("col1").show
+----+-----------+---------+---------+---------+
|col1|count(col1)|min(col2)|max(col2)|avg(col2)|
+----+-----------+---------+---------+---------+
|   a|          4|        1|        4|      2.5|
|   b|          2|       10|       20|     15.0|
|   c|          1|      100|      100|    100.0|
+----+-----------+---------+---------+---------+
{code}
 

 


> Multiple aggregate methods in the same column in DataFrame
> ----------------------------------------------------------
>
>                 Key: SPARK-26752
>                 URL: https://issues.apache.org/jira/browse/SPARK-26752
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Guilherme Beltramini
>            Priority: Minor
>
> The agg function in 
> [org.apache.spark.sql.RelationalGroupedDataset|https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.RelationalGroupedDataset]
>  accepts as input:
>  * Column*
>  * Map[String, String]
>  * (String, String)*
> I'm proposing to add Map[String, Seq[String]], where the keys are the columns 
> to aggregate, and the values are the functions to apply the aggregation. Here 
> is a similar question: 
> http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-multiple-agg-on-the-same-column-td29541.html.
> In the example below (running in spark-shell, with Spark 2.4.0), I'm showing 
> a workaround. What I'm proposing is that agg should accept aggMap as input:
> {code:java}
> scala> val df = Seq(("a", 1), ("a", 2), ("a", 3), ("a", 4), ("b", 10), ("b", 
> 20), ("c", 100)).toDF("col1", "col2")
> df: org.apache.spark.sql.DataFrame = [col1: string, col2: int]
> scala> df.show
> +----+----+
> |col1|col2|
> +----+----+
> |   a|   1|
> |   a|   2|
> |   a|   3|
> |   a|   4|
> |   b|  10|
> |   b|  20|
> |   c| 100|
> +----+----+
> scala> val aggMap = Map("col1" -> Seq("count"), "col2" -> Seq("min", "max", 
> "mean"))
> aggMap: scala.collection.immutable.Map[String,Seq[String]] = Map(col1 -> 
> List(count), col2 -> List(min, max, mean))
> scala> val aggSeq = aggMap.toSeq.flatMap{ case (c: String, fns: Seq[String]) 
> => Seq(c).zipAll(fns, c, "") }
> aggSeq: Seq[(String, String)] = ArrayBuffer((col1,count), (col2,min), 
> (col2,max), (col2,mean))
> scala> val dfAgg = df.groupBy("col1").agg(aggSeq.head, aggSeq.tail: _*)
> dfAgg: org.apache.spark.sql.DataFrame = [col1: string, count(col1): bigint 
> ... 3 more fields]
> scala> dfAgg.orderBy("col1").show
> +----+-----------+---------+---------+---------+
> |col1|count(col1)|min(col2)|max(col2)|avg(col2)|
> +----+-----------+---------+---------+---------+
> |   a|          4|        1|        4|      2.5|
> |   b|          2|       10|       20|     15.0|
> |   c|          1|      100|      100|    100.0|
> +----+-----------+---------+---------+---------+
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26752) Multiple aggregate methods in the same column in DataFrame

Reply via email to