RE: Spark SQL running totals
Thanks Deenar. This works perfectly. I can't test the solution with window functions because I am still on Spark 1.3.1 Hopefully will move to 1.5 soon. Stefan Panayotov Sent from my Windows Phone From: Deenar Toraskar<mailto:deenar.toras...@gmail.com> Sent: 10/15/2015 2:35 PM To: Stefan Panayotov<mailto:spanayo...@msn.com> Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark SQL running totals you can do a self join of the table with itself with the join clause being a.col1 >= b.col1 select a.col1, a.col2, sum(b.col2) from tablea as a left outer join tablea as b on (a.col1 >= b.col1) group by a.col1, a.col2 I havent tried it, but cant see why it cant work, but doing it in RDD might be more efficient see https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/ On 15 October 2015 at 18:48, Stefan Panayotov wrote: > Hi, > > I need help with Spark SQL. I need to achieve something like the following. > If I have data like: > > col_1 col_2 > 1 10 > 2 30 > 3 15 > 4 20 > 5 25 > > I need to get col_3 to be the running total of the sum of the previous > rows of col_2, e.g. > > col_1 col_2 col_3 > 1 1010 > 2 3040 > 3 1555 > 4 2075 > 5 25100 > > Is there a way to achieve this in Spark SQL or maybe with Data frame > transformations? > > Thanks in advance, > > > *Stefan Panayotov, PhD **Home*: 610-355-0919 > *Cell*: 610-517-5586 > *email*: spanayo...@msn.com > spanayo...@outlook.com > spanayo...@comcast.net > >
RE: Spark SQL running totals
Thanks to all of you guys for the helpful suggestions. I'll try these first thing tomorrow morning. Stefan Panayotov Sent from my Windows Phone From: java8964<mailto:java8...@hotmail.com> Sent: 10/15/2015 4:30 PM To: Michael Armbrust<mailto:mich...@databricks.com>; Deenar Toraskar<mailto:deenar.toras...@gmail.com> Cc: Stefan Panayotov<mailto:spanayo...@msn.com>; user@spark.apache.org<mailto:user@spark.apache.org> Subject: RE: Spark SQL running totals My mistake. I didn't noticed "UNBOUNDED PRECEDING" already supported. So cumulative sum should work then. Thanks Yong From: java8...@hotmail.com To: mich...@databricks.com; deenar.toras...@gmail.com CC: spanayo...@msn.com; user@spark.apache.org Subject: RE: Spark SQL running totals Date: Thu, 15 Oct 2015 16:24:39 -0400 Not sure the windows function can work for his case. If you do a "sum() over (partitioned by)", that will return a total sum per partition, instead of a cumulative sum wanted in this case. I saw there is a "cume_dis", but no "cume_sum". Do we really have a "cume_sum" in Spark window function, or am I total misunderstand about "sum() over (partitioned by)" in it? Yong From: mich...@databricks.com Date: Thu, 15 Oct 2015 11:51:59 -0700 Subject: Re: Spark SQL running totals To: deenar.toras...@gmail.com CC: spanayo...@msn.com; user@spark.apache.org Check out: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html On Thu, Oct 15, 2015 at 11:35 AM, Deenar Toraskar wrote: you can do a self join of the table with itself with the join clause being a.col1 >= b.col1 select a.col1, a.col2, sum(b.col2)from tablea as a left outer join tablea as b on (a.col1 >= b.col1)group by a.col1, a.col2 I havent tried it, but cant see why it cant work, but doing it in RDD might be more efficient see https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/ On 15 October 2015 at 18:48, Stefan Panayotov wrote: Hi, I need help with Spark SQL. I need to achieve something like the following. If I have data like: col_1 col_2 1 10 2 30 3 15 4 20 5 25 I need to get col_3 to be the running total of the sum of the previous rows of col_2, e.g. col_1 col_2 col_3 1 1010 2 3040 3 1555 4 2075 5 25100 Is there a way to achieve this in Spark SQL or maybe with Data frame transformations? Thanks in advance, Stefan Panayotov, PhD Home: 610-355-0919 Cell: 610-517-5586 email: spanayo...@msn.com spanayo...@outlook.com spanayo...@comcast.net
RE: Spark SQL running totals
My mistake. I didn't noticed "UNBOUNDED PRECEDING" already supported. So cumulative sum should work then. Thanks Yong From: java8...@hotmail.com To: mich...@databricks.com; deenar.toras...@gmail.com CC: spanayo...@msn.com; user@spark.apache.org Subject: RE: Spark SQL running totals Date: Thu, 15 Oct 2015 16:24:39 -0400 Not sure the windows function can work for his case. If you do a "sum() over (partitioned by)", that will return a total sum per partition, instead of a cumulative sum wanted in this case. I saw there is a "cume_dis", but no "cume_sum". Do we really have a "cume_sum" in Spark window function, or am I total misunderstand about "sum() over (partitioned by)" in it? Yong From: mich...@databricks.com Date: Thu, 15 Oct 2015 11:51:59 -0700 Subject: Re: Spark SQL running totals To: deenar.toras...@gmail.com CC: spanayo...@msn.com; user@spark.apache.org Check out: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html On Thu, Oct 15, 2015 at 11:35 AM, Deenar Toraskar wrote: you can do a self join of the table with itself with the join clause being a.col1 >= b.col1 select a.col1, a.col2, sum(b.col2)from tablea as a left outer join tablea as b on (a.col1 >= b.col1)group by a.col1, a.col2 I havent tried it, but cant see why it cant work, but doing it in RDD might be more efficient see https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/ On 15 October 2015 at 18:48, Stefan Panayotov wrote: Hi, I need help with Spark SQL. I need to achieve something like the following. If I have data like: col_1 col_2 1 10 2 30 3 15 4 20 5 25 I need to get col_3 to be the running total of the sum of the previous rows of col_2, e.g. col_1 col_2 col_3 1 1010 2 3040 3 1555 4 2075 5 25100 Is there a way to achieve this in Spark SQL or maybe with Data frame transformations? Thanks in advance, Stefan Panayotov, PhD Home: 610-355-0919 Cell: 610-517-5586 email: spanayo...@msn.com spanayo...@outlook.com spanayo...@comcast.net
RE: Spark SQL running totals
Not sure the windows function can work for his case. If you do a "sum() over (partitioned by)", that will return a total sum per partition, instead of a cumulative sum wanted in this case. I saw there is a "cume_dis", but no "cume_sum". Do we really have a "cume_sum" in Spark window function, or am I total misunderstand about "sum() over (partitioned by)" in it? Yong From: mich...@databricks.com Date: Thu, 15 Oct 2015 11:51:59 -0700 Subject: Re: Spark SQL running totals To: deenar.toras...@gmail.com CC: spanayo...@msn.com; user@spark.apache.org Check out: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html On Thu, Oct 15, 2015 at 11:35 AM, Deenar Toraskar wrote: you can do a self join of the table with itself with the join clause being a.col1 >= b.col1 select a.col1, a.col2, sum(b.col2)from tablea as a left outer join tablea as b on (a.col1 >= b.col1)group by a.col1, a.col2 I havent tried it, but cant see why it cant work, but doing it in RDD might be more efficient see https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/ On 15 October 2015 at 18:48, Stefan Panayotov wrote: Hi, I need help with Spark SQL. I need to achieve something like the following. If I have data like: col_1 col_2 1 10 2 30 3 15 4 20 5 25 I need to get col_3 to be the running total of the sum of the previous rows of col_2, e.g. col_1 col_2 col_3 1 1010 2 3040 3 1555 4 2075 5 25100 Is there a way to achieve this in Spark SQL or maybe with Data frame transformations? Thanks in advance, Stefan Panayotov, PhD Home: 610-355-0919 Cell: 610-517-5586 email: spanayo...@msn.com spanayo...@outlook.com spanayo...@comcast.net
Re: Spark SQL running totals
You can do it and many other transformations very easily with window functions, see this blog post: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html In your case you would do (in Scala): import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.{functions => func} val wo = Window.orderBy("col1") val newdf = df.withColumn("col3", func.sum("col2").over(wo)) The true power of this is when you have to do a running total by a particular, say, user (in a different column)! :) HTH, Kristina On Thu, Oct 15, 2015 at 1:48 PM, Stefan Panayotov wrote: > Hi, > > I need help with Spark SQL. I need to achieve something like the following. > If I have data like: > > col_1 col_2 > 1 10 > 2 30 > 3 15 > 4 20 > 5 25 > > I need to get col_3 to be the running total of the sum of the previous > rows of col_2, e.g. > > col_1 col_2 col_3 > 1 1010 > 2 3040 > 3 1555 > 4 2075 > 5 25100 > > Is there a way to achieve this in Spark SQL or maybe with Data frame > transformations? > > Thanks in advance, > > > *Stefan Panayotov, PhD **Home*: 610-355-0919 > *Cell*: 610-517-5586 > *email*: spanayo...@msn.com > spanayo...@outlook.com > spanayo...@comcast.net > >
Re: Spark SQL running totals
Check out: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html On Thu, Oct 15, 2015 at 11:35 AM, Deenar Toraskar wrote: > you can do a self join of the table with itself with the join clause being > a.col1 >= b.col1 > > select a.col1, a.col2, sum(b.col2) > from tablea as a left outer join tablea as b on (a.col1 >= b.col1) > group by a.col1, a.col2 > > I havent tried it, but cant see why it cant work, but doing it in RDD > might be more efficient see > https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/ > > On 15 October 2015 at 18:48, Stefan Panayotov wrote: > >> Hi, >> >> I need help with Spark SQL. I need to achieve something like the >> following. >> If I have data like: >> >> col_1 col_2 >> 1 10 >> 2 30 >> 3 15 >> 4 20 >> 5 25 >> >> I need to get col_3 to be the running total of the sum of the previous >> rows of col_2, e.g. >> >> col_1 col_2 col_3 >> 1 1010 >> 2 3040 >> 3 1555 >> 4 2075 >> 5 25100 >> >> Is there a way to achieve this in Spark SQL or maybe with Data frame >> transformations? >> >> Thanks in advance, >> >> >> *Stefan Panayotov, PhD **Home*: 610-355-0919 >> *Cell*: 610-517-5586 >> *email*: spanayo...@msn.com >> spanayo...@outlook.com >> spanayo...@comcast.net >> >> > >
Re: Spark SQL running totals
you can do a self join of the table with itself with the join clause being a.col1 >= b.col1 select a.col1, a.col2, sum(b.col2) from tablea as a left outer join tablea as b on (a.col1 >= b.col1) group by a.col1, a.col2 I havent tried it, but cant see why it cant work, but doing it in RDD might be more efficient see https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/ On 15 October 2015 at 18:48, Stefan Panayotov wrote: > Hi, > > I need help with Spark SQL. I need to achieve something like the following. > If I have data like: > > col_1 col_2 > 1 10 > 2 30 > 3 15 > 4 20 > 5 25 > > I need to get col_3 to be the running total of the sum of the previous > rows of col_2, e.g. > > col_1 col_2 col_3 > 1 1010 > 2 3040 > 3 1555 > 4 2075 > 5 25100 > > Is there a way to achieve this in Spark SQL or maybe with Data frame > transformations? > > Thanks in advance, > > > *Stefan Panayotov, PhD **Home*: 610-355-0919 > *Cell*: 610-517-5586 > *email*: spanayo...@msn.com > spanayo...@outlook.com > spanayo...@comcast.net > >
Spark SQL running totals
Hi, I need help with Spark SQL. I need to achieve something like the following. If I have data like: col_1 col_2 1 10 2 30 3 15 4 20 5 25 I need to get col_3 to be the running total of the sum of the previous rows of col_2, e.g. col_1 col_2 col_3 1 1010 2 3040 3 1555 4 2075 5 25100 Is there a way to achieve this in Spark SQL or maybe with Data frame transformations? Thanks in advance, Stefan Panayotov, PhD Home: 610-355-0919 Cell: 610-517-5586 email: spanayo...@msn.com spanayo...@outlook.com spanayo...@comcast.net