Thanks to all of you guys for the helpful suggestions. I'll try these first thing tomorrow morning.
Stefan Panayotov Sent from my Windows Phone ________________________________ From: java8964<mailto:java8...@hotmail.com> Sent: 10/15/2015 4:30 PM To: Michael Armbrust<mailto:mich...@databricks.com>; Deenar Toraskar<mailto:deenar.toras...@gmail.com> Cc: Stefan Panayotov<mailto:spanayo...@msn.com>; user@spark.apache.org<mailto:user@spark.apache.org> Subject: RE: Spark SQL running totals My mistake. I didn't noticed "UNBOUNDED PRECEDING" already supported. So cumulative sum should work then. Thanks Yong From: java8...@hotmail.com To: mich...@databricks.com; deenar.toras...@gmail.com CC: spanayo...@msn.com; user@spark.apache.org Subject: RE: Spark SQL running totals Date: Thu, 15 Oct 2015 16:24:39 -0400 Not sure the windows function can work for his case. If you do a "sum() over (partitioned by)", that will return a total sum per partition, instead of a cumulative sum wanted in this case. I saw there is a "cume_dis", but no "cume_sum". Do we really have a "cume_sum" in Spark window function, or am I total misunderstand about "sum() over (partitioned by)" in it? Yong From: mich...@databricks.com Date: Thu, 15 Oct 2015 11:51:59 -0700 Subject: Re: Spark SQL running totals To: deenar.toras...@gmail.com CC: spanayo...@msn.com; user@spark.apache.org Check out: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html On Thu, Oct 15, 2015 at 11:35 AM, Deenar Toraskar <deenar.toras...@gmail.com> wrote: you can do a self join of the table with itself with the join clause being a.col1 >= b.col1 select a.col1, a.col2, sum(b.col2)from tablea as a left outer join tablea as b on (a.col1 >= b.col1)group by a.col1, a.col2 I havent tried it, but cant see why it cant work, but doing it in RDD might be more efficient see https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/ On 15 October 2015 at 18:48, Stefan Panayotov <spanayo...@msn.com> wrote: Hi, I need help with Spark SQL. I need to achieve something like the following. If I have data like: col_1 col_2 1 10 2 30 3 15 4 20 5 25 I need to get col_3 to be the running total of the sum of the previous rows of col_2, e.g. col_1 col_2 col_3 1 10 10 2 30 40 3 15 55 4 20 75 5 25 100 Is there a way to achieve this in Spark SQL or maybe with Data frame transformations? Thanks in advance, Stefan Panayotov, PhD Home: 610-355-0919 Cell: 610-517-5586 email: spanayo...@msn.com spanayo...@outlook.com spanayo...@comcast.net