RE: Spark SQL running totals

Stefan Panayotov Fri, 16 Oct 2015 13:08:51 -0700

Thanks Deenar.
This works perfectly.
I can't test the solution with window functions because I am still on Spark 
1.3.1
Hopefully will move to 1.5 soon.


Stefan Panayotov
Sent from my Windows Phone
________________________________
From: Deenar Toraskar<mailto:deenar.toras...@gmail.com>
Sent: ‎10/‎15/‎2015 2:35 PM
To: Stefan Panayotov<mailto:spanayo...@msn.com>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Spark SQL running totals

you can do a self join of the table with itself with the join clause being
a.col1 >= b.col1

select a.col1, a.col2, sum(b.col2)
from tablea as a left outer join tablea as b on (a.col1 >= b.col1)
group by a.col1, a.col2

I havent tried it, but cant see why it cant work, but doing it in RDD might
be more efficient see
https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/

On 15 October 2015 at 18:48, Stefan Panayotov <spanayo...@msn.com> wrote:

> Hi,
>
> I need help with Spark SQL. I need to achieve something like the following.
> If I have data like:
>
> col_1  col_2
> 1         10
> 2         30
> 3         15
> 4         20
> 5         25
>
> I need to get col_3 to be the running total of the sum of the previous
> rows of col_2, e.g.
>
> col_1  col_2  col_3
> 1         10        10
> 2         30        40
> 3         15        55
> 4         20        75
> 5         25        100
>
> Is there a way to achieve this in Spark SQL or maybe with Data frame
> transformations?
>
> Thanks in advance,
>
>
> *Stefan Panayotov, PhD **Home*: 610-355-0919
> *Cell*: 610-517-5586
> *email*: spanayo...@msn.com
> spanayo...@outlook.com
> spanayo...@comcast.net
>
>

RE: Spark SQL running totals

Reply via email to