RE: Spark SQL running totals

2015-10-16 Thread Stefan Panayotov
Thanks Deenar.
This works perfectly.
I can't test the solution with window functions because I am still on Spark 
1.3.1
Hopefully will move to 1.5 soon.

Stefan Panayotov
Sent from my Windows Phone

From: Deenar Toraskar<mailto:deenar.toras...@gmail.com>
Sent: ‎10/‎15/‎2015 2:35 PM
To: Stefan Panayotov<mailto:spanayo...@msn.com>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Spark SQL running totals

you can do a self join of the table with itself with the join clause being
a.col1 >= b.col1

select a.col1, a.col2, sum(b.col2)
from tablea as a left outer join tablea as b on (a.col1 >= b.col1)
group by a.col1, a.col2

I havent tried it, but cant see why it cant work, but doing it in RDD might
be more efficient see
https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/

On 15 October 2015 at 18:48, Stefan Panayotov <spanayo...@msn.com> wrote:

> Hi,
>
> I need help with Spark SQL. I need to achieve something like the following.
> If I have data like:
>
> col_1  col_2
> 1 10
> 2 30
> 3 15
> 4 20
> 5 25
>
> I need to get col_3 to be the running total of the sum of the previous
> rows of col_2, e.g.
>
> col_1  col_2  col_3
> 1 1010
> 2 3040
> 3 1555
> 4 2075
> 5 25100
>
> Is there a way to achieve this in Spark SQL or maybe with Data frame
> transformations?
>
> Thanks in advance,
>
>
> *Stefan Panayotov, PhD **Home*: 610-355-0919
> *Cell*: 610-517-5586
> *email*: spanayo...@msn.com
> spanayo...@outlook.com
> spanayo...@comcast.net
>
>


RE: Spark SQL running totals

2015-10-15 Thread Stefan Panayotov
Thanks to all of you guys for the helpful suggestions. I'll try these first 
thing tomorrow morning.

Stefan Panayotov
Sent from my Windows Phone

From: java8964<mailto:java8...@hotmail.com>
Sent: ‎10/‎15/‎2015 4:30 PM
To: Michael Armbrust<mailto:mich...@databricks.com>; Deenar 
Toraskar<mailto:deenar.toras...@gmail.com>
Cc: Stefan Panayotov<mailto:spanayo...@msn.com>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: Spark SQL running totals

My mistake. I didn't noticed "UNBOUNDED PRECEDING" already supported.
So cumulative sum should work then.
Thanks
Yong

From: java8...@hotmail.com
To: mich...@databricks.com; deenar.toras...@gmail.com
CC: spanayo...@msn.com; user@spark.apache.org
Subject: RE: Spark SQL running totals
Date: Thu, 15 Oct 2015 16:24:39 -0400




Not sure the windows function can work for his case.
If you do a "sum() over (partitioned by)", that will return a total sum per 
partition, instead of a cumulative sum wanted in this case.
I saw there is a "cume_dis", but no "cume_sum".
Do we really have a "cume_sum" in Spark window function, or am I total 
misunderstand about "sum() over (partitioned by)" in it?
Yong

From: mich...@databricks.com
Date: Thu, 15 Oct 2015 11:51:59 -0700
Subject: Re: Spark SQL running totals
To: deenar.toras...@gmail.com
CC: spanayo...@msn.com; user@spark.apache.org

Check out: 
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
On Thu, Oct 15, 2015 at 11:35 AM, Deenar Toraskar <deenar.toras...@gmail.com> 
wrote:
you can do a self join of the table with itself with the join clause being 
a.col1 >= b.col1
select a.col1, a.col2, sum(b.col2)from tablea as a left outer join tablea as b 
on (a.col1 >= b.col1)group by a.col1, a.col2
I havent tried it, but cant see why it cant work, but doing it in RDD might be 
more efficient see 
https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/
On 15 October 2015 at 18:48, Stefan Panayotov <spanayo...@msn.com> wrote:



Hi,

I need help with Spark SQL. I need to achieve something like the following.
If I have data like:

col_1  col_2
1 10
2 30
3 15
4 20
5 25

I need to get col_3 to be the running total of the sum of the previous rows of 
col_2, e.g.

col_1  col_2  col_3
1 1010
2 3040
3 1555
4 2075
5 25100

Is there a way to achieve this in Spark SQL or maybe with Data frame 
transformations?

Thanks in advance,


Stefan Panayotov, PhD
Home: 610-355-0919
Cell: 610-517-5586
email: spanayo...@msn.com
spanayo...@outlook.com
spanayo...@comcast.net







RE: Spark SQL running totals

2015-10-15 Thread java8964
My mistake. I didn't noticed "UNBOUNDED PRECEDING" already supported. 
So cumulative sum should work then.
Thanks
Yong

From: java8...@hotmail.com
To: mich...@databricks.com; deenar.toras...@gmail.com
CC: spanayo...@msn.com; user@spark.apache.org
Subject: RE: Spark SQL running totals
Date: Thu, 15 Oct 2015 16:24:39 -0400




Not sure the windows function can work for his case.
If you do a "sum() over (partitioned by)", that will return a total sum per 
partition, instead of a cumulative sum wanted in this case.
I saw there is a "cume_dis", but no "cume_sum".
Do we really have a "cume_sum" in Spark window function, or am I total 
misunderstand about "sum() over (partitioned by)" in it?
Yong

From: mich...@databricks.com
Date: Thu, 15 Oct 2015 11:51:59 -0700
Subject: Re: Spark SQL running totals
To: deenar.toras...@gmail.com
CC: spanayo...@msn.com; user@spark.apache.org

Check out: 
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
On Thu, Oct 15, 2015 at 11:35 AM, Deenar Toraskar <deenar.toras...@gmail.com> 
wrote:
you can do a self join of the table with itself with the join clause being 
a.col1 >= b.col1
select a.col1, a.col2, sum(b.col2)from tablea as a left outer join tablea as b 
on (a.col1 >= b.col1)group by a.col1, a.col2
I havent tried it, but cant see why it cant work, but doing it in RDD might be 
more efficient see 
https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/
On 15 October 2015 at 18:48, Stefan Panayotov <spanayo...@msn.com> wrote:



Hi,
 
I need help with Spark SQL. I need to achieve something like the following.
If I have data like:
 
col_1  col_2
1 10
2 30
3 15
4 20
5 25
 
I need to get col_3 to be the running total of the sum of the previous rows of 
col_2, e.g.
 
col_1  col_2  col_3
1 1010
2 3040
3 1555
4 2075
5 25100
 
Is there a way to achieve this in Spark SQL or maybe with Data frame 
transformations?
 
Thanks in advance,


Stefan Panayotov, PhD 
Home: 610-355-0919 
Cell: 610-517-5586 
email: spanayo...@msn.com 
spanayo...@outlook.com 
spanayo...@comcast.net
  




  

RE: Spark SQL running totals

2015-10-15 Thread java8964
Not sure the windows function can work for his case.
If you do a "sum() over (partitioned by)", that will return a total sum per 
partition, instead of a cumulative sum wanted in this case.
I saw there is a "cume_dis", but no "cume_sum".
Do we really have a "cume_sum" in Spark window function, or am I total 
misunderstand about "sum() over (partitioned by)" in it?
Yong

From: mich...@databricks.com
Date: Thu, 15 Oct 2015 11:51:59 -0700
Subject: Re: Spark SQL running totals
To: deenar.toras...@gmail.com
CC: spanayo...@msn.com; user@spark.apache.org

Check out: 
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
On Thu, Oct 15, 2015 at 11:35 AM, Deenar Toraskar <deenar.toras...@gmail.com> 
wrote:
you can do a self join of the table with itself with the join clause being 
a.col1 >= b.col1
select a.col1, a.col2, sum(b.col2)from tablea as a left outer join tablea as b 
on (a.col1 >= b.col1)group by a.col1, a.col2
I havent tried it, but cant see why it cant work, but doing it in RDD might be 
more efficient see 
https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/
On 15 October 2015 at 18:48, Stefan Panayotov <spanayo...@msn.com> wrote:



Hi,
 
I need help with Spark SQL. I need to achieve something like the following.
If I have data like:
 
col_1  col_2
1 10
2 30
3 15
4 20
5 25
 
I need to get col_3 to be the running total of the sum of the previous rows of 
col_2, e.g.
 
col_1  col_2  col_3
1 1010
2 3040
3 1555
4 2075
5 25100
 
Is there a way to achieve this in Spark SQL or maybe with Data frame 
transformations?
 
Thanks in advance,


Stefan Panayotov, PhD 
Home: 610-355-0919 
Cell: 610-517-5586 
email: spanayo...@msn.com 
spanayo...@outlook.com 
spanayo...@comcast.net
  



  

Re: Spark SQL running totals

2015-10-15 Thread Deenar Toraskar
you can do a self join of the table with itself with the join clause being
a.col1 >= b.col1

select a.col1, a.col2, sum(b.col2)
from tablea as a left outer join tablea as b on (a.col1 >= b.col1)
group by a.col1, a.col2

I havent tried it, but cant see why it cant work, but doing it in RDD might
be more efficient see
https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/

On 15 October 2015 at 18:48, Stefan Panayotov  wrote:

> Hi,
>
> I need help with Spark SQL. I need to achieve something like the following.
> If I have data like:
>
> col_1  col_2
> 1 10
> 2 30
> 3 15
> 4 20
> 5 25
>
> I need to get col_3 to be the running total of the sum of the previous
> rows of col_2, e.g.
>
> col_1  col_2  col_3
> 1 1010
> 2 3040
> 3 1555
> 4 2075
> 5 25100
>
> Is there a way to achieve this in Spark SQL or maybe with Data frame
> transformations?
>
> Thanks in advance,
>
>
> *Stefan Panayotov, PhD **Home*: 610-355-0919
> *Cell*: 610-517-5586
> *email*: spanayo...@msn.com
> spanayo...@outlook.com
> spanayo...@comcast.net
>
>


Re: Spark SQL running totals

2015-10-15 Thread Michael Armbrust
Check out:
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

On Thu, Oct 15, 2015 at 11:35 AM, Deenar Toraskar  wrote:

> you can do a self join of the table with itself with the join clause being
> a.col1 >= b.col1
>
> select a.col1, a.col2, sum(b.col2)
> from tablea as a left outer join tablea as b on (a.col1 >= b.col1)
> group by a.col1, a.col2
>
> I havent tried it, but cant see why it cant work, but doing it in RDD
> might be more efficient see
> https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/
>
> On 15 October 2015 at 18:48, Stefan Panayotov  wrote:
>
>> Hi,
>>
>> I need help with Spark SQL. I need to achieve something like the
>> following.
>> If I have data like:
>>
>> col_1  col_2
>> 1 10
>> 2 30
>> 3 15
>> 4 20
>> 5 25
>>
>> I need to get col_3 to be the running total of the sum of the previous
>> rows of col_2, e.g.
>>
>> col_1  col_2  col_3
>> 1 1010
>> 2 3040
>> 3 1555
>> 4 2075
>> 5 25100
>>
>> Is there a way to achieve this in Spark SQL or maybe with Data frame
>> transformations?
>>
>> Thanks in advance,
>>
>>
>> *Stefan Panayotov, PhD **Home*: 610-355-0919
>> *Cell*: 610-517-5586
>> *email*: spanayo...@msn.com
>> spanayo...@outlook.com
>> spanayo...@comcast.net
>>
>>
>
>


Re: Spark SQL running totals

2015-10-15 Thread Kristina Rogale Plazonic
You can do it and many other transformations very easily with window
functions, see this blog post:

https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

In your case you would do (in Scala):

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.{functions => func}

val wo = Window.orderBy("col1")
val newdf = df.withColumn("col3", func.sum("col2").over(wo))

The true power of this is when you have to do a running total by a
particular, say, user (in a different column)! :)

HTH,
Kristina

On Thu, Oct 15, 2015 at 1:48 PM, Stefan Panayotov 
wrote:

> Hi,
>
> I need help with Spark SQL. I need to achieve something like the following.
> If I have data like:
>
> col_1  col_2
> 1 10
> 2 30
> 3 15
> 4 20
> 5 25
>
> I need to get col_3 to be the running total of the sum of the previous
> rows of col_2, e.g.
>
> col_1  col_2  col_3
> 1 1010
> 2 3040
> 3 1555
> 4 2075
> 5 25100
>
> Is there a way to achieve this in Spark SQL or maybe with Data frame
> transformations?
>
> Thanks in advance,
>
>
> *Stefan Panayotov, PhD **Home*: 610-355-0919
> *Cell*: 610-517-5586
> *email*: spanayo...@msn.com
> spanayo...@outlook.com
> spanayo...@comcast.net
>
>