Re: Cumulative Sum function using Dataset API

2016-08-09 Thread Jon Barksdale
Cool, learn something new every day.  Thanks again.

On Tue, Aug 9, 2016 at 4:08 PM ayan guha <guha.a...@gmail.com> wrote:

> Thanks for reporting back. Glad it worked for you. Actually sum with
> partitioning behaviour is same in oracle too.
> On 10 Aug 2016 03:01, "Jon Barksdale" <jon.barksd...@gmail.com> wrote:
>
>> Hi Santoshakhilesh,
>>
>> I'd seen that already, but I was trying to avoid using rdds to perform
>> this calculation.
>>
>> @Ayan, it seems I was mistaken, and doing a sum(b) over(order by b)
>> totally works.  I guess I expected the windowing with sum to work more like
>> oracle.  Thanks for the suggestion :)
>>
>> Thank you both for your help,
>>
>> Jon
>>
>> On Tue, Aug 9, 2016 at 3:01 AM Santoshakhilesh <
>> santosh.akhil...@huawei.com> wrote:
>>
>>> You could check following link.
>>>
>>>
>>> http://stackoverflow.com/questions/35154267/how-to-compute-cumulative-sum-using-spark
>>>
>>>
>>>
>>> *From:* Jon Barksdale [mailto:jon.barksd...@gmail.com]
>>> *Sent:* 09 August 2016 08:21
>>> *To:* ayan guha
>>> *Cc:* user
>>> *Subject:* Re: Cumulative Sum function using Dataset API
>>>
>>>
>>>
>>> I don't think that would work properly, and would probably just give me
>>> the sum for each partition. I'll give it a try when I get home just to be
>>> certain.
>>>
>>> To maybe explain the intent better, if I have a column (pre sorted) of
>>> (1,2,3,4), then the cumulative sum would return (1,3,6,10).
>>>
>>> Does that make sense? Naturally, if ordering a sum turns it into a
>>> cumulative sum, I'll gladly use that :)
>>>
>>> Jon
>>>
>>> On Mon, Aug 8, 2016 at 4:55 PM ayan guha <guha.a...@gmail.com> wrote:
>>>
>>> You mean you are not able to use sum(col) over (partition by key order
>>> by some_col) ?
>>>
>>>
>>>
>>> On Tue, Aug 9, 2016 at 9:53 AM, jon <jon.barksd...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> I'm trying to write a function that calculates a cumulative sum as a
>>> column
>>> using the Dataset API, and I'm a little stuck on the implementation.
>>> From
>>> what I can tell, UserDefinedAggregateFunctions don't seem to support
>>> windowing clauses, which I think I need for this use case.  If I write a
>>> function that extends from AggregateWindowFunction, I end up needing
>>> classes
>>> that are package private to the sql package, so I need to make my
>>> function
>>> under the org.apache.spark.sql package, which just feels wrong.
>>>
>>> I've also considered writing a custom transformer, but haven't spend as
>>> much
>>> time reading through the code, so I don't know how easy or hard that
>>> would
>>> be.
>>>
>>> TLDR; What's the best way to write a function that returns a value for
>>> every
>>> row, but has mutable state, and gets row in a specific order?
>>>
>>> Does anyone have any ideas, or examples?
>>>
>>> Thanks,
>>>
>>> Jon
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-Sum-function-using-Dataset-API-tp27496.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Best Regards,
>>> Ayan Guha
>>>
>>>


Re: Cumulative Sum function using Dataset API

2016-08-09 Thread ayan guha
Thanks for reporting back. Glad it worked for you. Actually sum with
partitioning behaviour is same in oracle too.
On 10 Aug 2016 03:01, "Jon Barksdale" <jon.barksd...@gmail.com> wrote:

> Hi Santoshakhilesh,
>
> I'd seen that already, but I was trying to avoid using rdds to perform
> this calculation.
>
> @Ayan, it seems I was mistaken, and doing a sum(b) over(order by b)
> totally works.  I guess I expected the windowing with sum to work more like
> oracle.  Thanks for the suggestion :)
>
> Thank you both for your help,
>
> Jon
>
> On Tue, Aug 9, 2016 at 3:01 AM Santoshakhilesh <
> santosh.akhil...@huawei.com> wrote:
>
>> You could check following link.
>>
>> http://stackoverflow.com/questions/35154267/how-to-
>> compute-cumulative-sum-using-spark
>>
>>
>>
>> *From:* Jon Barksdale [mailto:jon.barksd...@gmail.com]
>> *Sent:* 09 August 2016 08:21
>> *To:* ayan guha
>> *Cc:* user
>> *Subject:* Re: Cumulative Sum function using Dataset API
>>
>>
>>
>> I don't think that would work properly, and would probably just give me
>> the sum for each partition. I'll give it a try when I get home just to be
>> certain.
>>
>> To maybe explain the intent better, if I have a column (pre sorted) of
>> (1,2,3,4), then the cumulative sum would return (1,3,6,10).
>>
>> Does that make sense? Naturally, if ordering a sum turns it into a
>> cumulative sum, I'll gladly use that :)
>>
>> Jon
>>
>> On Mon, Aug 8, 2016 at 4:55 PM ayan guha <guha.a...@gmail.com> wrote:
>>
>> You mean you are not able to use sum(col) over (partition by key order by
>> some_col) ?
>>
>>
>>
>> On Tue, Aug 9, 2016 at 9:53 AM, jon <jon.barksd...@gmail.com> wrote:
>>
>> Hi all,
>>
>> I'm trying to write a function that calculates a cumulative sum as a
>> column
>> using the Dataset API, and I'm a little stuck on the implementation.  From
>> what I can tell, UserDefinedAggregateFunctions don't seem to support
>> windowing clauses, which I think I need for this use case.  If I write a
>> function that extends from AggregateWindowFunction, I end up needing
>> classes
>> that are package private to the sql package, so I need to make my function
>> under the org.apache.spark.sql package, which just feels wrong.
>>
>> I've also considered writing a custom transformer, but haven't spend as
>> much
>> time reading through the code, so I don't know how easy or hard that would
>> be.
>>
>> TLDR; What's the best way to write a function that returns a value for
>> every
>> row, but has mutable state, and gets row in a specific order?
>>
>> Does anyone have any ideas, or examples?
>>
>> Thanks,
>>
>> Jon
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Cumulative-Sum-function-using-
>> Dataset-API-tp27496.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>>
>>
>> --
>>
>> Best Regards,
>> Ayan Guha
>>
>>


Re: Cumulative Sum function using Dataset API

2016-08-09 Thread Jon Barksdale
Hi Santoshakhilesh,

I'd seen that already, but I was trying to avoid using rdds to perform this
calculation.

@Ayan, it seems I was mistaken, and doing a sum(b) over(order by b) totally
works.  I guess I expected the windowing with sum to work more like
oracle.  Thanks for the suggestion :)

Thank you both for your help,

Jon

On Tue, Aug 9, 2016 at 3:01 AM Santoshakhilesh <santosh.akhil...@huawei.com>
wrote:

> You could check following link.
>
>
> http://stackoverflow.com/questions/35154267/how-to-compute-cumulative-sum-using-spark
>
>
>
> *From:* Jon Barksdale [mailto:jon.barksd...@gmail.com]
> *Sent:* 09 August 2016 08:21
> *To:* ayan guha
> *Cc:* user
> *Subject:* Re: Cumulative Sum function using Dataset API
>
>
>
> I don't think that would work properly, and would probably just give me
> the sum for each partition. I'll give it a try when I get home just to be
> certain.
>
> To maybe explain the intent better, if I have a column (pre sorted) of
> (1,2,3,4), then the cumulative sum would return (1,3,6,10).
>
> Does that make sense? Naturally, if ordering a sum turns it into a
> cumulative sum, I'll gladly use that :)
>
> Jon
>
> On Mon, Aug 8, 2016 at 4:55 PM ayan guha <guha.a...@gmail.com> wrote:
>
> You mean you are not able to use sum(col) over (partition by key order by
> some_col) ?
>
>
>
> On Tue, Aug 9, 2016 at 9:53 AM, jon <jon.barksd...@gmail.com> wrote:
>
> Hi all,
>
> I'm trying to write a function that calculates a cumulative sum as a column
> using the Dataset API, and I'm a little stuck on the implementation.  From
> what I can tell, UserDefinedAggregateFunctions don't seem to support
> windowing clauses, which I think I need for this use case.  If I write a
> function that extends from AggregateWindowFunction, I end up needing
> classes
> that are package private to the sql package, so I need to make my function
> under the org.apache.spark.sql package, which just feels wrong.
>
> I've also considered writing a custom transformer, but haven't spend as
> much
> time reading through the code, so I don't know how easy or hard that would
> be.
>
> TLDR; What's the best way to write a function that returns a value for
> every
> row, but has mutable state, and gets row in a specific order?
>
> Does anyone have any ideas, or examples?
>
> Thanks,
>
> Jon
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-Sum-function-using-Dataset-API-tp27496.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>
>


RE: Cumulative Sum function using Dataset API

2016-08-09 Thread Santoshakhilesh
You could check following link.
http://stackoverflow.com/questions/35154267/how-to-compute-cumulative-sum-using-spark

From: Jon Barksdale [mailto:jon.barksd...@gmail.com]
Sent: 09 August 2016 08:21
To: ayan guha
Cc: user
Subject: Re: Cumulative Sum function using Dataset API

I don't think that would work properly, and would probably just give me the sum 
for each partition. I'll give it a try when I get home just to be certain.

To maybe explain the intent better, if I have a column (pre sorted) of 
(1,2,3,4), then the cumulative sum would return (1,3,6,10).

Does that make sense? Naturally, if ordering a sum turns it into a cumulative 
sum, I'll gladly use that :)

Jon
On Mon, Aug 8, 2016 at 4:55 PM ayan guha 
<guha.a...@gmail.com<mailto:guha.a...@gmail.com>> wrote:
You mean you are not able to use sum(col) over (partition by key order by 
some_col) ?

On Tue, Aug 9, 2016 at 9:53 AM, jon 
<jon.barksd...@gmail.com<mailto:jon.barksd...@gmail.com>> wrote:
Hi all,

I'm trying to write a function that calculates a cumulative sum as a column
using the Dataset API, and I'm a little stuck on the implementation.  From
what I can tell, UserDefinedAggregateFunctions don't seem to support
windowing clauses, which I think I need for this use case.  If I write a
function that extends from AggregateWindowFunction, I end up needing classes
that are package private to the sql package, so I need to make my function
under the org.apache.spark.sql package, which just feels wrong.

I've also considered writing a custom transformer, but haven't spend as much
time reading through the code, so I don't know how easy or hard that would
be.

TLDR; What's the best way to write a function that returns a value for every
row, but has mutable state, and gets row in a specific order?

Does anyone have any ideas, or examples?

Thanks,

Jon




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-Sum-function-using-Dataset-API-tp27496.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>



--
Best Regards,
Ayan Guha


Re: Cumulative Sum function using Dataset API

2016-08-08 Thread Jon Barksdale
I don't think that would work properly, and would probably just give me the
sum for each partition. I'll give it a try when I get home just to be
certain.

To maybe explain the intent better, if I have a column (pre sorted) of
(1,2,3,4), then the cumulative sum would return (1,3,6,10).

Does that make sense? Naturally, if ordering a sum turns it into a
cumulative sum, I'll gladly use that :)

Jon
On Mon, Aug 8, 2016 at 4:55 PM ayan guha  wrote:

> You mean you are not able to use sum(col) over (partition by key order by
> some_col) ?
>
> On Tue, Aug 9, 2016 at 9:53 AM, jon  wrote:
>
>> Hi all,
>>
>> I'm trying to write a function that calculates a cumulative sum as a
>> column
>> using the Dataset API, and I'm a little stuck on the implementation.  From
>> what I can tell, UserDefinedAggregateFunctions don't seem to support
>> windowing clauses, which I think I need for this use case.  If I write a
>> function that extends from AggregateWindowFunction, I end up needing
>> classes
>> that are package private to the sql package, so I need to make my function
>> under the org.apache.spark.sql package, which just feels wrong.
>>
>> I've also considered writing a custom transformer, but haven't spend as
>> much
>> time reading through the code, so I don't know how easy or hard that would
>> be.
>>
>> TLDR; What's the best way to write a function that returns a value for
>> every
>> row, but has mutable state, and gets row in a specific order?
>>
>> Does anyone have any ideas, or examples?
>>
>> Thanks,
>>
>> Jon
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-Sum-function-using-Dataset-API-tp27496.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Cumulative Sum function using Dataset API

2016-08-08 Thread ayan guha
You mean you are not able to use sum(col) over (partition by key order by
some_col) ?

On Tue, Aug 9, 2016 at 9:53 AM, jon  wrote:

> Hi all,
>
> I'm trying to write a function that calculates a cumulative sum as a column
> using the Dataset API, and I'm a little stuck on the implementation.  From
> what I can tell, UserDefinedAggregateFunctions don't seem to support
> windowing clauses, which I think I need for this use case.  If I write a
> function that extends from AggregateWindowFunction, I end up needing
> classes
> that are package private to the sql package, so I need to make my function
> under the org.apache.spark.sql package, which just feels wrong.
>
> I've also considered writing a custom transformer, but haven't spend as
> much
> time reading through the code, so I don't know how easy or hard that would
> be.
>
> TLDR; What's the best way to write a function that returns a value for
> every
> row, but has mutable state, and gets row in a specific order?
>
> Does anyone have any ideas, or examples?
>
> Thanks,
>
> Jon
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Cumulative-Sum-function-using-
> Dataset-API-tp27496.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha