I don't think that would work properly, and would probably just give me the
sum for each partition. I'll give it a try when I get home just to be
certain.

To maybe explain the intent better, if I have a column (pre sorted) of
(1,2,3,4), then the cumulative sum would return (1,3,6,10).

Does that make sense? Naturally, if ordering a sum turns it into a
cumulative sum, I'll gladly use that :)

Jon
On Mon, Aug 8, 2016 at 4:55 PM ayan guha <guha.a...@gmail.com> wrote:

> You mean you are not able to use sum(col) over (partition by key order by
> some_col) ?
>
> On Tue, Aug 9, 2016 at 9:53 AM, jon <jon.barksd...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm trying to write a function that calculates a cumulative sum as a
>> column
>> using the Dataset API, and I'm a little stuck on the implementation.  From
>> what I can tell, UserDefinedAggregateFunctions don't seem to support
>> windowing clauses, which I think I need for this use case.  If I write a
>> function that extends from AggregateWindowFunction, I end up needing
>> classes
>> that are package private to the sql package, so I need to make my function
>> under the org.apache.spark.sql package, which just feels wrong.
>>
>> I've also considered writing a custom transformer, but haven't spend as
>> much
>> time reading through the code, so I don't know how easy or hard that would
>> be.
>>
>> TLDR; What's the best way to write a function that returns a value for
>> every
>> row, but has mutable state, and gets row in a specific order?
>>
>> Does anyone have any ideas, or examples?
>>
>> Thanks,
>>
>> Jon
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-Sum-function-using-Dataset-API-tp27496.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Reply via email to