Re: Cumulative Sum function using Dataset API
Cool, learn something new every day. Thanks again. On Tue, Aug 9, 2016 at 4:08 PM ayan guha <guha.a...@gmail.com> wrote: > Thanks for reporting back. Glad it worked for you. Actually sum with > partitioning behaviour is same in oracle too. > On 10 Aug 2016 03:01, "Jon Barksdale" <jon.barksd...@gmail.com> wrote: > >> Hi Santoshakhilesh, >> >> I'd seen that already, but I was trying to avoid using rdds to perform >> this calculation. >> >> @Ayan, it seems I was mistaken, and doing a sum(b) over(order by b) >> totally works. I guess I expected the windowing with sum to work more like >> oracle. Thanks for the suggestion :) >> >> Thank you both for your help, >> >> Jon >> >> On Tue, Aug 9, 2016 at 3:01 AM Santoshakhilesh < >> santosh.akhil...@huawei.com> wrote: >> >>> You could check following link. >>> >>> >>> http://stackoverflow.com/questions/35154267/how-to-compute-cumulative-sum-using-spark >>> >>> >>> >>> *From:* Jon Barksdale [mailto:jon.barksd...@gmail.com] >>> *Sent:* 09 August 2016 08:21 >>> *To:* ayan guha >>> *Cc:* user >>> *Subject:* Re: Cumulative Sum function using Dataset API >>> >>> >>> >>> I don't think that would work properly, and would probably just give me >>> the sum for each partition. I'll give it a try when I get home just to be >>> certain. >>> >>> To maybe explain the intent better, if I have a column (pre sorted) of >>> (1,2,3,4), then the cumulative sum would return (1,3,6,10). >>> >>> Does that make sense? Naturally, if ordering a sum turns it into a >>> cumulative sum, I'll gladly use that :) >>> >>> Jon >>> >>> On Mon, Aug 8, 2016 at 4:55 PM ayan guha <guha.a...@gmail.com> wrote: >>> >>> You mean you are not able to use sum(col) over (partition by key order >>> by some_col) ? >>> >>> >>> >>> On Tue, Aug 9, 2016 at 9:53 AM, jon <jon.barksd...@gmail.com> wrote: >>> >>> Hi all, >>> >>> I'm trying to write a function that calculates a cumulative sum as a >>> column >>> using the Dataset API, and I'm a little stuck on the implementation. >>> From >>> what I can tell, UserDefinedAggregateFunctions don't seem to support >>> windowing clauses, which I think I need for this use case. If I write a >>> function that extends from AggregateWindowFunction, I end up needing >>> classes >>> that are package private to the sql package, so I need to make my >>> function >>> under the org.apache.spark.sql package, which just feels wrong. >>> >>> I've also considered writing a custom transformer, but haven't spend as >>> much >>> time reading through the code, so I don't know how easy or hard that >>> would >>> be. >>> >>> TLDR; What's the best way to write a function that returns a value for >>> every >>> row, but has mutable state, and gets row in a specific order? >>> >>> Does anyone have any ideas, or examples? >>> >>> Thanks, >>> >>> Jon >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-Sum-function-using-Dataset-API-tp27496.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >>> >>> >>> >>> -- >>> >>> Best Regards, >>> Ayan Guha >>> >>>
Re: Cumulative Sum function using Dataset API
Thanks for reporting back. Glad it worked for you. Actually sum with partitioning behaviour is same in oracle too. On 10 Aug 2016 03:01, "Jon Barksdale" <jon.barksd...@gmail.com> wrote: > Hi Santoshakhilesh, > > I'd seen that already, but I was trying to avoid using rdds to perform > this calculation. > > @Ayan, it seems I was mistaken, and doing a sum(b) over(order by b) > totally works. I guess I expected the windowing with sum to work more like > oracle. Thanks for the suggestion :) > > Thank you both for your help, > > Jon > > On Tue, Aug 9, 2016 at 3:01 AM Santoshakhilesh < > santosh.akhil...@huawei.com> wrote: > >> You could check following link. >> >> http://stackoverflow.com/questions/35154267/how-to- >> compute-cumulative-sum-using-spark >> >> >> >> *From:* Jon Barksdale [mailto:jon.barksd...@gmail.com] >> *Sent:* 09 August 2016 08:21 >> *To:* ayan guha >> *Cc:* user >> *Subject:* Re: Cumulative Sum function using Dataset API >> >> >> >> I don't think that would work properly, and would probably just give me >> the sum for each partition. I'll give it a try when I get home just to be >> certain. >> >> To maybe explain the intent better, if I have a column (pre sorted) of >> (1,2,3,4), then the cumulative sum would return (1,3,6,10). >> >> Does that make sense? Naturally, if ordering a sum turns it into a >> cumulative sum, I'll gladly use that :) >> >> Jon >> >> On Mon, Aug 8, 2016 at 4:55 PM ayan guha <guha.a...@gmail.com> wrote: >> >> You mean you are not able to use sum(col) over (partition by key order by >> some_col) ? >> >> >> >> On Tue, Aug 9, 2016 at 9:53 AM, jon <jon.barksd...@gmail.com> wrote: >> >> Hi all, >> >> I'm trying to write a function that calculates a cumulative sum as a >> column >> using the Dataset API, and I'm a little stuck on the implementation. From >> what I can tell, UserDefinedAggregateFunctions don't seem to support >> windowing clauses, which I think I need for this use case. If I write a >> function that extends from AggregateWindowFunction, I end up needing >> classes >> that are package private to the sql package, so I need to make my function >> under the org.apache.spark.sql package, which just feels wrong. >> >> I've also considered writing a custom transformer, but haven't spend as >> much >> time reading through the code, so I don't know how easy or hard that would >> be. >> >> TLDR; What's the best way to write a function that returns a value for >> every >> row, but has mutable state, and gets row in a specific order? >> >> Does anyone have any ideas, or examples? >> >> Thanks, >> >> Jon >> >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/Cumulative-Sum-function-using- >> Dataset-API-tp27496.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >> >> >> >> -- >> >> Best Regards, >> Ayan Guha >> >>
Re: Cumulative Sum function using Dataset API
Hi Santoshakhilesh, I'd seen that already, but I was trying to avoid using rdds to perform this calculation. @Ayan, it seems I was mistaken, and doing a sum(b) over(order by b) totally works. I guess I expected the windowing with sum to work more like oracle. Thanks for the suggestion :) Thank you both for your help, Jon On Tue, Aug 9, 2016 at 3:01 AM Santoshakhilesh <santosh.akhil...@huawei.com> wrote: > You could check following link. > > > http://stackoverflow.com/questions/35154267/how-to-compute-cumulative-sum-using-spark > > > > *From:* Jon Barksdale [mailto:jon.barksd...@gmail.com] > *Sent:* 09 August 2016 08:21 > *To:* ayan guha > *Cc:* user > *Subject:* Re: Cumulative Sum function using Dataset API > > > > I don't think that would work properly, and would probably just give me > the sum for each partition. I'll give it a try when I get home just to be > certain. > > To maybe explain the intent better, if I have a column (pre sorted) of > (1,2,3,4), then the cumulative sum would return (1,3,6,10). > > Does that make sense? Naturally, if ordering a sum turns it into a > cumulative sum, I'll gladly use that :) > > Jon > > On Mon, Aug 8, 2016 at 4:55 PM ayan guha <guha.a...@gmail.com> wrote: > > You mean you are not able to use sum(col) over (partition by key order by > some_col) ? > > > > On Tue, Aug 9, 2016 at 9:53 AM, jon <jon.barksd...@gmail.com> wrote: > > Hi all, > > I'm trying to write a function that calculates a cumulative sum as a column > using the Dataset API, and I'm a little stuck on the implementation. From > what I can tell, UserDefinedAggregateFunctions don't seem to support > windowing clauses, which I think I need for this use case. If I write a > function that extends from AggregateWindowFunction, I end up needing > classes > that are package private to the sql package, so I need to make my function > under the org.apache.spark.sql package, which just feels wrong. > > I've also considered writing a custom transformer, but haven't spend as > much > time reading through the code, so I don't know how easy or hard that would > be. > > TLDR; What's the best way to write a function that returns a value for > every > row, but has mutable state, and gets row in a specific order? > > Does anyone have any ideas, or examples? > > Thanks, > > Jon > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-Sum-function-using-Dataset-API-tp27496.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > > > -- > > Best Regards, > Ayan Guha > >
RE: Cumulative Sum function using Dataset API
You could check following link. http://stackoverflow.com/questions/35154267/how-to-compute-cumulative-sum-using-spark From: Jon Barksdale [mailto:jon.barksd...@gmail.com] Sent: 09 August 2016 08:21 To: ayan guha Cc: user Subject: Re: Cumulative Sum function using Dataset API I don't think that would work properly, and would probably just give me the sum for each partition. I'll give it a try when I get home just to be certain. To maybe explain the intent better, if I have a column (pre sorted) of (1,2,3,4), then the cumulative sum would return (1,3,6,10). Does that make sense? Naturally, if ordering a sum turns it into a cumulative sum, I'll gladly use that :) Jon On Mon, Aug 8, 2016 at 4:55 PM ayan guha <guha.a...@gmail.com<mailto:guha.a...@gmail.com>> wrote: You mean you are not able to use sum(col) over (partition by key order by some_col) ? On Tue, Aug 9, 2016 at 9:53 AM, jon <jon.barksd...@gmail.com<mailto:jon.barksd...@gmail.com>> wrote: Hi all, I'm trying to write a function that calculates a cumulative sum as a column using the Dataset API, and I'm a little stuck on the implementation. From what I can tell, UserDefinedAggregateFunctions don't seem to support windowing clauses, which I think I need for this use case. If I write a function that extends from AggregateWindowFunction, I end up needing classes that are package private to the sql package, so I need to make my function under the org.apache.spark.sql package, which just feels wrong. I've also considered writing a custom transformer, but haven't spend as much time reading through the code, so I don't know how easy or hard that would be. TLDR; What's the best way to write a function that returns a value for every row, but has mutable state, and gets row in a specific order? Does anyone have any ideas, or examples? Thanks, Jon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-Sum-function-using-Dataset-API-tp27496.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> -- Best Regards, Ayan Guha
Re: Cumulative Sum function using Dataset API
I don't think that would work properly, and would probably just give me the sum for each partition. I'll give it a try when I get home just to be certain. To maybe explain the intent better, if I have a column (pre sorted) of (1,2,3,4), then the cumulative sum would return (1,3,6,10). Does that make sense? Naturally, if ordering a sum turns it into a cumulative sum, I'll gladly use that :) Jon On Mon, Aug 8, 2016 at 4:55 PM ayan guhawrote: > You mean you are not able to use sum(col) over (partition by key order by > some_col) ? > > On Tue, Aug 9, 2016 at 9:53 AM, jon wrote: > >> Hi all, >> >> I'm trying to write a function that calculates a cumulative sum as a >> column >> using the Dataset API, and I'm a little stuck on the implementation. From >> what I can tell, UserDefinedAggregateFunctions don't seem to support >> windowing clauses, which I think I need for this use case. If I write a >> function that extends from AggregateWindowFunction, I end up needing >> classes >> that are package private to the sql package, so I need to make my function >> under the org.apache.spark.sql package, which just feels wrong. >> >> I've also considered writing a custom transformer, but haven't spend as >> much >> time reading through the code, so I don't know how easy or hard that would >> be. >> >> TLDR; What's the best way to write a function that returns a value for >> every >> row, but has mutable state, and gets row in a specific order? >> >> Does anyone have any ideas, or examples? >> >> Thanks, >> >> Jon >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-Sum-function-using-Dataset-API-tp27496.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > > > -- > Best Regards, > Ayan Guha >
Re: Cumulative Sum function using Dataset API
You mean you are not able to use sum(col) over (partition by key order by some_col) ? On Tue, Aug 9, 2016 at 9:53 AM, jonwrote: > Hi all, > > I'm trying to write a function that calculates a cumulative sum as a column > using the Dataset API, and I'm a little stuck on the implementation. From > what I can tell, UserDefinedAggregateFunctions don't seem to support > windowing clauses, which I think I need for this use case. If I write a > function that extends from AggregateWindowFunction, I end up needing > classes > that are package private to the sql package, so I need to make my function > under the org.apache.spark.sql package, which just feels wrong. > > I've also considered writing a custom transformer, but haven't spend as > much > time reading through the code, so I don't know how easy or hard that would > be. > > TLDR; What's the best way to write a function that returns a value for > every > row, but has mutable state, and gets row in a specific order? > > Does anyone have any ideas, or examples? > > Thanks, > > Jon > > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Cumulative-Sum-function-using- > Dataset-API-tp27496.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha