Re: calculate diff of value and median in a group

2017-07-14 Thread roni
I was using this function percentile_approx on 100GB of compressed data and it just hangs there. Any pointers? On Wed, Mar 22, 2017 at 6:09 PM, ayan guha wrote: > For median, use percentile_approx with 0.5 (50th percentile is the median) > > On Thu, Mar 23, 2017 at 11:01

Re: calculate diff of value and median in a group

2017-03-22 Thread ayan guha
For median, use percentile_approx with 0.5 (50th percentile is the median) On Thu, Mar 23, 2017 at 11:01 AM, Yong Zhang wrote: > He is looking for median, not mean/avg. > > > You have to implement the median logic by yourself, as there is no > directly implementation from

Re: calculate diff of value and median in a group

2017-03-22 Thread Yong Zhang
He is looking for median, not mean/avg. You have to implement the median logic by yourself, as there is no directly implementation from Spark. You can use RDD API, if you are using 1.6.x, or dataset if 2.x The following example gives you an idea how to calculate the median using dataset

Re: calculate diff of value and median in a group

2017-03-22 Thread ayan guha
I would suggest use window function with partitioning. select group1,group2,name,value, avg(value) over (partition group1,group2 order by name) m from t On Thu, Mar 23, 2017 at 9:58 AM, Craig Ching wrote: > Are the elements count big per group? If not, you can group them

Re: calculate diff of value and median in a group

2017-03-22 Thread Craig Ching
> Are the elements count big per group? If not, you can group them and use the > code to calculate the median and diff. > > > They're not big, no. Any pointers on how I might do that? The part I'm having trouble with is the grouping, I can't seem to see how to do the median per group. For

Re: calculate diff of value and median in a group

2017-03-22 Thread Yong Zhang
Are the elements count big per group? If not, you can group them and use the code to calculate the median and diff. Yong From: Craig Ching Sent: Wednesday, March 22, 2017 3:17 PM To: user@spark.apache.org Subject: calculate diff of value

calculate diff of value and median in a group

2017-03-22 Thread Craig Ching
Hi, When using pyspark, I'd like to be able to calculate the difference between grouped values and their median for the group. Is this possible? Here is some code I hacked up that does what I want except that it calculates the grouped diff from mean. Also, please feel free to comment on how I