Re: computing median and percentiles

2014-03-20 Thread Stephen Sprague
the short answer is there is no native hive UDF that solves your unique
case.  That means you have to solve it.


i searched for something like you were looking for myself and found this
general recipe:

http://www.onlinemathlearning.com/median-frequency-table.html

off the top of my head i'm not sure how easy this would be in SQL but i
imagine using ROW_NUMBER() function a clever person could do it. And if not
SQL then perhaps a custom UDF but ultimately you have to do the work and
now you have a potential recipe to follow. :)




On Wed, Mar 19, 2014 at 9:37 PM, Seema Datar sda...@yahoo-inc.com wrote:

  Not really. If it was a single column with no counters, Hive provides an
 option to use percentile. So basically if the data was like -

  100
 100
 200
 200
 200
 200
 300

  But if we have 2 columns, one that maintain the value and the other that
 maintains the count, how can Hive be used to derive the percentile?

  Value Count
 100  2
 200  4
 300  1

  Thanks,
 Seema

   From: Stephen Sprague sprag...@gmail.com
 Reply-To: user@hive.apache.org user@hive.apache.org
 Date: Thursday, March 20, 2014 5:28 AM
 To: user@hive.apache.org user@hive.apache.org
 Subject: Re: computing median and percentiles

   not a hive question is it?   its more like a math question.



 On Wed, Mar 19, 2014 at 1:30 PM, Seema Datar sda...@yahoo-inc.com wrote:



   I understand the percentile function is supported in Hive in the
 latest versions. However, how does once calculate percentiles when the data
 is across two columns. So say -

  Value  Count

  100 2   ( so basically 100 occurred twice)
 200 4
 300 1
 400 6
 500 3


  I want to find out the 0.25 percentile for the value distribution. How
 can I do it using the Hive percentile function?






Re: computing median and percentiles

2014-03-19 Thread Seema Datar


I understand the percentile function is supported in Hive in the latest 
versions. However, how does once calculate percentiles when the data is across 
two columns. So say -

Value  Count

100 2   ( so basically 100 occurred twice)
200 4
300 1
400 6
500 3


I want to find out the 0.25 percentile for the value distribution. How can I do 
it using the Hive percentile function?




Re: computing median and percentiles

2014-03-19 Thread Stephen Sprague
not a hive question is it?   its more like a math question.



On Wed, Mar 19, 2014 at 1:30 PM, Seema Datar sda...@yahoo-inc.com wrote:



   I understand the percentile function is supported in Hive in the latest
 versions. However, how does once calculate percentiles when the data is
 across two columns. So say -

  Value  Count

  100 2   ( so basically 100 occurred twice)
 200 4
 300 1
 400 6
 500 3


  I want to find out the 0.25 percentile for the value distribution. How
 can I do it using the Hive percentile function?





Re: computing median and percentiles

2014-03-19 Thread Seema Datar
Not really. If it was a single column with no counters, Hive provides an option 
to use percentile. So basically if the data was like -

100
100
200
200
200
200
300

But if we have 2 columns, one that maintain the value and the other that 
maintains the count, how can Hive be used to derive the percentile?

Value Count
100  2
200  4
300  1

Thanks,
Seema

From: Stephen Sprague sprag...@gmail.commailto:sprag...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Thursday, March 20, 2014 5:28 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: computing median and percentiles

not a hive question is it?   its more like a math question.



On Wed, Mar 19, 2014 at 1:30 PM, Seema Datar 
sda...@yahoo-inc.commailto:sda...@yahoo-inc.com wrote:


I understand the percentile function is supported in Hive in the latest 
versions. However, how does once calculate percentiles when the data is across 
two columns. So say -

Value  Count

100 2   ( so basically 100 occurred twice)
200 4
300 1
400 6
500 3


I want to find out the 0.25 percentile for the value distribution. How can I do 
it using the Hive percentile function?