[ 
https://issues.apache.org/jira/browse/SPARK-18940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gagan taneja updated SPARK-18940:
---------------------------------
    Shepherd: Herman van Hovell

> Percentile and approximate percentile support for frequency distribution table
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-18940
>                 URL: https://issues.apache.org/jira/browse/SPARK-18940
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: gagan taneja
>
> I have a frequency distribution table with following entries 
> {noformat}
> Age,    No of person 
> 21, 10
> 22, 15
> 23, 18 
> ..
> ..
> 30, 14
> {noformat}
> Moreover it is common to have data in frequency distribution format to 
> further calculate Percentile, Median. With current implementation
> It would be very difficult and complex to find the percentile.
> Therefore i am proposing enhancement to current Percentile and Approx 
> Percentile implementation to take frequency distribution column into 
> consideration 
> Current Percentile definition 
> {noformat}
> percentile(col, array(percentage1 [, percentage2]...))
> case class Percentile(
>   child: Expression,
>   percentageExpression: Expression,
>   mutableAggBufferOffset: Int = 0,
>   inputAggBufferOffset: Int = 0) {
>    def this(child: Expression, percentageExpression: Expression) = {
>     this(child, percentageExpression, 0, 0)
>   }
> }
> {noformat}
> Proposed changes 
> {noformat}
> percentile(col, [frequency], array(percentage1 [, percentage2]...))
> case class Percentile(
>   child: Expression,
>   frequency : Expression,
>   percentageExpression: Expression,
>   mutableAggBufferOffset: Int = 0,
>   inputAggBufferOffset: Int = 0) {
>    def this(child: Expression, percentageExpression: Expression) = {
>     this(child, Literal(1L), percentageExpression, 0, 0)
>   }
>   def this(child: Expression, frequency : Expression, percentageExpression: 
> Expression) = {
>     this(child, frequency, percentageExpression, 0, 0)
>   }
> }
> {noformat}
> Although this definition will differ from hive implementation, it will be 
> useful functionality to many spark user.
> Moreover the changes are local to only Percentile and ApproxPercentile 
> implementation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to