[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-02-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16497


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-02-06 Thread tanejagagan
Github user tanejagagan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r99753046
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -125,10 +139,17 @@ case class Percentile(
   buffer: OpenHashMap[Number, Long],
   input: InternalRow): OpenHashMap[Number, Long] = {
 val key = child.eval(input).asInstanceOf[Number]
+val frqValue = frequencyExpression.eval(input)
 
 // Null values are ignored in counts map.
-if (key != null) {
-  buffer.changeValue(key, 1L, _ + 1L)
+if (key != null && frqValue != null) {
+  val frqLong = frqValue.asInstanceOf[Number].longValue()
+  // add only when frequency is positive
+  if (frqLong > 0) {
+buffer.changeValue(key, frqLong, _ + frqLong)
+  } else if ( frqLong < 0 ) {
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-02-05 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r99493733
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -125,10 +139,17 @@ case class Percentile(
   buffer: OpenHashMap[Number, Long],
   input: InternalRow): OpenHashMap[Number, Long] = {
 val key = child.eval(input).asInstanceOf[Number]
+val frqValue = frequencyExpression.eval(input)
 
 // Null values are ignored in counts map.
-if (key != null) {
-  buffer.changeValue(key, 1L, _ + 1L)
+if (key != null && frqValue != null) {
+  val frqLong = frqValue.asInstanceOf[Number].longValue()
+  // add only when frequency is positive
+  if (frqLong > 0) {
+buffer.changeValue(key, frqLong, _ + frqLong)
+  } else if ( frqLong < 0 ) {
--- End diff --

NIT: no leading and trailing spaces in the `if` clause


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-02-05 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r99494510
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileSuite.scala
 ---
@@ -232,6 +358,41 @@ class PercentileSuite extends SparkFunSuite {
 assert(agg.eval(buffer) != null)
   }
 
+  test("null and invalid values( 0 and negatives ) handling of frequency 
column") {
--- End diff --

Just make this test about testing the negative frequency case & move null & 
0 testing to the earlier test. This should be enough:
```scala
val perc = new Percentile(Literal(1), Literal(0.5), Literal(-1))
val buffer = new GenericInternalRow(2)
agg.initialize(buffer)
val e = intercept[SparkException](agg.update(buffer, InternalRow.empty))
assert(e.getMessage.contains("Negative values found in "))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-02-05 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r99494133
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileSuite.scala
 ---
@@ -84,6 +86,59 @@ class PercentileSuite extends SparkFunSuite {
 }
   }
 
+  test("class Percentile with frequency, high level interface, update, 
merge, eval...") {
--- End diff --

This test is almost a duplicate of the previous test. Lets just merge the 
two, and have a case without a frequency. I do like the parameterization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-02-05 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r99493969
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileSuite.scala
 ---
@@ -26,6 +27,7 @@ import org.apache.spark.sql.catalyst.util.ArrayData
 import org.apache.spark.sql.types._
 import org.apache.spark.util.collection.OpenHashMap
 
+
--- End diff --

Remove this line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-02-05 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r99494220
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileSuite.scala
 ---
@@ -119,35 +174,73 @@ class PercentileSuite extends SparkFunSuite {
   test("call from sql query") {
--- End diff --

Can you remove this entire test. We don't have to check the `sql` method. 
This feels like over testing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-10 Thread tanejagagan
Github user tanejagagan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95508337
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -81,7 +96,11 @@ case class Percentile(
   case arrayData: ArrayData => arrayData.toDoubleArray().toSeq
   }
 
-  override def children: Seq[Expression] = child :: percentageExpression 
:: Nil
+  override def children: Seq[Expression] = if (withFrqExpr) {
--- End diff --

@hvanhovell 
Can you provide your final thoughts on this and i will make the changes 
accordingly 
i did try to find an expression where middle argument is Optional but could 
not. Seems like only the last argument(s) are optional.
Which argument sequence  would you recommend 
Column [, frequency], percentages 
OR
Column, percentages [, frequency] 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95308691
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -81,7 +96,11 @@ case class Percentile(
   case arrayData: ArrayData => arrayData.toDoubleArray().toSeq
   }
 
-  override def children: Seq[Expression] = child :: percentageExpression 
:: Nil
+  override def children: Seq[Expression] = if (withFrqExpr) {
--- End diff --

Let's make the param sequence to: `child, percentageExpression, 
frequencyExpression`, and give a default value to `frequencyExpression`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread tanejagagan
Github user tanejagagan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95307385
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap
   _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the 
exact percentile value array
   of numeric column `col` at the given percentage(s). Each value of 
the percentage array must
   be between 0.0 and 1.0.
-""")
+
+  _FUNC_(col, frequency, percentage) - Returns the exact percentile 
value of numeric
+  column `col` with frequency column `frequency` at the given 
percentage. The value of
+  percentage must be between 0.0 and 1.0.
+
+  _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - 
Returns the exact percentile
+  value array of numeric column `col` with frequency column 
`frequency` at the given
+  percentage(s).Each value of the percentage array must be between 0.0 
and 1.0.
+
+  """)
 case class Percentile(
 child: Expression,
+frequency : Expression,
--- End diff --

i think frequenctExpression would be correct name but i have sequence them 
as they would appear in the SQL
select percentile( col, frq, percentage ) from table
where frq is Optional


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread tanejagagan
Github user tanejagagan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95307232
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap
   _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the 
exact percentile value array
   of numeric column `col` at the given percentage(s). Each value of 
the percentage array must
   be between 0.0 and 1.0.
-""")
+
+  _FUNC_(col, frequency, percentage) - Returns the exact percentile 
value of numeric
+  column `col` with frequency column `frequency` at the given 
percentage. The value of
+  percentage must be between 0.0 and 1.0.
+
+  _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - 
Returns the exact percentile
+  value array of numeric column `col` with frequency column 
`frequency` at the given
+  percentage(s).Each value of the percentage array must be between 0.0 
and 1.0.
+
+  """)
 case class Percentile(
 child: Expression,
+frequency : Expression,
 percentageExpression: Expression,
+withFrqExpr : Boolean,
--- End diff --

Please see my comment below why we need to make a distinction either using 
flag or Option

I am inclined towards using a flag because switching to option would change 
the code in update
from
val frqValue = frequency.eval(input)
to 
val frqValue = frequency.getOrElse( unit).eval(input) 

But i think Option[Expression] would be better logically
Once we have an agreement if we need to have a distinction or not i will 
make the changes accordingly 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread tanejagagan
Github user tanejagagan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95305820
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -81,7 +96,11 @@ case class Percentile(
   case arrayData: ArrayData => arrayData.toDoubleArray().toSeq
   }
 
-  override def children: Seq[Expression] = child :: percentageExpression 
:: Nil
+  override def children: Seq[Expression] = if (withFrqExpr) {
--- End diff --

I have given lot of thought to it and if we need to make a difference here.
Lets take a data set of
age_string, age, count
"20", 20, 1
"15", 15, 1
"10", 10, 1


For Sql  "select percentile( age, count , 0.5 ) from table"
logically correct values should be  
children = age::count ::0.5 :: Nil 
and 
inputType = IntegerType :: IntegerType::DoubleType::Nil

For sql "select pecentile( age, 0.5 ) from table"
logically correct values should be  
children = age::0.5 :: Nil 
and 
inputType = IntegerType ::DoubleType::Nil

Here is one example where keeping it logically correct would help 
For following incorrect SQL "select percentile( age, '10') from table" 

With children = age::'10'::Nil and inputType = IntergerType::StringType:: 
Nil 
Since both children and inputType is used for dataType validation, the 
error message would be correct as below.
"argument 2 requires Double type, however, 10 is of String type." 

However With children = age::Literal(1)::'10'::Nil and inputType = 
IntergerType::IntegerType::StringType:: Nil 
The error message would be NOT correct and confusing as below
"argument 3 requires Double type, however, 10 is of String type." 

Since both children and dataType are public method i was inclined to keep 
them explicitly correct and therefore i decided to make a difference. 
Please let me know your thoughts 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread tanejagagan
Github user tanejagagan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95303304
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -126,10 +152,15 @@ case class Percentile(
   buffer: OpenHashMap[Number, Long],
   input: InternalRow): OpenHashMap[Number, Long] = {
 val key = child.eval(input).asInstanceOf[Number]
+val frqValue = frequency.eval(input)
 
 // Null values are ignored in counts map.
-if (key != null) {
-  buffer.changeValue(key, 1L, _ + 1L)
+if (key != null && frqValue != null) {
--- End diff --

Yes this would be wrong to use the default value of 1 
Let take a data set of 
Age, Count 
20, 1 
15, 1 
10, 0

If we take the default value of 1L when the frq is 0  is then .5 percentile 
would become 15 . This is incorrect. I agree with other suggestion of either 
failing or disregard those values



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread tanejagagan
Github user tanejagagan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95303054
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -126,10 +152,15 @@ case class Percentile(
   buffer: OpenHashMap[Number, Long],
   input: InternalRow): OpenHashMap[Number, Long] = {
 val key = child.eval(input).asInstanceOf[Number]
+val frqValue = frequency.eval(input)
 
 // Null values are ignored in counts map.
-if (key != null) {
-  buffer.changeValue(key, 1L, _ + 1L)
+if (key != null && frqValue != null) {
+  val frqLong = frqValue.asInstanceOf[Number].longValue()
+  // add only when frequency is positive
+  if (frqLong > 0) {
--- End diff --

I  think the option was between either fail or disregard those values. We 
can certainly make this a requirement, document and fail when the values are 
negatives
I think for the cases where values are either null or 0 we should not be 
adding them to Map to unnecessary
 bloat the map.
The logic would look like 
if ( frqLong < 0 ) {
  throw new SomeException
}else if( frqLong > 0 ) {
  // process to add them to map 
}

Let me know if above look good and i will make the changes accordingly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95297133
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap
   _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the 
exact percentile value array
   of numeric column `col` at the given percentage(s). Each value of 
the percentage array must
   be between 0.0 and 1.0.
-""")
+
+  _FUNC_(col, frequency, percentage) - Returns the exact percentile 
value of numeric
+  column `col` with frequency column `frequency` at the given 
percentage. The value of
+  percentage must be between 0.0 and 1.0.
+
+  _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - 
Returns the exact percentile
+  value array of numeric column `col` with frequency column 
`frequency` at the given
+  percentage(s).Each value of the percentage array must be between 0.0 
and 1.0.
+
+  """)
 case class Percentile(
 child: Expression,
+frequency : Expression,
 percentageExpression: Expression,
+withFrqExpr : Boolean,
--- End diff --

Please remove withFrqExpr. The frequency must be provided, and should 
default to 1L.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95297232
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -81,7 +96,11 @@ case class Percentile(
   case arrayData: ArrayData => arrayData.toDoubleArray().toSeq
   }
 
-  override def children: Seq[Expression] = child :: percentageExpression 
:: Nil
+  override def children: Seq[Expression] = if (withFrqExpr) {
--- End diff --

Why do we need to make a difference here?

Just do:
```scala
override def children: Seq[Expression] = {
  child :: frequency :: percentageExpression :: Nil
}


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95297644
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -91,9 +110,16 @@ case class Percentile(
 case _ => DoubleType
   }
 
-  override def inputTypes: Seq[AbstractDataType] = 
percentageExpression.dataType match {
-case _: ArrayType => Seq(NumericType, ArrayType(DoubleType))
-case _ => Seq(NumericType, DoubleType)
+  override def inputTypes: Seq[AbstractDataType] = {
+val percentageExpType = percentageExpression.dataType match {
+  case _: ArrayType => ArrayType(DoubleType)
+  case _ => DoubleType
+}
+if (withFrqExpr) {
--- End diff --

Again remove withFrqExpr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95297633
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -126,10 +152,15 @@ case class Percentile(
   buffer: OpenHashMap[Number, Long],
   input: InternalRow): OpenHashMap[Number, Long] = {
 val key = child.eval(input).asInstanceOf[Number]
+val frqValue = frequency.eval(input)
 
 // Null values are ignored in counts map.
-if (key != null) {
-  buffer.changeValue(key, 1L, _ + 1L)
+if (key != null && frqValue != null) {
+  val frqLong = frqValue.asInstanceOf[Number].longValue()
+  // add only when frequency is positive
+  if (frqLong > 0) {
--- End diff --

Lets make this a requirement and fail when the value < 0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95297609
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -126,10 +152,15 @@ case class Percentile(
   buffer: OpenHashMap[Number, Long],
   input: InternalRow): OpenHashMap[Number, Long] = {
 val key = child.eval(input).asInstanceOf[Number]
+val frqValue = frequency.eval(input)
 
 // Null values are ignored in counts map.
-if (key != null) {
-  buffer.changeValue(key, 1L, _ + 1L)
+if (key != null && frqValue != null) {
--- End diff --

I don't think we should default to 1L as a value. That seems more wrong 
than either failing or skipping.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95295321
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -126,10 +152,15 @@ case class Percentile(
   buffer: OpenHashMap[Number, Long],
   input: InternalRow): OpenHashMap[Number, Long] = {
 val key = child.eval(input).asInstanceOf[Number]
+val frqValue = frequency.eval(input)
--- End diff --

We should check the value of the `frequencyExpression` beforehand.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95294516
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap
   _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the 
exact percentile value array
   of numeric column `col` at the given percentage(s). Each value of 
the percentage array must
   be between 0.0 and 1.0.
-""")
+
+  _FUNC_(col, frequency, percentage) - Returns the exact percentile 
value of numeric
+  column `col` with frequency column `frequency` at the given 
percentage. The value of
+  percentage must be between 0.0 and 1.0.
+
+  _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - 
Returns the exact percentile
+  value array of numeric column `col` with frequency column 
`frequency` at the given
+  percentage(s).Each value of the percentage array must be between 0.0 
and 1.0.
+
+  """)
 case class Percentile(
 child: Expression,
+frequency : Expression,
--- End diff --

Rename this to `frequencyExpression`, and place this after 
`percentageExpression`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95294336
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap
   _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the 
exact percentile value array
   of numeric column `col` at the given percentage(s). Each value of 
the percentage array must
   be between 0.0 and 1.0.
-""")
+
+  _FUNC_(col, frequency, percentage) - Returns the exact percentile 
value of numeric
+  column `col` with frequency column `frequency` at the given 
percentage. The value of
+  percentage must be between 0.0 and 1.0.
+
+  _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - 
Returns the exact percentile
+  value array of numeric column `col` with frequency column 
`frequency` at the given
+  percentage(s).Each value of the percentage array must be between 0.0 
and 1.0.
+
+  """)
 case class Percentile(
 child: Expression,
+frequency : Expression,
 percentageExpression: Expression,
+withFrqExpr : Boolean,
--- End diff --

Let's make the type of `frequency` to be `Option[Expression]`, and remove 
the `withFrqExpr` flag.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95293559
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap
   _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the 
exact percentile value array
   of numeric column `col` at the given percentage(s). Each value of 
the percentage array must
   be between 0.0 and 1.0.
-""")
+
+  _FUNC_(col, frequency, percentage) - Returns the exact percentile 
value of numeric
+  column `col` with frequency column `frequency` at the given 
percentage. The value of
+  percentage must be between 0.0 and 1.0.
+
+  _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - 
Returns the exact percentile
+  value array of numeric column `col` with frequency column 
`frequency` at the given
+  percentage(s).Each value of the percentage array must be between 0.0 
and 1.0.
--- End diff --

Please merge the comments with the exsiting one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95294834
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -81,7 +96,11 @@ case class Percentile(
   case arrayData: ArrayData => arrayData.toDoubleArray().toSeq
   }
 
-  override def children: Seq[Expression] = child :: percentageExpression 
:: Nil
+  override def children: Seq[Expression] = if (withFrqExpr) {
+child :: frequency :: percentageExpression :: Nil
+  } else {
+child :: percentageExpression :: Nil
+  }
--- End diff --

How about:
```
override def children: Seq[Expression] = {
val frequency = if (frequencyExpression.isDefined) {
frequencyExpression :: Nil
} else {
Nil
}
child :: percentageExpression :: frequency
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95295298
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -91,9 +110,16 @@ case class Percentile(
 case _ => DoubleType
   }
 
-  override def inputTypes: Seq[AbstractDataType] = 
percentageExpression.dataType match {
-case _: ArrayType => Seq(NumericType, ArrayType(DoubleType))
-case _ => Seq(NumericType, DoubleType)
+  override def inputTypes: Seq[AbstractDataType] = {
+val percentageExpType = percentageExpression.dataType match {
+  case _: ArrayType => ArrayType(DoubleType)
+  case _ => DoubleType
+}
+if (withFrqExpr) {
+  Seq(NumericType, IntegralType, percentageExpType)
--- End diff --

Please also change this according to the comments above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95293614
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap
   _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the 
exact percentile value array
   of numeric column `col` at the given percentage(s). Each value of 
the percentage array must
   be between 0.0 and 1.0.
-""")
+
+  _FUNC_(col, frequency, percentage) - Returns the exact percentile 
value of numeric
+  column `col` with frequency column `frequency` at the given 
percentage. The value of
+  percentage must be between 0.0 and 1.0.
+
+  _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - 
Returns the exact percentile
+  value array of numeric column `col` with frequency column 
`frequency` at the given
+  percentage(s).Each value of the percentage array must be between 0.0 
and 1.0.
+
+  """)
 case class Percentile(
 child: Expression,
+frequency : Expression,
 percentageExpression: Expression,
+withFrqExpr : Boolean,
--- End diff --

Why do we need the flag `withFrqExpr`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95295476
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -126,10 +152,15 @@ case class Percentile(
   buffer: OpenHashMap[Number, Long],
   input: InternalRow): OpenHashMap[Number, Long] = {
 val key = child.eval(input).asInstanceOf[Number]
+val frqValue = frequency.eval(input)
 
 // Null values are ignored in counts map.
-if (key != null) {
-  buffer.changeValue(key, 1L, _ + 1L)
+if (key != null && frqValue != null) {
--- End diff --

This is dangerous, we should use a default value (e.g. frqValue = 1) when 
frqValue is invalid.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-08 Thread tanejagagan
Github user tanejagagan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95092209
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -91,10 +111,19 @@ case class Percentile(
 case _ => DoubleType
   }
 
-  override def inputTypes: Seq[AbstractDataType] = 
percentageExpression.dataType match {
-case _: ArrayType => Seq(NumericType, ArrayType(DoubleType))
-case _ => Seq(NumericType, DoubleType)
-  }
+  override def inputTypes: Seq[AbstractDataType] =
+if (frequency == unit) {
+  percentageExpression.dataType match {
+case _: ArrayType => Seq(NumericType, ArrayType(DoubleType))
+case _=> Seq(NumericType, DoubleType)
+  }
+} else {
+  percentageExpression.dataType match {
--- End diff --

It seems like I would have to do it based on what were the argument in the 
SQL and which constructor was invoked 
for sql without Frequency like percentile( `a`, 0.5 ) children need to be 
child:: percentageExpr::Nil
for sql with Frequency like percentile( `a`, `frq`, 0.5 ) children need to 
be child::frequency::percentageExpr::Nil
As well as input type should be reflected on how many arguments were passed 

Both InputDataType Check and generated sql tests fails 

However i have made changes in the logic to determine which constructor was 
invoked and decide inputType and children based on it



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-08 Thread tanejagagan
Github user tanejagagan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95091986
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -51,26 +51,42 @@ import org.apache.spark.util.collection.OpenHashMap
   _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the 
exact percentile value array
   of numeric column `col` at the given percentage(s). Each value of 
the percentage array must
   be between 0.0 and 1.0.
+  
+  _FUNC_(col, frequency, percentage) - Returns the exact percentile 
value of numeric column `col` 
+  with frequency column `frequency` at the given percentage. The value 
of percentage must be 
+  between 0.0 and 1.0.
+
+  _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - 
Returns the exact percentile 
+  value array of numeric column `col` with frequency column 
`frequency` at the given percentage(s).
+  Each value of the percentage array must be between 0.0 and 1.0.
+
 """)
 case class Percentile(
 child: Expression,
+frequency : Expression,
 percentageExpression: Expression,
 mutableAggBufferOffset: Int = 0,
 inputAggBufferOffset: Int = 0)
   extends TypedImperativeAggregate[OpenHashMap[Number, Long]] with 
ImplicitCastInputTypes {
 
   def this(child: Expression, percentageExpression: Expression) = {
-this(child, percentageExpression, 0, 0)
+this(child, Literal(1l), percentageExpression, 0, 0)
+  }
+  
+  def this(child: Expression, frequency: Expression, percentageExpression: 
Expression) = {
+this(child, frequency, percentageExpression, 0, 0)
   }
 
+  private val unit = Literal(1l)
+  
   override def prettyName: String = "percentile"
 
   override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: 
Int): Percentile =
 copy(mutableAggBufferOffset = newMutableAggBufferOffset)
 
   override def withNewInputAggBufferOffset(newInputAggBufferOffset: Int): 
Percentile =
 copy(inputAggBufferOffset = newInputAggBufferOffset)
-
+  
--- End diff --


Did not realize that I had forgotten to do the men build which does the 
style check as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-08 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95074141
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -81,7 +97,11 @@ case class Percentile(
   case arrayData: ArrayData => arrayData.toDoubleArray().toSeq
   }
 
-  override def children: Seq[Expression] = child :: percentageExpression 
:: Nil
+  override def children: Seq[Expression] = if( frequency != unit){
--- End diff --

See my other comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-08 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95074238
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -126,10 +155,15 @@ case class Percentile(
   buffer: OpenHashMap[Number, Long],
   input: InternalRow): OpenHashMap[Number, Long] = {
 val key = child.eval(input).asInstanceOf[Number]
-
+val frqValue = frequency.eval(input)
+
 // Null values are ignored in counts map.
-if (key != null) {
-  buffer.changeValue(key, 1L, _ + 1L)
+if (key != null && frqValue != null) {
+  val frqLong = frqValue.asInstanceOf[Number].longValue()
+  //add only when frequency is positive
+  if(frqLong > 0 ){
--- End diff --

Nit add a space between if and its clause


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-08 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95074140
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -91,10 +111,19 @@ case class Percentile(
 case _ => DoubleType
   }
 
-  override def inputTypes: Seq[AbstractDataType] = 
percentageExpression.dataType match {
-case _: ArrayType => Seq(NumericType, ArrayType(DoubleType))
-case _ => Seq(NumericType, DoubleType)
-  }
+  override def inputTypes: Seq[AbstractDataType] =
+if (frequency == unit) {
+  percentageExpression.dataType match {
+case _: ArrayType => Seq(NumericType, ArrayType(DoubleType))
+case _=> Seq(NumericType, DoubleType)
+  }
+} else {
+  percentageExpression.dataType match {
--- End diff --

It is not needed to make a difference between freq = 1 and the freq > 1 
cases. Just use the more complete case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-08 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95074181
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -51,26 +51,42 @@ import org.apache.spark.util.collection.OpenHashMap
   _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the 
exact percentile value array
   of numeric column `col` at the given percentage(s). Each value of 
the percentage array must
   be between 0.0 and 1.0.
+  
+  _FUNC_(col, frequency, percentage) - Returns the exact percentile 
value of numeric column `col` 
+  with frequency column `frequency` at the given percentage. The value 
of percentage must be 
+  between 0.0 and 1.0.
+
+  _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - 
Returns the exact percentile 
+  value array of numeric column `col` with frequency column 
`frequency` at the given percentage(s).
+  Each value of the percentage array must be between 0.0 and 1.0.
+
 """)
 case class Percentile(
 child: Expression,
+frequency : Expression,
 percentageExpression: Expression,
 mutableAggBufferOffset: Int = 0,
 inputAggBufferOffset: Int = 0)
   extends TypedImperativeAggregate[OpenHashMap[Number, Long]] with 
ImplicitCastInputTypes {
 
   def this(child: Expression, percentageExpression: Expression) = {
-this(child, percentageExpression, 0, 0)
+this(child, Literal(1l), percentageExpression, 0, 0)
+  }
+  
+  def this(child: Expression, frequency: Expression, percentageExpression: 
Expression) = {
+this(child, frequency, percentageExpression, 0, 0)
   }
 
+  private val unit = Literal(1l)
+  
   override def prettyName: String = "percentile"
 
   override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: 
Int): Percentile =
 copy(mutableAggBufferOffset = newMutableAggBufferOffset)
 
   override def withNewInputAggBufferOffset(newInputAggBufferOffset: Int): 
Percentile =
 copy(inputAggBufferOffset = newInputAggBufferOffset)
-
+  
--- End diff --

Remove this. This will probably fail style checks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-07 Thread tanejagagan
GitHub user tanejagagan opened a pull request:

https://github.com/apache/spark/pull/16497

[SPARK-19118] [SQL] Percentile support for frequency distribution table

## What changes were proposed in this pull request?

I have a frequency distribution table with following entries
Age,No of person 
21, 10
22, 15
23, 18 
..
..
30, 14
Moreover it is common to have data in frequency distribution format to 
further calculate Percentile, Median. With current implementation
It would be very difficult and complex to find the percentile.
Therefore i am proposing enhancement to current Percentile and Approx 
Percentile implementation to take frequency distribution column into 
consideration 

## How was this patch tested?
1) Enhanced 
/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileSuite.scala
 to cover the additional functionality 
2) Run some performance benchmark test with 20 million row in local 
environment and did not see any performance degradation


Please review http://spark.apache.org/contributing.html before opening a 
pull request.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tanejagagan/spark branch-18940

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16497.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16497


commit c638fbc45047a250f7e41ad589e3f144d911e441
Author: gagan taneja 
Date:   2017-01-07T18:03:27Z

[SPARK-18940][SQL] Percentile and approximate percentile support for 
frequency distribution table

commit 3740a0c7785e8480b889c8a99f639906d8a0d5a1
Author: gagan taneja 
Date:   2017-01-07T18:03:27Z

[SPARK-18940][SQL] Percentile and approximate percentile support for 
frequency distribution table

commit 28cdb6652fe5c0953a749b9468548edfc9d42067
Author: gagan taneja 
Date:   2017-01-07T18:23:47Z

Merge branch 'branch-18940' of https://github.com/tanejagagan/spark into 
branch-18940




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org