[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

2016-08-30 Thread clockfly
Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r76898997
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -0,0 +1,304 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{InternalRow, ScalaReflectionLock}
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, 
TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.{PercentileDigest,
 PercentileDigestSerializer}
+import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData}
+import org.apache.spark.sql.catalyst.util.QuantileSummaries
+import 
org.apache.spark.sql.catalyst.util.QuantileSummaries.{defaultCompressThreshold, 
Stats}
+import org.apache.spark.sql.types._
+
+/**
+ * The ApproximatePercentile function returns the approximate 
percentile(s) of a column at the given
+ * percentage(s). A percentile is a watermark value below which a given 
percentage of the column
+ * values fall. For example, the percentile of column `col` at percentage 
50% is the median of
+ * column `col`.
+ *
+ * This function supports partial aggregation.
+ *
+ * @param child child expression that can produce column value with 
`child.eval(inputRow)`
+ * @param percentageExpression Expression that represents a single 
percentage value or
+ * a array of percentage values. Each 
percentage value must be between
+ * 0.0 and 1.0.
+ * @param accuracyExpression Integer literal expression of approximation 
accuracy. Higher value
+ *   yields better accuracy, the default value is
+ *   DEFAULT_PERCENTILE_ACCURACY.
+ */
+@ExpressionDescription(
+  usage =
+"""
+  _FUNC_(col, percentage [, accuracy]) - Returns the approximate 
percentile value of numeric
+  column `col` at the given percentage. The value of percentage must 
be between 0.0
+  and 1.0. The `accuracy` parameter (default: 1) is a positive 
integer literal which
+  controls approximation accuracy at the cost of memory. Higher value 
of `accuracy` yields
+  better accuracy, `1.0/accuracy` is the relative error of the 
approximation.
+
+  _FUNC_(col, array(percentage1 [, percentage2]...) [, accuracy]) - 
Returns the approximate
+  percentile array of column `col` at the given percentage array. Each 
value of the
+  percentage array must be between 0.0 and 1.0. The `accuracy` 
parameter (default: 1) is
+   a positive integer literal which controls approximation accuracy at 
the cost of memory.
+   Higher value of `accuracy` yields better accuracy, `1.0/accuracy` 
is the relative error of
+   the approximation.
+""")
+case class ApproximatePercentile(
+child: Expression,
+percentageExpression: Expression,
--- End diff --

And we can use the InputType Check facility to get a more consistent error 
message if check fails. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

2016-08-30 Thread clockfly
Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r76898106
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -0,0 +1,304 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{InternalRow, ScalaReflectionLock}
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, 
TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.{PercentileDigest,
 PercentileDigestSerializer}
+import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData}
+import org.apache.spark.sql.catalyst.util.QuantileSummaries
+import 
org.apache.spark.sql.catalyst.util.QuantileSummaries.{defaultCompressThreshold, 
Stats}
+import org.apache.spark.sql.types._
+
+/**
+ * The ApproximatePercentile function returns the approximate 
percentile(s) of a column at the given
+ * percentage(s). A percentile is a watermark value below which a given 
percentage of the column
+ * values fall. For example, the percentile of column `col` at percentage 
50% is the median of
+ * column `col`.
+ *
+ * This function supports partial aggregation.
+ *
+ * @param child child expression that can produce column value with 
`child.eval(inputRow)`
+ * @param percentageExpression Expression that represents a single 
percentage value or
+ * a array of percentage values. Each 
percentage value must be between
+ * 0.0 and 1.0.
+ * @param accuracyExpression Integer literal expression of approximation 
accuracy. Higher value
+ *   yields better accuracy, the default value is
+ *   DEFAULT_PERCENTILE_ACCURACY.
+ */
+@ExpressionDescription(
+  usage =
+"""
+  _FUNC_(col, percentage [, accuracy]) - Returns the approximate 
percentile value of numeric
+  column `col` at the given percentage. The value of percentage must 
be between 0.0
+  and 1.0. The `accuracy` parameter (default: 1) is a positive 
integer literal which
+  controls approximation accuracy at the cost of memory. Higher value 
of `accuracy` yields
+  better accuracy, `1.0/accuracy` is the relative error of the 
approximation.
+
+  _FUNC_(col, array(percentage1 [, percentage2]...) [, accuracy]) - 
Returns the approximate
+  percentile array of column `col` at the given percentage array. Each 
value of the
+  percentage array must be between 0.0 and 1.0. The `accuracy` 
parameter (default: 1) is
+   a positive integer literal which controls approximation accuracy at 
the cost of memory.
+   Higher value of `accuracy` yields better accuracy, `1.0/accuracy` 
is the relative error of
+   the approximation.
+""")
+case class ApproximatePercentile(
+child: Expression,
+percentageExpression: Expression,
--- End diff --

The sql string is automatically generated, without hacking.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

2016-08-30 Thread clockfly
Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r76898056
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -0,0 +1,304 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{InternalRow, ScalaReflectionLock}
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, 
TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.{PercentileDigest,
 PercentileDigestSerializer}
+import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData}
+import org.apache.spark.sql.catalyst.util.QuantileSummaries
+import 
org.apache.spark.sql.catalyst.util.QuantileSummaries.{defaultCompressThreshold, 
Stats}
+import org.apache.spark.sql.types._
+
+/**
+ * The ApproximatePercentile function returns the approximate 
percentile(s) of a column at the given
+ * percentage(s). A percentile is a watermark value below which a given 
percentage of the column
+ * values fall. For example, the percentile of column `col` at percentage 
50% is the median of
+ * column `col`.
+ *
+ * This function supports partial aggregation.
+ *
+ * @param child child expression that can produce column value with 
`child.eval(inputRow)`
+ * @param percentageExpression Expression that represents a single 
percentage value or
+ * a array of percentage values. Each 
percentage value must be between
+ * 0.0 and 1.0.
+ * @param accuracyExpression Integer literal expression of approximation 
accuracy. Higher value
+ *   yields better accuracy, the default value is
+ *   DEFAULT_PERCENTILE_ACCURACY.
+ */
+@ExpressionDescription(
+  usage =
+"""
+  _FUNC_(col, percentage [, accuracy]) - Returns the approximate 
percentile value of numeric
+  column `col` at the given percentage. The value of percentage must 
be between 0.0
+  and 1.0. The `accuracy` parameter (default: 1) is a positive 
integer literal which
+  controls approximation accuracy at the cost of memory. Higher value 
of `accuracy` yields
+  better accuracy, `1.0/accuracy` is the relative error of the 
approximation.
+
+  _FUNC_(col, array(percentage1 [, percentage2]...) [, accuracy]) - 
Returns the approximate
+  percentile array of column `col` at the given percentage array. Each 
value of the
+  percentage array must be between 0.0 and 1.0. The `accuracy` 
parameter (default: 1) is
+   a positive integer literal which controls approximation accuracy at 
the cost of memory.
+   Higher value of `accuracy` yields better accuracy, `1.0/accuracy` 
is the relative error of
+   the approximation.
+""")
+case class ApproximatePercentile(
+child: Expression,
+percentageExpression: Expression,
+accuracyExpression: Expression,
+override val mutableAggBufferOffset: Int,
+override val inputAggBufferOffset: Int) extends 
TypedImperativeAggregate[PercentileDigest] {
+
+  def this(child: Expression, percentageExpression: Expression, 
accuracyExpression: Expression) = {
+this(child, percentageExpression, accuracyExpression, 0, 0)
+  }
+
+  def this(child: Expression, percentageExpression: Expression) = {
+this(child, percentageExpression, 
Literal(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))
+  }
+
+  // Mark as lazy so that accuracyExpression is not evaluated during tree 
transformation.
+  privat

[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

2016-08-30 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r76803012
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentileSuite.scala
 ---
@@ -0,0 +1,318 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.{SimpleAnalyzer, 
UnresolvedAttribute}
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.TypeCheckFailure
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions.{Alias, 
AttributeReference, BoundReference, Cast, CreateArray, DecimalLiteral, 
GenericMutableRow, Literal}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.{PercentileDigest,
 PercentileDigestSerializer}
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.catalyst.util.QuantileSummaries
+import org.apache.spark.sql.catalyst.util.QuantileSummaries.Stats
+import org.apache.spark.sql.types.{ArrayType, DoubleType, IntegerType}
+import org.apache.spark.util.SizeEstimator
+
+
+class ApproximatePercentileSuite extends SparkFunSuite {
+
+  private val random = new java.util.Random()
+
+  private val data = (0 until 1).map { _ =>
+random.nextInt(1)
+  }
+
+  test("serialize and de-serialize") {
+
--- End diff --

Remove this new line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

2016-08-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r76782663
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -0,0 +1,304 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{InternalRow, ScalaReflectionLock}
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, 
TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.{PercentileDigest,
 PercentileDigestSerializer}
+import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData}
+import org.apache.spark.sql.catalyst.util.QuantileSummaries
+import 
org.apache.spark.sql.catalyst.util.QuantileSummaries.{defaultCompressThreshold, 
Stats}
+import org.apache.spark.sql.types._
+
+/**
+ * The ApproximatePercentile function returns the approximate 
percentile(s) of a column at the given
+ * percentage(s). A percentile is a watermark value below which a given 
percentage of the column
+ * values fall. For example, the percentile of column `col` at percentage 
50% is the median of
+ * column `col`.
+ *
+ * This function supports partial aggregation.
+ *
+ * @param child child expression that can produce column value with 
`child.eval(inputRow)`
+ * @param percentageExpression Expression that represents a single 
percentage value or
+ * a array of percentage values. Each 
percentage value must be between
+ * 0.0 and 1.0.
+ * @param accuracyExpression Integer literal expression of approximation 
accuracy. Higher value
+ *   yields better accuracy, the default value is
+ *   DEFAULT_PERCENTILE_ACCURACY.
+ */
+@ExpressionDescription(
+  usage =
+"""
+  _FUNC_(col, percentage [, accuracy]) - Returns the approximate 
percentile value of numeric
+  column `col` at the given percentage. The value of percentage must 
be between 0.0
+  and 1.0. The `accuracy` parameter (default: 1) is a positive 
integer literal which
+  controls approximation accuracy at the cost of memory. Higher value 
of `accuracy` yields
+  better accuracy, `1.0/accuracy` is the relative error of the 
approximation.
+
+  _FUNC_(col, array(percentage1 [, percentage2]...) [, accuracy]) - 
Returns the approximate
+  percentile array of column `col` at the given percentage array. Each 
value of the
+  percentage array must be between 0.0 and 1.0. The `accuracy` 
parameter (default: 1) is
+   a positive integer literal which controls approximation accuracy at 
the cost of memory.
+   Higher value of `accuracy` yields better accuracy, `1.0/accuracy` 
is the relative error of
+   the approximation.
+""")
+case class ApproximatePercentile(
+child: Expression,
+percentageExpression: Expression,
+accuracyExpression: Expression,
+override val mutableAggBufferOffset: Int,
+override val inputAggBufferOffset: Int) extends 
TypedImperativeAggregate[PercentileDigest] {
+
+  def this(child: Expression, percentageExpression: Expression, 
accuracyExpression: Expression) = {
+this(child, percentageExpression, accuracyExpression, 0, 0)
+  }
+
+  def this(child: Expression, percentageExpression: Expression) = {
+this(child, percentageExpression, 
Literal(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))
+  }
+
+  // Mark as lazy so that accuracyExpression is not evaluated during tree 
transformation.
+  priva

[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

2016-08-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r76781601
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -0,0 +1,304 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{InternalRow, ScalaReflectionLock}
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, 
TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.{PercentileDigest,
 PercentileDigestSerializer}
+import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData}
+import org.apache.spark.sql.catalyst.util.QuantileSummaries
+import 
org.apache.spark.sql.catalyst.util.QuantileSummaries.{defaultCompressThreshold, 
Stats}
+import org.apache.spark.sql.types._
+
+/**
+ * The ApproximatePercentile function returns the approximate 
percentile(s) of a column at the given
+ * percentage(s). A percentile is a watermark value below which a given 
percentage of the column
+ * values fall. For example, the percentile of column `col` at percentage 
50% is the median of
+ * column `col`.
+ *
+ * This function supports partial aggregation.
+ *
+ * @param child child expression that can produce column value with 
`child.eval(inputRow)`
+ * @param percentageExpression Expression that represents a single 
percentage value or
+ * a array of percentage values. Each 
percentage value must be between
+ * 0.0 and 1.0.
+ * @param accuracyExpression Integer literal expression of approximation 
accuracy. Higher value
+ *   yields better accuracy, the default value is
+ *   DEFAULT_PERCENTILE_ACCURACY.
+ */
+@ExpressionDescription(
+  usage =
+"""
+  _FUNC_(col, percentage [, accuracy]) - Returns the approximate 
percentile value of numeric
+  column `col` at the given percentage. The value of percentage must 
be between 0.0
+  and 1.0. The `accuracy` parameter (default: 1) is a positive 
integer literal which
+  controls approximation accuracy at the cost of memory. Higher value 
of `accuracy` yields
+  better accuracy, `1.0/accuracy` is the relative error of the 
approximation.
+
+  _FUNC_(col, array(percentage1 [, percentage2]...) [, accuracy]) - 
Returns the approximate
+  percentile array of column `col` at the given percentage array. Each 
value of the
+  percentage array must be between 0.0 and 1.0. The `accuracy` 
parameter (default: 1) is
+   a positive integer literal which controls approximation accuracy at 
the cost of memory.
+   Higher value of `accuracy` yields better accuracy, `1.0/accuracy` 
is the relative error of
+   the approximation.
+""")
+case class ApproximatePercentile(
+child: Expression,
+percentageExpression: Expression,
--- End diff --

why not follow this style? 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala#L56


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, 

[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

2016-08-30 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r76761692
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -0,0 +1,304 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{InternalRow, ScalaReflectionLock}
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, 
TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.{PercentileDigest,
 PercentileDigestSerializer}
+import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData}
+import org.apache.spark.sql.catalyst.util.QuantileSummaries
+import 
org.apache.spark.sql.catalyst.util.QuantileSummaries.{defaultCompressThreshold, 
Stats}
+import org.apache.spark.sql.types._
+
+/**
+ * The ApproximatePercentile function returns the approximate 
percentile(s) of a column at the given
+ * percentage(s). A percentile is a watermark value below which a given 
percentage of the column
+ * values fall. For example, the percentile of column `col` at percentage 
50% is the median of
+ * column `col`.
+ *
+ * This function supports partial aggregation.
+ *
+ * @param child child expression that can produce column value with 
`child.eval(inputRow)`
+ * @param percentageExpression Expression that represents a single 
percentage value or
+ * a array of percentage values. Each 
percentage value must be between
+ * 0.0 and 1.0.
+ * @param accuracyExpression Integer literal expression of approximation 
accuracy. Higher value
+ *   yield better accuracy, the default value is
+ *   DEFAULT_PERCENTILE_ACCURACY.
+ */
+@ExpressionDescription(
+  usage =
+"""
+  _FUNC_(col, percentage [, accuracy]) - Returns the approximate 
percentile value of numeric
+  column `col` at the given percentage. The value of percentage must 
be between 0.0
+  and 1.0. The `accuracy` parameter (default: 1) is a positive 
integer literal which
+  controls approximation accuracy at the cost of memory. Higher value 
of `accuracy` yield
--- End diff --

Nit: yield => yields


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

2016-08-30 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r76761771
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -0,0 +1,304 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{InternalRow, ScalaReflectionLock}
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, 
TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.{PercentileDigest,
 PercentileDigestSerializer}
+import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData}
+import org.apache.spark.sql.catalyst.util.QuantileSummaries
+import 
org.apache.spark.sql.catalyst.util.QuantileSummaries.{defaultCompressThreshold, 
Stats}
+import org.apache.spark.sql.types._
+
+/**
+ * The ApproximatePercentile function returns the approximate 
percentile(s) of a column at the given
+ * percentage(s). A percentile is a watermark value below which a given 
percentage of the column
+ * values fall. For example, the percentile of column `col` at percentage 
50% is the median of
+ * column `col`.
+ *
+ * This function supports partial aggregation.
+ *
+ * @param child child expression that can produce column value with 
`child.eval(inputRow)`
+ * @param percentageExpression Expression that represents a single 
percentage value or
+ * a array of percentage values. Each 
percentage value must be between
+ * 0.0 and 1.0.
+ * @param accuracyExpression Integer literal expression of approximation 
accuracy. Higher value
+ *   yield better accuracy, the default value is
+ *   DEFAULT_PERCENTILE_ACCURACY.
+ */
+@ExpressionDescription(
+  usage =
+"""
+  _FUNC_(col, percentage [, accuracy]) - Returns the approximate 
percentile value of numeric
+  column `col` at the given percentage. The value of percentage must 
be between 0.0
+  and 1.0. The `accuracy` parameter (default: 1) is a positive 
integer literal which
+  controls approximation accuracy at the cost of memory. Higher value 
of `accuracy` yield
+  better accuracy, `1.0/accuracy` is the relative error of the 
approximation.
+
+  _FUNC_(col, array(percentage1 [, percentage2]...) [, accuracy]) - 
Returns the approximate
+  percentile array of column `col` at the given percentage array. Each 
value of the
+  percentage array must be between 0.0 and 1.0. The `accuracy` 
parameter (default: 1) is
+   a positive integer literal which controls approximation accuracy at 
the cost of memory.
+   Higher value of `accuracy` yield better accuracy, `1.0/accuracy` is 
the relative error of
--- End diff --

Same as above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

2016-08-30 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r76761387
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -0,0 +1,304 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{InternalRow, ScalaReflectionLock}
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, 
TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.{PercentileDigest,
 PercentileDigestSerializer}
+import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData}
+import org.apache.spark.sql.catalyst.util.QuantileSummaries
+import 
org.apache.spark.sql.catalyst.util.QuantileSummaries.{defaultCompressThreshold, 
Stats}
+import org.apache.spark.sql.types._
+
+/**
+ * The ApproximatePercentile function returns the approximate 
percentile(s) of a column at the given
+ * percentage(s). A percentile is a watermark value below which a given 
percentage of the column
+ * values fall. For example, the percentile of column `col` at percentage 
50% is the median of
+ * column `col`.
+ *
+ * This function supports partial aggregation.
+ *
+ * @param child child expression that can produce column value with 
`child.eval(inputRow)`
+ * @param percentageExpression Expression that represents a single 
percentage value or
+ * a array of percentage values. Each 
percentage value must be between
--- End diff --

Nit: "a array" => "an array"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

2016-08-30 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r76761445
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -0,0 +1,304 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{InternalRow, ScalaReflectionLock}
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, 
TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.{PercentileDigest,
 PercentileDigestSerializer}
+import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData}
+import org.apache.spark.sql.catalyst.util.QuantileSummaries
+import 
org.apache.spark.sql.catalyst.util.QuantileSummaries.{defaultCompressThreshold, 
Stats}
+import org.apache.spark.sql.types._
+
+/**
+ * The ApproximatePercentile function returns the approximate 
percentile(s) of a column at the given
+ * percentage(s). A percentile is a watermark value below which a given 
percentage of the column
+ * values fall. For example, the percentile of column `col` at percentage 
50% is the median of
+ * column `col`.
+ *
+ * This function supports partial aggregation.
+ *
+ * @param child child expression that can produce column value with 
`child.eval(inputRow)`
+ * @param percentageExpression Expression that represents a single 
percentage value or
+ * a array of percentage values. Each 
percentage value must be between
+ * 0.0 and 1.0.
+ * @param accuracyExpression Integer literal expression of approximation 
accuracy. Higher value
+ *   yield better accuracy, the default value is
--- End diff --

Nit: yield => yields


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

2016-08-29 Thread clockfly
Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r76697179
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/ApproximatePercentileQuerySuite.scala
 ---
@@ -0,0 +1,226 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.PercentileDigest
+import org.apache.spark.sql.test.SharedSQLContext
+
+class ApproximatePercentileQuerySuite extends QueryTest with 
SharedSQLContext {
+  import testImplicits._
+
+  private val table = "percentile_test"
+
+  test("percentile_approx, single percentile value") {
+withTempView(table) {
+  (1 to 1000).toDF("col").createOrReplaceTempView(table)
+  checkAnswer(
+spark.sql(
+  s"""
+ |SELECT
+ |  percentile_approx(col, 0.25),
+ |  percentile_approx(col, 0.5),
+ |  percentile_approx(col, 0.75d),
+ |  percentile_approx(col, 0.0),
+ |  percentile_approx(col, 1.0),
+ |  percentile_approx(col, 0),
+ |  percentile_approx(col, 1)
+ |FROM $table
+   """.stripMargin),
+Row(250D, 500D, 750D, 1D, 1000D, 1D, 1000D)
+  )
+}
+  }
+
+  test("percentile_approx, array of percentile value") {
+withTempView(table) {
+  (1 to 1000).toDF("col").createOrReplaceTempView(table)
+  checkAnswer(
+spark.sql(
+  s"""SELECT
+ |  percentile_approx(col, array(0.25, 0.5, 0.75D)),
+ |  count(col),
+ |  percentile_approx(col, array(0.0, 1.0)),
+ |  sum(col)
+ |FROM $table
+   """.stripMargin),
+Row(Seq(250D, 500D, 750D), 1000, Seq(1D, 1000D), 500500)
+  )
+}
+  }
+
+  test("percentile_approx, with different accuracies") {
+
+withTempView(table) {
+  (1 to 1000).toDF("col").createOrReplaceTempView(table)
+
+  // With different accuracies
+  val expectedPercentile = 250D
+  val accuracies = Array(1, 10, 100, 1000, 1)
+  val errors = accuracies.map { accuracy =>
+val df = spark.sql(s"SELECT percentile_approx(col, 0.25, 
$accuracy) FROM $table")
+val approximatePercentile = df.collect().head.getDouble(0)
+val error = Math.abs(approximatePercentile - expectedPercentile)
+error
+  }
+
+  // The larger accuracy value we use, the smaller error we get
+  assert(errors.sorted.sameElements(errors.reverse))
+}
+  }
+
+  test("percentile_approx, supports constant folding for parameter 
accuracy and percentages") {
+withTempView(table) {
+  (1 to 1000).toDF("col").createOrReplaceTempView(table)
+  checkAnswer(
+spark.sql(s"SELECT percentile_approx(col, array(0.25 + 0.25D), 200 
+ 800D) FROM $table"),
+Row(Seq(500D))
+  )
+}
+  }
+
+  test("percentile_approx(), aggregation on empty input table, no group 
by") {
+withTempView(table) {
+  Seq.empty[Int].toDF("col").createOrReplaceTempView(table)
+  checkAnswer(
+spark.sql(s"SELECT sum(col), percentile_approx(col, 0.5) FROM 
$table"),
+Row(null, null)
+  )
+}
+  }
+
+  test("percentile_approx(), aggregation on empty input table, with group 
by") {
--- End diff --

SortBasedAggregationIterator has different code path for "with group by" 
and "no group by", so it is better that we test both cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or

[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

2016-08-29 Thread clockfly
Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r76695715
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -0,0 +1,302 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{InternalRow, ScalaReflectionLock}
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, 
TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.{PercentileDigest,
 PercentileDigestSerializer}
+import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData}
+import org.apache.spark.sql.catalyst.util.QuantileSummaries
+import 
org.apache.spark.sql.catalyst.util.QuantileSummaries.{defaultCompressThreshold, 
Stats}
+import org.apache.spark.sql.types._
+
+/**
+ * The ApproximatePercentile function returns the approximate 
percentile(s) of a column at the given
+ * percentage(s). A percentile is a watermark value below which a given 
percentage of the column
+ * values fall. For example, the percentile of column `col` at percentage 
50% is the median of
+ * column `col`.
+ *
+ * This function supports partial aggregation.
+ *
+ * @param child child expression that can produce column value with 
`child.eval(inputRow)`
+ * @param percentageExpression Expression that represents a single 
percentage value or
+ * a array of percentage values. Each 
percentage value must be between
+ * 0.0 and 1.0.
+ * @param accuracyExpression Integer literal expression of approximation 
accuracy. Higher value
+ *   yield better accuracy, the default value is
+ *   DEFAULT_PERCENTILE_ACCURACY.
+ */
+@ExpressionDescription(
+  usage =
+"""
+  _FUNC_(col, percentage [, accuracy]) - Returns the approximate 
percentile value of numeric
+  column `col` at the given percentage. The value of percentage must 
be between 0.0
+  and 1.0. The `accuracy` parameter (default: 1) is a positive 
integer literal which
+  controls approximation accuracy at the cost of memory. Higher value 
of `accuracy` yield
+  better accuracy, `1.0/accuracy` is the relative error of the 
approximation.
+
+  _FUNC_(col, array(percentage1 [, percentage2]...) [, accuracy]) - 
Returns the approximate
+  percentile array of column `col` at the given percentage array. Each 
value of the
+  percentage array must be between 0.0 and 1.0. The `accuracy` 
parameter (default: 1) is
+   a positive integer literal which controls approximation accuracy at 
the cost of memory.
+   Higher value of `accuracy` yield better accuracy, `1.0/accuracy` is 
the relative error of
+   the approximation.
+""")
+case class ApproximatePercentile(
+child: Expression,
+percentageExpression: Expression,
+accuracyExpression: Expression,
+override val mutableAggBufferOffset: Int,
+override val inputAggBufferOffset: Int) extends 
TypedImperativeAggregate[PercentileDigest] {
+
+  def this(child: Expression, percentageExpression: Expression, 
accuracyExpression: Expression) = {
+this(child, percentageExpression, accuracyExpression, 0, 0)
+  }
+
+  def this(child: Expression, percentageExpression: Expression) = {
+this(child, percentageExpression, 
Literal(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))
+  }
+
+  // Mark as lazy so that accuracyExpression is not evaluated during tree 
transformation.
+  private l

[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

2016-08-29 Thread clockfly
Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r76695365
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/ApproximatePercentileQuerySuite.scala
 ---
@@ -0,0 +1,226 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.PercentileDigest
+import org.apache.spark.sql.test.SharedSQLContext
+
+class ApproximatePercentileQuerySuite extends QueryTest with 
SharedSQLContext {
+  import testImplicits._
+
+  private val table = "percentile_test"
+
+  test("percentile_approx, single percentile value") {
+withTempView(table) {
+  (1 to 1000).toDF("col").createOrReplaceTempView(table)
+  checkAnswer(
+spark.sql(
+  s"""
+ |SELECT
+ |  percentile_approx(col, 0.25),
+ |  percentile_approx(col, 0.5),
+ |  percentile_approx(col, 0.75d),
+ |  percentile_approx(col, 0.0),
+ |  percentile_approx(col, 1.0),
+ |  percentile_approx(col, 0),
+ |  percentile_approx(col, 1)
+ |FROM $table
+   """.stripMargin),
+Row(250D, 500D, 750D, 1D, 1000D, 1D, 1000D)
+  )
+}
+  }
+
+  test("percentile_approx, array of percentile value") {
+withTempView(table) {
+  (1 to 1000).toDF("col").createOrReplaceTempView(table)
+  checkAnswer(
+spark.sql(
+  s"""SELECT
+ |  percentile_approx(col, array(0.25, 0.5, 0.75D)),
+ |  count(col),
+ |  percentile_approx(col, array(0.0, 1.0)),
+ |  sum(col)
+ |FROM $table
+   """.stripMargin),
+Row(Seq(250D, 500D, 750D), 1000, Seq(1D, 1000D), 500500)
+  )
+}
+  }
+
+  test("percentile_approx, with different accuracies") {
+
+withTempView(table) {
+  (1 to 1000).toDF("col").createOrReplaceTempView(table)
+
+  // With different accuracies
+  val expectedPercentile = 250D
+  val accuracies = Array(1, 10, 100, 1000, 1)
+  val errors = accuracies.map { accuracy =>
+val df = spark.sql(s"SELECT percentile_approx(col, 0.25, 
$accuracy) FROM $table")
+val approximatePercentile = df.collect().head.getDouble(0)
+val error = Math.abs(approximatePercentile - expectedPercentile)
+error
+  }
+
+  // The larger accuracy value we use, the smaller error we get
+  assert(errors.sorted.sameElements(errors.reverse))
+}
+  }
+
+  test("percentile_approx, supports constant folding for parameter 
accuracy and percentages") {
+withTempView(table) {
+  (1 to 1000).toDF("col").createOrReplaceTempView(table)
+  checkAnswer(
+spark.sql(s"SELECT percentile_approx(col, array(0.25 + 0.25D), 200 
+ 800D) FROM $table"),
+Row(Seq(500D))
+  )
+}
+  }
+
+  test("percentile_approx(), aggregation on empty input table, no group 
by") {
+withTempView(table) {
+  Seq.empty[Int].toDF("col").createOrReplaceTempView(table)
+  checkAnswer(
+spark.sql(s"SELECT sum(col), percentile_approx(col, 0.5) FROM 
$table"),
+Row(null, null)
+  )
+}
+  }
+
+  test("percentile_approx(), aggregation on empty input table, with group 
by") {
+withTempView(table) {
+  Seq.empty[Int].toDF("col").createOrReplaceTempView(table)
+  checkAnswer(
+spark.sql(s"SELECT sum(col), percentile_approx(col, 0.5) FROM 
$table GROUP BY col"),
+Seq.empty[Row]
+  )
+}
+  }
+
+  test("percentile_approx(null), aggregation with 

[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

2016-08-29 Thread clockfly
Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r76695042
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -0,0 +1,302 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{InternalRow, ScalaReflectionLock}
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, 
TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.{PercentileDigest,
 PercentileDigestSerializer}
+import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData}
+import org.apache.spark.sql.catalyst.util.QuantileSummaries
+import 
org.apache.spark.sql.catalyst.util.QuantileSummaries.{defaultCompressThreshold, 
Stats}
+import org.apache.spark.sql.types._
+
+/**
+ * The ApproximatePercentile function returns the approximate 
percentile(s) of a column at the given
+ * percentage(s). A percentile is a watermark value below which a given 
percentage of the column
+ * values fall. For example, the percentile of column `col` at percentage 
50% is the median of
+ * column `col`.
+ *
+ * This function supports partial aggregation.
+ *
+ * @param child child expression that can produce column value with 
`child.eval(inputRow)`
+ * @param percentageExpression Expression that represents a single 
percentage value or
+ * a array of percentage values. Each 
percentage value must be between
+ * 0.0 and 1.0.
+ * @param accuracyExpression Integer literal expression of approximation 
accuracy. Higher value
+ *   yield better accuracy, the default value is
+ *   DEFAULT_PERCENTILE_ACCURACY.
+ */
+@ExpressionDescription(
+  usage =
+"""
+  _FUNC_(col, percentage [, accuracy]) - Returns the approximate 
percentile value of numeric
+  column `col` at the given percentage. The value of percentage must 
be between 0.0
+  and 1.0. The `accuracy` parameter (default: 1) is a positive 
integer literal which
+  controls approximation accuracy at the cost of memory. Higher value 
of `accuracy` yield
+  better accuracy, `1.0/accuracy` is the relative error of the 
approximation.
+
+  _FUNC_(col, array(percentage1 [, percentage2]...) [, accuracy]) - 
Returns the approximate
+  percentile array of column `col` at the given percentage array. Each 
value of the
+  percentage array must be between 0.0 and 1.0. The `accuracy` 
parameter (default: 1) is
+   a positive integer literal which controls approximation accuracy at 
the cost of memory.
+   Higher value of `accuracy` yield better accuracy, `1.0/accuracy` is 
the relative error of
+   the approximation.
+""")
+case class ApproximatePercentile(
+child: Expression,
+percentageExpression: Expression,
+accuracyExpression: Expression,
+override val mutableAggBufferOffset: Int,
+override val inputAggBufferOffset: Int) extends 
TypedImperativeAggregate[PercentileDigest] {
+
+  def this(child: Expression, percentageExpression: Expression, 
accuracyExpression: Expression) = {
+this(child, percentageExpression, accuracyExpression, 0, 0)
+  }
+
+  def this(child: Expression, percentageExpression: Expression) = {
+this(child, percentageExpression, 
Literal(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))
+  }
+
+  // Mark as lazy so that accuracyExpression is not evaluated during tree 
transformation.
+  private l

[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

2016-08-29 Thread clockfly
Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r76690870
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala
 ---
@@ -41,7 +41,7 @@ import 
org.apache.spark.sql.catalyst.util.QuantileSummaries.Stats
  * @param count the count of all the elements *inserted in the sampled 
buffer*
  *  (excluding the head buffer)
  */
-class QuantileSummaries(
+case class QuantileSummaries(
--- End diff --

Make it as a case class, so that we can use the expression encoder to 
serialize it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

2016-08-29 Thread clockfly
GitHub user clockfly opened a pull request:

https://github.com/apache/spark/pull/14868

Implements percentile_approx aggregation function which supports partial 
aggregation.

## What changes were proposed in this pull request?

This PR implements aggregation function `percentile_approx`. Function 
`percentile_approx` returns the approximate percentile(s) of a column at the 
given percentage(s). A percentile is a watermark value below which a given 
percentage of the column values fall. For example, the percentile of column 
`col` at percentage 50% is the median value of column `col`. 

### Syntax:
```
# Returns percentile at a given percentage value. The approximation error 
can be reduced by increasing parameter accuracy, at the cost of memory.
percentile_approx(col, percentage [, accuracy])

# Returns percentile value array at given percentage value array 
percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy])
```

### Features:
1. This function supports partial aggregation.
2. The memory consumption is bounded. The larger `accuracy` parameter we 
choose, we smaller error we get. The default accuracy value is 1, to match 
with Hive default setting. Choose a smaller value for smaller memory footprint. 
3.  This function supports window function aggregation.

### Example usages:
```
## Returns the 25th percentile value, with default accuracy
SELECT percentile_approx(col, 0.25) FROM table

## Returns an array of percentile value (25th, 50th, 75th), with default 
accuracy
SELECT percentile_approx(col, array(0.25, 0.5, 0.75)) FROM table

## Returns 25th percentile value, with custom accuracy value 100, larger 
accuracy parameter yields smaller approximation error 
SELECT percentile_approx(col, 0.25, 100) FROM table

## Returns the 25th, and 50th percentile values, with custom accuracy value 
100
SELECT percentile_approx(col, array(0.25, 0.5), 100) FROM table
```

### NOTE:
1. The `percentile_approx` implementation is different from Hive, so the 
result returned on same query maybe slightly different with Hive. This 
implementation uses `QuantileSummaries` as the underlying probabilistic data 
structure, and mainly follows paper `Space-efficient Online Computation of 
Quantile Summaries` by Greenwald, Michael and Khanna, Sanjeev. 
(http://dx.doi.org/10.1145/375663.375670)`
2. The current implementation of `QuantileSummaries` doesn't support 
automatic compression. This PR has a rule to do compression automatically at 
the caller side, but it may not be optimal.

## How was this patch tested?

Unit test, and Sql query test.

## Acknowledgement
1. This PR's work in based on @lw-lin's PR 
https://github.com/apache/spark/pull/14298, with improvements like supporting 
partial aggregation, fixing out of memory issue.
2. Thanks to @cloud-fan for the implementation of serialization using 
Expression Encoder.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/clockfly/spark appro_percentile_try_2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14868.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14868


commit d71397cc283493a946c42124e41309d7e64d0b12
Author: Sean Zhong 
Date:   2016-08-19T16:34:56Z

percentile approximate




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org