[GitHub] [spark] zero323 commented on a change in pull request #27278: [SPARK-30569][SQL][PYSPARK][SPARKR] Add percentile_approx DSL functions.

GitBox Tue, 21 Jan 2020 04:23:05 -0800

zero323 commented on a change in pull request #27278: 
[SPARK-30569][SQL][PYSPARK][SPARKR] Add percentile_approx DSL functions.
URL: https://github.com/apache/spark/pull/27278#discussion_r368969058


 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 ##########
 @@ -652,6 +652,122 @@ object functions {
    */
   def min(columnName: String): Column = min(Column(columnName))
 
+  /**
+   * Aggregate function: Returns and array of the approximate percentile values
+   * of numeric column col at the given percentages.
+   *
+   * Each value of the percentage array must be between 0.0 and 1.0.
+   *
+   * The accuracy parameter is a positive numeric literal
+   * which controls approximation accuracy at the cost of memory.
+   * Higher value of accuracy yields better accuracy, 1.0/accuracy
+   * is the relative error of the approximation.
+   *
+   * @group agg_funcs
+   * @since 3.0.0
+   */
+  def percentile_approx(e: Column, percentage: Array[Double], accuracy: Long): 
Column = {
+    withAggregateFunction {
+      new ApproximatePercentile(
+        e.expr, typedLit(percentage).expr, lit(accuracy).expr
+      )
+    }
+  }
+
+  /**
+   * Aggregate function: Returns and array of the approximate percentile values
+   * of numeric column col at the given percentages.
+   *
+   * Each value of the percentage array must be between 0.0 and 1.0.
+   *
+   * The accuracy parameter is a positive numeric literal
+   * which controls approximation accuracy at the cost of memory.
+   * Higher value of accuracy yields better accuracy, 1.0/accuracy
+   * is the relative error of the approximation.
+   *
+   * @group agg_funcs
+   * @since 3.0.0
+   */
+  def percentile_approx(columnName: String, percentage: Array[Double], 
accuracy: Long): Column = {
+    percentile_approx(Column(columnName), percentage, accuracy)
+  }
+
+  /**
+   * Aggregate function: Returns and array of the approximate percentile values
+   * of numeric column col at the given percentages.
+   *
+   * Each value of the percentage array must be between 0.0 and 1.0.
+   *
+   * The accuracy parameter is a positive numeric literal
+   * which controls approximation accuracy at the cost of memory.
+   * Higher value of accuracy yields better accuracy, 1.0/accuracy
+   * is the relative error of the approximation.
+   *
+   * @group agg_funcs
+   * @since 3.0.0
+   */
+  def percentile_approx(e: Column, percentage: Seq[Double], accuracy: Long): 
Column = {
+    percentile_approx(e, percentage.toArray, accuracy)
+  }
+
+  /**
+   * Aggregate function: Returns and array of the approximate percentile values
+   * of numeric column col at the given percentages.
+   *
+   * Each value of the percentage array must be between 0.0 and 1.0.
+   *
+   * The accuracy parameter is a positive numeric literal
+   * which controls approximation accuracy at the cost of memory.
+   * Higher value of accuracy yields better accuracy, 1.0/accuracy
+   * is the relative error of the approximation.
+   *
+   * @group agg_funcs
+   * @since 3.0.0
+   */
+  def percentile_approx(columnName: String, percentage: Seq[Double], accuracy: 
Long): Column = {
+    percentile_approx(Column(columnName), percentage.toArray, accuracy)
+  }
+
+  /**
+   * Aggregate function: Returns the approximate percentile value of numeric
+   * column col at the given percentage.
+   *
+   * The value of percentage must be between 0.0 and 1.0.\
+   *
+   * The accuracy parameter is a positive numeric literal
+   * which controls approximation accuracy at the cost of memory.
+   * Higher value of accuracy yields better accuracy, 1.0/accuracy
+   * is the relative error of the approximation.
+   *
+   * @group agg_funcs
+   * @since 3.0.0
+   */
+  def percentile_approx(e: Column, percentage: Double, accuracy: Long): Column 
= {
+    withAggregateFunction {
+      new ApproximatePercentile(
+        e.expr, lit(percentage).expr, lit(accuracy).expr
+      )
+    }
+  }
+
+  /**
+   * Aggregate function: Returns the approximate percentile value of numeric
+   * column col at the given percentage.
+   *
+   * The value of percentage must be between 0.0 and 1.0.\
+   *
+   * The accuracy parameter is a positive numeric literal
+   * which controls approximation accuracy at the cost of memory.
+   * Higher value of accuracy yields better accuracy, 1.0/accuracy
+   * is the relative error of the approximation.
+   *
+   * @group agg_funcs
+   * @since 3.0.0
+   */
+  def percentile_approx(columnName: String, percentage: Double, accuracy: 
Long): Column = {
 
 Review comment:
   To be honest I am not very enthusiastic about it and I am not even convinced 
that it is consistent with the rest of `functions`. 
   
   The closest equivalents we have are
   
   - `approx_count_distinct` with `rsd`
   - `last` with `ignoreNulls`
   
   and both use external types, not columns. Not to mention this is still 
counter-intuitive and painful to use though:
   
   > we don't need to duplicate docs with less maintenance.
   
   is a fair point.
   
   - I can easily remove `Seq` variants, that's for sure and cut number of 
signatures by two, leaving us with four.
   - If having not `Column` variant on JVM is fine, we can drop `(String, _, _) 
=> Column` variant so that brings us to two variants.
   - It is also not hard to build `Column` objects transparently for Python and 
R users to support `(Column, Column, Column) => Column`. But I am still 
concerned about confusing semantics. 
   
     If two variants are still to much, we could always have `(Column, Any, 
Double) => Column` ‒ `o.a.sql.functions` is already quite full of `Any`s. Or if 
we're fine with making Java users miserable, we could `(Column, Either[Double, 
Array[Double], Double) => Column`, but this will require additional supporting 
code for R and Python.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 commented on a change in pull request #27278: [SPARK-30569][SQL][PYSPARK][SPARKR] Add percentile_approx DSL functions.

Reply via email to