Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8362#discussion_r40610006
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala
 ---
    @@ -302,3 +307,393 @@ case class Sum(child: Expression) extends 
AlgebraicAggregate {
     
       override val evaluateExpression = Cast(currentSum, resultType)
     }
    +
    +/**
    + * HyperLogLog++ is a state of the art cardinality estimation algorithm.
    + *
    + *
    + * This implementation has been based on the following papers:
    + * Papers:
    + * http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
    + *
    + * HyperLogLog in Practice: Algorithmic Engineering of a State of
    + * The Art Cardinality Estimation Algorithm
    + *
    + * 
https://docs.google.com/document/d/1gyjfMHy43U9OWBXxfaeG-3MjGzejW1dlpyMwEYAAWEI/view?fullscreen#
    + *
    + * This implementation has been based on the following implementations:
    + * Note on provenance
    + * - Clearspring:
    + * - Aggregage Knowledge:
    + * - Algebird:
    + *
    + * Note on naming: Tried to match the paper.
    + *
    + *
    + * @param child
    + * @param relativeSD
    + */
    +case class HyperLogLogPlusPlus(child: Expression, relativeSD: Double = 
0.05)
    +    extends AggregateFunction2 {
    +  import HyperLogLogPlusPlus._
    +
    +  /**
    +   * HLL++ uses 'b' bits for addressing. The more addressing bits we use, 
the more accurate the
    +   * algorithm will be, and the more memory it will require. The 'b' value 
is based on the accuracy
    +   * requested.
    +   *
    +   * HLL++ requires that we use at least 4 bits of addressing space (a 
minimum accuracy of 27%).
    +   *
    +   * TODO we currently round down to the nearest integer. This means 
accuracy is typically worse
    --- End diff --
    
    Should we round it up?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to