Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8362#discussion_r40610794
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala
 ---
    @@ -302,3 +307,393 @@ case class Sum(child: Expression) extends 
AlgebraicAggregate {
     
       override val evaluateExpression = Cast(currentSum, resultType)
     }
    +
    +/**
    + * HyperLogLog++ is a state of the art cardinality estimation algorithm.
    + *
    + *
    + * This implementation has been based on the following papers:
    + * Papers:
    + * http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
    + *
    + * HyperLogLog in Practice: Algorithmic Engineering of a State of
    + * The Art Cardinality Estimation Algorithm
    + *
    + * 
https://docs.google.com/document/d/1gyjfMHy43U9OWBXxfaeG-3MjGzejW1dlpyMwEYAAWEI/view?fullscreen#
    + *
    + * This implementation has been based on the following implementations:
    + * Note on provenance
    + * - Clearspring:
    + * - Aggregage Knowledge:
    + * - Algebird:
    + *
    + * Note on naming: Tried to match the paper.
    + *
    + *
    + * @param child
    + * @param relativeSD
    + */
    +case class HyperLogLogPlusPlus(child: Expression, relativeSD: Double = 
0.05)
    +    extends AggregateFunction2 {
    +  import HyperLogLogPlusPlus._
    +
    +  /**
    +   * HLL++ uses 'b' bits for addressing. The more addressing bits we use, 
the more accurate the
    +   * algorithm will be, and the more memory it will require. The 'b' value 
is based on the accuracy
    +   * requested.
    +   *
    +   * HLL++ requires that we use at least 4 bits of addressing space (a 
minimum accuracy of 27%).
    +   *
    +   * TODO we currently round down to the nearest integer. This means 
accuracy is typically worse
    +   * than the user expects.
    +   */
    +  private[this] val b = {
    +    (2.0d * Math.log(1.106d / relativeSD) / Math.log(2.0d)).toInt
    +  }
    +  require(b >= 4, "HLL++ requires at least 4 bits for addressing. " +
    +    "Use a higher accuracy, at least 27%.")
    +
    +  /**
    +   * Shift used to extract the 'j' (the register) value from the hashed 
value.
    +   *
    +   * This assumes the use of 64-bit hashcodes.
    +   */
    +  private[this] val jShift = JLong.SIZE - b
    +
    +  /**
    +   * Value to pad the 'w' value with before the number of leading zeros is 
determined.
    +   */
    +  private[this] val wPadding = 1L << (b - 1)
    +
    +  /**
    +   * The number of registers used.
    +   */
    +  private[this] val m = 1 << b
    +
    +  /**
    +   * The pre-calculated combination of: alpha * m * m
    +   *
    +   * 'alpha' corrects the raw cardinality estimate 'Z'. See the FlFuGaMe07 
paper for its
    +   * derivation.
    +   */
    +  private[this] val alphaM2 = b match {
    +    case 4 => 0.673d * m * m
    +    case 5 => 0.697d * m * m
    +    case 6 => 0.709d * m * m
    +    case _ => (0.7213d / (1.0d + 1.079d / m)) * m * m
    +  }
    +
    +  /**
    +   * The number of words used to store the registers. We use Longs for 
storage because this is the
    +   * most compact way of storage; Spark aligns to 8-byte words or uses 
Long wrappers.
    +   *
    +   * We only store whole registers per word in order to prevent overly 
complex bitwise operations.
    +   * In practice this means we only use 60 out of 64 bits.
    +   */
    +  private[this] val numWords = m / REGISTERS_PER_WORD match {
    +    case x if m % REGISTERS_PER_WORD == 0 => x
    --- End diff --
    
    This will not happen here, since m is 1<<b, but REGISTERS_PER_WORD is ten. 
We could remove this case (just waste one word in worst case if something 
changed in future).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to