[jira] [Created] (SPARK-16484) Incremental Cardinality estimation operations with Hyperloglog

Yongjia Wang (JIRA) Mon, 11 Jul 2016 13:01:34 -0700

Yongjia Wang created SPARK-16484:
------------------------------------

             Summary: Incremental Cardinality estimation operations with 
Hyperloglog
                 Key: SPARK-16484
                 URL: https://issues.apache.org/jira/browse/SPARK-16484
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Yongjia Wang

Efficient cardinality estimation is very important, and SparkSQL has had
approxCountDistinct based on Hyperloglog for quite some time. However, there
isn't a way to do incremental estimation. For example, if we want to get
updated distinct counts of the last 90 days, we need to do the aggregation for
the entire window over and over again. The more efficient way involves
serializing the counter for smaller time windows (such as hourly) so the counts
can be efficiently updated in an incremental fashion for any time window.
With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus
implementation in the current Spark version, it's easy enough to extend the
functionality to include incremental counting, and even other general set
operations such as intersection and set difference. Spark API is already as
elegant as it can be, but it still takes quite some effort to do a custom
implementation of the aforementioned operations which are supposed to be in
high demand. I have been searching but failed to find an usable existing
solution nor any ongoing effort for this. The closest I got is the following
but it does not work with Spark 1.6 due to API changes.
https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala

I wonder if it worth to integrate such operations into SparkSQL. The only
problem I see is it depends on serialization of a specific HLL implementation
and introduce compatibility issues. But as long as the user is aware of such
issue, it should be fine.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16484) Incremental Cardinality estimation operations with Hyperloglog

Reply via email to