Yongjia Wang created SPARK-16484:
------------------------------------

             Summary: Incremental Cardinality estimation operations with 
Hyperloglog
                 Key: SPARK-16484
                 URL: https://issues.apache.org/jira/browse/SPARK-16484
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Yongjia Wang


Efficient cardinality estimation is very important, and SparkSQL has had 
approxCountDistinct based on Hyperloglog for quite some time. However, there 
isn't a way to do incremental estimation. For example, if we want to get 
updated distinct counts of the last 90 days, we need to do the aggregation for 
the entire window over and over again. The more efficient way involves 
serializing the counter for smaller time windows (such as hourly) so the counts 
can be efficiently updated in an incremental fashion for any time window.
With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus 
implementation in the current Spark version, it's easy enough to extend the 
functionality to include incremental counting, and even other general set 
operations such as intersection and set difference. Spark API is already as 
elegant as it can be, but it still takes quite some effort to do a custom 
implementation of the aforementioned operations which are supposed to be in 
high demand. I have been searching but failed to find an usable existing 
solution nor any ongoing effort for this. The closest I got is the following 
but it does not work with Spark 1.6 due to API changes. 
https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala

I wonder if it worth to integrate such operations into SparkSQL. The only 
problem I see is it depends on serialization of a specific HLL implementation 
and introduce compatibility issues. But as long as the user is aware of such 
issue, it should be fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to