[ 
https://issues.apache.org/jira/browse/SPARK-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371709#comment-15371709
 ] 

Yongjia Wang commented on SPARK-16484:
--------------------------------------

Yes, I agree all the building blocks are there and easy enough to put together 
a solution now. I guess what I did is the second approach you mentioned - 
saving the hll++ "buffer" as a byte array column, with a custom UDAF to merge 
them using SQL expression. 
I was trying to say if it worth extending sparksql to include those extra 
UDAFs, making it more accessible for regular spark users. Also doing 
intersection of multiple sets can be tricky, wouldn't it be nice to have it as 
part of sparksql's standard set of functions?

> Incremental Cardinality estimation operations with Hyperloglog
> --------------------------------------------------------------
>
>                 Key: SPARK-16484
>                 URL: https://issues.apache.org/jira/browse/SPARK-16484
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Yongjia Wang
>
> Efficient cardinality estimation is very important, and SparkSQL has had 
> approxCountDistinct based on Hyperloglog for quite some time. However, there 
> isn't a way to do incremental estimation. For example, if we want to get 
> updated distinct counts of the last 90 days, we need to do the aggregation 
> for the entire window over and over again. The more efficient way involves 
> serializing the counter for smaller time windows (such as hourly) so the 
> counts can be efficiently updated in an incremental fashion for any time 
> window.
> With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus 
> implementation in the current Spark version, it's easy enough to extend the 
> functionality to include incremental counting, and even other general set 
> operations such as intersection and set difference. Spark API is already as 
> elegant as it can be, but it still takes quite some effort to do a custom 
> implementation of the aforementioned operations which are supposed to be in 
> high demand. I have been searching but failed to find an usable existing 
> solution nor any ongoing effort for this. The closest I got is the following 
> but it does not work with Spark 1.6 due to API changes. 
> https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala
> I wonder if it worth to integrate such operations into SparkSQL. The only 
> problem I see is it depends on serialization of a specific HLL implementation 
> and introduce compatibility issues. But as long as the user is aware of such 
> issue, it should be fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to