[jira] [Comment Edited] (CALCITE-1787) thetaSketch Support for Druid Adapter

slim bouguerra (JIRA) Thu, 01 Jun 2017 08:39:25 -0700

    [ 
https://issues.apache.org/jira/browse/CALCITE-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032985#comment-16032985
 ]


slim bouguerra edited comment on CALCITE-1787 at 6/1/17 3:38 PM:
-----------------------------------------------------------------

[~zhumayun] please read the sketch docs.
1 - Don't agree with the claim that queries like {code} SELECT COUNT(DISTINCT 
"user_unique") FROM "foodmart" WHERE "store_city" = 'Chicago' AND "store_city" 
= 'Seattle'; {code} works fine.
Pushing only filters to druid will produce the wrong results, you need post 
aggregation and filtered aggregator to do the intersection between sketches, 
without intersection the result you get is the union which means you have 
counted duplication thus you are not getting unique counts.
2 - for the filters on metrics or the more general case when we can not push 
filter/query to druid in fact calcite can not do much, sketch is a binary blob 
that needs  ser/desr library, not sure what is the perfect path to take, i 
don't know calcite well to provide an answer to the question.


was (Author: bslim):
[~zhumayun] please read the sketch docs.
1 - Don't agree with the claim that queries like {code} SELECT COUNT(DISTINCT 
"user_unique") FROM "foodmart" WHERE "the_month" = 'April' AND "store_city" = 
'Seattle'; {code} works fine.
Pushing only filters to druid will produce the wrong results, you need post 
aggregation and filtered aggregator to do the intersection between sketches, 
without intersection the result you get is the union which means you have 
counted duplication thus you are not getting unique counts.
2 - for the filters on metrics or the more general case when we can not push 
filter/query to druid in fact calcite can not do much, sketch is a binary blob 
that needs  ser/desr library, not sure what is the perfect path to take, i 
don't know calcite well to provide an answer to the question.

> thetaSketch Support for Druid Adapter
> -------------------------------------
>
>                 Key: CALCITE-1787
>                 URL: https://issues.apache.org/jira/browse/CALCITE-1787
>             Project: Calcite
>          Issue Type: New Feature
>          Components: druid
>    Affects Versions: 1.12.0
>            Reporter: Zain Humayun
>            Assignee: Zain Humayun
>            Priority: Minor
>
> Currently, the Druid adapter does not support the 
> [thetaSketch|http://druid.io/docs/latest/development/extensions-core/datasketches-aggregators.html]
>  aggregate type, which is used to measure the cardinality of a column 
> quickly. Many Druid instances support theta sketches, so I think it would be 
> a nice feature to have.
> I've been looking at the Druid adapter, and propose we add a new DruidType 
> called {{thetaSketch}} and then add logic in the {{getJsonAggregation}} 
> method in class {{DruidQuery}} to generate the {{thetaSketch}} aggregate. 
> This will require accessing information about the columns (what data type 
> they are) so that the thetaSketch aggregate is only produced if the column's 
> type is {{thetaSketch}}. 
> Also, I've noticed that a {{hyperUnique}} DruidType is currently defined, but 
> a {{hyperUnique}} aggregate is never produced. Since both are approximate 
> aggregators, I could also couple in the logic for {{hyperUnique}}.
> I'd love to hear your thoughts on my approach, and any suggestions you have 
> for this feature.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (CALCITE-1787) thetaSketch Support for Druid Adapter

Reply via email to