[jira] [Commented] (CALCITE-1803) Add post aggregation support in Druid to optimize druid queries.

2017-05-24 Thread Julian Hyde (JIRA)

[ 
https://issues.apache.org/jira/browse/CALCITE-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023282#comment-16023282
 ] 

Julian Hyde commented on CALCITE-1803:
--

We already have a JIRA case for HAVING (what I guess you Druid folks would call 
a post-aggregation filter): CALCITE-1206.

I am going to re-title this case to just focus on post-aggregation project.

> Add post aggregation support in Druid to optimize druid queries.
> 
>
> Key: CALCITE-1803
> URL: https://issues.apache.org/jira/browse/CALCITE-1803
> Project: Calcite
>  Issue Type: New Feature
>  Components: druid
>Affects Versions: 1.11.0
>Reporter: Junxian Wu
>Assignee: Julian Hyde
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Druid post aggregations are not supported when parsing SQL queries. By 
> implementing post aggregations, we can offload some computation to the druid 
> cluster rather than aggregate on the client side.
> Example usage:
> {{SELECT SUM("column1") - SUM("column2") FROM "table";}}
> This query will be parsed into two separate Druid aggregations according to 
> current rules. Then the results will be subtracted in Calcite. By using the 
> {{postAggregations}} field in the druid query, the subtraction could be done 
> in Druid cluster. Although the previous example is simple, the difference 
> will be obvious when the number of result rows are large. (Multiple rows 
> result will happen when group by is used).
> Questions:
> After I push Post aggregation into Druid query, what should I change on the 
> project relational correlation? In the case of the example above, the 
> {{BindableProject}} will have the expression to representation the 
> subtraction. If I push the post aggregation into druid query, the expression 
> of subtraction should be replaced by the representation of the post 
> aggregations result. For now, the project expression seems can only point to 
> the aggregations results. Since post aggregations have to point to 
> aggregations results too, it could not be placed in the parallel level as 
> aggregation. Where should I put post aggregations?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CALCITE-1803) Add post aggregation support in Druid to optimize druid queries.

2017-05-24 Thread slim bouguerra (JIRA)

[ 
https://issues.apache.org/jira/browse/CALCITE-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023070#comment-16023070
 ] 

slim bouguerra commented on CALCITE-1803:
-

[~axeisghost] druid support the equivalent of sql having as per the 
[doc|http://druid.io/docs/latest/querying/having.html]

> Add post aggregation support in Druid to optimize druid queries.
> 
>
> Key: CALCITE-1803
> URL: https://issues.apache.org/jira/browse/CALCITE-1803
> Project: Calcite
>  Issue Type: New Feature
>  Components: druid
>Affects Versions: 1.11.0
>Reporter: Junxian Wu
>Assignee: Julian Hyde
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Druid post aggregations are not supported when parsing SQL queries. By 
> implementing post aggregations, we can offload some computation to the druid 
> cluster rather than aggregate on the client side.
> Example usage:
> {{SELECT SUM("column1") - SUM("column2") FROM "table";}}
> This query will be parsed into two separate Druid aggregations according to 
> current rules. Then the results will be subtracted in Calcite. By using the 
> {{postAggregations}} field in the druid query, the subtraction could be done 
> in Druid cluster. Although the previous example is simple, the difference 
> will be obvious when the number of result rows are large. (Multiple rows 
> result will happen when group by is used).
> Questions:
> After I push Post aggregation into Druid query, what should I change on the 
> project relational correlation? In the case of the example above, the 
> {{BindableProject}} will have the expression to representation the 
> subtraction. If I push the post aggregation into druid query, the expression 
> of subtraction should be replaced by the representation of the post 
> aggregations result. For now, the project expression seems can only point to 
> the aggregations results. Since post aggregations have to point to 
> aggregations results too, it could not be placed in the parallel level as 
> aggregation. Where should I put post aggregations?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CALCITE-1803) Add post aggregation support in Druid to optimize druid queries.

2017-05-23 Thread Julian Hyde (JIRA)

[ 
https://issues.apache.org/jira/browse/CALCITE-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021935#comment-16021935
 ] 

Julian Hyde commented on CALCITE-1803:
--

Somewhat off-topic, but I'll note that we're being held back by Druid's query 
language. Druid can (I assume) implement a pipeline of (project, filter, 
aggregate, sort, limit) relational operators, and Calcite can represent that, 
but Druid's query language can't express it.

MongoDB added an [aggregation 
pipeline|https://docs.mongodb.com/manual/core/aggregation-pipeline/] capability 
a while back, and it was really useful.

> Add post aggregation support in Druid to optimize druid queries.
> 
>
> Key: CALCITE-1803
> URL: https://issues.apache.org/jira/browse/CALCITE-1803
> Project: Calcite
>  Issue Type: New Feature
>  Components: druid
>Affects Versions: 1.11.0
>Reporter: Junxian Wu
>Assignee: Julian Hyde
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Druid post aggregations are not supported when parsing SQL queries. By 
> implementing post aggregations, we can offload some computation to the druid 
> cluster rather than aggregate on the client side.
> Example usage:
> {{SELECT SUM("column1") - SUM("column2") FROM "table";}}
> This query will be parsed into two separate Druid aggregations according to 
> current rules. Then the results will be subtracted in Calcite. By using the 
> {{postAggregations}} field in the druid query, the subtraction could be done 
> in Druid cluster. Although the previous example is simple, the difference 
> will be obvious when the number of result rows are large. (Multiple rows 
> result will happen when group by is used).
> Questions:
> After I push Post aggregation into Druid query, what should I change on the 
> project relational correlation? In the case of the example above, the 
> {{BindableProject}} will have the expression to representation the 
> subtraction. If I push the post aggregation into druid query, the expression 
> of subtraction should be replaced by the representation of the post 
> aggregations result. For now, the project expression seems can only point to 
> the aggregations results. Since post aggregations have to point to 
> aggregations results too, it could not be placed in the parallel level as 
> aggregation. Where should I put post aggregations?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CALCITE-1803) Add post aggregation support in Druid to optimize druid queries.

2017-05-23 Thread Junxian Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/CALCITE-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021547#comment-16021547
 ] 

Junxian Wu commented on CALCITE-1803:
-

Yes, "Post Aggregation" means doing something after aggregation. Here, I am 
referring to the "post-aggregator" in druid 
query.(http://druid.io/docs/0.10.0/querying/post-aggregations.html) By 
implementing this, Calcite can push more things into the Druid query. 
The post aggregation in Druid is similar to the idea of "Projects after 
aggregation" you mentioned in Calcite, but it also has consequences when I try 
to implement it.

If I change the DruidProjectRule and allow the project to be pushed into Druid 
Query as a post-aggregator, the result of the Druid query will contain the 
result of the project (Minus operation project). Then the project should point 
to the result of the post-aggregator in the Druid query, but for now, it seems 
the project can only point to the result of an aggregator in a druid query.

For your questions:
1. Filter cannot be done after aggregation because "fields" in druid filter 
cannot refer to the aggregated columns.
2. The "ordering" should be the implementation of sort in Druid, and both 
aggregator and post aggregator have "ordering" field. I think the sort could be 
done in post-aggregation because that's what the "ordering" field in 
post-aggregator does.

> Add post aggregation support in Druid to optimize druid queries.
> 
>
> Key: CALCITE-1803
> URL: https://issues.apache.org/jira/browse/CALCITE-1803
> Project: Calcite
>  Issue Type: New Feature
>  Components: druid
>Affects Versions: 1.11.0
>Reporter: Junxian Wu
>Assignee: Julian Hyde
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Druid post aggregations are not supported when parsing SQL queries. By 
> implementing post aggregations, we can offload some computation to the druid 
> cluster rather than aggregate on the client side.
> Example usage:
> {{SELECT SUM("column1") - SUM("column2") FROM "table";}}
> This query will be parsed into two separate Druid aggregations according to 
> current rules. Then the results will be subtracted in Calcite. By using the 
> {{postAggregations}} field in the druid query, the subtraction could be done 
> in Druid cluster. Although the previous example is simple, the difference 
> will be obvious when the number of result rows are large. (Multiple rows 
> result will happen when group by is used).
> Questions:
> After I push Post aggregation into Druid query, what should I change on the 
> project relational correlation? In the case of the example above, the 
> {{BindableProject}} will have the expression to representation the 
> subtraction. If I push the post aggregation into druid query, the expression 
> of subtraction should be replaced by the representation of the post 
> aggregations result. For now, the project expression seems can only point to 
> the aggregations results. Since post aggregations have to point to 
> aggregations results too, it could not be placed in the parallel level as 
> aggregation. Where should I put post aggregations?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CALCITE-1803) Add post aggregation support in Druid to optimize druid queries.

2017-05-22 Thread Julian Hyde (JIRA)

[ 
https://issues.apache.org/jira/browse/CALCITE-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16020492#comment-16020492
 ] 

Julian Hyde commented on CALCITE-1803:
--

Do you mind if we change the terminology? "Post aggregation" suggests 
aggregation that happens after something. But I think you mean 
"Post-aggregation projects". Or in simpler English, "Projects after 
aggregation".

To answer your question: You will need to have a DruidQuery that contains a 
Scan followed by an Aggregate followed by a Project.

Currently DruidProjectRule will not allow the Project to be pushed in, because 
"sap" (scan, aggregate, project) is not a valid signature according to 
DruidQuery.VALID_SIG. But you should make it valid.

I'm curious:
* Does Druid allow filters after aggregation? (I.e. HAVING)
* I know that Druid allows sort after aggregation. But is this before or after 
the post-aggregation projects?

> Add post aggregation support in Druid to optimize druid queries.
> 
>
> Key: CALCITE-1803
> URL: https://issues.apache.org/jira/browse/CALCITE-1803
> Project: Calcite
>  Issue Type: New Feature
>  Components: druid
>Affects Versions: 1.11.0
>Reporter: Junxian Wu
>Assignee: Julian Hyde
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Druid post aggregations are not supported when parsing SQL queries. By 
> implementing post aggregations, we can offload some computation to the druid 
> cluster rather than aggregate on the client side.
> Example usage:
> {{SELECT SUM("column1") - SUM("column2") FROM "table";}}
> This query will be parsed into two separate Druid aggregations according to 
> current rules. Then the results will be subtracted in Calcite. By using the 
> {{postAggregations}} field in the druid query, the subtraction could be done 
> in Druid cluster. Although the previous example is simple, the difference 
> will be obvious when the number of result rows are large. (Multiple rows 
> result will happen when group by is used).
> Questions:
> After I push Post aggregation into Druid query, what should I change on the 
> project relational correlation? In the case of the example above, the 
> {{BindableProject}} will have the expression to representation the 
> subtraction. If I push the post aggregation into druid query, the expression 
> of subtraction should be replaced by the representation of the post 
> aggregations result. For now, the project expression seems can only point to 
> the aggregations results. Since post aggregations have to point to 
> aggregations results too, it could not be placed in the parallel level as 
> aggregation. Where should I put post aggregations?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)