[Question] Obtaining aliases for group columns in aggregate (especially after AGGREGATE_PROJECT_MERGE)
Hi folks, There exists an API to get the column name for the aggregate column ( *getNamedAggCalls*), I can't seem to find a similar API for group columns, is there a way to obtain the same in *LogicalAggregate*? Currently, I am optimizing the logical plan into a physical plan (in bindable convention) and then converting it back to a plan with logical nodes via *ToLogicalConverter. *As a consequence of *AGGREGATE_PROJECT_MERGE* rule, the project gets removed and hence information about the column aliases for the group columns in aggregate does not seem to be available. I guess I am missing something minor. Thanks, Ishan
Re: Trait propagation guidelines
> The observation is that parent operators sometimes do not know the exact > traits they will have for the given child traits. Several examples: > 1. PhysicalAggregate1 may request both HASH[b,c] or HASH[c,b]. Contrary to > the default Apache Calcite implementation, in many systems, these are two > different distributions - which one should I request? To make things worse, > some operators may have strict requirements to the order (Join, Union), > whilst others do not care about the order (Aggregate, Window). select group by b,c,a,g,z,d; if you have StreamAgg in non-distributed system, what collation(s) do you request? Just request either one. I already stated in the email [1], but seems like you missed the 5th paragraph. > 2. In some systems, the distribution may also define the distribution > function, e.g., a number of shards. A UNION DISTINCT of two tables with the > same sharding key, but the different numbers of shards must yield an > exchange. The parent operator cannot know the number of shards of the input > in advance and cannot define the proper trait set in the "passThrough" > method. The parent operator doesn't need to know what number of shards to request, just request hash distribution with shard number 0 or -1 or what ever to indicate shard number not decided yet. Later the child operator will tell parent operator the exact distribution through "derive". In Alibaba MaxCompute, we have customized hash distribution, which contains number of buckets, hash function, null collation, we also support range distribution, which contains range bucket boundaries. All of these can work under current framework. With all that being said, distribution is nothing special than collation, it all depends on whether you design the operator "passthrough" and "derive" strategy correctly. [1] https://lists.apache.org/thread.html/r36b25cbe4ca05fb1262c432ad9103f4126b654698481fca0d2a01fe7%40%3Cdev.calcite.apache.org%3E Thanks, Haisheng Yuan On 2021/06/14 08:26:31, Vladimir Ozerov wrote: > Hi Haisheng, > > The observation is that parent operators sometimes do not know the exact > traits they will have for the given child traits. Several examples: > 1. PhysicalAggregate1 may request both HASH[b,c] or HASH[c,b]. Contrary to > the default Apache Calcite implementation, in many systems, these are two > different distributions - which one should I request? To make things worse, > some operators may have strict requirements to the order (Join, Union), > whilst others do not care about the order (Aggregate, Window). > 2. In some systems, the distribution may also define the distribution > function, e.g., a number of shards. A UNION DISTINCT of two tables with the > same sharding key, but the different numbers of shards must yield an > exchange. The parent operator cannot know the number of shards of the input > in advance and cannot define the proper trait set in the "passThrough" > method. > > We will miss the optimization opportunity in all these cases unless we can > clarify the real traits in the "derive" phase. But to do this, we need to > know the original optimization request. > > Regards, > Vladimir. > > > вс, 13 июн. 2021 г. в 22:17, Haisheng Yuan : > > > How does it relate with "derive" to get the desired plan? > > > > Initially PhysicalAggregate1 requests HASH[b,c], PhysicalAggregate2 > > requests HASH[a,b,c]. PhysicalAggregate2 is called on "passthrough" by > > passing HASH[b,c], then generate another PhysicalAggregate2 with trait > > HASH[b,c]. You don't need the involvement of "derive". > > > > Haisheng Yuan > > > > On 2021/06/13 16:58:53, Vladimir Ozerov wrote: > > > Hi, > > > > > > I tried to apply different approaches, but eventually, I failed to > > achieve > > > my goals. It seems that the current implementation cannot handle the > > > required scenario, as explained below. > > > > > > Consider the following tree: > > > LogicalAggregate1[group=[b,c]] > > > LogicalAggregate2[group=[a,b,c]] > > > LogicalInput > > > > > > I want to find the plan to do these two aggregations without an exchange > > in > > > between because they may have compatible distributions. Example: > > > PhysicalAggregate1[group=[b,c]] // SHARDED[b,c] > > > PhysicalAggregate2[group=[a,b,c]] // SHARDED[b,c] > > > Exchange// SHARDED[b,c] > > > PhysicalInput // SHARDED[?] > > > > > > The fundamental problem is that it is impossible to save the optimization > > > request and resolve traits in the "derive" phase afterward. What we need > > is > > > to send the optimization request "SHARDED by [b,c] in any order" to > > > PhysicalAggregate2, and use it in the derive phase so that the new > > > PhysicalAggregate2 is created with [b,c] or [c,b], but strictly without > > > [a]. Unfortunately, this doesn't work because the nodes emitted from the > > > pass-through do not participate in the "derive" phase. > > > > > > This could be fixed with a trivial
Equals method of AggregateCall
Hi All, I'm working on one query failure of apache drill which is related to calcite. Does anyone know why "RelDataType type" is not included in the equals method of org.apache.calcite.rel.core.AggregateCall? Is this a bug or on purposes? Appreciate for your response.
Cassandra with Calcite integration problem
Hello Team We have integrated Cassandra with calcite with your given example and getting below problem compare to run in spark thrift server it is petty slow can you suggest good git code example or tunning parameter fro driver configuration? we are planning to integrate in Production but not sure about performance. Not getting much more resources to experiment more. Thank santanu
Re: Optimizer (Hep) NOT scanning RelNodes of Subqqueries
Hi Krishnakant, Why do you need to maintain the structure of RexSubquery? Most of the systems relying on Calcide do use the SubQueryRemoveRule so I am curious why you cannot use it in your system. Best, Stamatis On Mon, Jun 14, 2021 at 10:44 AM Krishnakant Agrawal wrote: > Haisheng, No I included just 4-5 rules which I don't think they tinker with > subqueries. > I found SubQueryRemoveRule with three configs. > Are there any others too? Please let us know. > I don't want to include this rule because I want to maintain these > structures post-optimization too. > > Julian, Yes I was referring to scalar subqueries. > The subqueries if they are in a Rex Form are not picked up for optimization > by default. > If someone from the core team wants to expand on this please do, I might be > wrong. > I support the above statement because the Rel Inside the RexSubQuery was > NOT optimized. > But with the same set of rules, it got optimized when sent to HepPlanner as > the main rel. > > As of now, I will be writing a new Rule which basically does exactly what > Julian said of scanning Rexes for RexSubquery > and applying rules on them. > > This rule will need to be added to the Rule List so it's not enabled by > default. > Please let me know a more graceful/optimized way to do this if any. > > Sample RexSubquery(IN operator):- > > LogicalFilter(condition=[IN($8, { > LogicalProject(EXPR$0=[$8]) > LogicalFilter(condition=[=($7, 1)]) > LogicalTableScan(table=[[foodmart, employee]]) > })]) > LogicalTableScan(table=[[foodmart, employee]]) > > > On Thu, Jun 10, 2021 at 12:43 AM Julian Hyde > wrote: > > > I presume we’re talking about scalar sub-queries (e.g. sub-query in > > SELECT) and sub-queries with IN and EXISTS, as opposed to sub-queries in > > the FROM clause. The former are represented using RexSubQuery because the > > queries appear in scalar expressions. > > > > There are rules to expand RexSubQuery into relational operators. After > > those rules have been applied, then I’m sure that the Hep engine will be > > able to find them. I think that’s what Haisheng is referring to. > > > > But while those sub-queries are still wrapped in RexSubQuery, it’s > > possible that the Hep engine won’t find them. Because it will have to > walk > > over every RexNode inside a Project, Filter or Join and if that RexNode > is > > a RexSubQuery, traverse into it to apply the rules to the RelNodes > inside. > > I don’t know whether that happens today. Someone should check. It’s a > > reasonable ask (but there might be unforeseen consequences if we enable > it > > by default). > > > > Julian > > > > > > > > > On Jun 9, 2021, at 12:02 PM, Haisheng Yuan wrote: > > > > > > Did you include the subquery related rules in the HepPlanner? > > > > > > Haisheng > > > > > > On 2021/06/09 17:59:44, Krishnakant Agrawal > > wrote: > > >> Hi All, > > >> > > >> When running a HepOptimizer on top of a RelNode which has a subquery > > >> embedded in it, The Optimizer does not take the RelNode representing > the > > >> subquery up for optimization. > > >> > > >> Is this by design, what's the correct way for the Subquery RelNode to > > be > > >> picked up for Optimization? > > >> > > >> If this is a bug, what is the conceptually correct way to go about > > fixing > > >> this? > > >> > > >> Thanks, > > >> KK > > >> > > > > >
Re: Optimizer (Hep) NOT scanning RelNodes of Subqqueries
Haisheng, No I included just 4-5 rules which I don't think they tinker with subqueries. I found SubQueryRemoveRule with three configs. Are there any others too? Please let us know. I don't want to include this rule because I want to maintain these structures post-optimization too. Julian, Yes I was referring to scalar subqueries. The subqueries if they are in a Rex Form are not picked up for optimization by default. If someone from the core team wants to expand on this please do, I might be wrong. I support the above statement because the Rel Inside the RexSubQuery was NOT optimized. But with the same set of rules, it got optimized when sent to HepPlanner as the main rel. As of now, I will be writing a new Rule which basically does exactly what Julian said of scanning Rexes for RexSubquery and applying rules on them. This rule will need to be added to the Rule List so it's not enabled by default. Please let me know a more graceful/optimized way to do this if any. Sample RexSubquery(IN operator):- LogicalFilter(condition=[IN($8, { LogicalProject(EXPR$0=[$8]) LogicalFilter(condition=[=($7, 1)]) LogicalTableScan(table=[[foodmart, employee]]) })]) LogicalTableScan(table=[[foodmart, employee]]) On Thu, Jun 10, 2021 at 12:43 AM Julian Hyde wrote: > I presume we’re talking about scalar sub-queries (e.g. sub-query in > SELECT) and sub-queries with IN and EXISTS, as opposed to sub-queries in > the FROM clause. The former are represented using RexSubQuery because the > queries appear in scalar expressions. > > There are rules to expand RexSubQuery into relational operators. After > those rules have been applied, then I’m sure that the Hep engine will be > able to find them. I think that’s what Haisheng is referring to. > > But while those sub-queries are still wrapped in RexSubQuery, it’s > possible that the Hep engine won’t find them. Because it will have to walk > over every RexNode inside a Project, Filter or Join and if that RexNode is > a RexSubQuery, traverse into it to apply the rules to the RelNodes inside. > I don’t know whether that happens today. Someone should check. It’s a > reasonable ask (but there might be unforeseen consequences if we enable it > by default). > > Julian > > > > > On Jun 9, 2021, at 12:02 PM, Haisheng Yuan wrote: > > > > Did you include the subquery related rules in the HepPlanner? > > > > Haisheng > > > > On 2021/06/09 17:59:44, Krishnakant Agrawal > wrote: > >> Hi All, > >> > >> When running a HepOptimizer on top of a RelNode which has a subquery > >> embedded in it, The Optimizer does not take the RelNode representing the > >> subquery up for optimization. > >> > >> Is this by design, what's the correct way for the Subquery RelNode to > be > >> picked up for Optimization? > >> > >> If this is a bug, what is the conceptually correct way to go about > fixing > >> this? > >> > >> Thanks, > >> KK > >> > >
Re: Trait propagation guidelines
Hi Haisheng, The observation is that parent operators sometimes do not know the exact traits they will have for the given child traits. Several examples: 1. PhysicalAggregate1 may request both HASH[b,c] or HASH[c,b]. Contrary to the default Apache Calcite implementation, in many systems, these are two different distributions - which one should I request? To make things worse, some operators may have strict requirements to the order (Join, Union), whilst others do not care about the order (Aggregate, Window). 2. In some systems, the distribution may also define the distribution function, e.g., a number of shards. A UNION DISTINCT of two tables with the same sharding key, but the different numbers of shards must yield an exchange. The parent operator cannot know the number of shards of the input in advance and cannot define the proper trait set in the "passThrough" method. We will miss the optimization opportunity in all these cases unless we can clarify the real traits in the "derive" phase. But to do this, we need to know the original optimization request. Regards, Vladimir. вс, 13 июн. 2021 г. в 22:17, Haisheng Yuan : > How does it relate with "derive" to get the desired plan? > > Initially PhysicalAggregate1 requests HASH[b,c], PhysicalAggregate2 > requests HASH[a,b,c]. PhysicalAggregate2 is called on "passthrough" by > passing HASH[b,c], then generate another PhysicalAggregate2 with trait > HASH[b,c]. You don't need the involvement of "derive". > > Haisheng Yuan > > On 2021/06/13 16:58:53, Vladimir Ozerov wrote: > > Hi, > > > > I tried to apply different approaches, but eventually, I failed to > achieve > > my goals. It seems that the current implementation cannot handle the > > required scenario, as explained below. > > > > Consider the following tree: > > LogicalAggregate1[group=[b,c]] > > LogicalAggregate2[group=[a,b,c]] > > LogicalInput > > > > I want to find the plan to do these two aggregations without an exchange > in > > between because they may have compatible distributions. Example: > > PhysicalAggregate1[group=[b,c]] // SHARDED[b,c] > > PhysicalAggregate2[group=[a,b,c]] // SHARDED[b,c] > > Exchange// SHARDED[b,c] > > PhysicalInput // SHARDED[?] > > > > The fundamental problem is that it is impossible to save the optimization > > request and resolve traits in the "derive" phase afterward. What we need > is > > to send the optimization request "SHARDED by [b,c] in any order" to > > PhysicalAggregate2, and use it in the derive phase so that the new > > PhysicalAggregate2 is created with [b,c] or [c,b], but strictly without > > [a]. Unfortunately, this doesn't work because the nodes emitted from the > > pass-through do not participate in the "derive" phase. > > > > This could be fixed with a trivial change - to allow certain nodes > emitted > > from the "passThrough" to participate in "derive". We can do that using a > > marker interface or an extension to a PhysicalRel interface. For example: > > interface PhysicalRel { > > boolean enforceDerive(); > > } > > > > When set to "true", the node would not be added to the pass-through > cache. > > This way, we may use this node as *storage* for the optimization request. > > When the "derive" is called later, we know both the parent requirements > and > > the child traits. This would be sufficient to solve my problem. I already > > tried to do this by disabling the pass-through cache completely and > > confirmed that the required plan is found. > > > > Do you have any objections to such a change? > > > > Regards, > > Vladimir. > > > > сб, 29 мая 2021 г. в 11:59, Vladimir Ozerov : > > > > > Hi Haisheng, Jinpeng > > > > > > I think we are more or less on the same page: > > > > > >1. The current implementation of Apache Calcite may generate > wasteful > > >alternatives because rules lack the optimization context. > > >2. But the actual impact on efficiency is not clear. > > > > > > The (2) is essential to understand whether my efforts make any > practical > > > sense. And so far, I have only a vague common sense and some simple > > > examples in mind, which is not sufficient to make any claims. > > > > > > Nevertheless, I've checked the source code of the original Columbia > > > optimizer. I was wrong in my original claim that Columbia doesn't pass > > > optimization context to rules. It does [1]. The context consists of > > > required traits and cost budget. In Apache Calcite terms, the context > is > > > passed to both "RelRule.matches" and "RelRule.onMatch", so that the > rule > > > may decide on the optimization strategy based on parent request. This > is > > > exactly what I was trying to achieve in my system with some hacks > around > > > derive/passThrough. > > > > > > Regarding the example with join, my proposal is not likely to make any > > > difference because the tables are not co-located on the join key, and > hence > > > join may emit several distributions.