[Question] Obtaining aliases for group columns in aggregate (especially after AGGREGATE_PROJECT_MERGE)

2021-06-14 Thread Sharma, Ishan
Hi folks,
There exists an API to get the column name for the aggregate column (
*getNamedAggCalls*), I can't seem to find a similar API for group columns,
is there a way to obtain the same in *LogicalAggregate*?

Currently, I am optimizing the logical plan into a physical plan (in
bindable convention) and then converting it back to a plan with logical
nodes via *ToLogicalConverter. *As a consequence of
*AGGREGATE_PROJECT_MERGE* rule, the project gets removed and hence
information about the column aliases for the group columns in aggregate
does not seem to be available. I guess I am missing something minor.

Thanks,
Ishan


Re: Trait propagation guidelines

2021-06-14 Thread Haisheng Yuan
> The observation is that parent operators sometimes do not know the exact
> traits they will have for the given child traits. Several examples:
> 1. PhysicalAggregate1 may request both HASH[b,c] or HASH[c,b]. Contrary to
> the default Apache Calcite implementation, in many systems, these are two
> different distributions - which one should I request? To make things worse,
> some operators may have strict requirements to the order (Join, Union),
> whilst others do not care about the order (Aggregate, Window).
select  group by b,c,a,g,z,d; if you have StreamAgg in non-distributed 
system, what collation(s) do you request? 
Just request either one. I already stated in the email [1], but seems like you 
missed the 5th paragraph.

> 2. In some systems, the distribution may also define the distribution
> function, e.g., a number of shards. A UNION DISTINCT of two tables with the
> same sharding key, but the different numbers of shards must yield an
> exchange. The parent operator cannot know the number of shards of the input
> in advance and cannot define the proper trait set in the "passThrough"
> method.
The parent operator doesn't need to know what number of shards to request, just 
request hash distribution with shard number 0 or -1 or what ever to indicate 
shard number not decided yet. Later the child operator will tell parent 
operator the exact distribution through "derive".

In Alibaba MaxCompute, we have customized hash distribution, which contains 
number of buckets, hash function, null collation, we also support range 
distribution, which contains range bucket boundaries. All of these can work 
under current framework. With all that being said, distribution is nothing 
special than collation, it all depends on whether you design the operator 
"passthrough" and "derive" strategy correctly.

[1] 
https://lists.apache.org/thread.html/r36b25cbe4ca05fb1262c432ad9103f4126b654698481fca0d2a01fe7%40%3Cdev.calcite.apache.org%3E

Thanks,
Haisheng Yuan

On 2021/06/14 08:26:31, Vladimir Ozerov  wrote: 
> Hi Haisheng,
> 
> The observation is that parent operators sometimes do not know the exact
> traits they will have for the given child traits. Several examples:
> 1. PhysicalAggregate1 may request both HASH[b,c] or HASH[c,b]. Contrary to
> the default Apache Calcite implementation, in many systems, these are two
> different distributions - which one should I request? To make things worse,
> some operators may have strict requirements to the order (Join, Union),
> whilst others do not care about the order (Aggregate, Window).
> 2. In some systems, the distribution may also define the distribution
> function, e.g., a number of shards. A UNION DISTINCT of two tables with the
> same sharding key, but the different numbers of shards must yield an
> exchange. The parent operator cannot know the number of shards of the input
> in advance and cannot define the proper trait set in the "passThrough"
> method.
> 
> We will miss the optimization opportunity in all these cases unless we can
> clarify the real traits in the "derive" phase. But to do this, we need to
> know the original optimization request.
> 
> Regards,
> Vladimir.
> 
> 
> вс, 13 июн. 2021 г. в 22:17, Haisheng Yuan :
> 
> > How does it relate with "derive" to get the desired plan?
> >
> > Initially PhysicalAggregate1 requests HASH[b,c], PhysicalAggregate2
> > requests HASH[a,b,c]. PhysicalAggregate2 is called on "passthrough" by
> > passing HASH[b,c], then generate another PhysicalAggregate2 with trait
> > HASH[b,c]. You don't need the involvement of "derive".
> >
> > Haisheng Yuan
> >
> > On 2021/06/13 16:58:53, Vladimir Ozerov  wrote:
> > > Hi,
> > >
> > > I tried to apply different approaches, but eventually, I failed to
> > achieve
> > > my goals. It seems that the current implementation cannot handle the
> > > required scenario, as explained below.
> > >
> > > Consider the following tree:
> > > LogicalAggregate1[group=[b,c]]
> > >   LogicalAggregate2[group=[a,b,c]]
> > > LogicalInput
> > >
> > > I want to find the plan to do these two aggregations without an exchange
> > in
> > > between because they may have compatible distributions. Example:
> > > PhysicalAggregate1[group=[b,c]] // SHARDED[b,c]
> > >   PhysicalAggregate2[group=[a,b,c]] // SHARDED[b,c]
> > > Exchange// SHARDED[b,c]
> > >   PhysicalInput // SHARDED[?]
> > >
> > > The fundamental problem is that it is impossible to save the optimization
> > > request and resolve traits in the "derive" phase afterward. What we need
> > is
> > > to send the optimization request "SHARDED by [b,c] in any order" to
> > > PhysicalAggregate2, and use it in the derive phase so that the new
> > > PhysicalAggregate2 is created with [b,c] or [c,b], but strictly without
> > > [a]. Unfortunately, this doesn't work because the nodes emitted from the
> > > pass-through do not participate in the "derive" phase.
> > >
> > > This could be fixed with a trivial 

Equals method of AggregateCall

2021-06-14 Thread 王腾飞(飞腾)

Hi All,

I'm working on one query failure of apache drill which is related to calcite. 
Does anyone know why "RelDataType type" is not included in the equals method of 
org.apache.calcite.rel.core.AggregateCall? Is this a bug or on purposes?


Appreciate for your response.

Cassandra with Calcite integration problem

2021-06-14 Thread santanu mohanty
Hello Team
We have integrated Cassandra with calcite with your given example and
getting below problem

compare to run in spark thrift server it is petty slow

can you suggest good git code example or tunning parameter fro driver
configuration?
we are planning to integrate in Production but not sure about performance.

Not getting much more resources to experiment more.
Thank
santanu


Re: Optimizer (Hep) NOT scanning RelNodes of Subqqueries

2021-06-14 Thread Stamatis Zampetakis
Hi Krishnakant,

Why do you need to maintain the structure of RexSubquery?
Most of the systems relying on Calcide do use the SubQueryRemoveRule so I
am curious why you cannot use it in your system.

Best,
Stamatis

On Mon, Jun 14, 2021 at 10:44 AM Krishnakant Agrawal 
wrote:

> Haisheng, No I included just 4-5 rules which I don't think they tinker with
> subqueries.
> I found SubQueryRemoveRule with three configs.
> Are there any others too? Please let us know.
> I don't want to include this rule because I want to maintain these
> structures post-optimization too.
>
> Julian, Yes I was referring to scalar subqueries.
> The subqueries if they are in a Rex Form are not picked up for optimization
> by default.
> If someone from the core team wants to expand on this please do, I might be
> wrong.
> I support the above statement because the Rel Inside the RexSubQuery was
> NOT optimized.
> But with the same set of rules, it got optimized when sent to HepPlanner as
> the main rel.
>
> As of now, I will be writing a new Rule which basically does exactly what
> Julian said of scanning Rexes for RexSubquery
> and applying rules on them.
>
> This rule will need to be added to the Rule List so it's not enabled by
> default.
> Please let me know a more graceful/optimized way to do this if any.
>
> Sample RexSubquery(IN operator):-
>
> LogicalFilter(condition=[IN($8, {
> LogicalProject(EXPR$0=[$8])
>   LogicalFilter(condition=[=($7, 1)])
> LogicalTableScan(table=[[foodmart, employee]])
> })])
>   LogicalTableScan(table=[[foodmart, employee]])
>
>
> On Thu, Jun 10, 2021 at 12:43 AM Julian Hyde 
> wrote:
>
> > I presume we’re talking about scalar sub-queries (e.g. sub-query in
> > SELECT) and sub-queries with IN and EXISTS, as opposed to sub-queries in
> > the FROM clause. The former are represented using RexSubQuery because the
> > queries appear in scalar expressions.
> >
> > There are rules to expand RexSubQuery into relational operators. After
> > those rules have been applied, then I’m sure that the Hep engine will be
> > able to find them. I think that’s what Haisheng is referring to.
> >
> > But while those sub-queries are still wrapped in RexSubQuery, it’s
> > possible that the Hep engine won’t find them. Because it will have to
> walk
> > over every RexNode inside a Project, Filter or Join and if that RexNode
> is
> > a RexSubQuery, traverse into it to apply the rules to the RelNodes
> inside.
> > I don’t know whether that happens today. Someone should check. It’s a
> > reasonable ask (but there might be unforeseen consequences if we enable
> it
> > by default).
> >
> > Julian
> >
> >
> >
> > > On Jun 9, 2021, at 12:02 PM, Haisheng Yuan  wrote:
> > >
> > > Did you include the subquery related rules in the HepPlanner?
> > >
> > > Haisheng
> > >
> > > On 2021/06/09 17:59:44, Krishnakant Agrawal 
> > wrote:
> > >> Hi All,
> > >>
> > >> When running a HepOptimizer on top of a RelNode which has a subquery
> > >> embedded in it, The Optimizer does not take the RelNode representing
> the
> > >> subquery up for optimization.
> > >>
> > >> Is this by design,  what's the correct way for the Subquery RelNode to
> > be
> > >> picked up for Optimization?
> > >>
> > >> If this is a bug, what is the conceptually correct way to go about
> > fixing
> > >> this?
> > >>
> > >> Thanks,
> > >> KK
> > >>
> >
> >
>


Re: Optimizer (Hep) NOT scanning RelNodes of Subqqueries

2021-06-14 Thread Krishnakant Agrawal
Haisheng, No I included just 4-5 rules which I don't think they tinker with
subqueries.
I found SubQueryRemoveRule with three configs.
Are there any others too? Please let us know.
I don't want to include this rule because I want to maintain these
structures post-optimization too.

Julian, Yes I was referring to scalar subqueries.
The subqueries if they are in a Rex Form are not picked up for optimization
by default.
If someone from the core team wants to expand on this please do, I might be
wrong.
I support the above statement because the Rel Inside the RexSubQuery was
NOT optimized.
But with the same set of rules, it got optimized when sent to HepPlanner as
the main rel.

As of now, I will be writing a new Rule which basically does exactly what
Julian said of scanning Rexes for RexSubquery
and applying rules on them.

This rule will need to be added to the Rule List so it's not enabled by
default.
Please let me know a more graceful/optimized way to do this if any.

Sample RexSubquery(IN operator):-

LogicalFilter(condition=[IN($8, {
LogicalProject(EXPR$0=[$8])
  LogicalFilter(condition=[=($7, 1)])
LogicalTableScan(table=[[foodmart, employee]])
})])
  LogicalTableScan(table=[[foodmart, employee]])


On Thu, Jun 10, 2021 at 12:43 AM Julian Hyde  wrote:

> I presume we’re talking about scalar sub-queries (e.g. sub-query in
> SELECT) and sub-queries with IN and EXISTS, as opposed to sub-queries in
> the FROM clause. The former are represented using RexSubQuery because the
> queries appear in scalar expressions.
>
> There are rules to expand RexSubQuery into relational operators. After
> those rules have been applied, then I’m sure that the Hep engine will be
> able to find them. I think that’s what Haisheng is referring to.
>
> But while those sub-queries are still wrapped in RexSubQuery, it’s
> possible that the Hep engine won’t find them. Because it will have to walk
> over every RexNode inside a Project, Filter or Join and if that RexNode is
> a RexSubQuery, traverse into it to apply the rules to the RelNodes inside.
> I don’t know whether that happens today. Someone should check. It’s a
> reasonable ask (but there might be unforeseen consequences if we enable it
> by default).
>
> Julian
>
>
>
> > On Jun 9, 2021, at 12:02 PM, Haisheng Yuan  wrote:
> >
> > Did you include the subquery related rules in the HepPlanner?
> >
> > Haisheng
> >
> > On 2021/06/09 17:59:44, Krishnakant Agrawal 
> wrote:
> >> Hi All,
> >>
> >> When running a HepOptimizer on top of a RelNode which has a subquery
> >> embedded in it, The Optimizer does not take the RelNode representing the
> >> subquery up for optimization.
> >>
> >> Is this by design,  what's the correct way for the Subquery RelNode to
> be
> >> picked up for Optimization?
> >>
> >> If this is a bug, what is the conceptually correct way to go about
> fixing
> >> this?
> >>
> >> Thanks,
> >> KK
> >>
>
>


Re: Trait propagation guidelines

2021-06-14 Thread Vladimir Ozerov
Hi Haisheng,

The observation is that parent operators sometimes do not know the exact
traits they will have for the given child traits. Several examples:
1. PhysicalAggregate1 may request both HASH[b,c] or HASH[c,b]. Contrary to
the default Apache Calcite implementation, in many systems, these are two
different distributions - which one should I request? To make things worse,
some operators may have strict requirements to the order (Join, Union),
whilst others do not care about the order (Aggregate, Window).
2. In some systems, the distribution may also define the distribution
function, e.g., a number of shards. A UNION DISTINCT of two tables with the
same sharding key, but the different numbers of shards must yield an
exchange. The parent operator cannot know the number of shards of the input
in advance and cannot define the proper trait set in the "passThrough"
method.

We will miss the optimization opportunity in all these cases unless we can
clarify the real traits in the "derive" phase. But to do this, we need to
know the original optimization request.

Regards,
Vladimir.


вс, 13 июн. 2021 г. в 22:17, Haisheng Yuan :

> How does it relate with "derive" to get the desired plan?
>
> Initially PhysicalAggregate1 requests HASH[b,c], PhysicalAggregate2
> requests HASH[a,b,c]. PhysicalAggregate2 is called on "passthrough" by
> passing HASH[b,c], then generate another PhysicalAggregate2 with trait
> HASH[b,c]. You don't need the involvement of "derive".
>
> Haisheng Yuan
>
> On 2021/06/13 16:58:53, Vladimir Ozerov  wrote:
> > Hi,
> >
> > I tried to apply different approaches, but eventually, I failed to
> achieve
> > my goals. It seems that the current implementation cannot handle the
> > required scenario, as explained below.
> >
> > Consider the following tree:
> > LogicalAggregate1[group=[b,c]]
> >   LogicalAggregate2[group=[a,b,c]]
> > LogicalInput
> >
> > I want to find the plan to do these two aggregations without an exchange
> in
> > between because they may have compatible distributions. Example:
> > PhysicalAggregate1[group=[b,c]] // SHARDED[b,c]
> >   PhysicalAggregate2[group=[a,b,c]] // SHARDED[b,c]
> > Exchange// SHARDED[b,c]
> >   PhysicalInput // SHARDED[?]
> >
> > The fundamental problem is that it is impossible to save the optimization
> > request and resolve traits in the "derive" phase afterward. What we need
> is
> > to send the optimization request "SHARDED by [b,c] in any order" to
> > PhysicalAggregate2, and use it in the derive phase so that the new
> > PhysicalAggregate2 is created with [b,c] or [c,b], but strictly without
> > [a]. Unfortunately, this doesn't work because the nodes emitted from the
> > pass-through do not participate in the "derive" phase.
> >
> > This could be fixed with a trivial change - to allow certain nodes
> emitted
> > from the "passThrough" to participate in "derive". We can do that using a
> > marker interface or an extension to a PhysicalRel interface. For example:
> > interface PhysicalRel {
> > boolean enforceDerive();
> > }
> >
> > When set to "true", the node would not be added to the pass-through
> cache.
> > This way, we may use this node as *storage* for the optimization request.
> > When the "derive" is called later, we know both the parent requirements
> and
> > the child traits. This would be sufficient to solve my problem. I already
> > tried to do this by disabling the pass-through cache completely and
> > confirmed that the required plan is found.
> >
> > Do you have any objections to such a change?
> >
> > Regards,
> > Vladimir.
> >
> > сб, 29 мая 2021 г. в 11:59, Vladimir Ozerov :
> >
> > > Hi Haisheng, Jinpeng
> > >
> > > I think we are more or less on the same page:
> > >
> > >1. The current implementation of Apache Calcite may generate
> wasteful
> > >alternatives because rules lack the optimization context.
> > >2. But the actual impact on efficiency is not clear.
> > >
> > > The (2) is essential to understand whether my efforts make any
> practical
> > > sense. And so far, I have only a vague common sense and some simple
> > > examples in mind, which is not sufficient to make any claims.
> > >
> > > Nevertheless, I've checked the source code of the original Columbia
> > > optimizer. I was wrong in my original claim that Columbia doesn't pass
> > > optimization context to rules. It does [1]. The context consists of
> > > required traits and cost budget. In Apache Calcite terms, the context
> is
> > > passed to both "RelRule.matches" and "RelRule.onMatch", so that the
> rule
> > > may decide on the optimization strategy based on parent request. This
> is
> > > exactly what I was trying to achieve in my system with some hacks
> around
> > > derive/passThrough.
> > >
> > > Regarding the example with join, my proposal is not likely to make any
> > > difference because the tables are not co-located on the join key, and
> hence
> > > join may emit several distributions.