Re: [ANNOUNCE] Danny Chan joins Calcite PMC

2019-11-01 Thread Muhammad Gelbana
Congratulations!

Thanks,
Gelbana


On Fri, Nov 1, 2019 at 9:07 AM Stamatis Zampetakis 
wrote:

> Congratulations Danny!
>
> You are doing an amazing job. The project and the community is becoming
> better every day and your help is much appreciated.
>
> Keep up the momentum!
>
> Best,
> Stamatis
>
> On Thu, Oct 31, 2019 at 4:41 AM Kurt Young  wrote:
>
> > Congratulations Danny!
> >
> > Best,
> > Kurt
> >
> >
> > On Thu, Oct 31, 2019 at 11:18 AM Danny Chan 
> wrote:
> >
> > > Thank you so much colleagues, it’s my honor to work with you!
> > >
> > > I have always felt respected and the harmony of the community, hope to
> > > contribute more and I would give help as best as I can, thanks !
> > >
> > > Best,
> > > Danny Chan
> > > 在 2019年10月31日 +0800 AM5:22,Francis Chuang  >,写道:
> > > > I'm pleased to announce that Danny has accepted an invitation to
> > > > join the Calcite PMC. Danny has been a consistent and helpful
> > > > figure in the Calcite community for which we are very grateful. We
> > > > look forward to the continued contributions and support.
> > > >
> > > > Please join me in congratulating Danny!
> > > >
> > > > - Francis (on behalf of the Calcite PMC)
> > >
> >
>


Re: [DISCUSS] Proposal to add API to force rules matching specific rels

2019-11-01 Thread Xiening Dai
I think the Convention as a trait is something special to Calcite (not a 
Volcano concept). Calcite uses it for federation queries over heterogeneous 
data source. Look at Calcite paper[1] 4 Traits. It explains the idea quite well.

[1] https://arxiv.org/pdf/1802.10233.pdf


> On Nov 1, 2019, at 1:42 AM, Stamatis Zampetakis  wrote:
> 
> @Vladimir: Given that you want to make the optimizer to work in bottom-up
> fashion maybe it suffices to inverse the comparator in the RuleQueue [1].
> For sure there are rules which would not be compatible with this change but
> maybe for your project it makes sense.
> I never tried it my self but I would be curious to see if it works.
> 
> @Jinfeng: I guess that the reason that we have the convention as a trait is
> because results from different systems need to be combined together.
> Given that these results arrive in different forms we need a way to know
> how to transform one to the other. In some sense the convention trait can
> be thought to be similar to the "assembledness" property mentioned in the
> paper.
> "For query optimization in object-oriented systems, we plan on defining
> "assembledness" of complex objects in memory as a physical property and
> using the assembly operator described in ... as the enforcer for this
> property." [2]
> For sure I am not the best person to talk about this choice so don't take
> anything that I say as 100% accurate. I am sure other people can provide a
> much better explanation.
> 
> Best,
> Stamatis
> 
> [1]
> https://github.com/apache/calcite/blob/65043f290a7be45a668cf941ab48ee3c30efbe4e/core/src/main/java/org/apache/calcite/plan/volcano/RuleQueue.java#L94
> [2]
> https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/Papers/Volcano-graefe.pdf
> 
> On Fri, Nov 1, 2019 at 7:53 AM Jinfeng Ni  wrote:
> 
>> I think "pull-up traits" is necessary, since the physical traits are
>> propagated  upwards. However, I'm not fully convinced "On Demand
>> Trait" is the best solution, as I feel it may restrict the choices the
>> planner may consider.  Maybe after the proposed design doc is ready,
>> we may look into the detail and reevaluate.
>> 
>> One question that I have always had ever since I started using Calcite
>> couple of years ago,  is the concept of Convention being a trait, and
>> introduction of AbstractConverter in Calcite's VolcanoPlanner.
>> Quickly looking through the original paper of Volcano [1], or the new
>> one [2], I did not see mentioning of such concepts.  The original
>> paper has a classification of "transformation rules" which operates on
>> logical relation expression  and "implementation rules" which provides
>> the mapping to physical operators.
>> 
>> 
>> 1. https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf
>> 2.
>> https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/Papers/volcano-prasan.pdf
>> 
>> On Thu, Oct 31, 2019 at 2:40 PM Xiening Dai  wrote:
>>> 
>>> Actually we solved this problem in our setup using a mechanism called
>> “Pull-Up Traits”, which explores the possible trait set of children’s input
>> to decide parent’s physical properties. In order to determine child input
>> trait, you would have to look at child’s children, and all the way to the
>> leaves nodes or a barrier. A barrier is a rel node which cannot derive any
>> traits regardless the input. A good example would be a user define function
>> which would throw off any distribution or collation. Then we realize just
>> pulling up is not enough, sometimes we would need to look at parent’s
>> requirement as well. So we try to solve this in a unified framework, which
>> we call “On Demand Trait” and implement it as part of the framework so
>> anyone can be benefited. I hope Haisheng can share a design doc once we
>> have more concrete ideas.
>>> 
>>> 
 On Oct 31, 2019, at 11:37 AM, Jinfeng Ni  wrote:
 
 Hi Vladimir,
 
 The SubsetTransformer interface and the iterating over the RelNodes
 within a RelSubset in Drill  is exactly implemented to do the trait
 propagation. We also had to rely on AbstractConverter to fire
 necessary rule to avoid the CanNotPlan issue. At some point, Calcite
 community chooses to remove AbstractConverter and Drill had to add it
 back, which is probably one of the main reasons for us to continue
 using a Calcite fork.  I still remember we constantly had to deal with
 the dilemma between "CanNotPlan" and long planing time due to explored
 search space.
 
 Glad to see more people are joining the effort to solve this long
 overdue issue, something missing in Calcite's core optimizer framework
 "since before Calcite was Calcite" (Jacques's words).
 
 Jinfeng
 
 
 On Thu, Oct 31, 2019 at 3:38 AM Vladimir Ozerov 
>> wrote:
> 
> Hi Danny,
> 
> Thank you very much for the links. What is described here is pretty
>> much
> similar to the problem I describe. Especially the discussion about
>> trait

Re: [DISCUSS] On-demand traitset request

2019-11-01 Thread Xiening Dai
Yes, my major concern is the expanding of search space. If we request the 
permutation of (a, b, c) then we increase the search space 6 times. In a lot of 
cases where query is complex, we might not be able to finish the search at all. 
I think there’s gonna be some heuristics built in to guide the search, which, 
in some rare cases, would mean reducing the chance of finding the best plan. 
But essentially this is a trade off to make. An optimizer cannot guarantee to 
always find the best plan with time/space bound. That’s why we need hinting and 
other tools.

Regarding your comment - "In the current implementation of VolcanoPlanner, I 
feel the root issue of long planning time is not to explore all possible 
satisfying trait.”, I am not sure I understand. If the planner explore more 
possible traits, there could be better plan, but how would that reduce planning 
time? Can you please elaborate? Thanks.


> On Oct 31, 2019, at 11:10 PM, Jinfeng Ni  wrote:
> 
> Hi Xiening,
> 
> "Let say if R and S doesn’t have sorting properties at all. In your
> case, we would end up adding enforcers for LHS and RHS to get
> collation (a, b, c). Then we would need another enforcer to get
> collation (b, c). This is a sub optimal plan as we could have use (b,
> c, a) for join."
> 
> In such case, for step 2 when MergeJoin request a permutation match of
> (a, b,c) on both it's input, it is not necessary to end up with
> collation (a, b, c) only. Since it request "permutation", MJ could ask
> all possible satisfying collations, which include (b, c, a). In other
> words, the steps I described did not exclude such plan.
> 
> You may argue it would increase the search space. However,  by
> limiting the search space, without explore all possible choice, we may
> lose the chance getting 'optimal' plan we want.  For instance, in the
> above example, the idea of passing "on demand" trait request (b,c)
> from Agg to MJ is to avoid unnecessary sort (b,c).  In cases where the
> join condition has good filtering, and such sort of join output could
> be quite cheap.  Yet in the plan enumeration, since we use "on demand"
> trait request from parent to guide the actions of MJ, I'm not sure if
> we may restrict the choices we consider in the legs of join, whose
> cardinality could be larger and play a bigger role in the overall
> cost.
> 
> In other words, by using "on demand" trait request, we may restrict
> the choices of plan, possibly in the some operators with larger data
> size.
> 
> In the current implementation of VolcanoPlanner, I feel the root issue
> of long planning time is not to explore all possible satisfying trait.
> It is actually the unnecessary of AbstractConverter, added to the
> equivalence class.
> 
> 
> On Fri, Oct 18, 2019 at 10:39 PM Xiening Dai  wrote:
>> 
>> Thanks for the sharing. I like the way you model this problem, Jinfeng.
>> 
>> There’s one minor issue with your example. Let say if R and S doesn’t have 
>> sorting properties at all. In your case, we would end up adding enforcers 
>> for LHS and RHS to get collation (a, b, c). Then we would need another 
>> enforcer to get collation (b, c). This is a sub optimal plan as we could 
>> have use (b, c, a) for join.
>> 
>> I think in step #2, the join operator would need to take the agg trait 
>> requirement into account. Then it would have two options -
>> 
>> 1) require *exact/super* match of  (b, c, a) or (c, b, a); this is to 
>> guarantee the join output would deliver the collation agg needs.
>> 2) require permutation match of (a, b, c); in such case, an enforcer might 
>> be needed for aggregation.
>> 
>> Eventually the cost model decides who is the winner.
>> 
>> There’s a fundamental difference between your model and Haisheng’s proposal. 
>> In Haisheng’s case, a rel node not only looks at its parent’s requirement, 
>> but also tries to get the potential traits its input could deliver. It would 
>> try to align them to eliminate unnecessary alternatives.
>> 
>> In above example, assuming R is (b, c, a) and S is (a, b, c), to implement 
>> option 1), we would generate two alternatives -
>> 
>> MergeJoin (b, c, a)
>>TableScan R
>>Sort(b, c, a)
>>TableScan S
>> 
>> MergeJoin(c, b, a)
>>Sort(c, b, a)
>>TableScan R
>>Sort(c, b, a)
>>TableScan S
>> 
>> But if we look at the input traits and has the insight that R already 
>> delivers (b, c, a), we could decide to require (b, c, a) only and avoid 
>> generating the 2nd plan, which is definitely worse, and reduce the search 
>> space.
>> 
>> 
>>> On Oct 18, 2019, at 4:57 PM, Jinfeng Ni  wrote:
>>> 
>>> A little bit of history.  In Drill,  when we first implemented
>>> Distribution trait's definition,  we allows both exact match and
>>> partial match in satisfy() method. This works fine for single-input
>>> operator such aggregation, however it leads to incorrect plan for join
>>> query, i.e LHS shuffle with (a, b),  RHS shuffle with 

Re: Problem with converters and possibly rule matching

2019-11-01 Thread Vladimir Ozerov
Hi Stamatis,

Thank you for your reply. I also thought that we may set the distribution
trait during logical planning because it is known in advance. And the
example I gave will work! :-) But unfortunately, it will work only because
the tree is very simple, and the Project is adjacent to the Scan. This is
how my reproducer will work in that case:
1) Root: enforce "SINGLETON" on Project
2) Project: check the logical Scan, infer already resolved distribution,
then convert to [PhyiscalProject <- PhysicalScan]
3) Resolve Root enforcer, adding and Exchange if needed.

But this stops working as soon as a plan becomes more complex so that it is
impossible to infer the distribution from the child immediately. E.g.:
LogicalRoot [distribution=SINGLETON]
-> LogicalProject // We are here new and cannot produce the physical project
 -> LogicalFilter[distribution=?]
  -> LogicalScan[distribution=REPLICATED]

This is where your suggestion with cascading enforcement may kick in. But
now consider that instead of having REPLICATED distribution, which
satisfies SINGLETON, we have a PARTITIONED distribution, It doesn't satisfy
SINGLETON, so we have to insert the exchange.

Before:
PhysicalRoot   [SINGLETON]
-> PhysicalProject [SINGLETON]
 -> PhysicalFilter [SINGLETON]
  -> LogicalScan   [PARTITIONED] // SINGLETON is enforced here

After:
PhysicalRoot   [SINGLETON]
-> PhysicalProject [SINGLETON]
 -> PhysicalFilter [SINGLETON]
  -> SingletonExchange [SINGLETON]
   -> PhysicalScan [PARTITIONED]

But unfortunately, this plan is not optimal. Since we perform Project and
Filter after the exchange, we now have to reshuffle the whole table. The
optimal plan would be to pushdown Project and Filter past exchange:
PhysicalRoot [SINGLETON]
-> SingletonExchange [SINGLETON]
 -> PhysicalProject  [PARTITIONED]
  -> PhysicalFilter  [PARTITIONED]
   -> PhysicalScan   [PARTITIONED]

But in order to achieve that, now I need to introduce new rules which will
attempt to transpose dozens of different operators past exchange, so most
likely the planning will never finish :-)

This is why it seems that the bottom-up approach seems to be more natural
for such cases. Another example is index support. A table may have a number
of access methods in addition to plain scan, and every such method may
produce a physical scan with different collations. It is not practical to
try to enforce something from the top. Instead, we'd better produce
alternatives from the bottom.

Basically, Drill tries to mix two approaches. First, it tries to derive
traits from the child. If it fails and no transformations were created, it
just stops trait propagation. As a result, an exchange is created on top of
the operator where we stopped, which again leads to not optimal plans. You
may observe this in action if you run my reproducer. "testPass" produces an
optimal plan with help of abstract converters. "testDrill" produces not
optimal plan with the unnecessary exchange.

All in all, I cannot get how we can employ distribution metadata and index
metadata for optimal planning without using a pure bottom-up approach.

Regards,
Vladimir.

пт, 1 нояб. 2019 г. в 11:42, Stamatis Zampetakis :

> The discussion is interesting thanks for the nice examples.
> I am replying in this thread since it has all the examples and the history
> of the conversation.
>
> *Part I: Distribution physical or logical property*
>
> The Volcano paper writes the following regarding properties:
>
> "Logical properties can be derived from the logical algebra expression and
> include schema, expected size, etc., while physical properties depend on
> algorithms, e.g., sort order, partitioning, etc."
>
> We could say that partitioning is not only a property which depends on an
> algorithm but a property which depends on the schema. When you have a table
> you know in advance that the particular table is replicated.
>
> Basically, this means that the starting "logical" plan in your case could
> be the following:
> RootLogicalRel[convention=LOGICAL, distribution=ANY]
> -> ProjectLogicalRel[convention=LOGICAL, distribution=ANY]
>   -> MapScanLogicalRel[convention=LOGICAL, distribution=REPLICATED]
>
> When you are about to create your physical projection it is not necessary
> to create three equivalent paths but the most promising by asking the input
> what can it offer as traits. Again this partially overlaps with the
> discussion about "On demand traitset" [1] where as I wrote there is already
> a mechanism to derive traits from children (collation, distribution, etc.).
>
> Note that I am not saying that there is nothing more to be done to improve
> the Calcite optimizer; I am just trying to provide a viable alternative
> given the current situation of the project.
>
> *Part II: Example with the original Volcano optimizer*
>
> Now let me try to show how the original (from the paper) Volcano optimizer
> could handle your query.
>
> The optimization starts and we request two 

Re: Problem with converters and possibly rule matching

2019-11-01 Thread Seliverstov Igor
Stamatis,

Small comment from me, for distributed systems is more efficient if
SINGLETON distribution appears only on root node, in other words as late as
possible, it allows to filter or cut rows (project) on source nodes which
significantly reduces network costs.

To do that (In case we use original volcano approach) we need to request
all possible distribution traits on project node (some of them we cannot
request, because we don't know whether a repartitioning and rehashing
happens in the middle) to provide a place for possible converted inputs
(possibly leaving empty subsets, which causes exceptions in current
implementation). Such approach significant slowdowns planning phase.

Regards,
Igor

пт, 1 нояб. 2019 г., 11:42 Stamatis Zampetakis :

> The discussion is interesting thanks for the nice examples.
> I am replying in this thread since it has all the examples and the history
> of the conversation.
>
> *Part I: Distribution physical or logical property*
>
> The Volcano paper writes the following regarding properties:
>
> "Logical properties can be derived from the logical algebra expression and
> include schema, expected size, etc., while physical properties depend on
> algorithms, e.g., sort order, partitioning, etc."
>
> We could say that partitioning is not only a property which depends on an
> algorithm but a property which depends on the schema. When you have a table
> you know in advance that the particular table is replicated.
>
> Basically, this means that the starting "logical" plan in your case could
> be the following:
> RootLogicalRel[convention=LOGICAL, distribution=ANY]
> -> ProjectLogicalRel[convention=LOGICAL, distribution=ANY]
>   -> MapScanLogicalRel[convention=LOGICAL, distribution=REPLICATED]
>
> When you are about to create your physical projection it is not necessary
> to create three equivalent paths but the most promising by asking the input
> what can it offer as traits. Again this partially overlaps with the
> discussion about "On demand traitset" [1] where as I wrote there is already
> a mechanism to derive traits from children (collation, distribution, etc.).
>
> Note that I am not saying that there is nothing more to be done to improve
> the Calcite optimizer; I am just trying to provide a viable alternative
> given the current situation of the project.
>
> *Part II: Example with the original Volcano optimizer*
>
> Now let me try to show how the original (from the paper) Volcano optimizer
> could handle your query.
>
> The optimization starts and we request two properties convention and
> distribution.
>
> RootLogicalRel[convention=PHYSICAL, distribution=SINGLETON]
> -> ProjectLogicalRel
>   -> MapScanLogicalRel
>
> Rule 1: RootLogicalRel -> RootPhysicalRel applies the algorithm but cannot
> satisfy the SINGLETON property so it propagates the demand
>
> RootPhysicalRel
> -> ProjectLogicalRel [convention=PHYSICAL, distribution=SINGLETON]
>   -> MapScanLogicalRel
>
> Rule 2: ProjectLogicalRel -> ProjectPhysicalRel applies the algorithm but
> cannot satisfy the SINGLETON property so it propagates the demand
>
> RootPhysicalRel
> -> ProjectPhysicalRel
>   -> MapScanLogicalRel [convention=PHYSICAL, distribution=SINGLETON]
>
> Rule 3: MapScanLogicalRel -> MapScanPhysicalRel applies the algorithm and
> given that the table is replicated it satisfies the distribution property.
>
> RootPhysicalRel
> -> ProjectPhysicalRel
>   -> MapScanPhysicalRel
>
> So we end up with a complete plan that satisfies all properties and does
> not have redundant exchange operators.
> Moreover we didn't require all possible distributions when we applied Rule
> 2 since we just want to satisfy the SINGLETON property requested by the
> parent.
> The example above does not demonstrate the additional optimization paths
> which would be apply an enforcer (an Exchange operator to satisfy the
> SINGLETON property) at a higher level than the scan.
>
> Would this be an acceptable approach? Does it still suffer from a big
> search space?
>
> To make the analog with the actual Volcano optimizer in Calcite the thing
> that may be missing is to be able to know what trait was requested by the
> parent during the application of Rule 2.
>
> Best,
> Stamatis
>
> [1]
>
> https://lists.apache.org/thread.html/79dac47ea50b5dfbd3f234e368ed61d247fb0eb989f87fe01aedaf25@%3Cdev.calcite.apache.org%3E
>
> On Wed, Oct 30, 2019 at 7:24 PM Vladimir Ozerov 
> wrote:
>
> > Hi Stamatis,
> >
> > The problem that the presented reproducer is a very simplified version of
> > what is actually needed. In reality, there are several distribution
> types -
> > PARTITIONED, REPLICATED, SINGLETON. To make things worse, PARTITIONED may
> > or may not have concrete distribution fields. In theory, I can create one
> > transformation per distribution type, but that would increase plan space
> > significantly. In my sample "Root <- Project <- Scan" plan, there is no
> > room for optimization at all, we only need to convert the nodes and
> > 

Re: [DISCUSS] Proposal to add API to force rules matching specific rels

2019-11-01 Thread Stamatis Zampetakis
@Vladimir: Given that you want to make the optimizer to work in bottom-up
fashion maybe it suffices to inverse the comparator in the RuleQueue [1].
For sure there are rules which would not be compatible with this change but
maybe for your project it makes sense.
I never tried it my self but I would be curious to see if it works.

@Jinfeng: I guess that the reason that we have the convention as a trait is
because results from different systems need to be combined together.
Given that these results arrive in different forms we need a way to know
how to transform one to the other. In some sense the convention trait can
be thought to be similar to the "assembledness" property mentioned in the
paper.
"For query optimization in object-oriented systems, we plan on defining
"assembledness" of complex objects in memory as a physical property and
using the assembly operator described in ... as the enforcer for this
property." [2]
For sure I am not the best person to talk about this choice so don't take
anything that I say as 100% accurate. I am sure other people can provide a
much better explanation.

Best,
Stamatis

[1]
https://github.com/apache/calcite/blob/65043f290a7be45a668cf941ab48ee3c30efbe4e/core/src/main/java/org/apache/calcite/plan/volcano/RuleQueue.java#L94
[2]
https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/Papers/Volcano-graefe.pdf

On Fri, Nov 1, 2019 at 7:53 AM Jinfeng Ni  wrote:

> I think "pull-up traits" is necessary, since the physical traits are
> propagated  upwards. However, I'm not fully convinced "On Demand
> Trait" is the best solution, as I feel it may restrict the choices the
> planner may consider.  Maybe after the proposed design doc is ready,
> we may look into the detail and reevaluate.
>
> One question that I have always had ever since I started using Calcite
> couple of years ago,  is the concept of Convention being a trait, and
> introduction of AbstractConverter in Calcite's VolcanoPlanner.
> Quickly looking through the original paper of Volcano [1], or the new
> one [2], I did not see mentioning of such concepts.  The original
> paper has a classification of "transformation rules" which operates on
> logical relation expression  and "implementation rules" which provides
> the mapping to physical operators.
>
>
> 1. https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf
> 2.
> https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/Papers/volcano-prasan.pdf
>
> On Thu, Oct 31, 2019 at 2:40 PM Xiening Dai  wrote:
> >
> > Actually we solved this problem in our setup using a mechanism called
> “Pull-Up Traits”, which explores the possible trait set of children’s input
> to decide parent’s physical properties. In order to determine child input
> trait, you would have to look at child’s children, and all the way to the
> leaves nodes or a barrier. A barrier is a rel node which cannot derive any
> traits regardless the input. A good example would be a user define function
> which would throw off any distribution or collation. Then we realize just
> pulling up is not enough, sometimes we would need to look at parent’s
> requirement as well. So we try to solve this in a unified framework, which
> we call “On Demand Trait” and implement it as part of the framework so
> anyone can be benefited. I hope Haisheng can share a design doc once we
> have more concrete ideas.
> >
> >
> > > On Oct 31, 2019, at 11:37 AM, Jinfeng Ni  wrote:
> > >
> > > Hi Vladimir,
> > >
> > > The SubsetTransformer interface and the iterating over the RelNodes
> > > within a RelSubset in Drill  is exactly implemented to do the trait
> > > propagation. We also had to rely on AbstractConverter to fire
> > > necessary rule to avoid the CanNotPlan issue. At some point, Calcite
> > > community chooses to remove AbstractConverter and Drill had to add it
> > > back, which is probably one of the main reasons for us to continue
> > > using a Calcite fork.  I still remember we constantly had to deal with
> > > the dilemma between "CanNotPlan" and long planing time due to explored
> > > search space.
> > >
> > > Glad to see more people are joining the effort to solve this long
> > > overdue issue, something missing in Calcite's core optimizer framework
> > > "since before Calcite was Calcite" (Jacques's words).
> > >
> > > Jinfeng
> > >
> > >
> > > On Thu, Oct 31, 2019 at 3:38 AM Vladimir Ozerov 
> wrote:
> > >>
> > >> Hi Danny,
> > >>
> > >> Thank you very much for the links. What is described here is pretty
> much
> > >> similar to the problem I describe. Especially the discussion about
> trait
> > >> propagation, as this is basically what I need - to explore potential
> traits
> > >> of children before optimizing parents. And this is basically what
> Drill
> > >> already does with it's SubsetTransformer:
> > >> 1) There is a SubsetTransformer interface, which iterates over
> physical
> > >> relations of the given subset [1]
> > >> 2) If you want to make a physical project, you iterate over physical
> > >> 

How to enable cache schema through JDBC properties

2019-11-01 Thread zhaojun
Hi, all

I have found a cache property in JsonSchema, default value was true,calcite 
version was 1.20.0
But when I create a calcite connection through DriverManager.getConnection, 
cache property value was null after Jackson deserialization.
My code is like this

```
Properties properties = new Properties();
properties.setProperty("schema", schemaName);
result.setProperty("schemaFactory", 
MemorySchemaFactory.class.getCanonicalName());
result.setProperty("lex", "MYSQL");

DriverManager.getConnection(CONNECTION_URL, properties);

```
How can I solve this issue?

Regards

--
Zhao Jun
Apache Sharding-Sphere & ServiceComb



Re: Problem with converters and possibly rule matching

2019-11-01 Thread Stamatis Zampetakis
The discussion is interesting thanks for the nice examples.
I am replying in this thread since it has all the examples and the history
of the conversation.

*Part I: Distribution physical or logical property*

The Volcano paper writes the following regarding properties:

"Logical properties can be derived from the logical algebra expression and
include schema, expected size, etc., while physical properties depend on
algorithms, e.g., sort order, partitioning, etc."

We could say that partitioning is not only a property which depends on an
algorithm but a property which depends on the schema. When you have a table
you know in advance that the particular table is replicated.

Basically, this means that the starting "logical" plan in your case could
be the following:
RootLogicalRel[convention=LOGICAL, distribution=ANY]
-> ProjectLogicalRel[convention=LOGICAL, distribution=ANY]
  -> MapScanLogicalRel[convention=LOGICAL, distribution=REPLICATED]

When you are about to create your physical projection it is not necessary
to create three equivalent paths but the most promising by asking the input
what can it offer as traits. Again this partially overlaps with the
discussion about "On demand traitset" [1] where as I wrote there is already
a mechanism to derive traits from children (collation, distribution, etc.).

Note that I am not saying that there is nothing more to be done to improve
the Calcite optimizer; I am just trying to provide a viable alternative
given the current situation of the project.

*Part II: Example with the original Volcano optimizer*

Now let me try to show how the original (from the paper) Volcano optimizer
could handle your query.

The optimization starts and we request two properties convention and
distribution.

RootLogicalRel[convention=PHYSICAL, distribution=SINGLETON]
-> ProjectLogicalRel
  -> MapScanLogicalRel

Rule 1: RootLogicalRel -> RootPhysicalRel applies the algorithm but cannot
satisfy the SINGLETON property so it propagates the demand

RootPhysicalRel
-> ProjectLogicalRel [convention=PHYSICAL, distribution=SINGLETON]
  -> MapScanLogicalRel

Rule 2: ProjectLogicalRel -> ProjectPhysicalRel applies the algorithm but
cannot satisfy the SINGLETON property so it propagates the demand

RootPhysicalRel
-> ProjectPhysicalRel
  -> MapScanLogicalRel [convention=PHYSICAL, distribution=SINGLETON]

Rule 3: MapScanLogicalRel -> MapScanPhysicalRel applies the algorithm and
given that the table is replicated it satisfies the distribution property.

RootPhysicalRel
-> ProjectPhysicalRel
  -> MapScanPhysicalRel

So we end up with a complete plan that satisfies all properties and does
not have redundant exchange operators.
Moreover we didn't require all possible distributions when we applied Rule
2 since we just want to satisfy the SINGLETON property requested by the
parent.
The example above does not demonstrate the additional optimization paths
which would be apply an enforcer (an Exchange operator to satisfy the
SINGLETON property) at a higher level than the scan.

Would this be an acceptable approach? Does it still suffer from a big
search space?

To make the analog with the actual Volcano optimizer in Calcite the thing
that may be missing is to be able to know what trait was requested by the
parent during the application of Rule 2.

Best,
Stamatis

[1]
https://lists.apache.org/thread.html/79dac47ea50b5dfbd3f234e368ed61d247fb0eb989f87fe01aedaf25@%3Cdev.calcite.apache.org%3E

On Wed, Oct 30, 2019 at 7:24 PM Vladimir Ozerov  wrote:

> Hi Stamatis,
>
> The problem that the presented reproducer is a very simplified version of
> what is actually needed. In reality, there are several distribution types -
> PARTITIONED, REPLICATED, SINGLETON. To make things worse, PARTITIONED may
> or may not have concrete distribution fields. In theory, I can create one
> transformation per distribution type, but that would increase plan space
> significantly. In my sample "Root <- Project <- Scan" plan, there is no
> room for optimization at all, we only need to convert the nodes and
> propagate traits. But if I follow the proposed approach, the planner would
> create three equivalent paths. For complex queries, this may increase
> optimization time significantly.
>
> What I need instead is to gradually convert and optimize nodes *bottom-up*,
> instead of top-bottom. That is, create a physical scan first, then create a
> physical project on top of it, etc. But at the same time, some rules still
> require the top-bottom approach. So essentially I need the optimizer to do
> both. Abstract converters help me establish bottom-up preparation but do
> this at the cost of considering too many trait pairs, and as a result, also
> pollute the search space.
>
> To the contrast, precise command "I transformed the node A to node B,
> please re-trigger the rules for A's parents" would allow us to re-trigger
> only required rules, without adding more nodes.
>
> Does it make sense?
>
> Regards,
> Vladimir.
>
> ср, 30 окт. 2019 

[jira] [Created] (CALCITE-3469) Fix typo in SubstitutionVisitor#rowTypesAreEquivalent

2019-11-01 Thread daimin (Jira)
daimin created CALCITE-3469:
---

 Summary: Fix typo in SubstitutionVisitor#rowTypesAreEquivalent
 Key: CALCITE-3469
 URL: https://issues.apache.org/jira/browse/CALCITE-3469
 Project: Calcite
  Issue Type: Bug
  Components: core
Reporter: daimin


{code:java}
  private boolean rowTypesAreEquivalent(
  MutableRel rel0, MutableRel rel1, Litmus litmus) {
if (rel0.rowType.getFieldCount() != rel1.rowType.getFieldCount()) {
  return litmus.fail("Mismatch for column count: [{}]", Pair.of(rel0, 
rel1));
}
for (Pair pair
: Pair.zip(rel0.rowType.getFieldList(), rel0.rowType.getFieldList())) {
  if (!pair.left.getType().equals(pair.right.getType())) {
return litmus.fail("Mismatch for column type: [{}]", Pair.of(rel0, 
rel1));
  }
}
return litmus.succeed();
  }
{code}
Where 
{code:java}
Pair.zip(rel0.rowType.getFieldList(), rel0.rowType.getFieldList())
{code}
should be 

{code:java}
Pair.zip(rel0.rowType.getFieldList(), rel1.rowType.getFieldList())
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [ANNOUNCE] Danny Chan joins Calcite PMC

2019-11-01 Thread Stamatis Zampetakis
Congratulations Danny!

You are doing an amazing job. The project and the community is becoming
better every day and your help is much appreciated.

Keep up the momentum!

Best,
Stamatis

On Thu, Oct 31, 2019 at 4:41 AM Kurt Young  wrote:

> Congratulations Danny!
>
> Best,
> Kurt
>
>
> On Thu, Oct 31, 2019 at 11:18 AM Danny Chan  wrote:
>
> > Thank you so much colleagues, it’s my honor to work with you!
> >
> > I have always felt respected and the harmony of the community, hope to
> > contribute more and I would give help as best as I can, thanks !
> >
> > Best,
> > Danny Chan
> > 在 2019年10月31日 +0800 AM5:22,Francis Chuang ,写道:
> > > I'm pleased to announce that Danny has accepted an invitation to
> > > join the Calcite PMC. Danny has been a consistent and helpful
> > > figure in the Calcite community for which we are very grateful. We
> > > look forward to the continued contributions and support.
> > >
> > > Please join me in congratulating Danny!
> > >
> > > - Francis (on behalf of the Calcite PMC)
> >
>


Re: [DISCUSS] Proposal to add API to force rules matching specific rels

2019-11-01 Thread Jinfeng Ni
I think "pull-up traits" is necessary, since the physical traits are
propagated  upwards. However, I'm not fully convinced "On Demand
Trait" is the best solution, as I feel it may restrict the choices the
planner may consider.  Maybe after the proposed design doc is ready,
we may look into the detail and reevaluate.

One question that I have always had ever since I started using Calcite
couple of years ago,  is the concept of Convention being a trait, and
introduction of AbstractConverter in Calcite's VolcanoPlanner.
Quickly looking through the original paper of Volcano [1], or the new
one [2], I did not see mentioning of such concepts.  The original
paper has a classification of "transformation rules" which operates on
logical relation expression  and "implementation rules" which provides
the mapping to physical operators.


1. https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf
2. 
https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/Papers/volcano-prasan.pdf

On Thu, Oct 31, 2019 at 2:40 PM Xiening Dai  wrote:
>
> Actually we solved this problem in our setup using a mechanism called 
> “Pull-Up Traits”, which explores the possible trait set of children’s input 
> to decide parent’s physical properties. In order to determine child input 
> trait, you would have to look at child’s children, and all the way to the 
> leaves nodes or a barrier. A barrier is a rel node which cannot derive any 
> traits regardless the input. A good example would be a user define function 
> which would throw off any distribution or collation. Then we realize just 
> pulling up is not enough, sometimes we would need to look at parent’s 
> requirement as well. So we try to solve this in a unified framework, which we 
> call “On Demand Trait” and implement it as part of the framework so anyone 
> can be benefited. I hope Haisheng can share a design doc once we have more 
> concrete ideas.
>
>
> > On Oct 31, 2019, at 11:37 AM, Jinfeng Ni  wrote:
> >
> > Hi Vladimir,
> >
> > The SubsetTransformer interface and the iterating over the RelNodes
> > within a RelSubset in Drill  is exactly implemented to do the trait
> > propagation. We also had to rely on AbstractConverter to fire
> > necessary rule to avoid the CanNotPlan issue. At some point, Calcite
> > community chooses to remove AbstractConverter and Drill had to add it
> > back, which is probably one of the main reasons for us to continue
> > using a Calcite fork.  I still remember we constantly had to deal with
> > the dilemma between "CanNotPlan" and long planing time due to explored
> > search space.
> >
> > Glad to see more people are joining the effort to solve this long
> > overdue issue, something missing in Calcite's core optimizer framework
> > "since before Calcite was Calcite" (Jacques's words).
> >
> > Jinfeng
> >
> >
> > On Thu, Oct 31, 2019 at 3:38 AM Vladimir Ozerov  wrote:
> >>
> >> Hi Danny,
> >>
> >> Thank you very much for the links. What is described here is pretty much
> >> similar to the problem I describe. Especially the discussion about trait
> >> propagation, as this is basically what I need - to explore potential traits
> >> of children before optimizing parents. And this is basically what Drill
> >> already does with it's SubsetTransformer:
> >> 1) There is a SubsetTransformer interface, which iterates over physical
> >> relations of the given subset [1]
> >> 2) If you want to make a physical project, you iterate over physical
> >> relations of the input subset and create possible physical projects [2]
> >> 3) But if you cannot find any physical input, then you trigger creation of
> >> a "bad" physical project, which is very likely to have poor cost because it
> >> cannot take advantage of input's distribution information [3]
> >> So, step 2 - is a trait set propagation which is needed by many
> >> distributed engines. Step 3 - an attempt to workaround current
> >> VolcanoPlanner behavior, when a parent rule is fired only if produced child
> >> node has compatible trait set.
> >>
> >> I do not know Calcite's architecture that good but at first glance, the
> >> proposed ability to re-fire rules of a specific Rel seems good enough. It
> >> doesn't expand search space, because no new nodes are created, and it seems
> >> to be relatively simple to implement.
> >>
> >> [1]
> >> https://github.com/apache/drill/blob/1.16.0/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/SubsetTransformer.java
> >> [2]
> >> https://github.com/apache/drill/blob/1.16.0/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/ProjectPrule.java#L66
> >> [3]
> >> https://github.com/apache/drill/blob/1.16.0/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/ProjectPrule.java#L69
> >>
> >> чт, 31 окт. 2019 г. в 12:21, Danny Chan :
> >>
> >>> Thanks Vladimir for bringing up this discussion !
> >>>
> >>> Did you notice that there is a JIRA issue about this problem ? [1] Also a
> >>> discussion about how to propagate 

Re: [DISCUSS] On-demand traitset request

2019-11-01 Thread Jinfeng Ni
Hi Xiening,

"Let say if R and S doesn’t have sorting properties at all. In your
case, we would end up adding enforcers for LHS and RHS to get
collation (a, b, c). Then we would need another enforcer to get
collation (b, c). This is a sub optimal plan as we could have use (b,
c, a) for join."

In such case, for step 2 when MergeJoin request a permutation match of
(a, b,c) on both it's input, it is not necessary to end up with
collation (a, b, c) only. Since it request "permutation", MJ could ask
all possible satisfying collations, which include (b, c, a). In other
words, the steps I described did not exclude such plan.

You may argue it would increase the search space. However,  by
limiting the search space, without explore all possible choice, we may
lose the chance getting 'optimal' plan we want.  For instance, in the
above example, the idea of passing "on demand" trait request (b,c)
from Agg to MJ is to avoid unnecessary sort (b,c).  In cases where the
join condition has good filtering, and such sort of join output could
be quite cheap.  Yet in the plan enumeration, since we use "on demand"
trait request from parent to guide the actions of MJ, I'm not sure if
we may restrict the choices we consider in the legs of join, whose
cardinality could be larger and play a bigger role in the overall
cost.

In other words, by using "on demand" trait request, we may restrict
the choices of plan, possibly in the some operators with larger data
size.

In the current implementation of VolcanoPlanner, I feel the root issue
of long planning time is not to explore all possible satisfying trait.
It is actually the unnecessary of AbstractConverter, added to the
equivalence class.


On Fri, Oct 18, 2019 at 10:39 PM Xiening Dai  wrote:
>
> Thanks for the sharing. I like the way you model this problem, Jinfeng.
>
> There’s one minor issue with your example. Let say if R and S doesn’t have 
> sorting properties at all. In your case, we would end up adding enforcers for 
> LHS and RHS to get collation (a, b, c). Then we would need another enforcer 
> to get collation (b, c). This is a sub optimal plan as we could have use (b, 
> c, a) for join.
>
> I think in step #2, the join operator would need to take the agg trait 
> requirement into account. Then it would have two options -
>
> 1) require *exact/super* match of  (b, c, a) or (c, b, a); this is to 
> guarantee the join output would deliver the collation agg needs.
> 2) require permutation match of (a, b, c); in such case, an enforcer might be 
> needed for aggregation.
>
> Eventually the cost model decides who is the winner.
>
> There’s a fundamental difference between your model and Haisheng’s proposal. 
> In Haisheng’s case, a rel node not only looks at its parent’s requirement, 
> but also tries to get the potential traits its input could deliver. It would 
> try to align them to eliminate unnecessary alternatives.
>
> In above example, assuming R is (b, c, a) and S is (a, b, c), to implement 
> option 1), we would generate two alternatives -
>
> MergeJoin (b, c, a)
> TableScan R
> Sort(b, c, a)
> TableScan S
>
> MergeJoin(c, b, a)
> Sort(c, b, a)
> TableScan R
> Sort(c, b, a)
> TableScan S
>
> But if we look at the input traits and has the insight that R already 
> delivers (b, c, a), we could decide to require (b, c, a) only and avoid 
> generating the 2nd plan, which is definitely worse, and reduce the search 
> space.
>
>
> > On Oct 18, 2019, at 4:57 PM, Jinfeng Ni  wrote:
> >
> > A little bit of history.  In Drill,  when we first implemented
> > Distribution trait's definition,  we allows both exact match and
> > partial match in satisfy() method. This works fine for single-input
> > operator such aggregation, however it leads to incorrect plan for join
> > query, i.e LHS shuffle with (a, b),  RHS shuffle with (a) .  At that
> > time, we removed partial match, and use exact match only. Yet this
> > changes leads to unnecessary additional exchange.  To mitigate this
> > problem, in join physical operator, for a join key (a, b, c),  we
> > enumerate different distribution requests, yet this lead to more space
> > to explore and significantly increase planning time (which is probably
> > what Haisheng also experienced).  When I look back, I feel probably
> > what we miss is the "coordination" step in the join operator, because
> > if we relax the requirement of satisfy(), for multi-input operators,
> > we have to enforce some "coordination", to make sure multiple input's
> > trait could work together properly.
> >
> >
> >
> > On Fri, Oct 18, 2019 at 4:38 PM Jinfeng Ni  wrote:
> >>
> >> This is an interesting topic. Thanks for bringing up this issue.
> >>
> >> My understanding of Volcano planner is it works in a top-down search
> >> mode (the parent asks for certain trait of its child), while the trait
> >> propagates in a bottom-up way, as Stamatis explained.
> >>
> >> IMHO, the issue