[jira] [Created] (CALCITE-3575) JOIN 条件下推引入BUG

2019-12-05 Thread Jira
小黑一点都不黑 created CALCITE-3575:


 Summary: JOIN 条件下推引入BUG
 Key: CALCITE-3575
 URL: https://issues.apache.org/jira/browse/CALCITE-3575
 Project: Calcite
  Issue Type: Bug
  Components: core
Affects Versions: 1.21.0, 1.20.0, 1.19.0, 1.18.0
Reporter: 小黑一点都不黑
 Attachments: image-2019-12-06-14-24-56-599.png

In the case,  a join statement with equal condition and left is expression 
instead of column, will fire the bug. the reason is pushDownJoinConditions 
method does't add left project node to SqlToRelConverter.leaves。

 

!image-2019-12-06-14-25-09-548.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CALCITE-3574) Add rlike syntax

2019-12-05 Thread Aitozi (Jira)
Aitozi created CALCITE-3574:
---

 Summary: Add rlike syntax 
 Key: CALCITE-3574
 URL: https://issues.apache.org/jira/browse/CALCITE-3574
 Project: Calcite
  Issue Type: Improvement
  Components: core
Reporter: Aitozi


Now rlike is not supported in calcite, i think it's useful to support this so 
that it can be used in other engine depending on calcite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Some issues with Gradle release for Avatica

2019-12-05 Thread Francis Chuang
I was attempting to make Avatica 1.16.0-rc0 available for voting, but 
ran into a few show stopper bugs.


Vladimir would probably be the best person to fix this as he did most of 
the migration to Gralde.


A few things I noticed:
- I am using -Pasf.git.pushRepositoryProvider=GITBOX to push the tag via 
Gitbox in the docker.sh script. This is so that we only ask the user for 
his ASF credentials and it can push the tag to git, upload to nexus and 
so on. I noticed that the tag was being pushed to the calcite repository 
and not the avatica repository. The link in the email links to the 
calcite repo on gitbox as well.


- There are some links in the vote email to the calcite-site-preview 
repository, but it doesn't seem to exist.


Other than those issues, I am a big fan of the automation and a lot of 
manual steps have been removed from the release process, which is a 
great improvement.


Attached below is the generated vote email:
```
Hi all,

I have created a build for Apache Calcite Avatica 1.16.0, release
candidate 0.

Thanks to everyone who has contributed to this release.

You can read the release notes here:
https://gitbox.apache.org/repos/asf/calcite-site-preview.git/avatica/docs/history.html

The commit to be voted upon:
https://gitbox.apache.org/repos/asf?p=calcite-avatica.git;a=commit;h=8dd4b7281a35b20eb4b141f11b89330766e811cf

Its hash is 8dd4b7281a35b20eb4b141f11b89330766e811cf

Tag:
https://gitbox.apache.org/repos/asf?p=calcite.git;a=tag;h=refs/tags/avatica-1.16.0-rc0

The artifacts to be voted on are located here:
https://dist.apache.org/repos/dist/dev/calcite/apache-calcite-avatica-1.16.0-rc0
(revision 37097)

RAT report:
https://gitbox.apache.org/repos/asf/calcite-site-preview.git/avatica/rat/rat-report.txt

Site preview is here:
https://gitbox.apache.org/repos/asf/calcite-site-preview.git/avatica/

JavaDoc API preview is here:
https://gitbox.apache.org/repos/asf/calcite-site-preview.git/avatica/api

The hashes of the artifacts are as follows:
09e6b64de6a882b1e7e7cd48001e394c80959445bc9b5fcda94ac555be67238c4af532aef7826a7ecf5ad786294685565db998097576cc3ce092a056ba37bfc5
*apache-calcite-avatica-1.16.0-src.tar.gz
673fcddc9408b6e51c4feee0caa12b74b71fa6d94674001f207131864c05b7b67d40bc66ece99ad63a65a8c6346f4e4f1f487be08323cc410a2a0fea76f3e12b
*apache-calcite-avatica-1.16.0-src.zip

A staged Maven repository is available for review at:
https://repository.apache.org/content/repositories/orgapachecalcite-1069/org/apache/calcite/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/francischuang.asc
https://www.apache.org/dist/calcite/KEYS

N.B.
To create the jars and test Apache Calcite Avatica: "./gradlew build 
-Prelease -PskipSigning".


If you do not have a Java environment available, you can run the tests
using docker. To do so, install docker and docker-compose, then run
"docker-compose run test" from the root of the directory.

Please vote on releasing this package as Apache Calcite Avatica 1.16.0.

The vote is open for the next 72 hours and passes if a majority of at
least three +1 PMC votes are cast.

[ ] +1 Release this package as Apache Calcite 1.16.0
[ ]  0 I don't feel strongly about it, but I'm okay with the release
[ ] -1 Do not release this package because...


Here is my vote:

+1 (binding)
```

Let's try to fix those issues and I'll attempt to make rc0 available for 
voting again.


Francis


[jira] [Created] (CALCITE-3573) Upgrade to Gradle 6.0 container to build releases and javadoc

2019-12-05 Thread Francis Chuang (Jira)
Francis Chuang created CALCITE-3573:
---

 Summary: Upgrade to Gradle 6.0 container to build releases and 
javadoc
 Key: CALCITE-3573
 URL: https://issues.apache.org/jira/browse/CALCITE-3573
 Project: Calcite
  Issue Type: Task
  Components: avatica
Reporter: Francis Chuang
Assignee: Francis Chuang
 Fix For: avatica-1.16.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Re: Re: Volcano's problem with trait propagation: current state and future

2019-12-05 Thread Haisheng Yuan
Oh, I forgot to mention that the join planning/reordering is not a big problem. 
Calcite just use the rule to generate a single alternative plan, which is not 
ideal.  But we can't say Calcite is doing wrong.

Ideally, we expect it generates multiple (neither all, nor single) bipartie 
graphs. The join reordering rule will cut each part into bipartie recursively 
and apply JoinCommutativity rule to generate more alternatives for each RelSet. 
It is just a different strategy. We can modify the rule, or create new join 
reordering rule to generate multiple plan alternatives.

- Haisheng

--
发件人:Haisheng Yuan
日 期:2019年12月06日 09:07:43
收件人:Vladimir Ozerov; dev@calcite.apache.org 
(dev@calcite.apache.org)
主 题:Re: Re: Volcano's problem with trait propagation: current state and future

Generally agree with what Vladimir said. I think what Calcite has is logical 
optimization or exploration, there are almost no physical optimization, Calcite 
leaves it to third party implementators. One of my friends at University of 
Wisconsin–Madison database research group told me that they gave up the idea of 
using Calcite in their project due to this reason.

Currently physical properties are requested in implementation rules, or even 
logical exploration rules, But each rule is independent, the pattern-matched 
expression is not aware of what does the parent operator want. Using 
AbstractConverter doesn't help, and is not promising.

>> You shouldn't regiester all logical rules in the planner simultaneously,... 
>> as Drill does.
That is because Calcite does too many redundant or duplicate rule matching, 
like all kinds of transpose (can't be avoided due to current design), matching 
physical operators.

>> decoupling the logical planning from the physical one looks a bit weird to 
>> me because it violates the idea of Cascades framework.
Orca optimizer fully adopted the design principle of Cascades framework except 
that it separates into 3 phases: logical exploration, physical implementation, 
and optimization (property enforcing). And it might be easier if we want to 
enable parallel optimization by seperating into 3 phases. Orca does 
branch-and-bound in optimization phase, before actual property derivation and 
enforcement, IIRC. It is highly efficient, works pretty well, and 
battlefield-tested by many large financial and insurance companies.

In my last thread about on-demand trait request, I gave the high-level general 
API for physical operators to derive and require physical properties, which is 
similar to Orca's design. But seems like the proposal of API change gets no 
love.

- Haisheng

--
发件人:Vladimir Ozerov
日 期:2019年12月05日 22:22:43
收件人:dev@calcite.apache.org (dev@calcite.apache.org)
主 题:Re: Volcano's problem with trait propagation: current state and future

AbstractConverter-s are attractive because they effectively emulate
straightforward recursive top-down optimization (Volcano/Cascades). But
instead of doing it with a recursive method call, which preserves the
context, we do this in Calcite as a sequence of unrelated rule calls, thus
losing the context. So with my current understanding, it could be thought
of not as a search space explosion, but rather than the inefficient
implementation of an otherwise straightforward parent->child->parent
navigation, since we achieve this navigation indirectly through the rule
queue, rather than through a normal method call. In any case, the net
result is wasted CPU. Perhaps this is not exponential waste, but some
multiplication of otherwise optimal planning. As I mentioned, in our
experiments with TPC-H, we observed a constant factor between 6-9x between
the number of joins and the number of join implementation rule invocations.
It doesn't growth past 9 even for complex queries, so I hope that this is
not an exponent :-)

Speaking of logical vs physical optimization, IMO it makes sense in some
cases. E.g. when doing predicate pushdown, you do not want to consider
intermediate logical tree states for implementation, until the predicate
reaches its final position. So running separate logical planning phase with
Volcano optimizer makes total sense to me, because it effectively prunes a
lot of not optimal logical plans before they reach the physical planning
stage. The real problem to me is that we forced to remove join planning
from the physical optimization stage. Because the goal of join planning not
to generate a single optimal plan, like with predicate pushdown, but rather
to generate a set of logical plans all of which should be implemented and
estimated. And with AbstractConverter-s this is not possible because of
their multiplicator increases the rate of search space growth, making join
planning inapplicable even for the small number of relations. So we have to
move them to the logical planning stage and pick only one permutation for
physic

[jira] [Created] (CALCITE-3572) Lack of Hints information when serialization and deserialization

2019-12-05 Thread xzh_dz (Jira)
xzh_dz created CALCITE-3572:
---

 Summary: Lack of Hints information when serialization and 
deserialization
 Key: CALCITE-3572
 URL: https://issues.apache.org/jira/browse/CALCITE-3572
 Project: Calcite
  Issue Type: Wish
Reporter: xzh_dz


when i try to serialization and deserialization hints as below:
{code:java}
@Test public void testExplainTerms() {
  String sql = "select /*+ resource(mem='20Mb') */ empno, ename from emp";
  final RelNode rel = tester.convertSqlToRel(sql).rel;
  RelJsonWriter jsonWriter = new RelJsonWriter();
  if(rel instanceof Project) {
rel.explain(jsonWriter);
final String relJson = jsonWriter.asString();
String s = deserializeAndDumpToTextFormat(getSchema(rel), relJson);
System.out.println(s);
  }
}
{code}
 

 
{code:java}
{
  "rels": [
{
  "id": "0",
  "relOp": "LogicalTableScan",
  "table": [
"CATALOG",
"SALES",
"EMP"
  ],
  "inputs": []
},
{
  "id": "1",
  "relOp": "LogicalProject",
  "fields": [
"EMPNO",
"ENAME"
  ],
  "exprs": [
{
  "input": 0,
  "name": "$0"
},
{
  "input": 1,
  "name": "$1"
}
  ]
}
  ]
}
{code}
I find that result lack of the Hints information, maybe we should support Hints 
serialization and deserialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Re: Volcano's problem with trait propagation: current state and future

2019-12-05 Thread Haisheng Yuan
Generally agree with what Vladimir said. I think what Calcite has is logical 
optimization or exploration, there are almost no physical optimization, Calcite 
leaves it to third party implementators. One of my friends at University of 
Wisconsin–Madison database research group told me that they gave up the idea of 
using Calcite in their project due to this reason.

Currently physical properties are requested in implementation rules, or even 
logical exploration rules, But each rule is independent, the pattern-matched 
expression is not aware of what does the parent operator want. Using 
AbstractConverter doesn't help, and is not promising.

>> You shouldn't regiester all logical rules in the planner simultaneously,... 
>> as Drill does.
That is because Calcite does too many redundant or duplicate rule matching, 
like all kinds of transpose (can't be avoided due to current design), matching 
physical operators.

>> decoupling the logical planning from the physical one looks a bit weird to 
>> me because it violates the idea of Cascades framework.
Orca optimizer fully adopted the design principle of Cascades framework except 
that it separates into 3 phases: logical exploration, physical implementation, 
and optimization (property enforcing). And it might be easier if we want to 
enable parallel optimization by seperating into 3 phases. Orca does 
branch-and-bound in optimization phase, before actual property derivation and 
enforcement, IIRC. It is highly efficient, works pretty well, and 
battlefield-tested by many large financial and insurance companies.

In my last thread about on-demand trait request, I gave the high-level general 
API for physical operators to derive and require physical properties, which is 
similar to Orca's design. But seems like the proposal of API change gets no 
love.

- Haisheng

--
发件人:Vladimir Ozerov
日 期:2019年12月05日 22:22:43
收件人:dev@calcite.apache.org (dev@calcite.apache.org)
主 题:Re: Volcano's problem with trait propagation: current state and future

AbstractConverter-s are attractive because they effectively emulate
straightforward recursive top-down optimization (Volcano/Cascades). But
instead of doing it with a recursive method call, which preserves the
context, we do this in Calcite as a sequence of unrelated rule calls, thus
losing the context. So with my current understanding, it could be thought
of not as a search space explosion, but rather than the inefficient
implementation of an otherwise straightforward parent->child->parent
navigation, since we achieve this navigation indirectly through the rule
queue, rather than through a normal method call. In any case, the net
result is wasted CPU. Perhaps this is not exponential waste, but some
multiplication of otherwise optimal planning. As I mentioned, in our
experiments with TPC-H, we observed a constant factor between 6-9x between
the number of joins and the number of join implementation rule invocations.
It doesn't growth past 9 even for complex queries, so I hope that this is
not an exponent :-)

Speaking of logical vs physical optimization, IMO it makes sense in some
cases. E.g. when doing predicate pushdown, you do not want to consider
intermediate logical tree states for implementation, until the predicate
reaches its final position. So running separate logical planning phase with
Volcano optimizer makes total sense to me, because it effectively prunes a
lot of not optimal logical plans before they reach the physical planning
stage. The real problem to me is that we forced to remove join planning
from the physical optimization stage. Because the goal of join planning not
to generate a single optimal plan, like with predicate pushdown, but rather
to generate a set of logical plans all of which should be implemented and
estimated. And with AbstractConverter-s this is not possible because of
their multiplicator increases the rate of search space growth, making join
planning inapplicable even for the small number of relations. So we have to
move them to the logical planning stage and pick only one permutation for
physical planning.


чт, 5 дек. 2019 г. в 15:35, Roman Kondakov :

> Vladimir,
>
> thank you for bringing it up. We are facing the same problems in Apache
> Ignite project
> and it would be great if Apache Calcite community will propose a
> solution for this
> issue.
>
> From my point of view an approach with abstract converters looks more
> promising, but as
> you mentioned it suffers from polluting the search space. The latter can
> be mitigated by
> splitting a planning stage into the several phases: you shouldn't
> register all logical rules in the planner simultaneously - it looks like
> it is better to have several iterations of planning stage with different
> sets of rules, as Drill does.
>
> Also I'd like to mention that decoupling the logical planning from the
> physical one looks
> a bit weird to me because it violates the idea of Cascad

[jira] [Created] (CALCITE-3571) RelBuilder#shouldMergeProject throws an exception for JOIN with complex conditions

2019-12-05 Thread Kirill Kozlov (Jira)
Kirill Kozlov created CALCITE-3571:
--

 Summary: RelBuilder#shouldMergeProject throws an exception for 
JOIN with complex conditions
 Key: CALCITE-3571
 URL: https://issues.apache.org/jira/browse/CALCITE-3571
 Project: Calcite
  Issue Type: Bug
  Components: core
Affects Versions: 1.21.0
Reporter: Kirill Kozlov


Before joins are created left and right paths are pared first. For the 1st 
query above they are as follows:
{code:java}
Left:
LogicalProject with RecordType(VARBINARY id, VARCHAR fA1, VARCHAR fA1_2)
  LogicalTableScan with RecordType(VARBINARY id, VARCHAR fA1)

Right:
LogicalTableScan with RecordType(VARCHAR id, VARCHAR fB1){code}
As they are processed - they are registered as leaves (added to the Array).

 

When Join node is being created it knows what the `condition expressions` is:
{code:java}
=(TO_HEX($0), $3)
{code}
Since TO_HEX is not computed anywhere - it modifies the left input to be as 
follows (via RelOptUtil#pushDownJoinConditions) because 
RelBuilder#shouldMergeProject always return true.

 
{code:java}
LogicalProject with RecordType(VARBINARY id, VARCHAR fA1, VARCHAR fA1_2, 
VARCHAR $f3)
{code}
where `VARCHAR $f3` is a result of TO_HEX. Note that the list of leaves is not 
updated.

[https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/sql2rel/SqlToRelConverter.java#L2571]

 

Finally, when identifier "query.fA1_2" is being converted (via 
SqlToRelConverter#convertIdentifier) for the top-most node
{code:java}
top-most node:
LogicalProject with RecordType(VARBINARY id, VARCHAR fA1, VARCHAR fA1_2, 
VARCHAR id0, VARCHAR fB1)
  LogicalJoin with RecordType(VARBINARY id, VARCHAR fA1, VARCHAR fA1_2, VARCHAR 
$f3, VARCHAR id0, VARCHAR fB1)
LogicalProject with RecordType(VARBINARY id, VARCHAR fA1, VARCHAR fA1_2, 
VARCHAR $f3)
  LogicalTableScan with RecordType(VARBINARY id, VARCHAR fA1)
LogicalTableScan with RecordType(VARCHAR id, VARCHAR fB1){code}
Blackboard perform a lookup (via SqlToRelConverter#lookupExp), in process of 
which LookupContext is created.

In a constructor, LookupContext performs flatten, which recursively traverses  
tree of nodes (from above codeblock) and checks the leaves to see if they 
contain such expression. When it does get to the modified left input of a join 
it does not get a match on it and continues further down to a TableScan.

When it finally flattens the result, TableScan's RecordType knows nothing about 
a duplicated field `fA1_2`, causing an error above.

 

I think a viable solution would be to modify Join creation to register a 
resulting join inputs as leaves (when they are modified). Alternative approach 
would be to not merge Projects when join needs to modify an input.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: CsvAdapter (Json content) from String / InputStream

2019-12-05 Thread Andrei Sereda
Pls see CALCITE-3560 [1] which was merged recently. You can now instantiate
Source from a generic CharSource [2]:

Source source = Sources.fromCharSource(CharSource.wrap("hello"));

This should be released with 1.22

[1] https://issues.apache.org/jira/browse/CALCITE-3560

[2]
https://guava.dev/releases/19.0/api/docs/com/google/common/io/CharSource.html

On Tue, Nov 19, 2019 at 1:07 PM Yanna elina 
wrote:

> Hi guys ,
>
> i have another question about this adapter.
> I have a json sample like this :
> [
>   {
> "field": "test",
> "properties": [
>   "a",
>   "b",
>   "c"
> ]
>   }
> ]
>
> JsonTable  look like to convert Array to  OTHER  ColumType .
> the column  "properties" is converted on  ' JAVA.UTIL.ARRAYLIST)>
>  when i check JsonEnumerator  i can see this function code :*RelDataType
> type = typeFactory.createJavaType(jsonFieldMap.get(key).getClass());*
>
> *About WHERE  / INis it supposed to work? on JavaType(ArrayList) ? *
> if i try to make this query "SELECT * FROM TABLE_A WHERE
> properties=any('a') i will have alway an Exception
> : org.apache.calcite.sql.validate.SqlValidatorException: Values passed to
> IN operator must have compatible type
>
> *if its not supposed to work i guess i need to re-implement this one to
> convert *JAVA.UTIL.ARRAYLIST on SQL_ARRAY type right ?
>
> thank
>
>
>
> Le lun. 18 nov. 2019 à 19:15, Yanna elina  a
> écrit :
>
> > Big Thank Guys  !!!
> >
> > Even if this CSVAdapter is an "example" its still really usefull  :)  and
> > with this update it will be a good "out of box" tools usable in many
> > scenario/workflow
> >
> >
> > Yanna
> >
> > Le jeu. 14 nov. 2019 à 21:20, Andrei Sereda  a écrit :
> >
> >> > I see that this feature request relates to Source.java and
> Sources.java,
> >> which are in org.apache.calcite.util in core.
> >> I'm not planning to change Source.java it already exposes Reader /
> >> InputStream
> >>
> >> > If you add some capability, it is fine to add It to the CSV adapter
> >> example, but it is much more important that that capability exists in
> the
> >> file adapter.
> >> I will add to both. The general idea behind this change is that
> currently
> >> CSV / File Adapter(s) require inputs to be File(s) or URL(s) which
> forces
> >> users to create temporal resources manually (when their content is
> already
> >> in-memory). If input to CSV / File adapter(s) is generic Readable [1] /
> >> Reader [2] or InputStream it gives more flexibility to users.
> >>
> >> [1] https://docs.oracle.com/javase/7/docs/api/java/lang/Readable.html
> >> [2] https://docs.oracle.com/javase/7/docs/api/java/io/Reader.html
> >>
> >>
> >> On Thu, Nov 14, 2019 at 2:45 PM Julian Hyde  wrote:
> >>
> >> > I see that this feature request relates to Source.java and
> Sources.java,
> >> > which are in org.apache.calcite.util in core.
> >> >
> >> > If you add some capability, it is fine to add It to the CSV adapter
> >> > example, but it is much more important that that capability exists in
> >> the
> >> > file adapter.
> >> >
> >> > Julian
> >> >
> >> >
> >> > > On Nov 14, 2019, at 11:36 AM, Andrei Sereda 
> wrote:
> >> > >
> >> > > I think the change is straightforward (will not add complexity).
> >> > >
> >> > > On Thu, Nov 14, 2019 at 1:24 PM Julian Hyde 
> wrote:
> >> > >
> >> > >> Remember that CsvAdapter is in the “example” module. Keep it
> simple.
> >> > >>
> >> > >> The file adapter can also parse CSV files.
> >> > >>
> >> > >> Julian
> >> > >>
> >> > >>
> >> > >>
> >> > >>> On Nov 14, 2019, at 9:40 AM, Andrei Sereda 
> >> wrote:
> >> > >>>
> >> > >>> Hello,
> >> > >>>
> >> > >>> Source object already exposes Reader / InputStream API. Probably
> >> > >>> JsonEnumerator can be changed to use those methods.
> >> > >>>
> >> > >>> Do you mind creating a JIRA ticket ? I'll take a look.
> >> > >>>
> >> > >>> Thanks,
> >> > >>> Andrei.
> >> > >>>
> >> > >>> On Thu, Nov 14, 2019 at 7:45 AM Yanna elina <
> >> > yannaelinasul...@gmail.com>
> >> > >>> wrote:
> >> > >>>
> >> >  Hi guys ,
> >> >  I saw in the code that this nice adapter makes it possible to
> make
> >> SQL
> >> >  queries on the data JSON
> >> > 
> >> > 
> >> > >>
> >> >
> >>
> https://github.com/apache/calcite/tree/ab71c4cae5a5c3c7d979337a2d38ddaf271aa206/example/csv/src/main/java/org/apache/calcite/adapter/csv
> >> > 
> >> >  But  it's limited on File / URL.
> >> >  JsonTable constructor accepte only a Source object and this
> Source
> >> > >> object
> >> >  can be construct only accross  a File / URL.
> >> > 
> >> >  it could be nice to have the possibility to make this source from
> >> >  ImputStream too .
> >> > 
> >> >  Creating a temp-file from an InputStream or String can be
> excesive.
> >> > 
> >> >  Thanks
> >> > 
> >> > >>
> >> > >>
> >> >
> >> >
> >>
> >
>


Re: Volcano's problem with trait propagation: current state and future

2019-12-05 Thread Vladimir Ozerov
AbstractConverter-s are attractive because they effectively emulate
straightforward recursive top-down optimization (Volcano/Cascades). But
instead of doing it with a recursive method call, which preserves the
context, we do this in Calcite as a sequence of unrelated rule calls, thus
losing the context. So with my current understanding, it could be thought
of not as a search space explosion, but rather than the inefficient
implementation of an otherwise straightforward parent->child->parent
navigation, since we achieve this navigation indirectly through the rule
queue, rather than through a normal method call. In any case, the net
result is wasted CPU. Perhaps this is not exponential waste, but some
multiplication of otherwise optimal planning. As I mentioned, in our
experiments with TPC-H, we observed a constant factor between 6-9x between
the number of joins and the number of join implementation rule invocations.
It doesn't growth past 9 even for complex queries, so I hope that this is
not an exponent :-)

Speaking of logical vs physical optimization, IMO it makes sense in some
cases. E.g. when doing predicate pushdown, you do not want to consider
intermediate logical tree states for implementation, until the predicate
reaches its final position. So running separate logical planning phase with
Volcano optimizer makes total sense to me, because it effectively prunes a
lot of not optimal logical plans before they reach the physical planning
stage. The real problem to me is that we forced to remove join planning
from the physical optimization stage. Because the goal of join planning not
to generate a single optimal plan, like with predicate pushdown, but rather
to generate a set of logical plans all of which should be implemented and
estimated. And with AbstractConverter-s this is not possible because of
their multiplicator increases the rate of search space growth, making join
planning inapplicable even for the small number of relations. So we have to
move them to the logical planning stage and pick only one permutation for
physical planning.


чт, 5 дек. 2019 г. в 15:35, Roman Kondakov :

> Vladimir,
>
> thank you for bringing it up. We are facing the same problems in Apache
> Ignite project
> and it would be great if Apache Calcite community will propose a
> solution for this
> issue.
>
>  From my point of view an approach with abstract converters looks more
> promising, but as
> you mentioned it suffers from polluting the search space. The latter can
> be mitigated by
> splitting a planning stage into the several phases: you shouldn't
> register all logical rules in the planner simultaneously - it looks like
> it is better to have several iterations of planning stage with different
> sets of rules, as Drill does.
>
> Also I'd like to mention that decoupling the logical planning from the
> physical one looks
> a bit weird to me because it violates the idea of Cascades framework.
> Possibly this decoupling is the consequence of some performance issues.
>
>
> --
> Kind Regards
> Roman Kondakov
>
> On 05.12.2019 14:24, Vladimir Ozerov wrote:
> > Hi,
> >
> > As I mentioned before, we are building a distributed SQL engine that uses
> > Apache Calcite for query optimization. The key problem we faced is the
> > inability to pull the physical traits of child relations efficiently. I'd
> > like to outline my understanding of the problem (I guess it was already
> > discussed multiple times) and ask the community to prove or disprove the
> > existence of that problem and its severity for the products which uses
> > Apache Calcite and ask for ideas on how it could be improved in the
> future.
> >
> > I'll start with the simplified problem description and mentioned more
> > complex use cases then. Consider that we have a logical tree and a set of
> > implementation rules. Our goal is to find the optimal physical tree by
> > applying these rules. The classical Cascades-based approach directs the
> > optimization process from the top to the bottom (hence "top-down").
> > However, the actual implementation of tree nodes still happens bottom-up.
> > For the tree L1 <- L2, we enter "optimize(L1)", which recursively
> delegates
> > to "optimize(L2)". We then implement children nodes L1 <- [P2', P2''],
> and
> > return back to the parent, which is now able to pick promising
> > implementations of the children nodes and reject bad ones with the
> > branch-and-bound approach. AFAIK Pivotal's Orca works this way.
> >
> > The Apache Calcite is very different because it doesn't allow the
> recursion
> > so that we lose the context on which node requested the child
> > transformation. This loss of context leads to the following problems:
> > 1) The parent node cannot deduce it's physical properties during the
> > execution of the implementation rule, because Calcite expects the
> > transformation to be applied before children nodes are implemented. That
> is
> > if we are optimizing LogicalProject <- LogicalScan, we cannot set pr

Re: Volcano's problem with trait propagation: current state and future

2019-12-05 Thread Roman Kondakov

Vladimir,

thank you for bringing it up. We are facing the same problems in Apache 
Ignite project
and it would be great if Apache Calcite community will propose a 
solution for this

issue.

From my point of view an approach with abstract converters looks more 
promising, but as
you mentioned it suffers from polluting the search space. The latter can 
be mitigated by
splitting a planning stage into the several phases: you shouldn't 
register all logical rules in the planner simultaneously - it looks like 
it is better to have several iterations of planning stage with different 
sets of rules, as Drill does.


Also I'd like to mention that decoupling the logical planning from the 
physical one looks
a bit weird to me because it violates the idea of Cascades framework. 
Possibly this decoupling is the consequence of some performance issues.



--
Kind Regards
Roman Kondakov

On 05.12.2019 14:24, Vladimir Ozerov wrote:

Hi,

As I mentioned before, we are building a distributed SQL engine that uses
Apache Calcite for query optimization. The key problem we faced is the
inability to pull the physical traits of child relations efficiently. I'd
like to outline my understanding of the problem (I guess it was already
discussed multiple times) and ask the community to prove or disprove the
existence of that problem and its severity for the products which uses
Apache Calcite and ask for ideas on how it could be improved in the future.

I'll start with the simplified problem description and mentioned more
complex use cases then. Consider that we have a logical tree and a set of
implementation rules. Our goal is to find the optimal physical tree by
applying these rules. The classical Cascades-based approach directs the
optimization process from the top to the bottom (hence "top-down").
However, the actual implementation of tree nodes still happens bottom-up.
For the tree L1 <- L2, we enter "optimize(L1)", which recursively delegates
to "optimize(L2)". We then implement children nodes L1 <- [P2', P2''], and
return back to the parent, which is now able to pick promising
implementations of the children nodes and reject bad ones with the
branch-and-bound approach. AFAIK Pivotal's Orca works this way.

The Apache Calcite is very different because it doesn't allow the recursion
so that we lose the context on which node requested the child
transformation. This loss of context leads to the following problems:
1) The parent node cannot deduce it's physical properties during the
execution of the implementation rule, because Calcite expects the
transformation to be applied before children nodes are implemented. That is
if we are optimizing LogicalProject <- LogicalScan, we cannot set proper
distribution and collation for the to be created PhysicalProject, because
it depends on the distribution and collation of the children which is yet
to be resolved.
2) The branch-and-bound cannot be used because it requires at least one
fully-built physical subtree.

As a result of this limitation, products which rely on Apache Calcite for
query optimization, use one or several workarounds:

*1) Guess the physical properties of parent nodes before logical children
are implemented*
*Apache Flink* uses this strategy. The strategy is bad because of the
number of combinations of traits growth exponentially with the number of
attributes in the given RelNode, so you either explode the search space or
give up optimization opportunities. Consider the following tree:
LogicalSort[a ASC] <- LogicalFilter <- LogicalScan
The optimal implementation of the LogicalFilter is PhysicalFilter[collation=a
ASC] because it may eliminate the parent sort. But such optimization should
happen only if we know that there is a physical implementation of scan
allowing for this sort order, e.g. PhysicalIndexScan[collation=a ASC]. I.e.
we need to know the child physical properties first. Otherwise we fallback
to speculative approaches. With the *optimistic* approach, we emit all
possible combinations of physical properties, with the hope that the child
will satisfy some of them, thus expanding the search space exponentially.
With the *pessimistic* approach, we just miss this optimization opportunity
even if the index exists. Apache Flink uses the pessimistic approach.

*2) Use AbstractConverters*
*Apache Drill* uses this strategy. The idea is to "glue" logical and
physical operators, so that implementation of a physical child re-triggers
implementation rule of a logical parent. The flow is as follows:
- Invoke parent implementation rule - it either doesn't produce new
physical nodes or produce not optimized physical nodes (like in the Apache
Flink case)
- Invoke children implementation rules and create physical children
- Then converters kick-in and re-trigger parent implementation rule through
the creation of an abstract converter
- Finally, the parent implementation rule is fired again and now it
produces optimized node(s) since at least some of the physical
distributions o

Volcano's problem with trait propagation: current state and future

2019-12-05 Thread Vladimir Ozerov
Hi,

As I mentioned before, we are building a distributed SQL engine that uses
Apache Calcite for query optimization. The key problem we faced is the
inability to pull the physical traits of child relations efficiently. I'd
like to outline my understanding of the problem (I guess it was already
discussed multiple times) and ask the community to prove or disprove the
existence of that problem and its severity for the products which uses
Apache Calcite and ask for ideas on how it could be improved in the future.

I'll start with the simplified problem description and mentioned more
complex use cases then. Consider that we have a logical tree and a set of
implementation rules. Our goal is to find the optimal physical tree by
applying these rules. The classical Cascades-based approach directs the
optimization process from the top to the bottom (hence "top-down").
However, the actual implementation of tree nodes still happens bottom-up.
For the tree L1 <- L2, we enter "optimize(L1)", which recursively delegates
to "optimize(L2)". We then implement children nodes L1 <- [P2', P2''], and
return back to the parent, which is now able to pick promising
implementations of the children nodes and reject bad ones with the
branch-and-bound approach. AFAIK Pivotal's Orca works this way.

The Apache Calcite is very different because it doesn't allow the recursion
so that we lose the context on which node requested the child
transformation. This loss of context leads to the following problems:
1) The parent node cannot deduce it's physical properties during the
execution of the implementation rule, because Calcite expects the
transformation to be applied before children nodes are implemented. That is
if we are optimizing LogicalProject <- LogicalScan, we cannot set proper
distribution and collation for the to be created PhysicalProject, because
it depends on the distribution and collation of the children which is yet
to be resolved.
2) The branch-and-bound cannot be used because it requires at least one
fully-built physical subtree.

As a result of this limitation, products which rely on Apache Calcite for
query optimization, use one or several workarounds:

*1) Guess the physical properties of parent nodes before logical children
are implemented*
*Apache Flink* uses this strategy. The strategy is bad because of the
number of combinations of traits growth exponentially with the number of
attributes in the given RelNode, so you either explode the search space or
give up optimization opportunities. Consider the following tree:
LogicalSort[a ASC] <- LogicalFilter <- LogicalScan
The optimal implementation of the LogicalFilter is PhysicalFilter[collation=a
ASC] because it may eliminate the parent sort. But such optimization should
happen only if we know that there is a physical implementation of scan
allowing for this sort order, e.g. PhysicalIndexScan[collation=a ASC]. I.e.
we need to know the child physical properties first. Otherwise we fallback
to speculative approaches. With the *optimistic* approach, we emit all
possible combinations of physical properties, with the hope that the child
will satisfy some of them, thus expanding the search space exponentially.
With the *pessimistic* approach, we just miss this optimization opportunity
even if the index exists. Apache Flink uses the pessimistic approach.

*2) Use AbstractConverters*
*Apache Drill* uses this strategy. The idea is to "glue" logical and
physical operators, so that implementation of a physical child re-triggers
implementation rule of a logical parent. The flow is as follows:
- Invoke parent implementation rule - it either doesn't produce new
physical nodes or produce not optimized physical nodes (like in the Apache
Flink case)
- Invoke children implementation rules and create physical children
- Then converters kick-in and re-trigger parent implementation rule through
the creation of an abstract converter
- Finally, the parent implementation rule is fired again and now it
produces optimized node(s) since at least some of the physical
distributions of children nodes are implemented.

Note that this is essentially a hack to simulate the Cascades flow! The
problem is that AbstractConverters increase the complexity of planning
because they do not have any context, so parent rules are just re-triggered
blindly. Consider the optimization of the following tree:
LogicalJoin <- [LogicalScan1, LogicalScan2]
With the converter approach, the join implementation rule will be fired at
least 3 times, while in reality, one call should be sufficient. In our
experiments with TPC-H queries, the join rule implemented that way is
typically called 6-9 times more often than expected.

*3) Transformations (i.e. logical optimization) are decoupled from
implementation (i.e. physical optimization)*
Normally, you would like to mix both logical and physical rules in a single
optimization program, because it is required for proper planning. That is,
you should consider both (Ax(BxC)) and ((AxB

Re: [VOTE] [CALCITE-3559] Drop HydromaticFileSetCheck, upgrade Checkstyle

2019-12-05 Thread Vladimir Sitnikov
Michael> -0 because I haven't found checkstyle violations
Michael> and I don't like the idea of losing checks which are currently in
Michael> place

Would you please re-evaluate? I seems like you did not know which checks
are there and which are not.

To my best knowledge, the only missing check is "Javadoc line too long ({0}
chars)".
However, Checkstyle already has "Line is longer than 100 characters"
warning, so I see no reasons why
extra javadoc-related check is needed.

All the rest are there, and most of the checks are strictly better now.

Can we just commit it and move forward?

Vladimir