Re: Summary of experience using Calcite

2022-08-04 Thread Vladimir Ozerov
Hi Sandeep,

Thank you for sharing your experience. Your feedback matches with what I
usually see in practice. IMO there are two main problems with Apache
Calcite:

   1. Documentation that focuses on a single edge case - the end-to-end
   execution of federated queries with Enumerable. In practice, this
   approach rarely could be used for anything but simple projects. Instead,
   major Apache Calcite users (Apache Hive, Apache Flink, etc) do what you
   mentioned as "unconventional use": they use separate Calcite components for
   query optimization, translation, DSLs, etc. And there is virtually no
   documentation on how to use them. So your use case is not unusual; there is
   just no docs for it.
   2. Another big problem is stability and quality. There are many bugs in
   optimization rules, rel-to-sql conversion, etc. When doing a sophisticated
   Apache Calcite integration, you should be ready to find a creative
   workarounds for various bugs: method overrides, unusual class hierarchies,
   copy-pasting of Apache Calcite codebase, etc.

Unfortunately, there is no major commercial vendor behind the community
that systematically invests in documentation and quality, so it is what it
is.

However, if you manage to get through these difficulties, you will get an
extremely powerful tool, and I personally not aware of any other project
that is even close to Apache Calcite in query optimization and translation
capabilities. This is why there are more and more database and data
management projects that choose Apache Calcite as the backbone of their
query processing.

If your use case is simple, then perhaps Apache Calcite might be too
heavy-weight. However, if you plan to make your system more complicated, I
would recommend you to keep Apache Calcite in mind still.

Regards,
Vladimir.

чт, 4 авг. 2022 г. в 04:42, Sandeep N :

> Hi all,
>
> I wanted to share my experience with one of the current projects that I
> tried to use Calcite with and through this experience hope to shed light on
> some of the challenges I ran into and share my thoughts on how they could
> be addressed.
>
> Apologies for the long drawn explanation but I think context will help you
> guys understand the challenges I hope to surface.
>
> For the tldr version jump to the paragraph with heading "In Summary".
>
> Let me first note that my use of Calcite  is unconventional in the sense
> that I was looking to convert queries written in a homegrown DSL to SQL
> that could be run on Snowflake. There were two main challenges : (a)
> Mapping the DSL to SQL and (b) Generating SQL compliant to Snowflake with
> support for custom data-types like VARIANT.
>
> The approach I landed on based on feedback from this forum was to take the
> AST from the DSL, use a tree walker to generate a relational algebra tree
> which then can generate SQL using one of the Dialects (Snowflake to be
> added).
>
> The first task involved asking questions on the dev forum and reading
> through previously asked questions. Here I have to extend thanks to
> Stamatis for the pointers and links he provided which helped immensely. In
> the end it took those pointers + step through debugging of the code via
> unit-tests to figure how to use the relational algebra tree (RelNode and
> RexNode).
>
> The second task was to add Snowflake specific constructs and here again
> thanks to  Stamatis and Julian I had pointers to start looking at the
> pieces. But before I went down the path of extending relational algebra or
> figuring adding snowflake functions I shared my prototype work with my team
> and talked through the approach and the Calcite integration. My goal was to
> get a general sense of whether my team thought this was a good path forward
> (some of the reason for this will be obvious in the In Summary section).
>
> A few of the team members looked at extending my prototype with Calcite and
> after a couple of weeks of working with it we as a team arrived at the
> decision that we will hold off on using  Calcite and instead build
> something simpler and homegrown. "Simpler" because the types of queries we
> needed to translate to SQL were very limited and did not exercise all the
> possible types of SQL queries. We may revisit this in the future and come
> back to Calcite.
>
> In Summary : The feedback I have is as follows
> (a) Calcite does not have enough examples, docs and how-to guides so it is
> difficult to get started. For this one I think our usage of Calcite being
> unconventional added to the complexity as the samples might not directly
> map to what we were looking for - one needs to extrapolate from what is
> done from SQL to applying the same for another DSL.
>
> (b) Learning curve for Calcite is steep. - It takes a lot of time and
> patience to debug through code and understand Calcite before using it. It
> takes a certain mind-set of "don't give up" and "must figure this out" to
> make headway. My experience with some of the technologies (having written

Re: Default operator override in Apache Calcite

2022-07-14 Thread Vladimir Ozerov
> "As the ’Std’ indicates, it is intended to contain the operators that are
defined by the SQL standard. And which presumably would be enabled in any
configuration."

Makes sense from the operator availability perspective. However, the SQL
standard mostly silent on how types should be inferred and which
operand combinations are valid or not. This is the key problem: MySQL and
Postgres both have PLUS operator, but their behavior is pretty different.

чт, 14 июл. 2022 г. в 10:14, Vladimir Ozerov :

> Hi Julian,
>
> The motivation for the proposal is a requirement to change the way how
> some operators, like PLUS or SUM, does the type inference and validation. I
> understand you concern about the complexity. How would you suggest to
> override the behavior of PLUS/SUM in the downstream project, provided that
> the static operators from the SqlStdOperatorTable injected uncontrollably
> on any optimization stage?
>
> Regards,
> Vladimir.
>
> пн, 11 июл. 2022 г. в 22:42, Julian Hyde :
>
>> Remember the rule of Chesterton’s fence: before you change something you
>> must know why it exists the way it is. Since you don’t cover it in your
>> email, let’s review the purpose of SqlStdOperatorTable.
>>
>> As the ’Std’ indicates, it is intended to contain the operators that are
>> defined by the SQL standard. And which presumably would be enabled in any
>> configuration.
>>
>> Because these operators are standard, they are resolved statically (by
>> referencing static field names such as SqlStdOperatorTable.AS) and can
>> therefore be used in locations without an operator table, such as the
>> parser.
>>
>> For many years SqlStdOperatorTable was the default location for any new
>> operator, and a lot of crap ended up there. I have been trying to restore
>> its old purpose by encouraging people to use SqlInternalOperatorTable (for
>> operators that are not visible in SQL but are used internally) and
>> SqlLibraryOperators (which are enabled for one or more libraries, e.g. for
>> Oracle or BigQuery compatibility).
>>
>> The alternative to static resolution is lookup by name/syntax and
>> overload resolution. This is performed by the validator and is complex and
>> expensive.
>>
>> You are proposing to add a third means of lookup, perhaps a middle way
>> that is more flexible than static lookup but not too complex or expensive.
>> However, it is a third way, doesn’t remove the existing two ways, and
>> therefore will probably make things more complex, albeit more customizable.
>>
>> I think you should start the discussion with the problem(s) you seek to
>> solve. There may be other solutions. For example, we wanted to customize
>> how the result type of standard aggregate functions (e.g. AVG) is derived
>> [1], and one solution would have been to replace the AVG operator, but we
>> instead added the policy to TypeSystem.
>>
>> There is merit to your suggestion of building on, and aligning with,
>> SqlKind. I added it because I wanted to make simple things simple (i.e.
>> prioritize the 80% case). However, it its granularity is sometimes too
>> coarse. An instance of that arrived just today [2].
>>
>> Julian
>>
>> [1] https://issues.apache.org/jira/browse/CALCITE-1945 <
>> https://issues.apache.org/jira/browse/CALCITE-1945>
>>
>> [2] https://issues.apache.org/jira/browse/CALCITE-5207 <
>> https://issues.apache.org/jira/browse/CALCITE-5207>
>>
>>
>>
>> > On Jul 10, 2022, at 9:41 PM, Yanjing Wang 
>> wrote:
>> >
>> > +1, Thanks Vladimir for pointing the pains out and the solutions looks
>> more
>> > clean and extensible.
>> >
>> > Vladimir Ozerov  于2022年7月9日周六 15:01写道:
>> >
>> >> Hi,
>> >>
>> >> Apache Calcite has a powerful but complicated and non-documented
>> function
>> >> library. Some projects may require overriding some of the existing
>> >> operators to introduce custom type deduction, custom validation, etc.
>> This
>> >> includes the base arithmetic functions (e.g, disallow INT + VARCHAR),
>> >> aggregate functions (e.g., custom precision extension), etc.
>> >>
>> >> One convenient way of doing this is to re-define the function in your
>> >> custom operator table. However, this doesn't work because Apache
>> Calcite
>> >> core uses the direct references to SqlStdOperatorTable. It starts with
>> the
>> >> parser [1] and validator [2]. If you manage to inject your functions at
>> >> this stage (e.

Re: Default operator override in Apache Calcite

2022-07-14 Thread Vladimir Ozerov
Hi Julian,

The motivation for the proposal is a requirement to change the way how some
operators, like PLUS or SUM, does the type inference and validation. I
understand you concern about the complexity. How would you suggest to
override the behavior of PLUS/SUM in the downstream project, provided that
the static operators from the SqlStdOperatorTable injected uncontrollably
on any optimization stage?

Regards,
Vladimir.

пн, 11 июл. 2022 г. в 22:42, Julian Hyde :

> Remember the rule of Chesterton’s fence: before you change something you
> must know why it exists the way it is. Since you don’t cover it in your
> email, let’s review the purpose of SqlStdOperatorTable.
>
> As the ’Std’ indicates, it is intended to contain the operators that are
> defined by the SQL standard. And which presumably would be enabled in any
> configuration.
>
> Because these operators are standard, they are resolved statically (by
> referencing static field names such as SqlStdOperatorTable.AS) and can
> therefore be used in locations without an operator table, such as the
> parser.
>
> For many years SqlStdOperatorTable was the default location for any new
> operator, and a lot of crap ended up there. I have been trying to restore
> its old purpose by encouraging people to use SqlInternalOperatorTable (for
> operators that are not visible in SQL but are used internally) and
> SqlLibraryOperators (which are enabled for one or more libraries, e.g. for
> Oracle or BigQuery compatibility).
>
> The alternative to static resolution is lookup by name/syntax and overload
> resolution. This is performed by the validator and is complex and expensive.
>
> You are proposing to add a third means of lookup, perhaps a middle way
> that is more flexible than static lookup but not too complex or expensive.
> However, it is a third way, doesn’t remove the existing two ways, and
> therefore will probably make things more complex, albeit more customizable.
>
> I think you should start the discussion with the problem(s) you seek to
> solve. There may be other solutions. For example, we wanted to customize
> how the result type of standard aggregate functions (e.g. AVG) is derived
> [1], and one solution would have been to replace the AVG operator, but we
> instead added the policy to TypeSystem.
>
> There is merit to your suggestion of building on, and aligning with,
> SqlKind. I added it because I wanted to make simple things simple (i.e.
> prioritize the 80% case). However, it its granularity is sometimes too
> coarse. An instance of that arrived just today [2].
>
> Julian
>
> [1] https://issues.apache.org/jira/browse/CALCITE-1945 <
> https://issues.apache.org/jira/browse/CALCITE-1945>
>
> [2] https://issues.apache.org/jira/browse/CALCITE-5207 <
> https://issues.apache.org/jira/browse/CALCITE-5207>
>
>
>
> > On Jul 10, 2022, at 9:41 PM, Yanjing Wang 
> wrote:
> >
> > +1, Thanks Vladimir for pointing the pains out and the solutions looks
> more
> > clean and extensible.
> >
> > Vladimir Ozerov  于2022年7月9日周六 15:01写道:
> >
> >> Hi,
> >>
> >> Apache Calcite has a powerful but complicated and non-documented
> function
> >> library. Some projects may require overriding some of the existing
> >> operators to introduce custom type deduction, custom validation, etc.
> This
> >> includes the base arithmetic functions (e.g, disallow INT + VARCHAR),
> >> aggregate functions (e.g., custom precision extension), etc.
> >>
> >> One convenient way of doing this is to re-define the function in your
> >> custom operator table. However, this doesn't work because Apache Calcite
> >> core uses the direct references to SqlStdOperatorTable. It starts with
> the
> >> parser [1] and validator [2]. If you manage to inject your functions at
> >> this stage (e.g., using a custom validator implementation or a custom
> >> SqlVisitor), the sql-to-rel converter will overwrite your functions
> >> still [3]. And even when you get the RelNode, optimization rules would
> >> silently replace your custom functions with the default ones [4].
> >>
> >> Alternatively, you may try extending some base interface, such as the
> >> TypeCoercion, but this doesn't give fine-grained control over the
> function
> >> behavior because you have to retain the existing function definitions
> to do
> >> coercion works.
> >>
> >> A better solution might be is to abstract out the function references
> >> through some sort of "factory"/"resolver", somewhat similar to the one
> used
> >> to resolve user-provided operators. For instance, th

Default operator override in Apache Calcite

2022-07-09 Thread Vladimir Ozerov
Hi,

Apache Calcite has a powerful but complicated and non-documented function
library. Some projects may require overriding some of the existing
operators to introduce custom type deduction, custom validation, etc. This
includes the base arithmetic functions (e.g, disallow INT + VARCHAR),
aggregate functions (e.g., custom precision extension), etc.

One convenient way of doing this is to re-define the function in your
custom operator table. However, this doesn't work because Apache Calcite
core uses the direct references to SqlStdOperatorTable. It starts with the
parser [1] and validator [2]. If you manage to inject your functions at
this stage (e.g., using a custom validator implementation or a custom
SqlVisitor), the sql-to-rel converter will overwrite your functions
still [3]. And even when you get the RelNode, optimization rules would
silently replace your custom functions with the default ones [4].

Alternatively, you may try extending some base interface, such as the
TypeCoercion, but this doesn't give fine-grained control over the function
behavior because you have to retain the existing function definitions to do
coercion works.

A better solution might be is to abstract out the function references
through some sort of "factory"/"resolver", somewhat similar to the one used
to resolve user-provided operators. For instance, the user may pass an
optional desired operator table to parser/validator/converter configs and
RelOptCluster. Then the "core" codebase could be refactored to dereference
functions by SqlKind instead of SqlStdOperatorTable. If some required
function types are missing from the SqlKind enum, we can add them. The
default behavior would delegate to SqlStdOperatorTable, so the existing
apps would not be affected.

A "small" problem is that there are ~1500 usages of the SqlStdOperatorTable
in the "core" module, but most of the usages are very straightforward to
replace.

This way, we would ensure a consistent function resolution throughout all
query optimization phases. WDYT?

Regards,
Vladimir.

[1]
https://github.com/apache/calcite/blob/calcite-1.30.0/core/src/main/codegen/templates/Parser.jj#L7164
[2]
https://github.com/apache/calcite/blob/calcite-1.30.0/core/src/main/java/org/apache/calcite/sql/validate/SqlValidatorImpl.java#L6788
[3]
https://github.com/apache/calcite/blob/calcite-1.30.0/core/src/main/java/org/apache/calcite/sql2rel/SqlToRelConverter.java#L1438
[4]
https://github.com/apache/calcite/blob/calcite-1.30.0/core/src/main/java/org/apache/calcite/rel/rules/AggregateReduceFunctionsRule.java#L357


Re: [ANNOUNCE] Vladimir Ozerov joins Calcite PMC

2022-06-06 Thread Vladimir Ozerov
Hi everybody,

Thank you very much!

пт, 27 мая 2022 г. в 05:27, Yanjing Wang :

> Congrats Vladimir!
>
> Forward Xu  于2022年5月25日周三 15:18写道:
>
> > Congratulations Vladimir!
> >
> >
> > Best,
> >
> > ForwardXu
> >
> > Stamatis Zampetakis  于2022年5月25日周三 15:05写道:
> >
> > > Congratulations Vladimir. You have shown that you care for the project
> in
> > > many different ways and it's only natural to see you in the PMC.
> > >
> > > Best,
> > > Stamatis
> > >
> > > On Wed, May 25, 2022 at 8:33 AM Haisheng Yuan 
> wrote:
> > >
> > > > Congratulations Vladimir!
> > > >
> > > > On 2022/05/25 06:27:16 Michael Mior wrote:
> > > > > Congratulations Vladimir!
> > > > >
> > > > > --
> > > > > Michael Mior
> > > > > mm...@apache.org
> > > > >
> > > > >
> > > > > Le mar. 24 mai 2022 à 16:47, Ruben Q L  a
> écrit :
> > > > >
> > > > > > I am pleased to announce that Vladimir has accepted an invitation
> > to
> > > > join
> > > > > > the Calcite PMC. Vladimir has been a consistent and helpful
> figure
> > in
> > > > the
> > > > > > Calcite community for which we are very grateful. We look forward
> > to
> > > > the
> > > > > > continued contributions and support.
> > > > > >
> > > > > > Please join me in congratulating Vladimir!
> > > > > >
> > > > > > - Ruben (on behalf of the Calcite PMC)
> > > > > >
> > > > >
> > > >
> > >
> >
>


[jira] [Created] (CALCITE-5175) IndexOutOfBoundsException when query has IN clause on top of the view column

2022-06-01 Thread Vladimir Ozerov (Jira)
Vladimir Ozerov created CALCITE-5175:


 Summary: IndexOutOfBoundsException when query has IN clause on top 
of the view column
 Key: CALCITE-5175
 URL: https://issues.apache.org/jira/browse/CALCITE-5175
 Project: Calcite
  Issue Type: Improvement
  Components: core
Reporter: Vladimir Ozerov


A query may contain views. Currently, views are processed using separate 
validator/converter instances. Both validator and sql-to-rel-converter are 
stateful and it appears that their state might be important for "parent" 
validator/converter pair. 

To reproduce the problem, please add the code below to the {{PlannerTest}}. The 
query with {{IN}} would fail with {{IndexOutOfBoundsException}} because we 
treat the {{IN}} clause as a subquery that requires access to the context of 
the child view, which is not propagated when using separate validator/converter 
instances for views.

{code}
@Test void testViewnViewWithIn() throws Exception {
final String sql = "select (dname in ('a', 'b')), count(deptno) FROM dept30 
group by (dname in ('a', 'b'))";
final String expected = "LogicalProject(DEPTNO=[$0], DNAME=[$1])\n"
+ "  LogicalValues("
+ "tuples=[[{ 10, 'Sales  ' },"
+ " { 20, 'Marketing  ' },"
+ " { 30, 'Engineering' },"
+ " { 40, 'Empty  ' }]])\n";
checkView(sql, is(expected));
  }

  @Test void testViewnViewWithOr() throws Exception {
final String sql = "select (dname = 'a' or dname = 'b'), count(deptno) FROM 
dept30 group by (dname = 'a' or dname = 'b')";
final String expected = "LogicalProject(DEPTNO=[$0], DNAME=[$1])\n"
+ "  LogicalValues("
+ "tuples=[[{ 10, 'Sales  ' },"
+ " { 20, 'Marketing  ' },"
+ " { 30, 'Engineering' },"
+ " { 40, 'Empty  ' }]])\n";
checkView(sql, is(expected));
  }
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (CALCITE-5174) EnumerableHashJoin filter out NULL values for "IS NOT DISTINCT FROM"

2022-06-01 Thread Vladimir Ozerov (Jira)
Vladimir Ozerov created CALCITE-5174:


 Summary: EnumerableHashJoin filter out NULL values for "IS NOT 
DISTINCT FROM"
 Key: CALCITE-5174
 URL: https://issues.apache.org/jira/browse/CALCITE-5174
 Project: Calcite
  Issue Type: Improvement
Reporter: Vladimir Ozerov


The {{Join}} operator constructs the join info from the condition using 
{{JoinInfo.of}}. This method collects information whether {{NULL}} values 
should be filtered for the given pair of left/right key. However, this 
information is not exposed from the {{Join}} operator and also not used by the 
{{EnumerableHashJoin]}, yielding incorrect results for queries like {{ON a IS 
NOT DISTINCT FROM b}}. 

The problem is pretty important, because the {{IS NOT DISTINCT FROM}} may 
silently appear in the realtional tree due to various simplifications and 
optimizations.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: Changes to the rule pattern interface

2022-04-15 Thread Vladimir Ozerov
;   exposing the concrete builders that Immutables generates but for
> at
> > > > least
> > > > >   the initial changes, I wanted to avoid exposing these as new
> public
> > > > >   interfaces (and thus all the immutables classes are marked
> package
> > > > private).
> > > > >   4. Merging construction and use feels like an anti-pattern (my
> > > > >   subjective opinion). The most common patterns I've seen treat a
> > > builder
> > > > >   separate from an immutable constructed object (and have something
> > > akin
> > > > to a
> > > > >   toBuilder() method to take an immutable config back to a
> builder).
> > In
> > > > the
> > > > >   Calcite config, these two concepts are merged. In some ways it
> > makes
> > > > things
> > > > >   simpler for trivial cases. However,  it is less formal and causes
> > > pain
> > > > when
> > > > >   you have required properties that have no defaults since there is
> > no
> > > > formal
> > > > >   model for "I'm done building, now check everything is complete".
> > This
> > > > means
> > > > >   in several places we have to put default values that are actually
> > > > invalid
> > > > >   rather than just rely on a builder's build validation step.
> > > > >
> > > > > One other note, I think if someone is working in Java 14+ (records)
> > or
> > > > > Kotlin, there are also several easy ways to produce config impls
> that
> > > are
> > > > > easy to use (without proxies and/or immutables).
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Apr 12, 2022 at 9:29 AM Thomas Rebele
> > >  > > > >
> > > > > wrote:
> > > > >
> > > > >> Hello,
> > > > >>
> > > > >> The reasons for the planner rules configuration can be found here:
> > > > >> CALCITE-3923 <https://issues.apache.org/jira/browse/CALCITE-3923
> >.
> > > See
> > > > >> also
> > > > >> the email thread [DISCUSS] Refactor how planner rules are
> > > parameterized
> > > > >> <
> > > > >>
> > > >
> > >
> >
> https://lists.apache.org/thread.html/rfdf6f9b7821988bdd92b0377e3d293443a6376f4773c4c658c891cf9%40%3Cdev.calcite.apache.org%3E
> > > > >>>
> > > > >> .
> > > > >>
> > > > >> Cordialement / Best Regards,
> > > > >> *Thomas Rebele, PhD* | R Developer | Germany | www.tibco.com
> > > > >>
> > > > >>
> > > > >> On Tue, Apr 12, 2022 at 6:10 PM Gavin Ray 
> > > > wrote:
> > > > >>
> > > > >>> I don't have any weight behind my opinion or experience,
> > > > >>> but anything that lowers the barrier to entry to Calcite for
> > > newcomers
> > > > >> is a
> > > > >>> huge win in my mind.
> > > > >>>
> > > > >>> I assume the reason for the changes was because codegen improved
> > > > >>> performance?
> > > > >>>
> > > > >>> Could it make sense to allow both options, the
> easy/less-performant
> > > way
> > > > >> for
> > > > >>> people who want to experiment and learn the ropes,
> > > > >>> and the codegen path for productionizing the final rules you come
> > up
> > > > >> with?
> > > > >>>
> > > > >>> Or does this make matters worse, trying to support two API's
> > > > >>>
> > > > >>> On Tue, Apr 12, 2022 at 6:25 AM Vladimir Ozerov <
> > ppoze...@gmail.com>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> Hi folks,
> > > > >>>>
> > > > >>>> Rules are an essential part of the Calcite-based query
> > optimizers. A
> > > > >>>> typical optimizer may require dozens of custom rules that are
> > > created
> > > > >> by
> > > > >>>> extending some Apache Calcite interfaces.
> > > > >>>>
> > > >

Changes to the rule pattern interface

2022-04-12 Thread Vladimir Ozerov
Hi folks,

Rules are an essential part of the Calcite-based query optimizers. A
typical optimizer may require dozens of custom rules that are created by
extending some Apache Calcite interfaces.

During the last two years, there were two major revisions of how rules are
created:

   1. In early 1.2x versions, the typical approach was to use
   RelOptRuleOperand with a set of helper methods in a builder-like
   pattern.
   2. Then, we switched to the runtime code generation.
   3. Finally, we switched to the compile-time code generation with the
   Immutables framework.

Every such change requires the downstream projects to rewrite all their
rules. Not only does this require time to understand the new approach, but
it may also compromise the correctness of the downstream optimizer because
the regression tracking in query optimizers is not trivial.

I had the privilege to try all three approaches, and I cannot get rid of
the feeling that every new approach is more complicated than the previous
one. I understand that this is a highly subjective statement, but when I
just started using Apache Calcite and knew very little about it, I was able
to write rule patterns by simply looking at the IDE JavaDoc pop-ups and
code completion. When the RuleConfig was introduced, every new rule always
required me to look at some other rule as an example, yet it was doable.
Now we also need to configure the project build system to write a single
custom rule.

At the same time, a significant fraction of the rules are pretty simple.
E.g., "operator A on top of operator B". If some additional configuration
is required, it could be added via plain rules fields, because at the end
of the day the rule instance is not more than a plain Java object.

A good example is the FilterProjectTransposeRule. What now takes tens of
lines of code in the Config subclass [1] (that you hardly could write
without a reference example), and ~500 LOC in the generated code that you
get through additional plugin configuration [2] in your build system, could
have been expressed in a dozen lines of code [3] in Apache Calcite 1.22.0.

My question is - are we sure we are going in the right direction in terms
of complexity and the entry bar for the newcomers? Wouldn't it be better to
follow the 80/20 rule, when simple rules could be easily created
programmatically with no external dependencies, while more advanced
facilities like Immutables are used only for the complex rules?

Regards,
Vladimir.

[1]
https://github.com/apache/calcite/blob/calcite-1.30.0/core/src/main/java/org/apache/calcite/rel/rules/FilterProjectTransposeRule.java#L208-L260
[2]
https://github.com/apache/calcite/blob/calcite-1.30.0/core/build.gradle.kts#L215-L224
[3]
https://github.com/apache/calcite/blob/calcite-1.22.0/core/src/main/java/org/apache/calcite/rel/rules/FilterProjectTransposeRule.java#L99-L110


Re: Why RelBuilder.project unwraps SARGs?

2022-04-04 Thread Vladimir Ozerov
Hi Julian,

Thank you for the comments. I've created the PR:
https://github.com/apache/calcite/pull/2762

Regards,
Vladimir.

сб, 2 апр. 2022 г. в 19:27, Julian Hyde :

> Good catch. The unwrapping was not intentional on my part. Removing the
> unwrapping seems like a benefit (less work for RelBuilder, therefore better
> planning performance; SARGs in more places, therefore better optimization;
> also the avoidance of cycles as you mention).
>
> There’s a small risk that downstream projects are relying on this
> unwrapping but they can control using config.simplify(). We don’t promise
> not to improve our rewrites from one release to the next.
>
> I think you should fix it.
>
> Julian
>
> > On Apr 2, 2022, at 8:30 AM, Vladimir Ozerov  wrote:
> >
> > Hi,
> >
> > When simplification is enabled in the RelBuilder, a call to the
> > RelBuilder.project unwraps SARGs into their flat counterparts [1]. This
> is
> > not the case for other nodes, like Filter or Join. This behavior may lead
> > to an infinite loop when the ProjectReduceExpressionsRule is used with
> the
> > HepPlanner (which is a common pattern);
> >
> >   1. The original Project(e) has the expression e.
> >   2. ProjectReduceExpressionsRule simplifies it to SARG(e).
> >   3. ProjectReduceExpressionsRule calls RelBuilder.project which returns
> >   back Project(e) instead of Project(SARG(e)).
> >   4. ProjectReduceExpressionsRule calls
> Project(e).transformTo(Project(e))
> >   which schedules invocation of the rule again, leading to a hang.
> >
> > Shall we remove this unwrapping? No tests are affected except for two
> > trivial failures in MaterializedViewRelOptRulesTest.
> >
> > Regards,
> > Vladimir.
> >
> > [1]
> >
> https://github.com/apache/calcite/blob/a81cfb2ad001589929e190939cf4db928ebac386/core/src/main/java/org/apache/calcite/tools/RelBuilder.java#L1948
>


[jira] [Created] (CALCITE-5083) In RelBuilder.project_, do not unwrap SARGs

2022-04-04 Thread Vladimir Ozerov (Jira)
Vladimir Ozerov created CALCITE-5083:


 Summary: In RelBuilder.project_, do not unwrap SARGs
 Key: CALCITE-5083
 URL: https://issues.apache.org/jira/browse/CALCITE-5083
 Project: Calcite
  Issue Type: Improvement
  Components: core
Affects Versions: 1.30.0
Reporter: Vladimir Ozerov
Assignee: Vladimir Ozerov


The {{RelBuilder.project_}} method forcefully unwraps SARG expressions. This 
way, we may lose SARG-related optimization opportunities and even get an 
infinite loop when some rules are applied heuristically. For example, the 
{{ProjectReduceExpressionsRule}} may simplify an expression to a SARG which 
would be unwrapped back to the original form in the RelBuilder, leading to an 
infinite execution of the rule.

The aim of this ticket is to remove the unconditional SARG unwrap from the 
{{RelBuilder}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Why RelBuilder.project unwraps SARGs?

2022-04-02 Thread Vladimir Ozerov
Hi,

When simplification is enabled in the RelBuilder, a call to the
RelBuilder.project unwraps SARGs into their flat counterparts [1]. This is
not the case for other nodes, like Filter or Join. This behavior may lead
to an infinite loop when the ProjectReduceExpressionsRule is used with the
HepPlanner (which is a common pattern);

   1. The original Project(e) has the expression e.
   2. ProjectReduceExpressionsRule simplifies it to SARG(e).
   3. ProjectReduceExpressionsRule calls RelBuilder.project which returns
   back Project(e) instead of Project(SARG(e)).
   4. ProjectReduceExpressionsRule calls Project(e).transformTo(Project(e))
   which schedules invocation of the rule again, leading to a hang.

Shall we remove this unwrapping? No tests are affected except for two
trivial failures in MaterializedViewRelOptRulesTest.

Regards,
Vladimir.

[1]
https://github.com/apache/calcite/blob/a81cfb2ad001589929e190939cf4db928ebac386/core/src/main/java/org/apache/calcite/tools/RelBuilder.java#L1948


Allow Cascades driver invoking "derive" on the nodes produced by "passThrough"

2022-02-10 Thread Vladimir Ozerov
Hi,

In the Cascades driver, it is possible to propagate the requests top-down
using the "passThrough", method and then notify parents bottom-up about the
concrete physical implementations of inputs using the "derive" method.

In some optimizers, the valid parent node cannot be created before the
trait sets of inputs are known. An example is a custom distribution trait
that includes the number of shards in the system. The parent operator alone
may guess the distribution keys, but cannot know the number of input
shards. To mitigate this, you may create a "template" node with an infinite
cost from within the optimization rule that will propagate the
passThrough/drive calls but would never participate in the final plan.

Currency, the top-down driver designed in a way that the nodes created from
the "passThrough" method are not notified on the "derive" stage. This leads
to the incomplete exploration of the search space. For example, the rule
may produce the node "A1.template" that will be converted into a normal
"A1" node in the derive phase. However, if the parent operator produced
"A2.template" from "A1.template" using pass-through mechanics, the
"A2.template" will never be notified about the concrete input traits,
possibly losing the optimal plan. This is especially painful in distributed
engines, where the number of shards is important for the placement of
Shuffle operators.

It seems that the problem could be solved with relatively low effort. The
"derive" is not invoked on the nodes created from the "passThrough" method,
because such nodes are placed in the "passThroughCache" collection. Instead
of doing this unconditionally, we may introduce an additional predicate
that would selectively enforce "derive" on such nodes. For example, this
could be a default method in the PhysicalNode interface, like:

interface PhysicalNode {
  default boolean enforceDerive() { return false; }
}

If there are no objections, I'll proceed with this change.

Alternatively, we may make the TopDownRuleDriver more "public", so that the
user can extend it and decide within the driver whether to cache a
particular node or not.

I would appreciate your feedback on the matter.

Regards,
Vladimir.


Re: Sort getting removed during optimization

2022-01-13 Thread Vladimir Ozerov
Hi Julian,

Could you please check if your RelRoot.collation has the expected value? I
would expect it to be [$0 DESC]. If the value is correct, you may add it to
the "desired" trait set: "desired.replace(root.collation)". Hopefully, this
should help.

Regards,
Vladimir.

вт, 11 янв. 2022 г. в 16:47, Julian Feinauer :

> Hey Vladimir,
>
> when this issue appeared it was
>
> RelTraitSet desired = cluster.traitSet()
> .replace(BindableConvention.INSTANCE);
>
> RelNode expectedRoot = planner.changeTraits(root, desired);
> planner.setRoot(expectedRoot);
>
> And then
>
>
> RelNode exp = planner.findBestExp();
>
> So the root node had no sorting „requirement”.
> But from my understanding of the SortRemoveRule it does remove the Sort
> and at the same time adds the respective CollationTrait to the input node.
> In my case this was a LogicalProject.
> I have no idea how the Project itself then assures that the CollationTrait
> is fulfilled.
>
> By the way, is there a way to shot the Traits from a RelNode Tree? This
> could help to analyze this kind of situations?
>
> Thanks!
> Julian
>
>
>
> From: Vladimir Ozerov 
> Date: Tuesday, 11. January 2022 at 14:09
> To: dev@calcite.apache.org (dev@calcite.apache.org) <
> dev@calcite.apache.org>
> Subject: Re: Sort getting removed during optimization
> Hi Julian,
>
> When invoking the optimizer, you may provide the desired trait set of the
> top-level node. It might happen, that the specific collation is not
> requested from the optimizer, and hence the plan with a top-level Sort
> operator is not chosen. Could you please show how you invoke the planner?
>
> Regards,
> Vladimir.
>
> вт, 11 янв. 2022 г. в 12:44, Julian Feinauer  >:
>
> > Hey Stamatis,
> >
> > yes, thats why I looked it up at first… the results are wrong : )
> > So both tables for themselves are sorted but the Full Join is finally two
> > blocks. The Left Join (sorted like the left rel) and then the remaining
> > entries from the right side (also ordered). But overall not ordered.
> >
> > Best
> > Julian
> >
> > From: Stamatis Zampetakis 
> > Date: Tuesday, 11. January 2022 at 09:43
> > To: dev@calcite.apache.org 
> > Subject: Re: Sort getting removed during optimization
> > Hi Julian F,
> >
> > Quite a naive question but did you get wrong results from the given
> plan? A
> > missing sort is not necessarily problematic.
> >
> > I hope I am not saying something stupid but I think there are cases
> where a
> > full join algorithm can retain the order of some of its inputs.
> >
> > Best,
> > Stamatis
> >
> > On Tue, Jan 11, 2022 at 8:30 AM Julian Feinauer <
> > j.feina...@pragmaticminds.de> wrote:
> >
> > > Hi Julian, Xiong,
> > >
> > > thanks for your fast replies!
> > >
> > > So first, the default Rules were registered:
> > >
> > > planner = new VolcanoPlanner();
> > > RelOptUtil.registerDefaultRules(planner, false, true);
> > >
> > > And as traits I used:
> > >
> > > planner.addRelTraitDef(ConventionTraitDef.INSTANCE);
> > > planner.addRelTraitDef(RelCollationTraitDef.INSTANCE);
> > >
> > > I digged a bit deeper and what was triggered was the `SortRemoveRule`.
> > > If I disabled the Collation Trait this did no longer happen and all
> > worked.
> > >
> > > I will later try to get a MWE done to reproduce this, if this is a bug.
> > > Because the bug would then either be the Full Join producing a wrong
> > > Collation or the SortRemoveRule investigating the input Collation
> wrong,
> > or?
> > >
> > > But nonetheless, thank you very much!
> > > Julian
> > >
> > > From: Julian Hyde 
> > > Date: Tuesday, 11. January 2022 at 00:38
> > > To: dev@calcite.apache.org 
> > > Subject: Re: Sort getting removed during optimization
> > > Is it possible that the Sort is being removed because some component
> > knows
> > > that the input is already sorted?
> > >
> > > In particular, if a relation has at most one row, it is always sorted.
> > > Maybe the planner is deducing this via a some row-count metadata or
> > > uniqueness constraint.
> > >
> > >
> > > > On Jan 10, 2022, at 3:35 PM, xiong duan  wrote:
> > > >
> > > > If  I understand correctly, If we remove the  BINDABLE_SORT_RULE, the
> > > > result will throw an exception about the  Plan transformation. So it
> > > looks
> > > > like a wrong rule's result. If you don't customize the rule, It is a
> > bug,
> > > > and please test this using Calcite's new version.
> > >
> >
>


Re: [DISCUSS] Writing good summaries for Jira cases

2022-01-12 Thread Vladimir Ozerov
Hi Julian,

In my opinion, both ways work well. People tend to think differently. Some
prefer symptoms, others - the root cause. I personally prefer the latter
for the following reason. If I face a problem, I first try to debug it on
my own. The result of the analysis is usually some questionable behavior in
a specific part of the code. Once you find the problematic place, you can
run a search in JIRA or Git log (class name, feature name, etc) and check
whether somebody else faced a similar issue. The description "Incorrect
plan ..." is less likely to help me than more concrete "In
SubstitutionVisitor ...". Especially, given that a single root cause may
manifest in several ways. But I would like to stress out - it is a matter
of personal habits and previous experience, not something that I
expect others to follow.

In the past, I worked on the Apache Ignite project. We had a number of
contribution rules, such as "put a comma here", "set the proper component
there", "write the comment in that way", etc. I was the one who actively
enforced this for a may years, because it gave the feeling that everything
is "put in order". Eventually, I came to the conclusion that this does more
harm than good, because I regularly observed confusion and dissatisfaction
of the new contributors (and Apache Ignite community is far less diverse
and active than in Apache Calcite), as they were forced to change their
natural way of thinking or past habits to engage with the community.

Regards,
Vladimir.

ср, 12 янв. 2022 г. в 21:42, Julian Hyde :

> Hi all,
>
> Can we discuss how we write summaries for Jira cases? In my opinion it’s
> really important, because summaries become commit messages, and commit
> messages become release notes, which is how most people figure out what is
> in Calcite. I spend a lot of my time working with people to write good
> summaries.
>
> I’d like some feedback on whether this approach is useful. And to try to
> teach people how to do it for themselves.
>
> Consider this case https://issues.apache.org/jira/browse/CALCITE-4983 <
> https://issues.apache.org/jira/browse/CALCITE-4983>. (I chose a still
> current case because it doesn’t yet have an ‘answer’.)
>
> The current summary is
>
> >  In SubstitutionVisitor's unifyAggregates, if Aggregate has
> >  grouping sets, we need to handle the condition needs to pull up.
>
> It describes the cause but it doesn’t describe the problem (or the
> symptoms the user sees).
>
> If you take your car into your mechanic the cause is ‘Leaky gasket results
> in oil dripping onto hot manifold’ but the problem is ‘Smoke comes from
> hood when engine gets hot’. Do you agree that the second description is
> much more useful?
>
> In this case, the author came up with an example:
>
> > Here is an example:
> >
> > sql: select empid, deptno from emps group by grouping sets ((empid,
> deptno),(empid))
> > mv: select empid, count(distinct deptno) from emps where empid>100
> >   group by grouping sets ((empid, deptno), (empid))
> >
> > the result plan is:
> >
> >   LogicalCalc(expr#0..2=[{inputs}], deptno=[$t1], EXPR$1=[$t2])
> > LogicalAggregate(group=[{0, 1}], groups=[[{0, 1}, {0}]],
> EXPR$1=[COUNT(DISTINCT $1)])
> >   EnumerableTableScan(table=[[hr, MV0]])
> >
> > We can see that this plan doesn't handle the condition empid>100
>
> I think it’s a great example. I especially like the last line, where the
> author pointed out what was wrong. I suggest the following summary:
>
> > Incorrect plan for query that has GROUPING SETS and WHERE
>
> Do you think the summary is more useful? Can it be improved?
>
> Julian
>
>
>
>


Re: Different behavior bewteen '>' and '='

2022-01-11 Thread Vladimir Ozerov
Hi,

If I recall correctly, the SQL standard is mostly silent on how one should
coerce operands. Therefore different systems implement the coercion at
their discretion. Moreover, the type inference might be influenced not only
by operands types but by their nature as well. For example, a target system
may be ok with "intCol = '1'", but fail for "intCol = strCol".

If you are not satisfied with the default Apache Calcite behavior, you want
to provide a custom function definition in your own SqlOperatorTable, that
would override functions from the SqlStdOperatorTable. The interfaces that
govern type inference are relatively straightforward to implement
(SqlOperandTypeChecker, SqlOperandTypeInference, SqlReturnTypeInference).
You may possibly face a surprising behavior in some cases. E.g., if you
override a base function (e.g. EQUALS), the parser might ignore your custom
definition and use the one from the SqlStdOperatorTable, as it is
hard-coded into the parser's code. In this case, you may need to implement
a custom visitor that would forcefully rewrite Calcite functions to your
custom ones. In more complicated cases, you may need to override parts of
validator/converter/coercion, but hopefully, your problem is not that
complex.

Regards,
Vladimir.

вт, 11 янв. 2022 г. в 07:43, Julian Hyde :

> Yes, this is by design.I believe that the SQL standard set the rules.
>
> It’s not that surprising that ‘=‘ has different behavior than
> ordering-based comparisons such as ‘>’. Consider: given a DATE value d, and
> a TIMESTAMP value t, it is reasonable to ask ‘is t > d?’ but less
> reasonable to ask ‘does t = d?'
>
> > On Jan 10, 2022, at 6:35 PM, Zou Dan  wrote:
> >
> > Hi community,
> > I recently ran into a problem that when we disable type coercion by
> SqlValidator#setEnableTypeCoercion(false),
> > there will be two different behaviors between '>' and '=':
> > 1. '>' between character and numeric (e.g. '1' > 1), the character will
> be implicitly converted to numeric
> > 2. '=' between character and numeric (e.g. '1' = 1), the character will
> `not` be implicitly converted to numeric
> > I find the reason is that the SqlOperandTypeChecker.Consistency for
> SqlStdOperatorTable.GREATER_THAN is `COMPARE` while
> > SqlStdOperatorTable.EQUALS is `LEAST_RESTRICTIVE`.
> > Is this by design?
>
>


Re: Sort getting removed during optimization

2022-01-11 Thread Vladimir Ozerov
Hi Julian,

When invoking the optimizer, you may provide the desired trait set of the
top-level node. It might happen, that the specific collation is not
requested from the optimizer, and hence the plan with a top-level Sort
operator is not chosen. Could you please show how you invoke the planner?

Regards,
Vladimir.

вт, 11 янв. 2022 г. в 12:44, Julian Feinauer :

> Hey Stamatis,
>
> yes, thats why I looked it up at first… the results are wrong : )
> So both tables for themselves are sorted but the Full Join is finally two
> blocks. The Left Join (sorted like the left rel) and then the remaining
> entries from the right side (also ordered). But overall not ordered.
>
> Best
> Julian
>
> From: Stamatis Zampetakis 
> Date: Tuesday, 11. January 2022 at 09:43
> To: dev@calcite.apache.org 
> Subject: Re: Sort getting removed during optimization
> Hi Julian F,
>
> Quite a naive question but did you get wrong results from the given plan? A
> missing sort is not necessarily problematic.
>
> I hope I am not saying something stupid but I think there are cases where a
> full join algorithm can retain the order of some of its inputs.
>
> Best,
> Stamatis
>
> On Tue, Jan 11, 2022 at 8:30 AM Julian Feinauer <
> j.feina...@pragmaticminds.de> wrote:
>
> > Hi Julian, Xiong,
> >
> > thanks for your fast replies!
> >
> > So first, the default Rules were registered:
> >
> > planner = new VolcanoPlanner();
> > RelOptUtil.registerDefaultRules(planner, false, true);
> >
> > And as traits I used:
> >
> > planner.addRelTraitDef(ConventionTraitDef.INSTANCE);
> > planner.addRelTraitDef(RelCollationTraitDef.INSTANCE);
> >
> > I digged a bit deeper and what was triggered was the `SortRemoveRule`.
> > If I disabled the Collation Trait this did no longer happen and all
> worked.
> >
> > I will later try to get a MWE done to reproduce this, if this is a bug.
> > Because the bug would then either be the Full Join producing a wrong
> > Collation or the SortRemoveRule investigating the input Collation wrong,
> or?
> >
> > But nonetheless, thank you very much!
> > Julian
> >
> > From: Julian Hyde 
> > Date: Tuesday, 11. January 2022 at 00:38
> > To: dev@calcite.apache.org 
> > Subject: Re: Sort getting removed during optimization
> > Is it possible that the Sort is being removed because some component
> knows
> > that the input is already sorted?
> >
> > In particular, if a relation has at most one row, it is always sorted.
> > Maybe the planner is deducing this via a some row-count metadata or
> > uniqueness constraint.
> >
> >
> > > On Jan 10, 2022, at 3:35 PM, xiong duan  wrote:
> > >
> > > If  I understand correctly, If we remove the  BINDABLE_SORT_RULE, the
> > > result will throw an exception about the  Plan transformation. So it
> > looks
> > > like a wrong rule's result. If you don't customize the rule, It is a
> bug,
> > > and please test this using Calcite's new version.
> >
>


[jira] [Created] (CALCITE-4978) Allow custom constant map in RelOptPredicateList

2022-01-09 Thread Vladimir Ozerov (Jira)
Vladimir Ozerov created CALCITE-4978:


 Summary: Allow custom constant map in RelOptPredicateList
 Key: CALCITE-4978
 URL: https://issues.apache.org/jira/browse/CALCITE-4978
 Project: Calcite
  Issue Type: Improvement
  Components: core
Affects Versions: 1.29.0
Reporter: Vladimir Ozerov
Assignee: Vladimir Ozerov


{{RelOptPredicateList}} tries to infer constants from the predicates passed to 
the {{of(...)}} factory methods. However, the underlying mechanic is limited 
and handles only a small subset of possible constant derivation strategies. We 
may want to implement such strategies gradually (inference from AND/OR, ranges, 
etc.), but it might be a big effort. 

This ticket proposes a new factory method that allows users to pass custom 
constant maps.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: Acquiring original SQL identifier from RexInputRef

2021-12-21 Thread Vladimir Ozerov
eved by
> using "predicate pushdown" in the IO readers and for example only reading
> certain columns from a Parquet or ORC file. The format needed to achieve
> this is DNF and requires the original column names so those predicates can
> be passed down into the implementation libraries. The problem is those
> libraries already exist as CUDA C/C++ implementations and cannot be
> modified.
>
> Does that make sense? If there is a more intelligent way to conditional
> predicates from the SQL query, even if it isn't at the Rex level I would
> love to hear suggestions
>
> [1] - https://github.com/dask-contrib/dask-sql
>
> On Tue, Dec 21, 2021 at 1:05 PM Vladimir Ozerov 
> wrote:
>
> > Hi Jeremy,
> >
> > Could you please share the use case behind this requirement? In the
> general
> > case, it is not possible to link RelNode's attributes to specific
> > identifiers. For this reason, an attempt to extract such identifier from
> > any "rel" except for the RelRoot might indicate a design issue.
> >
> > Regards,
> > Vladimir.
> >
> > вт, 21 дек. 2021 г. в 20:34, Jeremy Dyer :
> >
> > > Hello,
> > >
> > > Is it possible to get the original SQL identifier from an instance of
> > > RexInputRef? For example given a simple query like
> > >
> > > SELECT id FROM employees WHERE fname = 'adam'
> > >
> > > Instead of the ordinal name generated by RexInputRef ($11, for
> example).
> > I
> > > would like to find the original SQL identifier (fname, for example)
> > >
> > > Thanks,
> > > Jeremy Dyer
> > >
> >
>


Re: Acquiring original SQL identifier from RexInputRef

2021-12-21 Thread Vladimir Ozerov
Hi Jeremy,

Could you please share the use case behind this requirement? In the general
case, it is not possible to link RelNode's attributes to specific
identifiers. For this reason, an attempt to extract such identifier from
any "rel" except for the RelRoot might indicate a design issue.

Regards,
Vladimir.

вт, 21 дек. 2021 г. в 20:34, Jeremy Dyer :

> Hello,
>
> Is it possible to get the original SQL identifier from an instance of
> RexInputRef? For example given a simple query like
>
> SELECT id FROM employees WHERE fname = 'adam'
>
> Instead of the ordinal name generated by RexInputRef ($11, for example). I
> would like to find the original SQL identifier (fname, for example)
>
> Thanks,
> Jeremy Dyer
>


Re: Committer permissions

2021-12-21 Thread Vladimir Ozerov
Thank you for your suggestions, it works now.

Сб, 18 дек. 2021 г. в 15:24, Stamatis Zampetakis :

> Hi Vladimir,
>
> You have already committers rights. As Duan mentioned, you probably need to
> link your ASF with your GitHub account.
>
> Check the instructions here:
>
> https://calcite.apache.org/docs/howto.html#managing-calcite-repositories-through-github
>
> Best,
> Stamatis
>
> On Sat, Dec 18, 2021 at 1:00 PM xiong duan  wrote:
>
> > Hi Vladimir,
> >
> > I guess you need to set the Committer authority. Maybe can try
> > https://gitbox.apache.org/setup/ to check your count.  It includes three
> > procedures:
> >
> > 1) authority for Apache Count
> > 2) authority for Github Count
> > 3) authority for Apache Organization
> >
> >
> > Vladimir Ozerov  于2021年12月18日周六 16:57写道:
> >
> > > Hi,
> > >
> > > It seems that I do not have the committer permissions to the repo, as I
> > can
> > > merge PR neither from GitHub, nor manually. Could you please grant me
> the
> > > required access rights? Please let me know if any additional
> information
> > is
> > > required.
> > >
> > > Regards,
> > > Vladimir.
> > >
> >
>


Re: RelNode semantic evolution vs adapter implementations

2021-12-21 Thread Vladimir Ozerov
The answer depends on the definition of "wrong". Rules may produce
non-equivalent operators, the validator may accidentally change how type is
deduced for a specific expression, metadata may cause incorrect cost
calculation and pruning of promising plans, leading to a hang in a
downstream application.

Correctness and plan quality of real optimizers depend on all these
components together. Versioning of only one type of the component (e.g.,
RelNode) is likely to be insufficient for an advanced optimizer, as there
are many other sources of regressions. At the same time, some projects may
definitively benefit from even limited versioning. With this in mind, IMO
if we decide to add such versioning, it should be non-intrusive (i.e.,
require little to no efforts to disable it completely), so that the
downstream projects retain enough flexibility on how to manage the
optimizer's quality.

Regards,
Vladimir.

вт, 14 дек. 2021 г. в 12:01, Vladimir Sitnikov :

> >some direct or indirect changes (metadata, rules, validator, etc)
> >may cause changes to plans around the given RelNode in the downstream
> >project
>
> Do you think changes like "metadata, rules, validator" can cause "wrong
> results" issues?
> What could be the issues caused by "metadata, rules, validator" changes?
>
> I know ObjectWeb ASM bytecode manipulation library uses the concepts of
> versions for quite some time, and it seems to work for them.
>
> As you subclass, say, ClassVisitor, you have to pass the API version which
> your visitor supports.
> Then ASM fails in the runtime if the visitor happens to be applied with a
> too recent bytecode.
>
> >and gets the source code of the test that he includes into his project and
> CI
>
> Source code and bytecode parsing could be an alternative option (for other
> cases), however, it is not clear how to tell
> if the rule in question has been updated to support the new Project/Filter
> contract.
>
> Vladimir
>


Re: Behavior of SINGLE_VALUE aggregate

2021-12-21 Thread Vladimir Ozerov
Hi Chathura,

While the SINGLE_VALUE function could be used explicitly, usually this is a
result of a scalar subquery unnesting, and the goal of the function is to
ensure that the underlying operator returns no more than one row
(otherwise, this is not a scalar).

Regards,
Vladimir.

вт, 21 дек. 2021 г. в 13:52, Chathura Widanage :

> Hi,
>
> Could you please explain the behaviour of SINGLE_VALUE aggregate with the
> equivalent SQL query?
>
> SqlSingleValueAggFunction (Apache Calcite calcite API)
> <
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/sql/fun/SqlSingleValueAggFunction.html
> >
>
> It's not clear whether the "value" in the description refers to an entire
> table or a cell. Also, there are no samples publicly available for this
> aggregate function.
>
> Regards,
> Chathura
>


Re: [Calcite Release] Preparing for 1.29.0 Release

2021-12-18 Thread Vladimir Ozerov
Hi Xiong,

Thank you very much.

Vladimir.

Сб, 18 дек. 2021 г. в 14:32, xiong duan :

> Hi Vladimir,
>
> I have merged.
>
> Vladimir Ozerov  于2021年12月18日周六 16:58写道:
>
> > Hi Xiong,
> >
> > I've just resolved the conflicts.
> >
> > Regards,
> > Vladimir.
> >
> > сб, 18 дек. 2021 г. в 10:48, xiong duan :
> >
> > > Hi, Vladimir.
> > >
> > > The PR[https://github.com/apache/calcite/pull/2464] looks good. If you
> > > have
> > > free time Please resolve the conflicts. Then we can merge it.
> > >
> > > Vladimir Ozerov  于2021年12月18日周六 15:20写道:
> > >
> > > > Hi,
> > > >
> > > > Could you please also merge
> > > > https://github.com/apache/calcite/pull/2464 ? I guess it got stuck
> > > > somehow,
> > > > as all the comments are addressed. This issue forced us to create a
> > copy
> > > of
> > > > the AggregateUnionTransposeRule in our project.
> > > >
> > > > Regards,
> > > > Vladimir
> > > >
> > > > Сб, 18 дек. 2021 г. в 09:51, Rui Wang :
> > > >
> > > > > Thanks Julian!
> > > > >
> > > > > Now we will just decide whether to merge CALCITE-4907 or not, then
> I
> > > will
> > > > > lock the main branch and start the release process.
> > > > >
> > > > >
> > > > > -Rui
> > > > >
> > > > > On Fri, Dec 17, 2021 at 10:29 PM Julian Hyde <
> jhyde.apa...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Rui, I have merged CALCITE-4946.
> > > > > >
> > > > > > > On Dec 17, 2021, at 10:21 PM, Rui Wang 
> > > wrote:
> > > > > > >
> > > > > > > Thanks Zhe.
> > > > > > >
> > > > > > > Hi Francesco,
> > > > > > >
> > > > > > > I left a comment in your PR. Can you take a look?
> > > > > > >
> > > > > > > -Rui
> > > > > > >
> > > > > > > On Fri, Dec 17, 2021 at 3:32 PM Zhe Hu 
> wrote:
> > > > > > >
> > > > > > >> Hi, Rui.
> > > > > > >> CALCITE-4860 will be fixed in next release, you can pass it
> > > > currently.
> > > > > > >> Thanks for your work!
> > > > > > >>
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> ZheHu
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> On 12/18/2021 06:08,Francesco Gini
> > > wrote:
> > > > > > >> https://issues.apache.org/jira/browse/CALCITE-4907 is in
> > review.
> > > I
> > > > > > believe
> > > > > > >> I have addressed all the comments. Let me know if I missed
> > > anything.
> > > > > > >>
> > > > > > >> On Fri, 17 Dec 2021 at 22:42, Jacques Nadeau <
> > jacq...@apache.org>
> > > > > > wrote:
> > > > > > >>
> > > > > > >> I've merged CALCITE-4948 to mitigate the concerns around
> > > > CALCITE-4898
> > > > > > and
> > > > > > >> maven repositories and category x licenses.
> > > > > > >>
> > > > > > >> On Fri, Dec 17, 2021 at 1:02 PM Rui Wang <
> amaliu...@apache.org>
> > > > > wrote:
> > > > > > >>
> > > > > > >> For the list of 1.29 fix Jira, several Jira have been marked
> as
> > > > > resolved
> > > > > > >> since this Monday. I will do another pass today but probably
> not
> > > all
> > > > > > >> those
> > > > > > >> will make the 1.29.0 release.
> > > > > > >>
> > > > > > >>
> > > > > > >> -Rui
> > > > > > >>
> > > > > > >> On Fri, Dec 17, 2021 at 11:57 AM Ruben Q L  >
> > > > wrote:
> > > > > > >>
> > > > > > >> Hel

Re: [Calcite Release] Preparing for 1.29.0 Release

2021-12-18 Thread Vladimir Ozerov
Hi Xiong,

I've just resolved the conflicts.

Regards,
Vladimir.

сб, 18 дек. 2021 г. в 10:48, xiong duan :

> Hi, Vladimir.
>
> The PR[https://github.com/apache/calcite/pull/2464] looks good. If you
> have
> free time Please resolve the conflicts. Then we can merge it.
>
> Vladimir Ozerov  于2021年12月18日周六 15:20写道:
>
> > Hi,
> >
> > Could you please also merge
> > https://github.com/apache/calcite/pull/2464 ? I guess it got stuck
> > somehow,
> > as all the comments are addressed. This issue forced us to create a copy
> of
> > the AggregateUnionTransposeRule in our project.
> >
> > Regards,
> > Vladimir
> >
> > Сб, 18 дек. 2021 г. в 09:51, Rui Wang :
> >
> > > Thanks Julian!
> > >
> > > Now we will just decide whether to merge CALCITE-4907 or not, then I
> will
> > > lock the main branch and start the release process.
> > >
> > >
> > > -Rui
> > >
> > > On Fri, Dec 17, 2021 at 10:29 PM Julian Hyde 
> > > wrote:
> > >
> > > > Rui, I have merged CALCITE-4946.
> > > >
> > > > > On Dec 17, 2021, at 10:21 PM, Rui Wang 
> wrote:
> > > > >
> > > > > Thanks Zhe.
> > > > >
> > > > > Hi Francesco,
> > > > >
> > > > > I left a comment in your PR. Can you take a look?
> > > > >
> > > > > -Rui
> > > > >
> > > > > On Fri, Dec 17, 2021 at 3:32 PM Zhe Hu  wrote:
> > > > >
> > > > >> Hi, Rui.
> > > > >> CALCITE-4860 will be fixed in next release, you can pass it
> > currently.
> > > > >> Thanks for your work!
> > > > >>
> > > > >>
> > > > >> Best,
> > > > >> ZheHu
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 12/18/2021 06:08,Francesco Gini
> wrote:
> > > > >> https://issues.apache.org/jira/browse/CALCITE-4907 is in review.
> I
> > > > believe
> > > > >> I have addressed all the comments. Let me know if I missed
> anything.
> > > > >>
> > > > >> On Fri, 17 Dec 2021 at 22:42, Jacques Nadeau 
> > > > wrote:
> > > > >>
> > > > >> I've merged CALCITE-4948 to mitigate the concerns around
> > CALCITE-4898
> > > > and
> > > > >> maven repositories and category x licenses.
> > > > >>
> > > > >> On Fri, Dec 17, 2021 at 1:02 PM Rui Wang 
> > > wrote:
> > > > >>
> > > > >> For the list of 1.29 fix Jira, several Jira have been marked as
> > > resolved
> > > > >> since this Monday. I will do another pass today but probably not
> all
> > > > >> those
> > > > >> will make the 1.29.0 release.
> > > > >>
> > > > >>
> > > > >> -Rui
> > > > >>
> > > > >> On Fri, Dec 17, 2021 at 11:57 AM Ruben Q L 
> > wrote:
> > > > >>
> > > > >> Hello,
> > > > >>
> > > > >> FYI https://issues.apache.org/jira/browse/CALCITE-4737 has been
> > > > >> merged.
> > > > >>
> > > > >> Regards,
> > > > >> Ruben
> > > > >>
> > > > >>
> > > > >> On Fri, Dec 17, 2021 at 7:40 PM Julian Hyde <
> jhyde.apa...@gmail.com
> > >
> > > > >> wrote:
> > > > >>
> > > > >> Rui,
> > > > >>
> > > > >> Even though https://issues.apache.org/jira/browse/CALCITE-4898 <
> > > > >> https://issues.apache.org/jira/browse/CALCITE-4898> is marked
> > fixed,
> > > > >> there’s some fallout regarding maven repositories and category X.
> > > > >> That
> > > > >> needs to be sorted out before RC0. See [1].
> > > > >>
> > > > >> I’m working on getting
> > > > >> https://issues.apache.org/jira/browse/CALCITE-4946
> > > > >> <https://issues.apache.org/jira/browse/CALCITE-4946> done today.
> > > > >> Please
> > > > >> wait for that.
> > > > >>
> > > > >> https:/

Committer permissions

2021-12-18 Thread Vladimir Ozerov
Hi,

It seems that I do not have the committer permissions to the repo, as I can
merge PR neither from GitHub, nor manually. Could you please grant me the
required access rights? Please let me know if any additional information is
required.

Regards,
Vladimir.


Re: [Calcite Release] Preparing for 1.29.0 Release

2021-12-17 Thread Vladimir Ozerov
Hi,

Could you please also merge
https://github.com/apache/calcite/pull/2464 ? I guess it got stuck somehow,
as all the comments are addressed. This issue forced us to create a copy of
the AggregateUnionTransposeRule in our project.

Regards,
Vladimir

Сб, 18 дек. 2021 г. в 09:51, Rui Wang :

> Thanks Julian!
>
> Now we will just decide whether to merge CALCITE-4907 or not, then I will
> lock the main branch and start the release process.
>
>
> -Rui
>
> On Fri, Dec 17, 2021 at 10:29 PM Julian Hyde 
> wrote:
>
> > Rui, I have merged CALCITE-4946.
> >
> > > On Dec 17, 2021, at 10:21 PM, Rui Wang  wrote:
> > >
> > > Thanks Zhe.
> > >
> > > Hi Francesco,
> > >
> > > I left a comment in your PR. Can you take a look?
> > >
> > > -Rui
> > >
> > > On Fri, Dec 17, 2021 at 3:32 PM Zhe Hu  wrote:
> > >
> > >> Hi, Rui.
> > >> CALCITE-4860 will be fixed in next release, you can pass it currently.
> > >> Thanks for your work!
> > >>
> > >>
> > >> Best,
> > >> ZheHu
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On 12/18/2021 06:08,Francesco Gini wrote:
> > >> https://issues.apache.org/jira/browse/CALCITE-4907 is in review. I
> > believe
> > >> I have addressed all the comments. Let me know if I missed anything.
> > >>
> > >> On Fri, 17 Dec 2021 at 22:42, Jacques Nadeau 
> > wrote:
> > >>
> > >> I've merged CALCITE-4948 to mitigate the concerns around CALCITE-4898
> > and
> > >> maven repositories and category x licenses.
> > >>
> > >> On Fri, Dec 17, 2021 at 1:02 PM Rui Wang 
> wrote:
> > >>
> > >> For the list of 1.29 fix Jira, several Jira have been marked as
> resolved
> > >> since this Monday. I will do another pass today but probably not all
> > >> those
> > >> will make the 1.29.0 release.
> > >>
> > >>
> > >> -Rui
> > >>
> > >> On Fri, Dec 17, 2021 at 11:57 AM Ruben Q L  wrote:
> > >>
> > >> Hello,
> > >>
> > >> FYI https://issues.apache.org/jira/browse/CALCITE-4737 has been
> > >> merged.
> > >>
> > >> Regards,
> > >> Ruben
> > >>
> > >>
> > >> On Fri, Dec 17, 2021 at 7:40 PM Julian Hyde 
> > >> wrote:
> > >>
> > >> Rui,
> > >>
> > >> Even though https://issues.apache.org/jira/browse/CALCITE-4898 <
> > >> https://issues.apache.org/jira/browse/CALCITE-4898> is marked fixed,
> > >> there’s some fallout regarding maven repositories and category X.
> > >> That
> > >> needs to be sorted out before RC0. See [1].
> > >>
> > >> I’m working on getting
> > >> https://issues.apache.org/jira/browse/CALCITE-4946
> > >>  done today.
> > >> Please
> > >> wait for that.
> > >>
> > >> https://issues.apache.org/jira/browse/CALCITE-4885 <
> > >> https://issues.apache.org/jira/browse/CALCITE-4885> is a big
> > >> refactoring
> > >> and, though I’d love to get it in before the release, it isn’t going
> > >> to
> > >> happen. Don’t wait for it.
> > >>
> > >> I also see the following as open or in progress and fix-in-1.29:
> > >> * https://issues.apache.org/jira/browse/CALCITE-4872 <
> > >> https://issues.apache.org/jira/browse/CALCITE-4872> (Will Noble)
> > >> * https://issues.apache.org/jira/browse/CALCITE-4908 <
> > >> https://issues.apache.org/jira/browse/CALCITE-4908> (Woonsan Koo)
> > >> * https://issues.apache.org/jira/browse/CALCITE-4704 <
> > >> https://issues.apache.org/jira/browse/CALCITE-4704> (Stamatis)
> > >> * https://issues.apache.org/jira/browse/CALCITE-4737 <
> > >> https://issues.apache.org/jira/browse/CALCITE-4737> (Thomas Rebele)
> > >> * https://issues.apache.org/jira/browse/CALCITE-4860 <
> > >> https://issues.apache.org/jira/browse/CALCITE-4860> (Zhe Hu / Duan
> > >> Xiong)
> > >> * https://issues.apache.org/jira/browse/CALCITE-4907 <
> > >> https://issues.apache.org/jira/browse/CALCITE-4907> (Francesco Gini)
> > >>
> > >> What’s the status on those?
> > >>
> > >> Julian
> > >>
> > >> [1] https://github.com/apache/calcite/pull/2622 <
> > >> https://github.com/apache/calcite/pull/2622>
> > >>
> > >> PS It feels churlish to bring this up, but in time-honored tradition
> > >> the
> > >> email subject has been ‘[DISCUSS] Towards Calcite x.y.z’.
> > >>
> > >>
> > >> On Dec 14, 2021, at 10:37 AM, Rui Wang 
> > >> wrote:
> > >>
> > >> Hi community,
> > >>
> > >> To prepare for Calcite 1.29.0 release, I will start to check the
> > >> JIRA
> > >> list
> > >> [1] which marks its fix version as 1.29.0 (9 JIRA as of now remains
> > >> open/in
> > >> progress) and also the pending PRs which looks promising to be
> > >> merged.
> > >> Please help resolve/review/merge JIRA/PRs if you want some fixes to
> > >> be
> > >> released.
> > >>
> > >>
> > >> My current goal is to start freezing the main branch of Calcite by
> > >> the
> > >> end
> > >> of this week.
> > >>
> > >> Please let me know if you have any questions.
> > >>
> > >> [1]:
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> >
> https://issues.apache.org/jira/browse/CALCITE-4941?jql=project%20%3D%20CALCITE%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20fixVersion%20%3D%201.29.0
> > >>
> > >>
> > >> -Rui

Re: RelNode semantic evolution vs adapter implementations

2021-12-13 Thread Vladimir Ozerov
Practically, Apache Calcite doesn’t have public API. Real projects usually
override not only public extension points, but also internals, such as
SqlValidatorImpl or VolcanoPlanner.

Given the flexibility that Apache Calcite provides, this lack of public API
surface is not necessarily a bad thing, since you may change almost
everything without forking the project. On the other hand, almost any
change to any part of Calcite code base may cause regressions in downstream
projects.

I am not sure there is an ideal way of versioning rels because even if you
do so, some direct or indirect changes (metadata, rules, validator, etc)
may cause changes to plans around the given RelNode in the downstream
project.

Maybe instead of providing versions manually, we may expose some sort of
generated signatures for different components - rels, rules, metadata
handlers, golden plans, etc. Then, we may provide the ability to verify
expected and actual signatures, e.g., using some code generation: user
invokes a command with the list of interesting classes, and gets the source
code of the test that he includes into his project and CI. Now, when the
user migrates to the new version, tests for changed entities will fail, and
user will investigate the associated changes.

The main difference s from the original proposal:
1. Wider scope, because validation of rels is often not sufficient.
2. Automation, because as implement or you cannot always predict the effect
of your changes (e.g., metadata).

WDYT?

Вт, 14 дек. 2021 г. в 01:49, Konstantin Orlov :

> > The case there is that "correlation variables" are added to the Logical*
> nodes (e.g. LogicalFilter, LogicalProject).
>
> BTW, "correlation variables" were added to a LogicalFilter in CALCITE-816.
> So what is wrong with CALCITE-3183?
>
> вт, 14 дек. 2021 г. в 01:24, Konstantin Orlov :
>
> > Vladimir, could you please clarify in what way the PR#2623 changes
> > the semantics?
> >
> > The correlated project is already possible in the master. The
> MongoProject
> > already discards variablesSet (simply because it's currently not stored
> > for
> > project node) and either fails or returns invalid results. This behavior
> > (alas,
> > incorrect) will be preserved after this patch.
> >
> > пн, 13 дек. 2021 г. в 17:55, Vladimir Sitnikov <
> > sitnikov.vladi...@gmail.com>:
> >
> >> Hi,
> >>
> >> It turns out Calcite's nodes can change semantics without notification
> for
> >> the end-users.
> >>
> >> Here are the release notes for Calcite 1.21, and it says **nothing**
> like
> >> "Ensure you handle Filter#variablesSet in case you implement Filter or
> in
> >> case you transform LogicalFilter in your rules"
> >> https://calcite.apache.org/news/2019/09/11/release-1.21.0/
> >>
> >> On top of that, the in-core adapters fail to handle that properly. For
> >> example, MongoFilter discards Filter#variablesSet.
> >>
> >> Can we please stop changing the semantics of RelNodes or can we have a
> >> better way to detect the changes in the client code?
> >>
> >> What if we add a "version" property to the corresponding RelNodes, and
> we
> >> increment it every time the semantic changes?
> >> Then client APIs could be coded like "ok, I'm prepared to handle Project
> >> v4, and Filter v5" (e.g. make "version" required when registering a
> rule),
> >> and there will be a runtime error in case Calcite generates Filter v6 in
> >> runtime.
> >>
> >> ---
> >>
> >> Sample case:
> >> CALCITE-3183 Trimming method for Filter rel uses wrong traitSet
> >> CALCITE-4913 Correlated variables in a select list are not deduplicated
> >>
> >> The case there is that "correlation variables" are added to the Logical*
> >> nodes (e.g. LogicalFilter, LogicalProject).
> >> Unfortunately, that change is hard to notice: there's no compilation
> >> failure, and there's no runtime error.
> >>
> >> The old client code just discards "correlation variables", and I guess
> it
> >> would result in wrong results or something like that.
> >> Silent wrong results is a really sad outcome from the database.
> >>
> >> CALCITE-4913 / PR#2623 adds Project#variablesSet property as well, and I
> >> guess it would in the same hidden semantic loss.
> >>
> >> Vladimir
> >>
> >
> >
> > --
> > Regards,
> > Konstantin Orlov
> >
>
>
> --
> Regards,
> Konstantin Orlov
>


Re: [DISCUSS] Apache Calcite Online Meetup January 2022

2021-11-18 Thread Vladimir Ozerov
+ 1

>From our side (Querify Labs), we may present one of the difficult use cases
we faced this year, such as large-scale join order planning, experience
with Cascades integration, planning for heterogeneous systems,
advanced statistics management, etc. We may need some time to figure out
what can or cannot be presented due to NDAs. If the dates are known, we
will start figuring out the possible list of topics.

I already presented at the previous meetup. So if there are more proposals
than slots, please prefer other proposals, and we will move our talk to
future meetups.

Regards,
Vladimir.

чт, 18 нояб. 2021 г. в 00:16, Alessandro Solimando <
alessandro.solima...@gmail.com>:

> I'd be interested too!
>
> Il Mer 17 Nov 2021, 19:26 Jacques Nadeau  ha scritto:
>
> > Great idea. +1 from me.
> >
> > On Wed, Nov 17, 2021 at 8:22 AM Stamatis Zampetakis 
> > wrote:
> >
> > > Hi all,
> > >
> > > Last meetup [1] was roughly 10 months ago. How about organising another
> > one
> > > around mid January 2022?
> > >
> > > Organising an in-person event would be nice but I think it is still a
> bit
> > > complicated so I was gonna suggest a virtual one (via Zoom meeting).
> > >
> > > The format could be ~1h for presentations and ~1h for open discussion.
> > >
> > > Are the people willing to give a talk around Calcite?
> > >
> > > What do people think in terms of timing?
> > >
> > > Best,
> > > Stamatis
> > >
> > > [1] https://www.meetup.com/Apache-Calcite/
> > >
> >
>


Re: DISCUSS: merge calcite-avatica and calcite repositories

2021-11-09 Thread Vladimir Ozerov
+1 for a single repo.

Вт, 9 нояб. 2021 г. в 19:22, Vladimir Sitnikov :

> Michael> is not proposing to change the
> Michael>structure of modules within both projects, merely to have the code
> for
> Michael>both in a single repository.
>
> I propose to integrate them into a single build, and keep the set of the
> published jars.
> However, the modules and dependency structure could be kept as is.
>
> We might want to rename folders like
> calcite-core
> calcite-linq4j
> ..
> avatica-core
> avatica-server
>
> However, I am not sure it is that important to discuss folder names now.
> The idea is that as you "open Calcite in IDE, you see both Avatica and
> Calcite modules"
>
> Michael>Is there any reason we couldn't have a separate release schedule if
> Michael>both projects are in the same repository?
>
> A different schedule means Calcite must support at least two different
> Avatica versions.
> In other words, if we allow clients to update Avatica at their will, then
> we should allow them building Calcite with different Avatica versions,
> which implies Calcite test code should succeed for multiple different
> Avatica vesions.
>
> That makes it harder to write tests: we have to execute tests with two
> different Avatica releases (or even more than two).
>
> There are at least two sources for complexity:
>
> a) We have to write tests that tolerate multiple versions. For instance,
> "if (avatica.18+) {...}" and so on.
> That is not really trivial, especially taking into account some of the
> tests are created with non-yet-popular
> technologies like Quidem where you can't really google solutions. So the
> "trivial" task of "making a test to expect two possible outcomes"
> becomes less trivial as you try to pass the version from GitHub Action to
> Gradle to JUnit to Quidem to no-one-knows-which class.
> If we support one Avatica version only, that is not needed. We just patch
> the test in Avatica and Calcite and that is it.
> Single repo avoids "Gradle vs Quidem" dance.
>
> b) If we claim that we support 5 different Guava versions, 3 different JDK
> versions, 2 different Avatica versions,
> then we have to execute 5*3*2 = 30 combinations of the tests.
> That is not really a full matrix, however, things get way easier if we
> support one Avatica version only.
> The amount of tests we need to do during a proper release is much less, and
> it is easier to commit
> changes that touch Avatica and Calcite at the same time.
>
>
> Vladimir
>


[jira] [Created] (CALCITE-4700) AggregateUnionTransposeRule produces wrong group sets for the top Aggregate

2021-07-21 Thread Vladimir Ozerov (Jira)
Vladimir Ozerov created CALCITE-4700:


 Summary: AggregateUnionTransposeRule produces wrong group sets for 
the top Aggregate
 Key: CALCITE-4700
 URL: https://issues.apache.org/jira/browse/CALCITE-4700
 Project: Calcite
  Issue Type: Bug
  Components: core
Reporter: Vladimir Ozerov
Assignee: Vladimir Ozerov






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [ANNOUNCE] New committer: Vladimir Ozerov

2021-06-25 Thread Vladimir Ozerov
Thank you, everybody!

While I enjoy hacking Calcite's codebase, my colleagues and I observe a
consistent signal from the "field" - lack of documentation blocks many
attempts to integrate Apache Calcite. Engineers simply get lost with no
idea on how to move forward. So I more and more believe that (usually
boring) documentation-related efforts might boost Apache Calcite adoption
dramatically. Hopefully, we will find time to invest in it.

Regards,
Vladimir.

пт, 25 июн. 2021 г. в 09:52, Fan Liya :

> Congratulations, Vladimir!
> Thanks for the good articles.
>
> Best,
> Liya Fan
>
> On Fri, Jun 25, 2021 at 9:07 AM Julian Hyde 
> wrote:
>
> > Welcome, Vladimir!
> >
> > > On Jun 24, 2021, at 6:00 PM, Albert  wrote:
> > >
> > > Congrats.
> > > just found the blog:
> https://www.querifylabs.com/author/vladimir-ozerov
> > >
> > > On Thu, Jun 24, 2021 at 2:27 PM Alessandro Solimando <
> > > alessandro.solima...@gmail.com> wrote:
> > >
> > >> Congratulations Vladimir, well deserved, I had the chance to read some
> > >> of the blog posts and I have appreciated them very much.
> > >>
> > >> Best regards,
> > >> Alessandro
> > >>
> > >> On Thu, 24 Jun 2021 at 07:58, Viliam Durina 
> > wrote:
> > >>>
> > >>> Congratulations!
> > >>>
> > >>> Viliam
> > >>>
> > >>> On Thu, 24 Jun 2021 at 06:58, Forward Xu 
> > wrote:
> > >>>
> > >>>> Congratulations!
> > >>>>
> > >>>>
> > >>>> Best,
> > >>>>
> > >>>> Forward
> > >>>>
> > >>>> Danny Chan  于2021年6月24日周四 上午11:51写道:
> > >>>>
> > >>>>> Congrats, Vladimir!
> > >>>>>
> > >>>>> Best,
> > >>>>> Danny Chan
> > >>>>>
> > >>>>> Yanjing Wang  于2021年6月24日周四 上午11:41写道:
> > >>>>>
> > >>>>>> Congrats, Vladimir!
> > >>>>>>
> > >>>>>> Roman Kondakov  于2021年6月24日周四
> > >> 上午11:22写道:
> > >>>>>>
> > >>>>>>> Congratulations, Vladimir!
> > >>>>>>>
> > >>>>>>> Roman Kondakov
> > >>>>>>>
> > >>>>>>> On 24.06.2021 12:23, 段雄 wrote:
> > >>>>>>>> Congratulations!
> > >>>>>>>>
> > >>>>>>>> XING JIN  于2021年6月24日周四 上午10:21写道:
> > >>>>>>>>
> > >>>>>>>>> Congratulations ~
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Jin
> > >>>>>>>>>
> > >>>>>>>>> guangyuan wang  于2021年6月24日周四
> > >> 上午9:50写道:
> > >>>>>>>>>
> > >>>>>>>>>> Congratulations!
> > >>>>>>>>>>
> > >>>>>>>>>> Francis Chuang  于2021年6月24日周四
> > >>>> 上午6:39写道:
> > >>>>>>>>>>
> > >>>>>>>>>>> Congrats, Vladimir!
> > >>>>>>>>>>>
> > >>>>>>>>>>> Francis
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 24/06/2021 7:48 am, Haisheng Yuan wrote:
> > >>>>>>>>>>>> Congratulations and thanks for your contributions,
> > >> Vladimir!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Regards,
> > >>>>>>>>>>>> Haisheng
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On 2021/06/23 21:34:40, Stamatis Zampetakis <
> > >> zabe...@gmail.com
> > >>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>>>> Apache Calcite's Project Management Committee (PMC) has
> > >>>> invited
> > >>>>>>>>>> Vladimir
> > >>>>>>>>>>>>> Ozerov to
> > >>>>>>>&

Re: Trait propagation guidelines

2021-06-25 Thread Vladimir Ozerov
n(s) do you request?
> Just request either one. I already stated in the email [1], but seems like
> you missed the 5th paragraph.
>
> > 2. In some systems, the distribution may also define the distribution
> > function, e.g., a number of shards. A UNION DISTINCT of two tables with
> the
> > same sharding key, but the different numbers of shards must yield an
> > exchange. The parent operator cannot know the number of shards of the
> input
> > in advance and cannot define the proper trait set in the "passThrough"
> > method.
> The parent operator doesn't need to know what number of shards to request,
> just request hash distribution with shard number 0 or -1 or what ever to
> indicate shard number not decided yet. Later the child operator will tell
> parent operator the exact distribution through "derive".
>
> In Alibaba MaxCompute, we have customized hash distribution, which
> contains number of buckets, hash function, null collation, we also support
> range distribution, which contains range bucket boundaries. All of these
> can work under current framework. With all that being said, distribution is
> nothing special than collation, it all depends on whether you design the
> operator "passthrough" and "derive" strategy correctly.
>
> [1]
> https://lists.apache.org/thread.html/r36b25cbe4ca05fb1262c432ad9103f4126b654698481fca0d2a01fe7%40%3Cdev.calcite.apache.org%3E
>
> Thanks,
> Haisheng Yuan
>
> On 2021/06/14 08:26:31, Vladimir Ozerov  wrote:
> > Hi Haisheng,
> >
> > The observation is that parent operators sometimes do not know the exact
> > traits they will have for the given child traits. Several examples:
> > 1. PhysicalAggregate1 may request both HASH[b,c] or HASH[c,b]. Contrary
> to
> > the default Apache Calcite implementation, in many systems, these are two
> > different distributions - which one should I request? To make things
> worse,
> > some operators may have strict requirements to the order (Join, Union),
> > whilst others do not care about the order (Aggregate, Window).
> > 2. In some systems, the distribution may also define the distribution
> > function, e.g., a number of shards. A UNION DISTINCT of two tables with
> the
> > same sharding key, but the different numbers of shards must yield an
> > exchange. The parent operator cannot know the number of shards of the
> input
> > in advance and cannot define the proper trait set in the "passThrough"
> > method.
> >
> > We will miss the optimization opportunity in all these cases unless we
> can
> > clarify the real traits in the "derive" phase. But to do this, we need to
> > know the original optimization request.
> >
> > Regards,
> > Vladimir.
> >
> >
> > вс, 13 июн. 2021 г. в 22:17, Haisheng Yuan :
> >
> > > How does it relate with "derive" to get the desired plan?
> > >
> > > Initially PhysicalAggregate1 requests HASH[b,c], PhysicalAggregate2
> > > requests HASH[a,b,c]. PhysicalAggregate2 is called on "passthrough" by
> > > passing HASH[b,c], then generate another PhysicalAggregate2 with trait
> > > HASH[b,c]. You don't need the involvement of "derive".
> > >
> > > Haisheng Yuan
> > >
> > > On 2021/06/13 16:58:53, Vladimir Ozerov  wrote:
> > > > Hi,
> > > >
> > > > I tried to apply different approaches, but eventually, I failed to
> > > achieve
> > > > my goals. It seems that the current implementation cannot handle the
> > > > required scenario, as explained below.
> > > >
> > > > Consider the following tree:
> > > > LogicalAggregate1[group=[b,c]]
> > > >   LogicalAggregate2[group=[a,b,c]]
> > > > LogicalInput
> > > >
> > > > I want to find the plan to do these two aggregations without an
> exchange
> > > in
> > > > between because they may have compatible distributions. Example:
> > > > PhysicalAggregate1[group=[b,c]] // SHARDED[b,c]
> > > >   PhysicalAggregate2[group=[a,b,c]] // SHARDED[b,c]
> > > > Exchange// SHARDED[b,c]
> > > >   PhysicalInput // SHARDED[?]
> > > >
> > > > The fundamental problem is that it is impossible to save the
> optimization
> > > > request and resolve traits in the "derive" phase afterward. What we
> need
> > > is
> > > > to send the optimization request "SHARDED by [b,c] in any order" to
> 

Re: Trait propagation guidelines

2021-06-14 Thread Vladimir Ozerov
Hi Haisheng,

The observation is that parent operators sometimes do not know the exact
traits they will have for the given child traits. Several examples:
1. PhysicalAggregate1 may request both HASH[b,c] or HASH[c,b]. Contrary to
the default Apache Calcite implementation, in many systems, these are two
different distributions - which one should I request? To make things worse,
some operators may have strict requirements to the order (Join, Union),
whilst others do not care about the order (Aggregate, Window).
2. In some systems, the distribution may also define the distribution
function, e.g., a number of shards. A UNION DISTINCT of two tables with the
same sharding key, but the different numbers of shards must yield an
exchange. The parent operator cannot know the number of shards of the input
in advance and cannot define the proper trait set in the "passThrough"
method.

We will miss the optimization opportunity in all these cases unless we can
clarify the real traits in the "derive" phase. But to do this, we need to
know the original optimization request.

Regards,
Vladimir.


вс, 13 июн. 2021 г. в 22:17, Haisheng Yuan :

> How does it relate with "derive" to get the desired plan?
>
> Initially PhysicalAggregate1 requests HASH[b,c], PhysicalAggregate2
> requests HASH[a,b,c]. PhysicalAggregate2 is called on "passthrough" by
> passing HASH[b,c], then generate another PhysicalAggregate2 with trait
> HASH[b,c]. You don't need the involvement of "derive".
>
> Haisheng Yuan
>
> On 2021/06/13 16:58:53, Vladimir Ozerov  wrote:
> > Hi,
> >
> > I tried to apply different approaches, but eventually, I failed to
> achieve
> > my goals. It seems that the current implementation cannot handle the
> > required scenario, as explained below.
> >
> > Consider the following tree:
> > LogicalAggregate1[group=[b,c]]
> >   LogicalAggregate2[group=[a,b,c]]
> > LogicalInput
> >
> > I want to find the plan to do these two aggregations without an exchange
> in
> > between because they may have compatible distributions. Example:
> > PhysicalAggregate1[group=[b,c]] // SHARDED[b,c]
> >   PhysicalAggregate2[group=[a,b,c]] // SHARDED[b,c]
> > Exchange// SHARDED[b,c]
> >   PhysicalInput // SHARDED[?]
> >
> > The fundamental problem is that it is impossible to save the optimization
> > request and resolve traits in the "derive" phase afterward. What we need
> is
> > to send the optimization request "SHARDED by [b,c] in any order" to
> > PhysicalAggregate2, and use it in the derive phase so that the new
> > PhysicalAggregate2 is created with [b,c] or [c,b], but strictly without
> > [a]. Unfortunately, this doesn't work because the nodes emitted from the
> > pass-through do not participate in the "derive" phase.
> >
> > This could be fixed with a trivial change - to allow certain nodes
> emitted
> > from the "passThrough" to participate in "derive". We can do that using a
> > marker interface or an extension to a PhysicalRel interface. For example:
> > interface PhysicalRel {
> > boolean enforceDerive();
> > }
> >
> > When set to "true", the node would not be added to the pass-through
> cache.
> > This way, we may use this node as *storage* for the optimization request.
> > When the "derive" is called later, we know both the parent requirements
> and
> > the child traits. This would be sufficient to solve my problem. I already
> > tried to do this by disabling the pass-through cache completely and
> > confirmed that the required plan is found.
> >
> > Do you have any objections to such a change?
> >
> > Regards,
> > Vladimir.
> >
> > сб, 29 мая 2021 г. в 11:59, Vladimir Ozerov :
> >
> > > Hi Haisheng, Jinpeng
> > >
> > > I think we are more or less on the same page:
> > >
> > >1. The current implementation of Apache Calcite may generate
> wasteful
> > >alternatives because rules lack the optimization context.
> > >2. But the actual impact on efficiency is not clear.
> > >
> > > The (2) is essential to understand whether my efforts make any
> practical
> > > sense. And so far, I have only a vague common sense and some simple
> > > examples in mind, which is not sufficient to make any claims.
> > >
> > > Nevertheless, I've checked the source code of the original Columbia
> > > optimizer. I was wrong in my original claim that Columbia doesn't pass
> > > optimization context to rules. It d

Re: Possibly incorrect assertion in the TopDownRuleDriver.DeriveTrait.derive

2021-06-13 Thread Vladimir Ozerov
Thanks, I created an issue [1] to improve the assertion.

[1] https://issues.apache.org/jira/browse/CALCITE-4650

пн, 7 июн. 2021 г. в 23:30, Haisheng Yuan :

> > Shouldn't we remove the assertion above?
> Perhaps.
>
> Or perhaps the rel2Subset mapping is not up to date.
>
> Regards,
> Haisheng Yuan
>
> On 2021/06/06 13:09:16, Vladimir Ozerov  wrote:
> > Hi,
> >
> > When doing a trait derivation in the non-OMAKASE mode, the following
> lines
> > of code are invoked:
> > 1: RelSubset relSubset = planner.register(newRel, rel);
> > 2: assert relSubset.set == planner.getSubset(rel).set;
> >
> > The assertion on the second line may fail because the "newRel" is
> assigned
> > not the "rel" set, but "rel" *canonical set*, which might be different.
> >
> > As a workaround, we may change the derive mode to OMAKASE. In this case,
> we
> > do not hit the assertion and planning completes successfully.
> >
> > Shouldn't we remove the assertion above?
> >
> > Regards,
> > Vladimir.
> >
>


[jira] [Created] (CALCITE-4650) TopDownRuleDriver.DeriveTrait.derive produces wrong AssertionError when registering new nodes

2021-06-13 Thread Vladimir Ozerov (Jira)
Vladimir Ozerov created CALCITE-4650:


 Summary: TopDownRuleDriver.DeriveTrait.derive produces wrong 
AssertionError when registering new nodes
 Key: CALCITE-4650
 URL: https://issues.apache.org/jira/browse/CALCITE-4650
 Project: Calcite
  Issue Type: Bug
  Components: core
Affects Versions: 1.27.0
Reporter: Vladimir Ozerov
Assignee: Vladimir Ozerov
 Fix For: next


{{TopDownRuleDriver.DeriveTrait.derive}} asserts that a derived node belongs to 
the same {{RelSet}} as the original node. 

If the original node {{N1}} refers to set {{S1}}, the derived node {{N2}} might 
be registered in the set {{S2}}, if {{S1}} was merged to {{S2}} between the 
registration of {{N1}} and derivation of {{N2}}. 

Currently, we compare the original set of the original node {{N1}} with the 
canonical set of the derived node {{N2}}. This leads to false assertions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Trait propagation guidelines

2021-06-13 Thread Vladimir Ozerov
Hi,

I tried to apply different approaches, but eventually, I failed to achieve
my goals. It seems that the current implementation cannot handle the
required scenario, as explained below.

Consider the following tree:
LogicalAggregate1[group=[b,c]]
  LogicalAggregate2[group=[a,b,c]]
LogicalInput

I want to find the plan to do these two aggregations without an exchange in
between because they may have compatible distributions. Example:
PhysicalAggregate1[group=[b,c]] // SHARDED[b,c]
  PhysicalAggregate2[group=[a,b,c]] // SHARDED[b,c]
Exchange// SHARDED[b,c]
  PhysicalInput // SHARDED[?]

The fundamental problem is that it is impossible to save the optimization
request and resolve traits in the "derive" phase afterward. What we need is
to send the optimization request "SHARDED by [b,c] in any order" to
PhysicalAggregate2, and use it in the derive phase so that the new
PhysicalAggregate2 is created with [b,c] or [c,b], but strictly without
[a]. Unfortunately, this doesn't work because the nodes emitted from the
pass-through do not participate in the "derive" phase.

This could be fixed with a trivial change - to allow certain nodes emitted
from the "passThrough" to participate in "derive". We can do that using a
marker interface or an extension to a PhysicalRel interface. For example:
interface PhysicalRel {
boolean enforceDerive();
}

When set to "true", the node would not be added to the pass-through cache.
This way, we may use this node as *storage* for the optimization request.
When the "derive" is called later, we know both the parent requirements and
the child traits. This would be sufficient to solve my problem. I already
tried to do this by disabling the pass-through cache completely and
confirmed that the required plan is found.

Do you have any objections to such a change?

Regards,
Vladimir.

сб, 29 мая 2021 г. в 11:59, Vladimir Ozerov :

> Hi Haisheng, Jinpeng
>
> I think we are more or less on the same page:
>
>1. The current implementation of Apache Calcite may generate wasteful
>alternatives because rules lack the optimization context.
>2. But the actual impact on efficiency is not clear.
>
> The (2) is essential to understand whether my efforts make any practical
> sense. And so far, I have only a vague common sense and some simple
> examples in mind, which is not sufficient to make any claims.
>
> Nevertheless, I've checked the source code of the original Columbia
> optimizer. I was wrong in my original claim that Columbia doesn't pass
> optimization context to rules. It does [1]. The context consists of
> required traits and cost budget. In Apache Calcite terms, the context is
> passed to both "RelRule.matches" and "RelRule.onMatch", so that the rule
> may decide on the optimization strategy based on parent request. This is
> exactly what I was trying to achieve in my system with some hacks around
> derive/passThrough.
>
> Regarding the example with join, my proposal is not likely to make any
> difference because the tables are not co-located on the join key, and hence
> join may emit several distributions. Consider the different situation -
> data is already collocated. Without the context, I will emit both 1-phase
> and 2-phase aggregates because I do not know which distributions are
> available below. With the context available, I can collect propagate
> promising optimization requests from Aggregate rules (1-phase, 2-phase).
> Then wait for input optimization and check what is returned. If only
> [dist=a] is returned, I can skip the 2-phase aggregate completely.
> Aggregate[group=a]
>   Join[foo.a=bar.b]
> Input(foo, dist=a)
> Input(bar, dist=b)
>
> Another possible use case is join on several keys. By issuing a
> context-aware optimization request [dist a1] from Aggregate to Join, we
> can establish tight cost bounds on Aggregate and Join equivalence groups
> very early so that all other options (broadcasts, sharding in [a1,a2], ...)
> would be pruned without even entering MEMO.
> Aggregate[group=a1]
>   Join[foo.a1=bar.b1 AND foo.a2=bar.b2]
> Input(foo, dist=a1)
> Input(bar, dist=b2)
>
> As far as Jinpeng's example with logical multi-phase aggregates - I think
> this is a great example of why logical split might be useful. Thank you for
> that. This reminded me about another concerning use case. Consider an
> Aggregate on top of a UnionAll:
> LogicalAggregate[group=a, COUNT(b)]
>   UnionAll
> Input1
> Input2
>
> With Calcite rules, we may push the aggregate down:
> LogicalAggregate[group=a, SUM(COUNT)]
>   UnionAll
> LogicalAggregate[group=a, COUNT(b)] // <-- Possible exchange here
>   Input1
> LogicalAg

Re: Top-down optimizer cannot explore the search space because physical rule is treated as transformation rule

2021-06-13 Thread Vladimir Ozerov
Please disregard this email. I sent it to the wrong thread, sorry.

вс, 13 июн. 2021 г. в 19:56, Vladimir Ozerov :

> Hi,
>
> I tried to apply different approaches, but eventually, I failed to achieve
> my goals. It seems that the current implementation cannot handle the
> required scenario, as explained below.
>
> Consider the following tree:
> LogicalAggregate1[group=[b,c]]
>   LogicalAggregate2[group=[a,b,c]]
> LogicalInput
>
> I want to find the plan to do these two aggregations without an exchange
> in between because they may have compatible distributions. Example:
> PhysicalAggregate1[group=[b,c]] // SHARDED[b,c]
>   PhysicalAggregate2[group=[a,b,c]] // SHARDED[b,c]
> Exchange// SHARDED[b,c]
>   PhysicalInput // SHARDED[?]
>
> The fundamental problem is that it is impossible to save the optimization
> request and resolve traits in the "derive" phase afterward. What we need is
> to send the optimization request "SHARDED by [b,c] in any order" to
> PhysicalAggregate2, and use it in the derive phase so that the new
> PhysicalAggregate2 is created with [b,c] or [c,b], but strictly without
> [a]. Unfortunately, this doesn't work because the nodes emitted from the
> pass-through do not participate in the "derive" phase.
>
> This could be fixed with a trivial change - to allow certain nodes emitted
> from the "passThrough" to participate in "derive". We can do that using a
> marker interface or an extension to a PhysicalRel interface. For example:
> interface PhysicalRel {
> boolean enforceDerive();
> }
>
> When set to "true", the node would not be added to the pass-through cache.
> This way, we may use this node as *storage* for the optimization request.
> When the "derive" is called later, we know both the parent requirements and
> the child traits. This would be sufficient to solve my problem. I already
> tried to do this by disabling the pass-through cache completely and
> confirmed that the required plan is found.
>
> Do you have any objections to such a change?
>
> Regards,
> Vladimir.
>
>
> вс, 30 мая 2021 г. в 12:54, Jinpeng Wu :
>
>> Hi, Vladimir. I mean the error part only.  I am comfortable with the other
>> changes.
>>
>> If changing isTransformationRule could have unexpected consequences, one
>> possible way is reserving current logic and only adding a newline checking
>> the "implements TransformationRule''. Even though we remove the original
>> logic completely, users who prefer legacy logic to avoid risks can
>> overwrite the method by simply copying several lines of code. That's
>> totally acceptable. But if errors are issued, that's no longer a choice.
>>
>> In any case, if errors should be reported, we should provide an option to
>> suppress the errors.
>>
>> Thanks,
>> Jinpeng
>>
>> On Sun, May 30, 2021 at 4:59 PM Vladimir Ozerov 
>> wrote:
>>
>> > Hi Jinpeng,
>> >
>> > Do you mean the whole change or the error part only?
>> >
>> > My concern is that if we change the implementation of
>> > VolcanoPlanner.isTransformationRule, then some transformation rules that
>> > are not marked as "implements TransformationRule" will be treated as
>> > implementation rules, which may lead to some other hidden negative
>> > consequences.
>> >
>> > Ease of upgrade and predictable behavior is often in conflict with each
>> > other when planning migration paths. I am not insisting on the error,
>> but
>> > personally, I am more comfortable with products that fail fast, forcing
>> me
>> > to do the right things rather as early as possible.
>> >
>> > Regards,
>> > Vladimir.
>> >
>> > [1]
>> >
>> >
>> https://github.com/apache/calcite/blob/calcite-1.26.0/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoRuleCall.java#L96-L100
>> >
>> > вс, 30 мая 2021 г. в 10:45, Jinpeng Wu :
>> >
>> > > Hi.
>> > >
>> > > Warnings and Fixing isTransformationRule are good solutions. But I am
>> > still
>> > > concerned about reporting an error. It will force users to do a large
>> > > refactoring on existing codes before they can migrate to the new rule
>> > > driver. Any refactoring can be risky, especially for those critical
>> > > services. It leaves those systems no choice but to keep using old
>> > versions
>> > > of calcite. However, we usually can still get a good plan 

Re: Top-down optimizer cannot explore the search space because physical rule is treated as transformation rule

2021-06-13 Thread Vladimir Ozerov
Hi,

I tried to apply different approaches, but eventually, I failed to achieve
my goals. It seems that the current implementation cannot handle the
required scenario, as explained below.

Consider the following tree:
LogicalAggregate1[group=[b,c]]
  LogicalAggregate2[group=[a,b,c]]
LogicalInput

I want to find the plan to do these two aggregations without an exchange in
between because they may have compatible distributions. Example:
PhysicalAggregate1[group=[b,c]] // SHARDED[b,c]
  PhysicalAggregate2[group=[a,b,c]] // SHARDED[b,c]
Exchange// SHARDED[b,c]
  PhysicalInput // SHARDED[?]

The fundamental problem is that it is impossible to save the optimization
request and resolve traits in the "derive" phase afterward. What we need is
to send the optimization request "SHARDED by [b,c] in any order" to
PhysicalAggregate2, and use it in the derive phase so that the new
PhysicalAggregate2 is created with [b,c] or [c,b], but strictly without
[a]. Unfortunately, this doesn't work because the nodes emitted from the
pass-through do not participate in the "derive" phase.

This could be fixed with a trivial change - to allow certain nodes emitted
from the "passThrough" to participate in "derive". We can do that using a
marker interface or an extension to a PhysicalRel interface. For example:
interface PhysicalRel {
boolean enforceDerive();
}

When set to "true", the node would not be added to the pass-through cache.
This way, we may use this node as *storage* for the optimization request.
When the "derive" is called later, we know both the parent requirements and
the child traits. This would be sufficient to solve my problem. I already
tried to do this by disabling the pass-through cache completely and
confirmed that the required plan is found.

Do you have any objections to such a change?

Regards,
Vladimir.


вс, 30 мая 2021 г. в 12:54, Jinpeng Wu :

> Hi, Vladimir. I mean the error part only.  I am comfortable with the other
> changes.
>
> If changing isTransformationRule could have unexpected consequences, one
> possible way is reserving current logic and only adding a newline checking
> the "implements TransformationRule''. Even though we remove the original
> logic completely, users who prefer legacy logic to avoid risks can
> overwrite the method by simply copying several lines of code. That's
> totally acceptable. But if errors are issued, that's no longer a choice.
>
> In any case, if errors should be reported, we should provide an option to
> suppress the errors.
>
> Thanks,
> Jinpeng
>
> On Sun, May 30, 2021 at 4:59 PM Vladimir Ozerov 
> wrote:
>
> > Hi Jinpeng,
> >
> > Do you mean the whole change or the error part only?
> >
> > My concern is that if we change the implementation of
> > VolcanoPlanner.isTransformationRule, then some transformation rules that
> > are not marked as "implements TransformationRule" will be treated as
> > implementation rules, which may lead to some other hidden negative
> > consequences.
> >
> > Ease of upgrade and predictable behavior is often in conflict with each
> > other when planning migration paths. I am not insisting on the error, but
> > personally, I am more comfortable with products that fail fast, forcing
> me
> > to do the right things rather as early as possible.
> >
> > Regards,
> > Vladimir.
> >
> > [1]
> >
> >
> https://github.com/apache/calcite/blob/calcite-1.26.0/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoRuleCall.java#L96-L100
> >
> > вс, 30 мая 2021 г. в 10:45, Jinpeng Wu :
> >
> > > Hi.
> > >
> > > Warnings and Fixing isTransformationRule are good solutions. But I am
> > still
> > > concerned about reporting an error. It will force users to do a large
> > > refactoring on existing codes before they can migrate to the new rule
> > > driver. Any refactoring can be risky, especially for those critical
> > > services. It leaves those systems no choice but to keep using old
> > versions
> > > of calcite. However, we usually can still get a good plan even without
> > > correct rule types. I don't think it worthwhile to introduce that
> change.
> > >
> > > Thanks
> > > Jinpeng Wu
> > >
> > > On Sun, May 30, 2021 at 5:19 AM Vladimir Ozerov 
> > > wrote:
> > >
> > > > Great. Does the community have any objections to the following fix?
> > > > 1. Add a flag to a rule call instance with the expected node type. In
> > the
> > > > case of a mismatch, we will print a warning once per ru

Possibly incorrect assertion in the TopDownRuleDriver.DeriveTrait.derive

2021-06-06 Thread Vladimir Ozerov
Hi,

When doing a trait derivation in the non-OMAKASE mode, the following lines
of code are invoked:
1: RelSubset relSubset = planner.register(newRel, rel);
2: assert relSubset.set == planner.getSubset(rel).set;

The assertion on the second line may fail because the "newRel" is assigned
not the "rel" set, but "rel" *canonical set*, which might be different.

As a workaround, we may change the derive mode to OMAKASE. In this case, we
do not hit the assertion and planning completes successfully.

Shouldn't we remove the assertion above?

Regards,
Vladimir.


Re: Top-down optimizer cannot explore the search space because physical rule is treated as transformation rule

2021-05-30 Thread Vladimir Ozerov
Hi Jinpeng,

Do you mean the whole change or the error part only?

My concern is that if we change the implementation of
VolcanoPlanner.isTransformationRule, then some transformation rules that
are not marked as "implements TransformationRule" will be treated as
implementation rules, which may lead to some other hidden negative
consequences.

Ease of upgrade and predictable behavior is often in conflict with each
other when planning migration paths. I am not insisting on the error, but
personally, I am more comfortable with products that fail fast, forcing me
to do the right things rather as early as possible.

Regards,
Vladimir.

[1]
https://github.com/apache/calcite/blob/calcite-1.26.0/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoRuleCall.java#L96-L100

вс, 30 мая 2021 г. в 10:45, Jinpeng Wu :

> Hi.
>
> Warnings and Fixing isTransformationRule are good solutions. But I am still
> concerned about reporting an error. It will force users to do a large
> refactoring on existing codes before they can migrate to the new rule
> driver. Any refactoring can be risky, especially for those critical
> services. It leaves those systems no choice but to keep using old versions
> of calcite. However, we usually can still get a good plan even without
> correct rule types. I don't think it worthwhile to introduce that change.
>
> Thanks
> Jinpeng Wu
>
> On Sun, May 30, 2021 at 5:19 AM Vladimir Ozerov 
> wrote:
>
> > Great. Does the community have any objections to the following fix?
> > 1. Add a flag to a rule call instance with the expected node type. In the
> > case of a mismatch, we will print a warning once per rule per JVM
> instance
> > to avoid too many messages. Alternatively, we may print a warning per
> rule
> > per VolcanoPlanner, but I am concerned with too many repetitive messages
> > because VolcanoPlanner is usually instantiated per SQL query.
> > 2. In the next release, (1) replace the warning with an error, (2) change
> > VolcanoPlanner.isTransformationRule as discussed above.
> >
> >
> >
> > Пт, 28 мая 2021 г. в 21:27, Haisheng Yuan :
> >
> > > Great, that is the correct way to change it and that should be the
> > default
> > > implementation.
> > >
> > > On 2021/05/28 17:41:15, Vladimir Ozerov  wrote:
> > > > BTW, I tried to change the implementation to:
> > > >
> > > >  1: protected boolean isTransformationRule(VolcanoRuleCall match) {
> > > >  2:return match.getRule() instanceof TransformationRule;
> > > >  3: }
> > > >
> > > > It solved my problem - plans returned to normal. In the Apache
> Calcite
> > > > repo, only 4 tests in the TopDowOptTest class failed due to a minor
> > > > operator reordering.
> > > >
> > > > пт, 28 мая 2021 г. в 20:37, Vladimir Ozerov :
> > > >
> > > > > Hi Jinpeng,
> > > > >
> > > > > Thank you for the clarification. When I saw the code in question
> for
> > > the
> > > > > first time, my first thought was that it was perhaps designed for
> > > gradual
> > > > > migration. The main problem is that the current implementation
> > discards
> > > > > parts of the plan *silently*, which might be difficult to spot. I
> > > > > only spotted the problem in my specific case because I had ~100
> tests
> > > with
> > > > > complex queries. Otherwise, I would happily proceed with the new
> rule
> > > > > without knowing that I lost important parts of the search space.
> > > > >
> > > > > That said, I think we can do the following:
> > > > >
> > > > >1. Emit a warning if or even throw an exception if the
> > > transformation
> > > > >rule produced a physical node. This should be trivial to
> implement
> > > - add an
> > > > >expected node type to VolcanoRuleCall (e.g., "logical",
> > "physical",
> > > "any").
> > > > >The warning/exception should contain a proper fix suggestion -
> to
> > > override
> > > > >the VolcanoPlanner.isTransformationRule.
> > > > >2. Alternatively - do a breaking change. Apache Calcite doesn't
> > > have a
> > > > >major release cadence. It is normal practice in many products to
> > do
> > > > >breaking changes in minor releases. Even popular products like
> > > Mongo or
> > > > >DataStax do it regularly. We may inform the u

Re: Top-down optimizer cannot explore the search space because physical rule is treated as transformation rule

2021-05-29 Thread Vladimir Ozerov
Great. Does the community have any objections to the following fix?
1. Add a flag to a rule call instance with the expected node type. In the
case of a mismatch, we will print a warning once per rule per JVM instance
to avoid too many messages. Alternatively, we may print a warning per rule
per VolcanoPlanner, but I am concerned with too many repetitive messages
because VolcanoPlanner is usually instantiated per SQL query.
2. In the next release, (1) replace the warning with an error, (2) change
VolcanoPlanner.isTransformationRule as discussed above.



Пт, 28 мая 2021 г. в 21:27, Haisheng Yuan :

> Great, that is the correct way to change it and that should be the default
> implementation.
>
> On 2021/05/28 17:41:15, Vladimir Ozerov  wrote:
> > BTW, I tried to change the implementation to:
> >
> >  1: protected boolean isTransformationRule(VolcanoRuleCall match) {
> >  2:return match.getRule() instanceof TransformationRule;
> >  3: }
> >
> > It solved my problem - plans returned to normal. In the Apache Calcite
> > repo, only 4 tests in the TopDowOptTest class failed due to a minor
> > operator reordering.
> >
> > пт, 28 мая 2021 г. в 20:37, Vladimir Ozerov :
> >
> > > Hi Jinpeng,
> > >
> > > Thank you for the clarification. When I saw the code in question for
> the
> > > first time, my first thought was that it was perhaps designed for
> gradual
> > > migration. The main problem is that the current implementation discards
> > > parts of the plan *silently*, which might be difficult to spot. I
> > > only spotted the problem in my specific case because I had ~100 tests
> with
> > > complex queries. Otherwise, I would happily proceed with the new rule
> > > without knowing that I lost important parts of the search space.
> > >
> > > That said, I think we can do the following:
> > >
> > >1. Emit a warning if or even throw an exception if the
> transformation
> > >rule produced a physical node. This should be trivial to implement
> - add an
> > >expected node type to VolcanoRuleCall (e.g., "logical", "physical",
> "any").
> > >The warning/exception should contain a proper fix suggestion - to
> override
> > >the VolcanoPlanner.isTransformationRule.
> > >2. Alternatively - do a breaking change. Apache Calcite doesn't
> have a
> > >major release cadence. It is normal practice in many products to do
> > >breaking changes in minor releases. Even popular products like
> Mongo or
> > >DataStax do it regularly. We may inform the user in the first
> release and
> > >change to "rule instanceof TransformationRule" in the next release.
> > >
> > > Does it make sense?
> > >
> > > Regards,
> > > Vladimir.
> > >
> > > пт, 28 мая 2021 г. в 19:33, Jinpeng Wu :
> > >
> > >> Hi, Vladimir. Good catch! There could be some improvements here.
> > >>
> > >> Actually, this problem was discovered early when the top-down rule
> driver
> > >> was designed. At that time, no rule was annotated as
> "TransformationRule".
> > >> Moreover, it is impossible to ask every calcite user who designed
> their
> > >> own
> > >> rules to annotate the existing code. So the top-down rule driver was
> > >> designed so that it can:
> > >> 1. Work in chaos: even if there are no hints for rule types, it can
> still
> > >> work. Some opportunities may be lost, but NO failures, NO exceptions,
> and
> > >> NO worse than the original driver. People can migrate to the new
> driver
> > >> without concern.
> > >> 2. Be Improvable: Users can refactor their code, if they want, step by
> > >> step. As rule types become more and more accurate, the system achieves
> > >> more
> > >> and more benefits
> > >> 3. Be easy to customize: the default implementation is designed for
> the
> > >> most common cases, so that most users can benefit from it without much
> > >> effort. But it is not possible to fulfill all requirements as
> different
> > >> systems could have very different patterns to define logical and
> > >> physical. That's why the isTransformationRule method is put in
> > >> VolcanoPlanner and marked as protected: overwriting it can be very
> simple.
> > >>
> > >> Moreover, losing some "derive" opportunities is not as serious as
> > >> imagination. As I mention

Re: Trait propagation guidelines

2021-05-29 Thread Vladimir Ozerov
 +--  Broadcast
> > >+-- TableScan on bar
> > >
> > > Can you tell that the single phase aggregate plan is always better than
> > the 2 phase aggregate plan?
> > >
> > > > Therefore, the typical way to optimize
> > > > LogicalAggregate is to split in the physical phase (implementation
> > rule,
> > > > pass-through, derive). Practical systems like Dremio [1] and Flink
> [2]
> > > > work this way.
> > > Dremio and Flink work this way doesn't mean it is a good way. Greenplum
> > Orca and Alibaba MaxCompute optimizer work in another way. In Flink and
> > Dremio, they have HashAggPRule to generate 1 phase HashAgg and 2 phase
> > HashAgg, SortAggPRule to generate 1 phase SortAgg and 2 phase SortAgg.
> > However do you think there is possibility that the global SortAgg
> combined
> > with local HashAgg, or the global HashAgg combined with local SortAgg may
> > perform better in difference cases? Are you going to generate all the 4
> > combinations in the implementation rule? There are some cases we found
> we'd
> > better to split the aggregate into 3 phase aggregate [1], in which case,
> > will the implementation rule generate 3 HashAggs or 3 SortAggs, or all
> the
> > 6 combinations?
> > >
> > > In our system, we have 1 phase, 2 phase, 3 phase logical aggregate
> rules
> > to transform the LogicalAggregate to another kind of logical aggregate(s)
> > with phase info, say LogicalXXXAggregate, then our physical aggregate
> rules
> > match this kind of node to generate HashAgg or StreamAgg. Of course, in
> the
> > logical rules, we can add business logic to guess the possible traits
> > delivered by child nodes to determine whether the rule definitely won't
> > generate a better alternative and may decide to abort this transformation
> > early. But I would rather let the cost model decide.
> > >
> > > Admittedly, the current top-down optimization is not pure on-demand
> > request oriented, because it will always generate a physical request
> > regardless the parent nodes' trait request. For example the following
> query
> > in a non-distributed environment:
> > > select a, b, c, max(d) from foo group by a, b, c order by a desc;
> > >
> > > It will first generate a StreamAgg[a ASC, b ASC, c ASC] no matter what
> > the parent node requires, then the "passThrough" tells StreamAgg that
> > parent requires [a DESC], we get a StreamAgg[a DESC, b ASC, c ASC]. It
> > would be ideal if we only generate StreamAgg[a DESC, b ASC, c ASC] by
> > request, but I don't think that will make much difference, the bottleneck
> > relies on the join order enumeration and the Project related operation.
> > >
> > > Regards,
> > > Haisheng Yuan
> > >
> > > [1]
> >
> https://github.com/greenplum-db/gporca/blob/master/libgpopt/src/xforms/CXformSplitDQA.cpp
> > >
> > > On 2021/05/28 09:17:45, Vladimir Ozerov  wrote:
> > > > Hi Jinpeng, Haisheng,
> > > >
> > > > Thank you for your inputs. I really appreciate that. Let me try to
> > address
> > > > some of your comments and share some experience with the
> > implementation of
> > > > optimizers for a distributed engine I am currently working with.
> > > >
> > > > First of all, I would argue that multiple logical operators do not
> > have a
> > > > 1-1 mapping to physical operators, and Window is not special here.
> For
> > > > instance, LogicalAggregate doesn't have 1-1 mapping to physical
> > aggregates
> > > > because the physical implementation can be either 1-phase or 2-phase.
> > It
> > > > doesn't matter that the 2-phase aggregate is a composition of two
> > 1-phase
> > > > aggregates: the whole decision of whether to go with 1-phase or
> 2-phase
> > > > aggregate is a physical decision that should be made based on
> > available (or
> > > > assumed) input traits.
> > > >
> > > > Consider the following logical tree:
> > > > LogicalAggregate[group=$0, agg=SUM($1)]
> > > >   Input
> > > >
> > > > If I do the split on the logical phase with a separate transformation
> > rule,
> > > > I will get the following tree:
> > > > LogicalAggregate[group=$0, agg=SUM($1)]
> > > >   LogicalAggregate[group=$0, agg=SUM($1)]
> > > > Input
> > > >
> > > > Now we have an infinite loop 

Re: Top-down optimizer cannot explore the search space because physical rule is treated as transformation rule

2021-05-28 Thread Vladimir Ozerov
BTW, I tried to change the implementation to:

 1: protected boolean isTransformationRule(VolcanoRuleCall match) {
 2:return match.getRule() instanceof TransformationRule;
 3: }

It solved my problem - plans returned to normal. In the Apache Calcite
repo, only 4 tests in the TopDowOptTest class failed due to a minor
operator reordering.

пт, 28 мая 2021 г. в 20:37, Vladimir Ozerov :

> Hi Jinpeng,
>
> Thank you for the clarification. When I saw the code in question for the
> first time, my first thought was that it was perhaps designed for gradual
> migration. The main problem is that the current implementation discards
> parts of the plan *silently*, which might be difficult to spot. I
> only spotted the problem in my specific case because I had ~100 tests with
> complex queries. Otherwise, I would happily proceed with the new rule
> without knowing that I lost important parts of the search space.
>
> That said, I think we can do the following:
>
>1. Emit a warning if or even throw an exception if the transformation
>rule produced a physical node. This should be trivial to implement - add an
>expected node type to VolcanoRuleCall (e.g., "logical", "physical", "any").
>The warning/exception should contain a proper fix suggestion - to override
>the VolcanoPlanner.isTransformationRule.
>2. Alternatively - do a breaking change. Apache Calcite doesn't have a
>major release cadence. It is normal practice in many products to do
>breaking changes in minor releases. Even popular products like Mongo or
>DataStax do it regularly. We may inform the user in the first release and
>change to "rule instanceof TransformationRule" in the next release.
>
> Does it make sense?
>
> Regards,
> Vladimir.
>
> пт, 28 мая 2021 г. в 19:33, Jinpeng Wu :
>
>> Hi, Vladimir. Good catch! There could be some improvements here.
>>
>> Actually, this problem was discovered early when the top-down rule driver
>> was designed. At that time, no rule was annotated as "TransformationRule".
>> Moreover, it is impossible to ask every calcite user who designed their
>> own
>> rules to annotate the existing code. So the top-down rule driver was
>> designed so that it can:
>> 1. Work in chaos: even if there are no hints for rule types, it can still
>> work. Some opportunities may be lost, but NO failures, NO exceptions, and
>> NO worse than the original driver. People can migrate to the new driver
>> without concern.
>> 2. Be Improvable: Users can refactor their code, if they want, step by
>> step. As rule types become more and more accurate, the system achieves
>> more
>> and more benefits
>> 3. Be easy to customize: the default implementation is designed for the
>> most common cases, so that most users can benefit from it without much
>> effort. But it is not possible to fulfill all requirements as different
>> systems could have very different patterns to define logical and
>> physical. That's why the isTransformationRule method is put in
>> VolcanoPlanner and marked as protected: overwriting it can be very simple.
>>
>> Moreover, losing some "derive" opportunities is not as serious as
>> imagination. As I mentioned in previous discussions, parents are in charge
>> of raising as many requirements as possible. During "derive", if specific
>> traits were not built by children, it means that no parents were requiring
>> that. And if parents finally require that traits in the latter
>> optimization, passThrough methods get called and new physical nodes are
>> generated and "derive" get called again.
>> I tested it on millions of queries, with or without correct rule types, in
>> my own product. The performance of group pruning varies a lot. But the
>> output plans are almost the same. Only one obvious exception was
>> discovered: the spool node. That's because spool nodes cannot "passThough"
>> parent traits (it could have multiple parents and current framework cannot
>> handle such a situation) while it can "derive" input traits.
>>
>> Of course, this conclusion may not apply to your product as we could have
>> quite different rule sets. I am just sharing some of my experiences. Maybe
>> the current implementation of "isTransformationRule" is not good enough.
>> If
>> you have any better solutions, please share them.
>>
>> Thanks,
>> Jinpeng Wu
>>
>> On Fri, May 28, 2021 at 7:10 PM Vladimir Ozerov 
>> wrote:
>>
>> > Hi,
>> >
>> > I have an optimizer that uses top-down VolcanoPlanner and has a
&

Re: Top-down optimizer cannot explore the search space because physical rule is treated as transformation rule

2021-05-28 Thread Vladimir Ozerov
Hi Jinpeng,

Thank you for the clarification. When I saw the code in question for the
first time, my first thought was that it was perhaps designed for gradual
migration. The main problem is that the current implementation discards
parts of the plan *silently*, which might be difficult to spot. I
only spotted the problem in my specific case because I had ~100 tests with
complex queries. Otherwise, I would happily proceed with the new rule
without knowing that I lost important parts of the search space.

That said, I think we can do the following:

   1. Emit a warning if or even throw an exception if the transformation
   rule produced a physical node. This should be trivial to implement - add an
   expected node type to VolcanoRuleCall (e.g., "logical", "physical", "any").
   The warning/exception should contain a proper fix suggestion - to override
   the VolcanoPlanner.isTransformationRule.
   2. Alternatively - do a breaking change. Apache Calcite doesn't have a
   major release cadence. It is normal practice in many products to do
   breaking changes in minor releases. Even popular products like Mongo or
   DataStax do it regularly. We may inform the user in the first release and
   change to "rule instanceof TransformationRule" in the next release.

Does it make sense?

Regards,
Vladimir.

пт, 28 мая 2021 г. в 19:33, Jinpeng Wu :

> Hi, Vladimir. Good catch! There could be some improvements here.
>
> Actually, this problem was discovered early when the top-down rule driver
> was designed. At that time, no rule was annotated as "TransformationRule".
> Moreover, it is impossible to ask every calcite user who designed their own
> rules to annotate the existing code. So the top-down rule driver was
> designed so that it can:
> 1. Work in chaos: even if there are no hints for rule types, it can still
> work. Some opportunities may be lost, but NO failures, NO exceptions, and
> NO worse than the original driver. People can migrate to the new driver
> without concern.
> 2. Be Improvable: Users can refactor their code, if they want, step by
> step. As rule types become more and more accurate, the system achieves more
> and more benefits
> 3. Be easy to customize: the default implementation is designed for the
> most common cases, so that most users can benefit from it without much
> effort. But it is not possible to fulfill all requirements as different
> systems could have very different patterns to define logical and
> physical. That's why the isTransformationRule method is put in
> VolcanoPlanner and marked as protected: overwriting it can be very simple.
>
> Moreover, losing some "derive" opportunities is not as serious as
> imagination. As I mentioned in previous discussions, parents are in charge
> of raising as many requirements as possible. During "derive", if specific
> traits were not built by children, it means that no parents were requiring
> that. And if parents finally require that traits in the latter
> optimization, passThrough methods get called and new physical nodes are
> generated and "derive" get called again.
> I tested it on millions of queries, with or without correct rule types, in
> my own product. The performance of group pruning varies a lot. But the
> output plans are almost the same. Only one obvious exception was
> discovered: the spool node. That's because spool nodes cannot "passThough"
> parent traits (it could have multiple parents and current framework cannot
> handle such a situation) while it can "derive" input traits.
>
> Of course, this conclusion may not apply to your product as we could have
> quite different rule sets. I am just sharing some of my experiences. Maybe
> the current implementation of "isTransformationRule" is not good enough. If
> you have any better solutions, please share them.
>
> Thanks,
> Jinpeng Wu
>
> On Fri, May 28, 2021 at 7:10 PM Vladimir Ozerov 
> wrote:
>
> > Hi,
> >
> > I have an optimizer that uses top-down VolcanoPlanner and has a
> > ConverterRule for every LogicalNode. I have a new requirement when one of
> > the physical rules must emit several physical nodes instead of one. I
> tried
> > to convert a ConverterRule to a normal rule that emits physical nodes.
> Even
> > though the new rule has exactly the same pattern and logic as the
> previous
> > ConverterRule, plans changed. Analysis showed that this likely due to a
> bug
> > in the top-down optimizer as explained below.
> >
> > When optimizing a logical node, the top-down first schedules the
> > transformation rules, and then implementation rules. The logic to check
> > whether the rule is transformation rule or not is located in
>

Top-down optimizer cannot explore the search space because physical rule is treated as transformation rule

2021-05-28 Thread Vladimir Ozerov
Hi,

I have an optimizer that uses top-down VolcanoPlanner and has a
ConverterRule for every LogicalNode. I have a new requirement when one of
the physical rules must emit several physical nodes instead of one. I tried
to convert a ConverterRule to a normal rule that emits physical nodes. Even
though the new rule has exactly the same pattern and logic as the previous
ConverterRule, plans changed. Analysis showed that this likely due to a bug
in the top-down optimizer as explained below.

When optimizing a logical node, the top-down first schedules the
transformation rules, and then implementation rules. The logic to check
whether the rule is transformation rule or not is located in
VolcanoPlanner.isTransformationRule [1]. The rule scheduling logic ensures
that a given rule is executed either as a transformation rule, or an
implementation rule, but not both. See TopDowRuleQueue.popMatch. The
top-down optimizer schedules tasks in a stack. So even though the
transformation rules are scheduled before implementation rules, the latter
executed first.

If an implementation rule produces a physical node, this node will be
notified about input traits in the "derive" phase. In contrast,
transformation rules produce logical nodes only, and this happens after the
derivation of the inputs is completed. Therefore, if the top-down optimizer
mistakenly treats an implementation rule as a transformation rule, "derive"
will not be called on the produced physical nodes, leading to incomplete
search space exploration.

It seems, that this is exactly what happens in the current implementation.
The VolcanoPlanner.isTransformationRule looks like this:

 1: protected boolean isTransformationRule(VolcanoRuleCall match) {
 2:if (match.getRule() instanceof SubstitutionRule) {
 3:  return true;
 4:}
 5:if (match.getRule() instanceof ConverterRule
 6:&& match.getRule().getOutTrait() == rootConvention) {
 7:  return false;
 8:}
 9:return match.getRule().getOperand().trait == Convention.NONE
10:|| match.getRule().getOperand().trait == null;
11: }

If the rule is a ConverterRule and it produces the node with the target
convention, it is treated as an implementation rule (lines 5-6). But if the
rule is not a ConverterRule, the method tries to deduce the rule's type
from the incoming convention (lines 9-10). In practice, implementation
rules either do not care about the incoming trait or expect the NONE trait.
Therefore, it seems that currently, the top-down optimizer treats many
implementation rules as physical rules, and as a result, cannot notify
physical nodes produced from these rules about trait derivation.

This explains why in my case everything was ok when all implementation
rules were ConverterRules, and why I lost some optimal plans when the rule
was refactored to a non-converter variant.

Do you agree that this a bug? If yes, shouldn't we refactor that code to
just check whether the rule is an instance of TransformationRule? Since
this is a breaking change, we may add a special flag that preserves the old
behavior by default but allows for the new behavior to overcome the
aforementioned problem.

Regards,
Vladimir.

[1]
https://github.com/apache/calcite/blob/calcite-1.26.0/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoPlanner.java#L1398-L1408


Re: Trait propagation guidelines

2021-05-28 Thread Vladimir Ozerov
optimizer, where
> there is exact match, partial satisfying, orderless match etc, Calcite's
> default implementation always require exact satisfying. But we can still
> make use of "passThrough" and "derive" to achieve our goal. i.e. the
> aggregate generated by implementation rule requires itself and its child to
> delivered distribution on [a,b], but the "derive" method tells Aggregate
> that [b,a] is available, it can generate another option to require [b,a]
> instead.
> >
> > > In distributed engines, the nodes emitted from rules are basically
> "templates"
> > > that must be replaced with normal nodes.
> >
> > There is no difference between distributed and non-distributed engines
> when dealing with this. In Orca and CockroachDB optimizer, the nodes
> emitted from rules are operators without physical properties, the optimizer
> then request physical properties in top-down manner, either recursively or
> stack, or state machine. Calcite is quite different. when the physical
> operator is generated by implementation rule, the physical operator must
> has its own traits, at the same time, the traits that it expects its child
> operators to deliver. So in Calcite, they are not "templates". The
> difference is there since Calcite's inception.
> >
> > Regards,
> > Haisheng Yuan
> >
> > On 2021/05/27 08:59:33, Vladimir Ozerov  wrote:
> > > Hi Haisheng,
> > >
> > > Thank you for your inputs. They are really helpful. Let me summarize
> your
> > > feedback in my own words to verify that I understand it correctly.
> > >
> > >1. In distributed systems, an implementation rule may produce
> different
> > >physical operators depending on the input traits. Examples are
> Aggregate,
> > >Sort, Window.
> > >2. Since input traits are not known when the rule is fired, we must
> > >generate *all possible combinations* of physical operators that we
> may
> > >need. For LogicalAggregate, we must generate 1-phase and 2-phase
> > >alternatives. For LogicalSort, we also have 1-phase and 2-phase
> > >alternatives. Etc.
> > >3. If all combinations are generated, it is expected that
> "passThrough"
> > >and "derive" would be just trivial replacements of traits for most
> cases.
> > >This is why "passThroughTraits" and "deriveTraits" are recommended.
> A
> > >notable exception is TableScan that may emit alternative indexes in
> > >response to the pass-through requests.
> > >
> > > If my understanding is correct, then there are several issues with this
> > > approach still.
> > >
> > > 1. Unlike Aggregate or Sort, which may have only 1 or 2 phases, certain
> > > logical operators may have many physical alternatives. Consider the
> Window
> > > function:
> > > SELECT
> > >   AGG1 over (partition by a),
> > >   AGG2 over (partition by b),
> > >   AGG3 over (partition by c),
> > >   ...
> > > FROM input
> > >
> > > To calculate each aggregate, we need to re-shuffle the input based on
> the
> > > partition key. The key question is the order of reshuffling. If the
> input
> > > is shared by [a], I want to calculate AGG1 locally and then re-shuffle
> the
> > > input to calculate other aggregates. For the remaining AGG2 and AGG3,
> the
> > > order is also important. If the parent demands sharding by [b], then
> the
> > > proper sequence is b-c-a:
> > > 1: Window[AGG2 over (partition by b)] // SHARDED[b]
> > > 2:   Window[AGG3 over (partition by c)]   // SHARDED[c]
> > > 3: Window[AGG1 over (partition by a)] // SHARDED[a]
> > > 4:   Input// SHARDED[a]
> > >
> > > But if the parent demands [c], the proper sequence is c-b-a. Since we
> do
> > > not know real distributions when the rule is fired, we must emit all
> the
> > > permutations to ensure that no optimization opportunity is missed. But
> with
> > > complex window aggregate, this might be impractical because we will
> emit
> > > lots of unnecessary nodes.
> > >
> > > 2. As input traits are not known when the rule is fired, the nodes
> emitted
> > > from the implementation rules most likely would not be used in the
> final
> > > plan. For example, I can create a physical aggregate that demands
> > > non-strict distribution {a,b} from its input, meaning that both [a,b]
> 

Re: Trait propagation guidelines

2021-05-27 Thread Vladimir Ozerov
 ...
>
> // Candidate 6. passThrough called on candidate2
> PhysicalAggregate[group=[a], F2_phase2(c)]
>   PhysicalAggregate[group=[a,b], F1_phase2(c)]  // deliver dist[a]
> Exchange[dist[a]]
>   PhysicalAggregate[group=[a,b], F1_phase1(c)]
> ...
>
> step 3:
> The cost model chooses the best candidate.
> Note that Candidate 5 is not always the best. For example, when it is
> detected, from stats or other, that data is skewed on key [a], Candidate 2
> may be better. When it is detected that NDV(a, b) = 0.99 * ROWCOUNT() ,
> Candidate 6 is preferred, as partial aggregate can reduce little data. So
> it is not wasty to build all those candidates.
>
> Most of the above works are done by calcite frameworks. Users only need to:
> 1. Fire both implementations during aggregation builds.
> 2. Overwrite the passThroughTraits method.
>
> Thanks,
> Jinpeng Wu
>
>
> On Thu, May 27, 2021 at 8:19 AM Haisheng Yuan  wrote:
>
> > Another point I would like to mention is that it is not recommended to
> > override method "passThrough" and "derive" directly, override
> > "passThroughTraits" and "deriveTraits" instead, so that we can make sure
> > only the same type of physical node is created and no nested relnodes or
> > additional RelSets are created, unless you know you have to create
> > different type of nodes. For example, if the table foo has an btree index
> > on column a, and the parent relnode is requesting ordering on column a,
> > then we may consider to override "passThrough" of TableScan to return an
> > IndexScan instead of a TableScan.
> >
> > Regards,
> > Haisheng Yuan
> > On 2021/05/26 22:45:20, Haisheng Yuan  wrote:
> > > Hi Vladimir,
> > >
> > > 1. You need a logical rule to split the aggregate into a local
> aggregate
> > and global aggregate, for example:
> > >
> >
> https://github.com/greenplum-db/gporca/blob/master/libgpopt/src/xforms/CXformSplitGbAgg.cpp
> > > Only implementation rules can convert a logical node to a physical node
> > or multiple physical nodes.
> > > After physical implementation, you have 2 physical alternatives:
> > > 1) single phase global physical aggregate,
> > > 2) 2 phase physical aggregate with local and global aggregate.
> > > It should be up to the cost to decide which one to choose.
> > >
> > > 2. Given a desired traitset from parent node, the current relnode only
> > needs to generate a single relnode after passing down the traitset.
> Given a
> > traitset delivered by child node, the current relnode only derive a
> single
> > relnode. Quite unlike other optimizer, in Calcite's top-down optimizer,
> you
> > don't need to worry about issuing multiple optimization requests to
> inputs,
> > which is handled by Calcite framework secretly. i.e.
> > > SELECT a, b, min(c) from foo group by a, b;
> > > In many other optimizer, we probably need ask the aggregate to issue 3
> > distribution requests for tablescan on foo, which are
> > > 1) hash distributed by a,
> > > 2) hash distributed by b,
> > > 3) hash distributed by a, b
> > > However in Calcite top-down optimizer, your physical implementation
> rule
> > for global aggregate only need generate a single physical node with hash
> > distribution by a, b. In case the table foo happens to be distributed by
> a,
> > or b, the derive() method will tell you there is an opportunity. This is
> > the feature that Calcite's top-down optimizer excels over other
> optimizers,
> > because this can dramatically reduce the search space while keeping the
> > optimal optimization opportunity.
> > >
> > > 3. This is by design. Nodes produced from "passThrough" and "derive"
> and
> > just sibling physical node with different traitset, we only need the
> > initial physical nodes after implementation to avoid unnecessary
> > operations. The fundamental reason is, unlike Orca optimizer where
> physical
> > node and physical property are separate things, Calcite's
> logical/physical
> > nodes contains traitset. With regard to the latter question, can you give
> > an example?
> > >
> > > Regards,
> > > Haisheng Yuan
> > >
> > >
> > > On 2021/05/26 20:11:57, Vladimir Ozerov  wrote:
> > > > Hi,
> > > >
> > > > I tried to optimize a certain combination of operators for the
> > distributed
> > > > engine and got stuck with the trait propagation in the top-down
> > engine.

Re: Trait propagation guidelines

2021-05-27 Thread Vladimir Ozerov
> 2. Given a desired traitset from parent node, the current relnode only
> needs to generate a single relnode after passing down the traitset. Given a
> traitset delivered by child node, the current relnode only derive a single
> relnode. Quite unlike other optimizer, in Calcite's top-down optimizer, you
> don't need to worry about issuing multiple optimization requests to inputs,
> which is handled by Calcite framework secretly. i.e.
> > SELECT a, b, min(c) from foo group by a, b;
> > In many other optimizer, we probably need ask the aggregate to issue 3
> distribution requests for tablescan on foo, which are
> > 1) hash distributed by a,
> > 2) hash distributed by b,
> > 3) hash distributed by a, b
> > However in Calcite top-down optimizer, your physical implementation rule
> for global aggregate only need generate a single physical node with hash
> distribution by a, b. In case the table foo happens to be distributed by a,
> or b, the derive() method will tell you there is an opportunity. This is
> the feature that Calcite's top-down optimizer excels over other optimizers,
> because this can dramatically reduce the search space while keeping the
> optimal optimization opportunity.
> >
> > 3. This is by design. Nodes produced from "passThrough" and "derive" and
> just sibling physical node with different traitset, we only need the
> initial physical nodes after implementation to avoid unnecessary
> operations. The fundamental reason is, unlike Orca optimizer where physical
> node and physical property are separate things, Calcite's logical/physical
> nodes contains traitset. With regard to the latter question, can you give
> an example?
> >
> > Regards,
> > Haisheng Yuan
> >
> >
> > On 2021/05/26 20:11:57, Vladimir Ozerov  wrote:
> > > Hi,
> > >
> > > I tried to optimize a certain combination of operators for the
> distributed
> > > engine and got stuck with the trait propagation in the top-down
> engine. I
> > > want to ask the community for advice on whether the problem is solvable
> > > with the current Apache Calcite implementation or not.
> > >
> > > Consider the following logical tree:
> > > 3: LogicalAggregate[group=[a], F2(c)]
> > > 2:  LogicalAggregate[group=[a,b], F1(c)]
> > > 1:LogicalScan[t]
> > >
> > > Consider that these two aggregates cannot be merged or simplified for
> > > whatever reason. We have only a set of physical rules to translate this
> > > logical tree to a physical tree. Also, there could be any number of
> > > other operators between these two aggregates. We omit them for clarity,
> > > assuming that the distribution is not destroyed.
> > >
> > > In the distributed environment, non-collocated aggregates are often
> > > implemented in two phases: local pre-aggregation and final aggregation,
> > > with an exchange in between. Consider that the Scan operator is hash
> > > distributed by some key other than [a] or [b]. If we optimize operators
> > > without considering the whole plan, we may optimize each operator
> > > independently, which would give us the following plan:
> > > 3: PhysicalAggregate[group=[a], F2_phase2(c)] //
> > > HASH_DISTRIBUTED [a]
> > > 3:   Exchange[a]  //
> > > HASH_DISTRIBUTED [a]
> > > 3: PhysicalAggregate[group=[a], F2_phase1(c)] //
> > > HASH_DISTRIBUTED [a,b]
> > > 2:   PhysicalAggregate[group=[a,b], F1_phase2(c)] //
> > > HASH_DISTRIBUTED [a,b]
> > > 2: Exchange[a, b] //
> > > HASH_DISTRIBUTED [a,b]
> > > 2:   PhysicalAggregate[group=[a,b], F1_phase1(c)] //
> > > HASH_DISTRIBUTED [d]
> > > 1: PhysicalScan[t]//
> > > HASH_DISTRIBUTED [d]
> > >
> > > This plan is not optimal, because we re-hash inputs twice. A better
> plan
> > > that we want to get:
> > > 3: PhysicalAggregate[group=[a], F2(c)]//
> HASH_DISTRIBUTED
> > > [a]
> > > 2:   PhysicalAggregate[group=[a,b], F1_phase2(c)] //
> HASH_DISTRIBUTED
> > > [a]
> > > 2: Exchange[a]//
> HASH_DISTRIBUTED
> > > [a]
> > > 2:   PhysicalAggregate[group=[a,b], F1_phase1(c)] //
> HASH_DISTRIBUTED
> > > [d]
> > > 1: PhysicalScan[t]//
> HASH_DISTRIBUTED
> > > [d]
> > >
> > > In this case, we take advantag

Trait propagation guidelines

2021-05-26 Thread Vladimir Ozerov
Hi,

I tried to optimize a certain combination of operators for the distributed
engine and got stuck with the trait propagation in the top-down engine. I
want to ask the community for advice on whether the problem is solvable
with the current Apache Calcite implementation or not.

Consider the following logical tree:
3: LogicalAggregate[group=[a], F2(c)]
2:  LogicalAggregate[group=[a,b], F1(c)]
1:LogicalScan[t]

Consider that these two aggregates cannot be merged or simplified for
whatever reason. We have only a set of physical rules to translate this
logical tree to a physical tree. Also, there could be any number of
other operators between these two aggregates. We omit them for clarity,
assuming that the distribution is not destroyed.

In the distributed environment, non-collocated aggregates are often
implemented in two phases: local pre-aggregation and final aggregation,
with an exchange in between. Consider that the Scan operator is hash
distributed by some key other than [a] or [b]. If we optimize operators
without considering the whole plan, we may optimize each operator
independently, which would give us the following plan:
3: PhysicalAggregate[group=[a], F2_phase2(c)] //
HASH_DISTRIBUTED [a]
3:   Exchange[a]  //
HASH_DISTRIBUTED [a]
3: PhysicalAggregate[group=[a], F2_phase1(c)] //
HASH_DISTRIBUTED [a,b]
2:   PhysicalAggregate[group=[a,b], F1_phase2(c)] //
HASH_DISTRIBUTED [a,b]
2: Exchange[a, b] //
HASH_DISTRIBUTED [a,b]
2:   PhysicalAggregate[group=[a,b], F1_phase1(c)] //
HASH_DISTRIBUTED [d]
1: PhysicalScan[t]//
HASH_DISTRIBUTED [d]

This plan is not optimal, because we re-hash inputs twice. A better plan
that we want to get:
3: PhysicalAggregate[group=[a], F2(c)]// HASH_DISTRIBUTED
[a]
2:   PhysicalAggregate[group=[a,b], F1_phase2(c)] // HASH_DISTRIBUTED
[a]
2: Exchange[a]// HASH_DISTRIBUTED
[a]
2:   PhysicalAggregate[group=[a,b], F1_phase1(c)] // HASH_DISTRIBUTED
[d]
1: PhysicalScan[t]// HASH_DISTRIBUTED
[d]

In this case, we take advantage of the fact that the distribution [a] is
compatible with [a,b]. Therefore we may enforce only [a], instead of doing
[a,b] and then [a]. Since exchange operators are very expensive, this
optimization may bring a significant boost to the query engine. Now the
question - how do we reach that state? Intuitively, a pass-through is
exactly what we need. We may pass the optimization request from top
aggregate to bottom aggregate to find physical implementations shared by
[a]. But the devil is in the details - when and how exactly to pass this
request?

Typically, we have a conversion rule that converts a logical aggregate to a
physical aggregate. We may invoke "convert" on the input to initiate the
pass-through:

RelNode convert(...) {
return new PhysicalAggregate(
convert(input, HASH_DISTRIBUTED[a])
)
}

The first problem - we cannot create the normal physical aggregate here
because we do not know input traits yet. The final decision whether to do a
one-phase or two-phase aggregate can be made only in the
"PhysicalNode.derive" method when concrete input traits are resolved.
Therefore the converter rule should create a kind of "template" physical
operator, which would be used to construct the final operator(s) when input
traits are resolved. AFAIU Enumerable works similarly: we create operators
with virtually arbitrary traits taken from logical nodes in the conversion
rules. We only later do create normal nodes in the derive() methods.

The second problem - our top aggregate doesn't actually need the
HASH_DISTRIBUTED[a] input. Instead, it may accept inputs with any
distribution. What we really need is to inform the input (bottom aggregate)
that it should look for additional implementations that satisfy
HASH_DISTRIBUTED[a]. Therefore, enforcing a specific distribution on the
input using the "convert" method is not what we need because this
conversion might enforce unnecessary exchanges.

The third problem - derivation. Consider that we delivered the optimization
request to the bottom aggregate. As an implementor, what am I supposed to
do in this method? I cannot return the final aggregate from here because
the real input traits are not derived yet. Therefore, I can only return
another template, hoping that the "derive" method will be called on it.
However, this will not happen because trait derivation is skipped on the
nodes emitted from pass-through. See "DeriveTrait.perform" [1].

BottomAggregate {
RelNode passThrough(distribution=HASH_DISTRIBUTED[a]) {
// ???
}
}

I feel that I am either going in the wrong direction, or some gaps in the
product disallow such optimization. So I would like to ask the community to
assist with the following questions:
1. In the top-down 

Re: AggregateUnionTransposeRule fails when some inputs have unique grouping key

2021-05-25 Thread Vladimir Ozerov
Done:
https://issues.apache.org/jira/browse/CALCITE-4616?focusedCommentId=17351269=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17351269

пт, 21 мая 2021 г. в 21:32, Julian Hyde :

> Can you add your proposed fix to the bug, and we can discuss there.
>
> > On May 21, 2021, at 2:33 AM, Vladimir Ozerov  wrote:
> >
> > Hi,
> >
> > I created two issues for two distinct bugs in
> AggregateUnionTransposeRule:
> >
> >   1. https://issues.apache.org/jira/browse/CALCITE-4616 - the problem
> with
> >   a partial Aggregate push-down
> >   2. https://issues.apache.org/jira/browse/CALCITE-4615 - the problem
> with
> >   an incorrect type of aggregate call (not discussed in this thread
> before)
> >
> > Regarding issue 1, my proposal is to remove the partial pushdown
> > optimization completely. We may do the following:
> >
> >   1. Iterate over all inputs, and check if they have a unique key on the
> >   Aggregate's group key.
> >   2. If all inputs are unique, return as there is no sense to push the
> >   Aggregate past Union.
> >   3. If at least one input is not unique, push aggregate to *all* inputs,
> >   to maintain the proper row type, assuming that the AggregateRemoveRule
> >   will eventually remove unnecessary aggregations.
> >
> > Regards,
> > Vladimir.
> >
> > пт, 21 мая 2021 г. в 09:43, JiaTao Tao :
> >
> >> Sounds  make sense
> >>
> >> Regards!
> >>
> >> Aron Tao
> >>
> >>
> >> Vladimir Ozerov  于2021年5月19日周三 下午8:57写道:
> >>
> >>> Hi,
> >>>
> >>> The AggregateUnionTransposeRule attempts to push the Aggregate below
> the
> >>> Union.
> >>>
> >>> Before:
> >>> Aggregate[group=$0, agg=SUM($1]
> >>>  Union[all]
> >>>Input1
> >>>Input2
> >>>
> >>> After:
> >>> Aggregate[group=$0, agg=SUM($1]
> >>>  Union[all]
> >>>Aggregate[group=$0, agg=SUM($1]
> >>>  Input1
> >>>Aggregate[group=$0, agg=SUM($1]
> >>>  Input2
> >>>
> >>> When pushing the Aggregate, it checks whether the input is definitively
> >>> unique on the grouping key. If yes, the Aggregate is not installed on
> top
> >>> of the input, assuming that the result would be the same as without the
> >>> Aggregate. This generates a type mismatch exception when aggregation is
> >>> pushed only to some of the inputs:
> >>> Aggregate[group=$0, agg=SUM($1]
> >>>  Union[all]
> >>>Aggregate[group=$0, agg=SUM($1]
> >>>  Input1
> >>>Input2
> >>>
> >>> It seems that the uniqueness check should not be in that rule at all,
> and
> >>> the aggregate should be pushed unconditionally. Motivation: we already
> >> have
> >>> AggregateRemoveRule that removes unnecessary aggregates. No need to
> >>> duplicate the same non-trivial logic twice.
> >>>
> >>> Does the proposal make sense to you?
> >>>
> >>> Regards,
> >>> Vladimir.
> >>>
> >>
>
>


Re: Exposing multiple values of a trait from the operator

2021-05-25 Thread Vladimir Ozerov
Hi Haisheng,

Thank you for the advice. This is exactly how I designed distribution at
the moment (the approach 2 from my original email) - as a List
instead of just int[]. My main concern was the increased complexity of the
trait propagation/derivation, as I have to manage these nested lists by
hand. Nevertheless, it works well. So I hoped that there are better
built-in approaches that I may use. If the answer is negative, I'll
continue using the original approach, when multiple alternatives managed
manually.

Regards,
Vladimir.

вт, 25 мая 2021 г. в 20:30, Haisheng Yuan :

> Hi Vladimir,
>
> Glad to see you raised the question.
>
> Here is the advice:
> Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally
> flawed and has many bugs. It can't work properly no matter for top-down or
> bottom-up.
>
> Instead, we need to add equivalent keys bitmap as the property of physical
> trait like RelCollation, RelDistribution.
>
> For example:
> class RelDistributionImpl {
>   // list of distribution keys
>   private ImmutableIntList keys;
>
>// list of equivalent bitset for each distribution key
>   private ImmutableList equivBitSets;
> }
>
> In the trait satisfy and column remapping, we also need to take equivalent
> keys into consideration. Some of the work need to be done in Calcite core
> framework.
>
> Greenplum Orca optimizer has similar strategy:
>
> https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44
>
> Regards,
> Haisheng Yuan
>
> On 2021/05/25 15:37:32, Vladimir Ozerov  wrote:
> > Hi,
> >
> > Consider the distributed SQL engine that uses a distribution property to
> > model exchanges. Consider the following physical tree. To do the
> > distributed join, we co-locate tuples using the equijoin key. Now the
> Join
> > operator has two equivalent distributions - [a1] and [b1]. It is critical
> > to expose both distributions so that the top Aggregate can take advantage
> > of the co-location.
> >
> > Aggregate[group=b1]
> >   DistributedJoin[a.a1=b.b1]   // SHARDED[a1], SHARDED[b1]
> > Input[a]   // SHARDED[a1]
> > Input[b]   // SHARDED[b1]
> >
> > A similar example for the Project:
> > Aggregate[group=$1]
> >   Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1]
> > Input // SHARDED[a]
> >
> > The question is how to model this situation properly?
> >
> > First, it seems that RelMultipleTrait and RelCompositeTrait were designed
> > to handle this situation. However, I couldn't make them work with the
> > top-down optimizer. The reason is that when we register a RelNode with a
> > composite trait in MEMO, VolcanoPlanner flattens the composite trait into
> > the default trait value in RelSet.add -> RelTraitSet.simplify. That is,
> the
> > trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the
> > original traits could not be derived in the PhysicalNode.derive methods.
> >
> > Second, we may try to model multiple sharding keys in a single trait. But
> > this complicates the implementation of PhysicalNode.passThrough/derive
> > significantly.
> > SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]]
> >
> > Third, we may expose multiple traits using metadata. RelMdDistribution
> > would not work, because it exposes only a single distribution. But a
> custom
> > handler may potentially fix that. However, it will not be integrated with
> > the top-down optimizer still, which makes the idea questionable.
> >
> > To summarize, it seems that currently there is no easy way to handle
> > composite traits with a top-down optimizer. I wonder whether someone from
> > the devlist already solved similar issues in Apache Calcite or other
> > optimizers. If so, what was the approach or best practices? Intuitively,
> it
> > seems that RelMultipleTrait/RelCompositeTrait approach might be the way
> to
> > go. But why do we replace the original composite trait set with the
> default
> > value in the RelTraitSet.simplify routine?
> >
> > Regards,
> > Vladimir.
> >
>


Re: Exposing multiple values of a trait from the operator

2021-05-25 Thread Vladimir Ozerov
Hi Vladimir,

Thank you for the link. It is very relevant to my problem. I see that in
these discussions, there were several ideas and claims, such as that (1) we
can get rid of "simplify" altogether, (2) composite traits are rare in
practice, (3) composite traits are not designed well in the first place
[1]. I do not have the full picture in my head, so I'll try to share some
thoughts to advance the discussion.

Regarding (1), I think that the removal of "simplify" may help with my
particular (and pretty simple) test but might lead to some unpredictable
results for more complicated queries. Suppose that we generate two
equivalent nodes with different traits: [a] and [a][b]. Depending on the
nature of the trait def, these two nodes might or might belong to the same
subset. For example, [a] and [a][b] are different subsets for
RelCollcation. At the same time, [a] and [a][b] could belong to the same
subset for some distributions. That is, if the input is hash-distributed by
either [a] or [b], it might imply that a==b for every tuple (otherwise,
hashes will not match), and therefore every RelNode in the RelSet that is
shared by [a] is also sharded by [b] and vice verse. The idea is similar to
transitive predicates. So ideally, we should let the RelTraitDef define how
to compare composite traits with other traits. Otherwise, we may lose some
optimization opportunities.

Regarding (2), perhaps the multi-collation nodes are really rare in
practice. But nodes with multiple hash distributions are widespread for
distributed engines. Because in distributed systems, the collocated hash
equijoin is the most common way of joining two inputs, and such join always
produces an additional distribution.

Regarding (3), it would be very interesting to hear suggestions and ideas
on the proper design of composite traits. The composite traits mechanics
mentioned in RelSubset Javadoc's is not a good design choice for
distribution traits. That is, if we have a node that is distributed by
[a][b], we cannot just put it into two subsets [a] and [b], because
operator parents may require both [a] and [b], otherwise unnecessary
exchanges could appear. That is, [a][b] should be propagated together. For
example, the removal of SHARDED[a1] from #1 would add the exchange between
#2 and #1, and the removal of SHARDED[b1] from #1 would add the exchange
between #3 and #2. Neither is optimal.
3: Aggregate[group=b1]
2:   Join[a.a1=c.c1]   // SHARDED[a1], SHARDED[b1], SHARDED[c1]
1: Join[a.a1=b.b1] // SHARDED[a1], SHARDED[b1]

@Haisheng Yuan , following your comment [1], would
you mind providing your ideas around the proper design of composite traits?
Are composite traits implemented in Orca?

Regards,
Vladimir.

[1]
https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=17081984=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17081984

вт, 25 мая 2021 г. в 19:32, Vladimir Sitnikov :

> >VolcanoPlanner flattens the composite trait into
> the default trait value in RelSet.add -> RelTraitSet.simplify
>
> Vladimir, have you tried removing that RelTraitSet.simplify?
>
> I remember I have run into that multiple times already, and I suggested
> removing that "simplify".
> For example,
>
> https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=16750377=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16750377
>
>
> Vladimir
>


Exposing multiple values of a trait from the operator

2021-05-25 Thread Vladimir Ozerov
Hi,

Consider the distributed SQL engine that uses a distribution property to
model exchanges. Consider the following physical tree. To do the
distributed join, we co-locate tuples using the equijoin key. Now the Join
operator has two equivalent distributions - [a1] and [b1]. It is critical
to expose both distributions so that the top Aggregate can take advantage
of the co-location.

Aggregate[group=b1]
  DistributedJoin[a.a1=b.b1]   // SHARDED[a1], SHARDED[b1]
Input[a]   // SHARDED[a1]
Input[b]   // SHARDED[b1]

A similar example for the Project:
Aggregate[group=$1]
  Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1]
Input // SHARDED[a]

The question is how to model this situation properly?

First, it seems that RelMultipleTrait and RelCompositeTrait were designed
to handle this situation. However, I couldn't make them work with the
top-down optimizer. The reason is that when we register a RelNode with a
composite trait in MEMO, VolcanoPlanner flattens the composite trait into
the default trait value in RelSet.add -> RelTraitSet.simplify. That is, the
trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the
original traits could not be derived in the PhysicalNode.derive methods.

Second, we may try to model multiple sharding keys in a single trait. But
this complicates the implementation of PhysicalNode.passThrough/derive
significantly.
SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]]

Third, we may expose multiple traits using metadata. RelMdDistribution
would not work, because it exposes only a single distribution. But a custom
handler may potentially fix that. However, it will not be integrated with
the top-down optimizer still, which makes the idea questionable.

To summarize, it seems that currently there is no easy way to handle
composite traits with a top-down optimizer. I wonder whether someone from
the devlist already solved similar issues in Apache Calcite or other
optimizers. If so, what was the approach or best practices? Intuitively, it
seems that RelMultipleTrait/RelCompositeTrait approach might be the way to
go. But why do we replace the original composite trait set with the default
value in the RelTraitSet.simplify routine?

Regards,
Vladimir.


Re: AggregateUnionTransposeRule fails when some inputs have unique grouping key

2021-05-21 Thread Vladimir Ozerov
Hi,

I created two issues for two distinct bugs in AggregateUnionTransposeRule:

   1. https://issues.apache.org/jira/browse/CALCITE-4616 - the problem with
   a partial Aggregate push-down
   2. https://issues.apache.org/jira/browse/CALCITE-4615 - the problem with
   an incorrect type of aggregate call (not discussed in this thread before)

Regarding issue 1, my proposal is to remove the partial pushdown
optimization completely. We may do the following:

   1. Iterate over all inputs, and check if they have a unique key on the
   Aggregate's group key.
   2. If all inputs are unique, return as there is no sense to push the
   Aggregate past Union.
   3. If at least one input is not unique, push aggregate to *all* inputs,
   to maintain the proper row type, assuming that the AggregateRemoveRule
   will eventually remove unnecessary aggregations.

Regards,
Vladimir.

пт, 21 мая 2021 г. в 09:43, JiaTao Tao :

> Sounds  make sense
>
> Regards!
>
> Aron Tao
>
>
> Vladimir Ozerov  于2021年5月19日周三 下午8:57写道:
>
> > Hi,
> >
> > The AggregateUnionTransposeRule attempts to push the Aggregate below the
> > Union.
> >
> > Before:
> > Aggregate[group=$0, agg=SUM($1]
> >   Union[all]
> > Input1
> > Input2
> >
> > After:
> > Aggregate[group=$0, agg=SUM($1]
> >   Union[all]
> > Aggregate[group=$0, agg=SUM($1]
> >   Input1
> > Aggregate[group=$0, agg=SUM($1]
> >   Input2
> >
> > When pushing the Aggregate, it checks whether the input is definitively
> > unique on the grouping key. If yes, the Aggregate is not installed on top
> > of the input, assuming that the result would be the same as without the
> > Aggregate. This generates a type mismatch exception when aggregation is
> > pushed only to some of the inputs:
> > Aggregate[group=$0, agg=SUM($1]
> >   Union[all]
> > Aggregate[group=$0, agg=SUM($1]
> >   Input1
> > Input2
> >
> > It seems that the uniqueness check should not be in that rule at all, and
> > the aggregate should be pushed unconditionally. Motivation: we already
> have
> > AggregateRemoveRule that removes unnecessary aggregates. No need to
> > duplicate the same non-trivial logic twice.
> >
> > Does the proposal make sense to you?
> >
> > Regards,
> > Vladimir.
> >
>


[jira] [Created] (CALCITE-4616) AggregateUnionTransposeRule causes row type mismatch when some inputs have unique grouping key

2021-05-21 Thread Vladimir Ozerov (Jira)
Vladimir Ozerov created CALCITE-4616:


 Summary: AggregateUnionTransposeRule causes row type mismatch when 
some inputs have unique grouping key
 Key: CALCITE-4616
 URL: https://issues.apache.org/jira/browse/CALCITE-4616
 Project: Calcite
  Issue Type: Bug
Affects Versions: 1.26.0
Reporter: Vladimir Ozerov


Consider the following relational tree:
{code}
1: Aggregate[group=a, SUM(b)]
2:   Union
3: Input1[a, b, c]
4: Input2[a, b, c]
{code}

It may happen, that {{Input1}} has a unique key on {{[a]}}. In this case, 
Apache Calcite, will not install the {{Aggregate}} on top of it, which leads to 
the following tree:
1: Aggregate[group=a, SUM(b)]
2:   Union
3: Input1[a, b, c]
4: Aggregate[group=a, SUM(b)]
5:   Input2[a, b, c]
{code}

Obviously, the tree is incorrect, because {{Union}} inputs now have different 
row types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CALCITE-4615) AggregateUnionTransposeRule may assign wrong aggregate function to input

2021-05-21 Thread Vladimir Ozerov (Jira)
Vladimir Ozerov created CALCITE-4615:


 Summary: AggregateUnionTransposeRule may assign wrong aggregate 
function to input
 Key: CALCITE-4615
 URL: https://issues.apache.org/jira/browse/CALCITE-4615
 Project: Calcite
  Issue Type: Bug
  Components: core
Affects Versions: 1.26.0
Reporter: Vladimir Ozerov


Consider the following query tree:
{code}
1: Aggregate[SUM($0):BIGINT]
2:   Union[$0:BIGINT]
3: Input[$0:BIGINT]
4: Input[$0:BIGINT NOT NULL]
{code}
The Union's row type is {{BIGINT}}, and Aggergate's row type is {{BIGINT}} as 
well. 

When the {{AggregateUnionTransposeRule}} rule pushes {{Aggregate}} below 
{{Union}}, it installs the aggregate call from the original {{Aggregate}} to 
new aggregates:
{code}
1: Aggregate[SUM($0):BIGINT]
2:   Union[$0:BIGINT]
3: Aggregate[SUM($0):BIGINT]
4:   Input[$0:BIGINT]
5: Aggregate[SUM($0):BIGINT]
6:   Input[$0:BIGINT NOT NULL]
{code}
This leads to a Litmus failure in the {{Aggregate}} on line 5 because the 
aggregate call returns {{BIGINT}}, while it should return {{BIGINT NOT NULL}}.

The solution is to re-create aggregate calls for inputs and let Calcite deduce 
the return type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


AggregateUnionTransposeRule fails when some inputs have unique grouping key

2021-05-19 Thread Vladimir Ozerov
Hi,

The AggregateUnionTransposeRule attempts to push the Aggregate below the
Union.

Before:
Aggregate[group=$0, agg=SUM($1]
  Union[all]
Input1
Input2

After:
Aggregate[group=$0, agg=SUM($1]
  Union[all]
Aggregate[group=$0, agg=SUM($1]
  Input1
Aggregate[group=$0, agg=SUM($1]
  Input2

When pushing the Aggregate, it checks whether the input is definitively
unique on the grouping key. If yes, the Aggregate is not installed on top
of the input, assuming that the result would be the same as without the
Aggregate. This generates a type mismatch exception when aggregation is
pushed only to some of the inputs:
Aggregate[group=$0, agg=SUM($1]
  Union[all]
Aggregate[group=$0, agg=SUM($1]
  Input1
Input2

It seems that the uniqueness check should not be in that rule at all, and
the aggregate should be pushed unconditionally. Motivation: we already have
AggregateRemoveRule that removes unnecessary aggregates. No need to
duplicate the same non-trivial logic twice.

Does the proposal make sense to you?

Regards,
Vladimir.


Re: Trait propagation in heterogeneous plans

2021-05-06 Thread Vladimir Ozerov
It may propagate the in-core distribution in theory, if the relevant code
exists. Practically, there is no such code. For example, consider
EnumerableProject:

   1. EnumerableProjectRule.convert doesn't propagate input's distribution,
   thanks to EnumerableProject.create that uses RelOptCluster.traitSet.
   2. EnumerableProjectRule.derive also ignores all traits except for
   collation.

Therefore, irrespective of which trait set is present in the project's
input, the EnumerableProject will always have the default values for all
traits except for collation. This is what I refer to as "no trait
propagation". In this sense, EnumerableProject is an example of the correct
implementation wrt my proposal. But not all operators follow this, e.g.
EnumerableFilter.

чт, 6 мая 2021 г. в 14:39, Vladimir Sitnikov :

> >Enumerable in its current state cannot propagate any traits except for
> collation
>
> Enumerable can propagate in-core distribution trait.
>
> Vladimir
>


Re: Trait propagation in heterogeneous plans

2021-05-06 Thread Vladimir Ozerov
Hi,

I'd like to stress out that I am not trying to argue about subjective
concepts at all. Quite the opposite - I would like to agree or disagree on
a set of objective facts and find the solution. Specifically, from what I
saw in Calcite's codebase and real projects, I assert the following:

   1. Calcite-based projects may use custom traits.
   2. Enumerable in its current state cannot propagate any traits except
   for collation. The relevant code is simply missing from the product, it was
   never implemented.
   3. Despite (2), Enumerable rules/operators may demand unsupported traits
   from inputs, or expose unsupported traits, which may lead to problems on
   the user side (an example is in the first message of this thread).

Do you agree with these points?

If we are in agreement here, then I propose only one thing - fix (3),
because it affects real-life integrations. The fix is trivial:

   - Make sure that Enumerable operators never set non-default trait values
   for anything except for collation. For example, EnumerableProjectRule
   creates an operator with the correct trait set, whilst
   EnumerableFilterRule propagates unsupported traits.
   - Replace RelNode.getTraitSet with RelOptCluster.traitSet when deducing
   the desired input trait set in Enumerable rules.

These two fixes would ensure that we never have any non-default values of
any traits except for collation in Enumerable operators. On the one hand,
it fixes (3). On the other hand, it doesn't break anything, because thanks
to (2) there is nothing to break.

Does it make sense to you?

Regards,
Vladimir.


чт, 6 мая 2021 г. в 10:35, Vladimir Sitnikov :

> Vladimir,
>
> I generally agree with what you are saying,
>
> >Enumerable backend provides a clear and consistent contract: we support
> collation and reset everything
>
> That sounds like a way to go until there's a way to externalize "input
> trait enforcement" rules.
> "output" traits are simpler since they can be computed with metadataquery
> (however, we still hard-code the set of computed traits).
> It might be worth trying to compute all the traits known to the planner.
>
> However, Enumerable could play well with in-core distribution trait as
> well, so there's no need to limit enumerable to "collation only".
>
> If you don't like in-core distribution trait, you just do not use it.
> There's no much sense in limiting enumerable to collation only.
>
> Vladimir
>


Re: Trait propagation in heterogeneous plans

2021-05-05 Thread Vladimir Ozerov
Hi Vladimir, Julian,

I want to distinguish between two cases.

Some projects may decide to use Calcite's distribution trait. To my
knowledge, this is not a common pattern because it is not really integrated
into Calcite. It is not destroyed/adjusted in rules and operators as
needed, not integrated into EnumerableConvention.enforce, etc.

Other projects may decide to use a custom distribution trait. Examples are
Apache Flink, Hazelcast, and some other private projects we work on. There
are many reasons to do this. A couple of examples:
1. Calcite's distribution produces logical exchange, while production
grade-optimizers are typically multi-phase and want the distribution
convention to produce physical exchanges in a dedicated physical phase(s).
2. Some systems may have custom requirements for distribution, such as
propagating the number of shards, supporting multiple equivalent keys, etc.

But in both cases, the bottom line is that the Enumerable currently cannot
work with both built-in and custom distributions because the associated
code is not implemented in Calcite's core. And even if we add the
fully-fledged support of the built-in distribution to Enumerable, many
projects will continue using custom distribution traits because the
exchange is a physical operation with lots of backend-dependent specific
quirks, and any attempt to model it abstractly in Calcite's core is
unlikely to cover some edge cases.

The same applies to any other custom trait that depends on columns -
Enumerable will not be able to process it correctly.

Therefore, instead of having a definitively broken code, it might be better
to apply the defensive approach when the whole Enumerable backend provides
a clear and consistent contract: we support collation and reset everything
else. IMO it is better because it matches the current behavior and would
never cause strange bugs in a user code. If in the future we invest in the
proper integration of the built-in distribution or figure out how to
"externalize" the trait propagation for Enumerable operators, we may relax
this statement.

Please let me know if it makes any sense.

Regards,
Vladimir.

вт, 4 мая 2021 г. в 21:02, Julian Hyde :

> > I would say known in-core vs unknown trait is a reasonable approach to
> > distingush traits.
>
> Easy, but not reasonable. It will make it very difficult to reuse
> existing rels and rules (e.g. Enumerable) in a downstream project that
> has defined its own traits.
>
> On Tue, May 4, 2021 at 10:44 AM Vladimir Sitnikov
>  wrote:
> >
> > > It seems arbitrary to include Collation but exclude other traits.
> >
> > I would say known in-core vs unknown trait is a reasonable approach to
> > distingush traits.
> >
> > Vladimir
>


Re: Trait propagation in heterogeneous plans

2021-05-04 Thread Vladimir Ozerov
Hi Vladimir,

I couldn't share the reproducer, as it is behind the NDA. But the problem
is evident from the code.

There are two distinct issues actually:

   1. Propagation of unsupported traits in operators. EnumerableProject is
   not affected. Examples of the problem: EnumerableWindowRule,
   EnumerableFilterRule
   2. Incorrect enforcement of the input traits. Example:
   EnumerableProjectRule.convert. Imagine that I have an input with some
   custom trait, say, distribution. The EnumerableProjectRule may require
   the input to satisfy some specific distribution. But given that the
   distribution is not supported by Enumerable, I want to destroy the
   distribution in my convention enforcer. If I do so, I get the
   CannotPlanException, because the created EnumerableProject incorrectly
   requires the specific distribution from the input.

Regards,
Vladimir.

вт, 4 мая 2021 г. в 11:06, Vladimir Sitnikov :

> >First, the EnumerableProjectRule is executed. This rule propagates traits
> >from the input, replacing only convention.
>
> Vladimir, could you please share a reproducer?
>
> EnumerableProject#create explicitly resets all the traits for
> EnumerableProject except convention=enumerable, and
> collation=computed_with_metadataquery
> In practice, it could compute distribution traits as well, however, that is
> missing.
>
> Are you sure you get EnumerableProject with non-default distribution
> somehow?
>
> Vladimir
>


Re: Trait propagation in heterogeneous plans

2021-05-04 Thread Vladimir Ozerov
Hi Haisheng,

My original problem was with how Enumerable propagates traits. Many
Enumerable rules copy traits from the child operator. This seems wrong
because, as you mentioned, Enumerable supports only collation. Propagation
of the unsupported traits may lead to CannotPlanException as in the example
above when having a plan with multiple conventions.

Therefore, the proposal is to change Enumerable rules, so that they
propagate only collation, but not other traits. Does it make sense?

Regards,
Vladimir.

ср, 21 апр. 2021 г. в 04:31, Haisheng Yuan :

> Hi Vladimir,
>
> > There are two problems here. First, the project operator potentially
> > destroys any trait which depends on column order, such as distribution or
> > collation. Therefore, EnumerableProject has an incorrect value of the
> > distribution trait.
>
> The enumerable convention is intended for in-memory, non-distributed
> environment.
> Therefore, we only consider 2 traits: collation and convention. Other
> traits are not
> guaranteed to work correctly. If you want it work with distribution, you
> have to create
> your own operators, rules, either by extending or overriding, in which
> case, you will need
> to remap distribution columns to get the correct distribution trait, just
> like how collation does.
>
> > Second, which distribution should I assign to the CustomToEnumerable
> node?
> > As I know that parent convention cannot handle the distribution properly,
> > my natural thought is to set it to ANY.
>
> You can assume CustomToEnumerable to be an Enforcer operator, like Sort,
> Exchange.
> Sort only changes data collation, Exchange changes data distribution and
> collation, similarly
> CustomToEnumerable only change convention, but retains collation and
> distribution, I assume.
> But in practice, it should be decided by the operator inventor and the
> underlying physical
> implementation.
>
> Hope that answers your question. Feel free to ask if you have more
> questions.
>
> Thanks,
> Haisheng Yuan
>
> On 2021/03/27 08:43:15, Vladimir Ozerov  wrote:
> > Hi,
> >
> > Apache Calcite supports heterogeneous optimization when nodes may have
> > different conventions. The Enumerable rules propagate all traits from
> > inputs. We have doubts whether this is correct or not.
> >
> > Consider the following initial plan, which was created by Apache Calcite
> > after sql-to-rel conversion and invocation of TranslatableTable.toRel.
> The
> > table is in the CUSTOM convention. In this convention, there is an
> > additional Distribution trait that tracks which attribute is used for
> > sharding. It could be either SHARDED or ANY. The latter is the default
> > distribution value which is used when the distribution is unknown.
> Suppose
> > that the table is distributed by the attribute $0.
> > LogicalProject [convention=NONE,   distribution=ANY]
> >   CustomTable  [convention=CUSTOM, distribution=SHARDED($0)]
> >
> > Now suppose that we run VolcanoPlanner with two rules:
> EnumerableProjectRule
> > and converter rules that translate the CUSTOM node to ENUMERABLE node.
> > First, the EnumerableProjectRule is executed. This rule propagates traits
> > from the input, replacing only convention. Notice, how it propagated the
> > distribution trait.
> > EnumerableProject [convention=ENUMERABLE, distribution=SHARDED($0)]
> >   CustomTable [convention=CUSTOM, distribution=SHARDED($0)]
> >
> > Next, the converter will be invoked, yielding the following final plan:
> > EnumerableProject[convention=ENUMERABLE, distribution=SHARDED($0)]
> >   CustomToEnumerable [convention=ENUMERABLE, distribution=???]
> > CustomTable  [convention=CUSTOM, distribution=SHARDED($0)]
> >
> > There are two problems here. First, the project operator potentially
> > destroys any trait which depends on column order, such as distribution or
> > collation. Therefore, EnumerableProject has an incorrect value of the
> > distribution trait.
> > Second, which distribution should I assign to the CustomToEnumerable
> node?
> > As I know that parent convention cannot handle the distribution properly,
> > my natural thought is to set it to ANY. However, at least in the top-down
> > optimizer, this will lead to CannotPlanException, unless I declare that
> [ANY
> > satisfies SHARDED($0)], which is not the case: ANY is unknown
> distribution,
> > so all distribution satisfies ANY, but not vice versa.
> >
> > My question is - shouldn't we ensure that only the collation trait is
> > propagated from child nodes in Enumerable rules? For example, in the
> > Enu

Re: Time zone management and DST in Avatica

2021-03-31 Thread Vladimir Ozerov
Hi Julian, Vladimir,

Israel time zone It is not fixed. Formally, you may find different
abbreviations, like IST and IDT. But in real systems, like Postgres, MS
SQL, or Java, this is a single time zone with DST changes.

The following query in Postgres will return 21:00, 21:00, 22:00 for 24, 25,
and 26 Oct respectively:
select
  cast('October 24 00:00:00 2020 Israel' as timestamp with time zone) at
time zone 'UTC',
  cast('October 25 00:00:00 2020 Israel' as timestamp with time zone) at
time zone 'UTC',
  cast('October 26 00:00:00 2020 Israel' as timestamp with time zone) at
time zone 'UTC'
from t;

Same for SQL Server, even though the zone is named "Israel *Standard* Time"

In Java, the relevant ZoneId is "Israel". And it also returns 21:00, 21:00,
22:00 for these dates:

ZoneId utcZoneId = ZoneId.of("UTC");
ZoneId istZoneId = ZoneId.of("Israel");
LocalDate date24 = LocalDate.of(2020, 10, 24);
LocalDate date25 = LocalDate.of(2020, 10, 25);
LocalDate date26 = LocalDate.of(2020, 10, 26);
System.out.println(ZonedDateTime.ofInstant(date24.atStartOfDay(istZoneId).toInstant(),
utcZoneId));
System.out.println(ZonedDateTime.ofInstant(date25.atStartOfDay(istZoneId).toInstant(),
utcZoneId));
System.out.println(ZonedDateTime.ofInstant(date26.atStartOfDay(istZoneId).toInstant(),
utcZoneId));

I do not have an isolated test for Avatica for now, but I observed the
problem in the real system. The problematic code is in
AbstractCursor.longToDate. The method accepts the absolute GMT time at the
start of the day. E.g., 25-Oct-2020 00:00 GMT. At this time Israel's offset
was +2:00. Then it subtracts that offset, assuming that  25-Oct-2020 00:00
IST == 24-Oct-2020 22:00 GMT. But it is not, because several hours earlier
the offset was +3:00 due to DST. So actually 25-Oct-2020 00:00 IST ==
24-Oct-2020 21:00 GMT. As a result, Avatica will return 22:00 from the
result set, while other databases and Java would return 21:00.

I can write a dedicated test for that, but it would require some time. But
I do not see much value in it during the discussion, because the problem is
relatively clear: the offset at time T2 cannot be used to deduce the offset
at time T1.

This is why other drivers often do some "magic" with Calendar to get the
correct time. Like in PG JDBC [1]. Notice, that for time zones without DST,
they just do some simple math, similarly to Avatica. But for time zones
with DST, they do more complicated calculations to get the correct result.

Regards,
Vladimir.

[1]
https://github.com/pgjdbc/pgjdbc/blob/866c6a9e4cc42d9c279d68b8c756f562eaf0f249/pgjdbc/src/main/java/org/postgresql/jdbc/TimestampUtils.java#L1329

ср, 31 мар. 2021 г. в 23:25, Julian Hyde :

> Israel Standard Time is fixed at UTC+2. It does not observe daylight
> savings time. So maybe your wall clock should have been in the
> 'Asia/Jerusalem' time zone rather than Israel Standard Time.
>
> On Wed, Mar 31, 2021 at 12:23 PM Vladimir Sitnikov
>  wrote:
> >
> > >Let me provide the exact example
> >
> > Sorry for not being clear.
> > As far as I understand, Avatica means "Java JDBC API" or "something like
> Go
> > API".
> >
> > Could you please provide a test case via Avatica API (e.g. JDBC API)
> along
> > with the actual and expected result?
> > For example, org.apache.calcite.jdbc.CalciteRemoteDriverTest verifies the
> > behavior of Avatica+Calcite
> > integration.
> >
> > Vladimir
>


Re: Time zone management and DST in Avatica

2021-03-31 Thread Vladimir Ozerov
Hi,

Thank you for your feedback. Let me provide the exact example of when the
problem happened. Consider the date 2020 Oct 25 in Israel Standard Time
[1]. There was DST end at this date. This date represents exactly 18560
days since 01 Jan 1970. When there was 2020 Oct 25 00:00 IST, it was 2020
Oct 24 21:00 GMT. Therefore, my expectation is that if I return 18560 to
Avatica, I would get  2020 Oct 24 21:00 GMT millis back. This is what at
least PG and SQL Server returns. You may check it in the SQLFiddle [1]
quickly:

Schema:
create table t(c int primary key);
insert into t values (1);

Postgres:
select
  cast('October 25 00:00:00 2020 Israel' as timestamp with time zone) at
time zone 'UTC'
from t;

SQL Server:
select
  ({d'2020-10-25'}  AT TIME ZONE 'Israel Standard Time') AT TIME ZONE 'UTC'
from t;

However, Avatica returns me 2020 Oct 24 22:00 GMT. This is wrong because at
22:00 GMT it was 01:00 on a wall clock in Israel.
The problem is that we cannot use the current time to derive the offset
that should be applied to the past time.

Regards,
Vladimir.

[1] https://www.timeanddate.com/time/zone/israel/jerusalem
[2] http://sqlfiddle.com/

ср, 31 мар. 2021 г. в 10:38, Vladimir Sitnikov :

> Julian>you should be able to come up with a test case
> Julian>against a reputable database (e.g. PostgreSQL
>
> pgjdbc committer here :)
>
> There are timezone-related tests in
>
> https://github.com/pgjdbc/pgjdbc/blob/c633cc6cf5295bfbd5b6a79bb45dff863c5056f5/pgjdbc/src/test/java/org/postgresql/test/jdbc2/TimezoneTest.java
>
> and
>
> https://github.com/pgjdbc/pgjdbc/blob/c633cc6cf5295bfbd5b6a79bb45dff863c5056f5/pgjdbc/src/test/java/org/postgresql/test/jdbc42/GetObject310Test.java
>
> It might be fun to run the tests through Avatica.
>
> Vladimir, frankly speaking, I don't follow what do you mean by "Avatica
> assumes that the underlying engine operates on relative time"
> Could you please provide the exact API call sequence and the expected
> result?
>
> Vladimir
>


Time zone management and DST in Avatica

2021-03-28 Thread Vladimir Ozerov
Hi,

Avatica assumes that the underlying engine operates on relative time
without time zone as opposed to global time (UTC). When you set a temporal
value (e.g., a prepared statement parameter), Avatica adds the current
offset to the passed time. When you read a temporal value, Avatica
subtracts the current offset. This may lead to incorrect results if DST
offset changes.

Consider that we have a timezone with DST, that works as follows. D1 and D2
are two consecutive days (e.g., 24 and 25 Oct):
D2 00:00 GMT+2 -> D1 22:00 GMT
D2 01:00 GMT+2 -> D1 23:00 GMT
D2 03:00 GMT+3 -> D2 00:00 GMT
D2 04:00 GMT+3 -> D2 01:00 GMT

Now consider, that we want to save D2 00:00 GMT+2 using Avatica. On write,
Avatica will advance the time by the TZ offset. On read, Avatica will
subtract the TZ offset. The problem is that different offsets will be used,
leading to the incorrect result. The associated logic is located in
AbstractCursor and TypedValue classes.

long initial = [D2 00:00 GMT+2].epochMillis() // D1 22:00 GMT
long onWrite = initial + offsetAt(initial);   // D2 00:00 GMT
long onRead = onWrite - offsetAt(onWrite);// D1 21:00 GMT
assert initial == onRead; // Fails

The fundamental problem is that the current time offset is used, which
might differ before and after adjustment. One potential solution is to
enhance the reading part. It should check whether the offset after the
subtraction is the same and if not - do the additional adjustment to
restore the proper time.

Do you have any objections to the proposed change?

Regards,
Vladimir.


Trait propagation in heterogeneous plans

2021-03-27 Thread Vladimir Ozerov
Hi,

Apache Calcite supports heterogeneous optimization when nodes may have
different conventions. The Enumerable rules propagate all traits from
inputs. We have doubts whether this is correct or not.

Consider the following initial plan, which was created by Apache Calcite
after sql-to-rel conversion and invocation of TranslatableTable.toRel. The
table is in the CUSTOM convention. In this convention, there is an
additional Distribution trait that tracks which attribute is used for
sharding. It could be either SHARDED or ANY. The latter is the default
distribution value which is used when the distribution is unknown. Suppose
that the table is distributed by the attribute $0.
LogicalProject [convention=NONE,   distribution=ANY]
  CustomTable  [convention=CUSTOM, distribution=SHARDED($0)]

Now suppose that we run VolcanoPlanner with two rules: EnumerableProjectRule
and converter rules that translate the CUSTOM node to ENUMERABLE node.
First, the EnumerableProjectRule is executed. This rule propagates traits
from the input, replacing only convention. Notice, how it propagated the
distribution trait.
EnumerableProject [convention=ENUMERABLE, distribution=SHARDED($0)]
  CustomTable [convention=CUSTOM, distribution=SHARDED($0)]

Next, the converter will be invoked, yielding the following final plan:
EnumerableProject[convention=ENUMERABLE, distribution=SHARDED($0)]
  CustomToEnumerable [convention=ENUMERABLE, distribution=???]
CustomTable  [convention=CUSTOM, distribution=SHARDED($0)]

There are two problems here. First, the project operator potentially
destroys any trait which depends on column order, such as distribution or
collation. Therefore, EnumerableProject has an incorrect value of the
distribution trait.
Second, which distribution should I assign to the CustomToEnumerable node?
As I know that parent convention cannot handle the distribution properly,
my natural thought is to set it to ANY. However, at least in the top-down
optimizer, this will lead to CannotPlanException, unless I declare that [ANY
satisfies SHARDED($0)], which is not the case: ANY is unknown distribution,
so all distribution satisfies ANY, but not vice versa.

My question is - shouldn't we ensure that only the collation trait is
propagated from child nodes in Enumerable rules? For example, in the
EnumerableProjectRule instead of doing:
input.getTraitSet()
  .replace(EnumerableConvention.INSTANCE)
  .replace()

we may do:
RelOptCluster.traitSet().
  .replace(EnumerableConvention.INSTANCE)
  .replace()

This would ensure that all other traits are set to the default value. The
generalization of this idea is that every convention has a set of supported
traits. Every unsupported trait should be set to the default value.

I would appreciate your feedback on the matter.

Regards,
Vladimir.


[jira] [Created] (CALCITE-4543) Interval literal looses a fractional second when it has scale greater than 3

2021-03-19 Thread Vladimir Ozerov (Jira)
Vladimir Ozerov created CALCITE-4543:


 Summary: Interval literal looses a fractional second when it has 
scale greater than 3
 Key: CALCITE-4543
 URL: https://issues.apache.org/jira/browse/CALCITE-4543
 Project: Calcite
  Issue Type: Bug
  Components: core
Affects Versions: 1.26.0
Reporter: Vladimir Ozerov


The allowed fractional second precision for literals is between 1 and 9, with 
the default value set to 6. The relevant constants are defined in the 
{{SqlTypeName}} class, see {{DEFAULT_INTERVAL_FRACTIONAL_SECOND_PRECISION}} and 
{{MAX_INTERVAL_FRACTIONAL_SECOND_PRECISION}}.

At the same time, the {{DAY-SECOND}} literals are converted to milliseconds 
during actual processing, see {{SqlParserUtil.intervalToMillis}}. As a result, 
the sub-millisecond part is lost silently:
{code}
CAST('2021-01-01 10:00:00' as TIMESTAMP) + INTERVAL '0.001' SECOND => 
2021-01-01 10:00:00.001
CAST('2021-01-01 10:00:00' as TIMESTAMP) + INTERVAL '0.0001' SECOND => 
2021-01-01 10:00:00.0
{code}

There are two possible solutions here, which are orthogonal to each other:
# Since Apache Calcite claims to support up to 9 fractional positions, it 
should work with nanoseconds, rather than milliseconds.
# Provide a way for products to override the default and maximum scale for 
intervals.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Default disable the RexNode normalization(or operands reorder)

2021-03-13 Thread Vladimir Ozerov
Hi Julian,

I agree that in your example normalization may have some different concerns
comparing to simplification. However, both normalization and simplification
sometimes address similar problems either. For example, the simplification
may decrease the search space, but so does the normalization. E.g.
normalized reordering of operands in a join condition may allow for the
merge of equivalent nodes that otherwise would be considered
non-equivalent. Do any of the currently implemented rules depend on some
normalized representation?

Also, as many rules (such as join reorder rules) generate filters, I would
argue that moving the normalization to a separate phase might cause the
unnecessary expansion of the search space.

The idea I expressed above is inspired by CockroachDB (again :-)). In
CockroachDB, expressions are part of the MEMO and treated similarly to
relational operators, which allows for the unified rule infrastructure for
both operators and expressions. Expressions are created using a
context-aware builder, which knows the set of active normalization rules.
Whenever a builder is to create a new expression (not necessarily
the top-level), the normalization rules are invoked in a heuristic manner.
The code generation is used to build the heuristic rule executor. Both
normalization and simplification (in our terms) rules are invoked here. For
example, see [1] (normalization) and [2] (simplification). Finally, the
expression is registered in MEMO. As a result, every expression ever
produced is always in a normalized/simplified form.

I am not saying that we should follow this approach. But IMO (1) unified
handling of simplification and normalization through rules and (2) a single
entry point for all normalization (builder) are interesting design
decisions, as they offer both flexibility and convenience.

Regards,
Vladimir.

[1]
https://github.com/cockroachdb/cockroach/blob/release-21.1/pkg/sql/opt/norm/rules/scalar.opt#L8
[2]
https://github.com/cockroachdb/cockroach/blob/release-21.1/pkg/sql/opt/norm/rules/bool.opt#L30

пт, 12 мар. 2021 г. в 07:15, Julian Hyde :

> Without simplifications, many trivial RelNodes would be produced. It is
> beneficial to have those in RelBuilder; if they were in rules, the trivial
> RelNodes (and equivalence sets) would still be present, increasing the size
> of the search space.
>
> I want to draw a distinction between simplification and normalization. A
> normalized form is relied upon throughout the system. Suppose for example,
> that we always normalize ‘RexLiteral = RexInputRef’ to ‘RexInputRef =
> RexLiteral’. If a rule encountered the latter case, it would not be a bug
> if the rule failed with, say, a ClassCastException.
>
> So, I disagree with Vladimir that 'RexSimplify may also be considered a
> “normalization”’. If simplification is turned off, each rule must be able
> to deal with the unsimplified expressions.
>
> Also, the very idea of normalizations being optional, enabled by system
> properties or other config, is rather disturbing, because the rules
> probably don’t know that the normalization has been turned off.
>
> The only place for normalization, in my opinion, is explicitly, in a
> particular planner phase. For example, pulling up all filters before
> attempting to match materialized views.
>
> Julian
>
> > On Mar 11, 2021, at 10:37 AM, Vladimir Ozerov 
> wrote:
> >
> > in our practice, we also had some problems with normalization. First, we
> > observed problems with the unwanted (and sometimes
> > incorrect) simplification of expressions with CASTs and literals which
> came
> > from RexSimplify. I couldn't find an easy way to disable that behavior.
> > Note, that RexSimplify may also be considered a "normalization". Second,
> we
> > implemented a way to avoid Project when doing join reordering but had
> some
> > issues with operator signatures due to lack of automatic normalization
> for
> > expressions for permuted inputs. These two cases demonstrate two opposite
> > views: sometimes you want a specific normalization to happen
> automatically,
> > but sometimes you want to disable it.
> >
> > Perhaps an alternative approach could be to unify all simplification and
> > normalization logic and split it into configurable rules. Then, we may
> add
> > these rules as a separate rule set to the planner, which would be invoked
> > heuristically every time an operator with expressions is registered in
> > MEMO. In this case, a user would not need to bother about RexNode
> > constructors. To clarify, under "rules" I do not mean heavy-weight rules
> > similar to normal rules. Instead, it might be simple pattern+method
> pairs,
> > that could even be compiled into a static program using Janino. This
> 

Re: Make RelNode attribute order part of MEMO

2021-03-13 Thread Vladimir Ozerov
Hi Julian,

Thank you for sharing these issues. We end up with almost the same ideas.
We attempted to add an input permute to Join, which allowed us to avoid
projects. However, that complicates the integration with other rules, just
as you mention in [2]. Globally unique column IDs seem like a better option
from that perspective. Moreover, unique IDs may simplify the implementation
of other optimizations. For example, many join enumeration techniques,
whether DP-based or top-down, require the decomposition of the join graph
into independent vertices (inputs) and edges (conditions) and careful
reconstruction of the alternative join trees, and RexInputRef is not very
suitable for that process. Another possible example is recently reported
[3].

There are hundreds of usages of the RexInputRef, so the implementation of
this idea might be prohibitively expensive. But putting this problem aside,
do you envision any other potential blockers for the implementation of that
idea?

Regards,
Vladimir.

[1] https://github.com/apache/calcite/pull/2359
[2] https://issues.apache.org/jira/browse/CALCITE-62
[3] https://issues.apache.org/jira/browse/CALCITE-4534

вт, 9 мар. 2021 г. в 22:40, Julian Hyde :

> I investigated something similar a long time ago. We noticed that a
> lot of trivial Project operators were being generated to compensate
> for field re-ordering due to join transposition. And so the idea was
> to allow each RelNode (and especially Join) to permute its output
> fields.
>
> Here is the case: https://issues.apache.org/jira/browse/CALCITE-62.
> https://issues.apache.org/jira/browse/CALCITE-55 is related.
>
> The problem, as I noted in CALCITE-62, is that people writing rules
> have another mapping to deal with.
>
> I believe that other systems, such as Spark's Catalyst planner, use
> globally unique IDs for columns (as opposed to Calcite, whose column
> references are only locally unique, ordinals of the input
> operator(s)). Globally unique IDs would be superior for this problem
> but perhaps introduce other challenges.
>
> Julian
>
> On Sun, Mar 7, 2021 at 11:25 PM Vladimir Ozerov 
> wrote:
> >
> > Hi,
> >
> > Currently, the order of attributes is used to define operators
> equivalence.
> > This leads to several interesting problems, such as possible duplication
> of
> > expressions (e.g., "a,a,b") or additional work in rules to detect trivial
> > projects and/or input permutations (e.g. "a,b -> b,a").
> >
> > But the biggest problem is the join order planning. In Calcite, AxB is
> not
> > equivalent to BxA:
> >
> >1. It makes the ruleset [JoinCommuteRule, JoinAssociateRule]
> >insufficient to explore all bushy trees because the commute operation
> adds
> >a project on top of the new join (AxB -> project(BxA)), thus not
> >allowing for the associate rule to be executed on the upper join. The
> >solution is to add the ProjectJoinTransposeRule to the ruleset, but
> this
> >increases the planning time dramatically, making Apache Calcite
> unsuitable
> >for the cost-based join planning even for relatively simple join
> graphs.
> >2. It increases the number of join-related rule calls, which
> complicates
> >the implementation of the new join enumeration planning rules (e.g.
> >top-down enumerators) because duplicate derivations compromise
> performance.
> >
> > My question is - has the community considered an idea to make the order
> of
> > columns a common property of all operators, somewhat similar to the
> trait,
> > but without an explicit enforcer?
> >
> > For example, consider the following MEMO which the planner creates when
> > trying to transform the join AxB to BxA:
> >
> > G1: { Scan1[table=t1, cols=a,b] }
> > G2: { Scan2[table=t2, cols=c,d] }
> > G3: { AxB[G1, G2], Project[G4, cols=$2,$3,$0,$1] }
> > G4: { BxA[G2, G1] }
> >
> > However, if we make the column order part of the MEMO, we may potentially
> > have something like this:
> >
> > G1: { Scan1[table=t1, cols=a,b] }
> > G2: { Scan2[table=t2, cols=c,d] }
> > G3 [cols=G1.$0, G1.$1, G2.$0, G2.$1]: { AxB[G1, G2], BxA[G2, G1] }
> >
> > Notice, how we were able to put AxB and BxA to the same equivalence
> group.
> > To my knowledge, CockroachDB uses a somewhat similar design.
> >
> > I realize that this is rather a radical idea, and more design work is
> > required to come with a proper proposal. At this point, I just would like
> > to kindly ask the community to share high-level feedback on that. Were
> > similar ideas proposed before?
> >
> > Thank you,
> > Vladimir.
>


[jira] [Created] (CALCITE-4535) ServerDdlExecutor cannot execute DROP commands with qualified object names

2021-03-13 Thread Vladimir Ozerov (Jira)
Vladimir Ozerov created CALCITE-4535:


 Summary: ServerDdlExecutor cannot execute DROP commands with 
qualified object names
 Key: CALCITE-4535
 URL: https://issues.apache.org/jira/browse/CALCITE-4535
 Project: Calcite
  Issue Type: Improvement
  Components: server
Affects Versions: 1.26.0
Reporter: Vladimir Ozerov
Assignee: Vladimir Ozerov


Consider the query {{DROP TABLE s.t}}. The {{ServerDdlExecutor}} will attempt 
to resolve the default schema path, and the would try to invoke 
{{SqlIdentifier.getSimple}} on the {{s.t}} identifier. When assertions are 
enabled this would lead to an AssertionError. When assertions are disabled this 
would lead to an incorrectly resolved table name: {{s}} instead of {{t}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Default disable the RexNode normalization(or operands reorder)

2021-03-11 Thread Vladimir Ozerov
in our practice, we also had some problems with normalization. First, we
observed problems with the unwanted (and sometimes
incorrect) simplification of expressions with CASTs and literals which came
from RexSimplify. I couldn't find an easy way to disable that behavior.
Note, that RexSimplify may also be considered a "normalization". Second, we
implemented a way to avoid Project when doing join reordering but had some
issues with operator signatures due to lack of automatic normalization for
expressions for permuted inputs. These two cases demonstrate two opposite
views: sometimes you want a specific normalization to happen automatically,
but sometimes you want to disable it.

Perhaps an alternative approach could be to unify all simplification and
normalization logic and split it into configurable rules. Then, we may add
these rules as a separate rule set to the planner, which would be invoked
heuristically every time an operator with expressions is registered in
MEMO. In this case, a user would not need to bother about RexNode
constructors. To clarify, under "rules" I do not mean heavy-weight rules
similar to normal rules. Instead, it might be simple pattern+method pairs,
that could even be compiled into a static program using Janino. This
approach could be very flexible and convenient: a single place in the code
where all rewrite happens, complete control of the optimization rules,
modular rules instead of monolithic code (like in RexSimplify). The obvious
downside - it would require more time to implement than other proposed
approaches.

What do you think about that?

Regards,
Vladimir.

чт, 11 мар. 2021 г. в 13:33, Vladimir Sitnikov :

> Stamatis>just the option to use it or not in a more friendly way
> Stamatis>than a system property.
>
> As far as I remember, the key issue here is that new RexBuilder(...) is a
> quite common pattern,
> and what you suggest looks like "everyone would have to provide extra
> argument when creating RexBuilder".
>
> On top of that, there are use cases like "new RexCall(...)" in the static
> context (see org.apache.calcite.rex.RexUtil#not).
>
> Making the uses customizable adds significant overhead with doubtful gains.
>
> I have not explored the route though, so there might be solutions.
> For instance, it might work if we have an in-core dependency injection that
> would hide the complexity
> when coding :core, however, I don't think we could expose DI to Calcite
> users.
>
> Vladimir
>


[jira] [Created] (CALCITE-4533) Fix handling of REPLACE and IF NOT EXISTS keywords for CREATE TABLE/SCHEMA commands

2021-03-11 Thread Vladimir Ozerov (Jira)
Vladimir Ozerov created CALCITE-4533:


 Summary: Fix handling of REPLACE and IF NOT EXISTS keywords for 
CREATE TABLE/SCHEMA commands
 Key: CALCITE-4533
 URL: https://issues.apache.org/jira/browse/CALCITE-4533
 Project: Calcite
  Issue Type: Improvement
  Components: server
Affects Versions: 1.26.0
Reporter: Vladimir Ozerov
Assignee: Vladimir Ozerov


Presently, the {{ServerDdlExecutor}} incorrectly process some keywords:
# {{CREATE SCHEMA IF NOT EXISTS}} overwrites the existing schema
# {{CREATE OR REPLACE TABLE}} doesn't replace the table if the table exists

The goal of this ticket is to unify the handling of {{IF NOT EXISTS}} and 
{{REPLACE}} keywords for these two commands.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Question about parallel query planning

2021-03-09 Thread Vladimir Ozerov
*at such = at such scale

Вт, 9 марта 2021 г. в 11:01, Vladimir Ozerov :

> Hi Jihoon,
>
> I would say that 5 sec could be actually a pretty good result at such. Did
> you have a chance to check which exact rules contributed to the planning
> time? You may inject a listener to VolcanoPlanner to check that.
>
> Regards,
> Vladimir
>
> Вт, 9 марта 2021 г. в 05:37, Jihoon Son :
>
>> Hi all,
>>
>> I posted the same question on the ASF slack channel, but am posting
>> here as well to get a quicker response.
>>
>> I'm seeing an issue in query planning that it takes a long time (+5
>> sec) for a giant union query that has 120 subqueries in it. I captured
>> a flame graph (attached in this email) to see where the bottleneck is,
>> and based on the flame graph, I believe the query planner spent most
>> of time to explore the search space of candidate plans to find the
>> best plan. This seems because of those many subqueries in the same
>> union query. Is my understanding correct? If so, for this particular
>> case, it seems possible to parallelize exploring the search space. Do
>> you have any plan for parallelizing this part? I'm not sure whether
>> it's already done though in the master branch. I tried to search for a
>> jira ticket on https://issues.apache.org/jira/browse/CALCITE, but
>> couldn't find anything with my search skill.
>>
>> Thanks,
>> Jihoon
>>
>


Re: Question about parallel query planning

2021-03-09 Thread Vladimir Ozerov
Hi Jihoon,

I would say that 5 sec could be actually a pretty good result at such. Did
you have a chance to check which exact rules contributed to the planning
time? You may inject a listener to VolcanoPlanner to check that.

Regards,
Vladimir

Вт, 9 марта 2021 г. в 05:37, Jihoon Son :

> Hi all,
>
> I posted the same question on the ASF slack channel, but am posting
> here as well to get a quicker response.
>
> I'm seeing an issue in query planning that it takes a long time (+5
> sec) for a giant union query that has 120 subqueries in it. I captured
> a flame graph (attached in this email) to see where the bottleneck is,
> and based on the flame graph, I believe the query planner spent most
> of time to explore the search space of candidate plans to find the
> best plan. This seems because of those many subqueries in the same
> union query. Is my understanding correct? If so, for this particular
> case, it seems possible to parallelize exploring the search space. Do
> you have any plan for parallelizing this part? I'm not sure whether
> it's already done though in the master branch. I tried to search for a
> jira ticket on https://issues.apache.org/jira/browse/CALCITE, but
> couldn't find anything with my search skill.
>
> Thanks,
> Jihoon
>


Make RelNode attribute order part of MEMO

2021-03-07 Thread Vladimir Ozerov
Hi,

Currently, the order of attributes is used to define operators equivalence.
This leads to several interesting problems, such as possible duplication of
expressions (e.g., "a,a,b") or additional work in rules to detect trivial
projects and/or input permutations (e.g. "a,b -> b,a").

But the biggest problem is the join order planning. In Calcite, AxB is not
equivalent to BxA:

   1. It makes the ruleset [JoinCommuteRule, JoinAssociateRule]
   insufficient to explore all bushy trees because the commute operation adds
   a project on top of the new join (AxB -> project(BxA)), thus not
   allowing for the associate rule to be executed on the upper join. The
   solution is to add the ProjectJoinTransposeRule to the ruleset, but this
   increases the planning time dramatically, making Apache Calcite unsuitable
   for the cost-based join planning even for relatively simple join graphs.
   2. It increases the number of join-related rule calls, which complicates
   the implementation of the new join enumeration planning rules (e.g.
   top-down enumerators) because duplicate derivations compromise performance.

My question is - has the community considered an idea to make the order of
columns a common property of all operators, somewhat similar to the trait,
but without an explicit enforcer?

For example, consider the following MEMO which the planner creates when
trying to transform the join AxB to BxA:

G1: { Scan1[table=t1, cols=a,b] }
G2: { Scan2[table=t2, cols=c,d] }
G3: { AxB[G1, G2], Project[G4, cols=$2,$3,$0,$1] }
G4: { BxA[G2, G1] }

However, if we make the column order part of the MEMO, we may potentially
have something like this:

G1: { Scan1[table=t1, cols=a,b] }
G2: { Scan2[table=t2, cols=c,d] }
G3 [cols=G1.$0, G1.$1, G2.$0, G2.$1]: { AxB[G1, G2], BxA[G2, G1] }

Notice, how we were able to put AxB and BxA to the same equivalence group.
To my knowledge, CockroachDB uses a somewhat similar design.

I realize that this is rather a radical idea, and more design work is
required to come with a proper proposal. At this point, I just would like
to kindly ask the community to share high-level feedback on that. Were
similar ideas proposed before?

Thank you,
Vladimir.


Non-additive costs in heterogeneous engines

2021-02-26 Thread Vladimir Ozerov
Hi,

Several products that utilize Apache Calcite for query optimization might
use multiple execution units to execute physical operators concurrently.
Think of a heterogeneous engine that might split execution between a CPU
and a co-processor (GPU, FGPA, etc), or just a multi-core machine in the
simplest case. In such systems, the cumulative cost of an operator is not
additive. That is, cost(A,B) != cost(A) + cost(B).

Consider a theoretical system that might execute operations on either CPU
or GPU. There are Scan and Join operators. For every operator, there are
two physical alternatives - execute on CPU or execute on GPU. We also have
a valid cost model that provides comparable costs for both CPU and GPU
backends. Both CPU and GPU could execute one operator at a time.

Now consider that we have the following logical plan:
LogicalJoin
  LogicalScan[a]
  LogicalScan[b]

We then expand the MEMO with physical alternatives (omitting some
alternatives for clarity):
Set#1 {
  Subset#1[CPU]: CpuJoin[Subset#3, Subset#5]
}
Set#2 {
  Subset#3[CPU]: {
CpuScan[a], cost=[100] (best)
GpuToCpu[Subset#4], cost=[170]
  }
  Subset#4[GPU]: {
GpuScan[a], cost=[150]
  }
}
Set#3 {
  Subset#5[CPU]: {
CpuScan[b], cost=[100] (best)
GpuToCpu[Subset#6], cost=[170]
  }
  Subset#6[GPU]: {
GpuScan[b], cost=[150]
  }
}

With the current model, Apache Calcite will only consider "best" rels from
each subset when constructing the final tree, which means that the
following tree would be the winner:
CpuJoin, cost=[200]
  CpuScan[a], cost=[100] (best from Subset#3)
  CpuScan[b], cost=[100] (best from Subset#5)

However, the better plan might be those, which utilizes the cross-device
parallelism:
CpuJoin, cost=MAX[100, 170]
  CpuScan[a], cost=[100] (best from Subset#3)
  GpuToCpu[b], cost=[170] (*not best* from Subset#5)
GpuScan[b], cost=[150]

It seems that to support this kind of optimization with non-additive costs,
we need to switch from cost calculation on the operator level (local) to a
cost calculation that considers alternative paths for all subtrees
(global). Obviously, this could lead to a combinatorial explosion easily.
On the other hand, the new top-down optimizer with a more predictable rule
invocation order might significantly amortize the additional complexity.

I want to ask whether similar ideas were discussed in the community before?
Are you aware of any practical systems that do such kind of optimization?
Or any fundamental results that prove this idea to be infeasible?
I would appreciate any hints on the matter.

Regards,
Vladimir.


Re: Add cross-join suppression to JoinAssociateRule

2021-02-26 Thread Vladimir Ozerov
I created the PR [1]. It adds a property that prunes new join trees if
there is an "always true" condition. Strictly speaking, this is not a
suppression of all possible cross-joins. But such check is cheap and should
work well in practical optimizers assuming that a filter-pushdown is
applied, and joins contain only those conditions, that couldn't be
pushed down.

[1] https://github.com/apache/calcite/pull/2359


сб, 20 февр. 2021 г. в 20:34, Vladimir Ozerov :

> Thank you for the feedback. I'll try to prototype it for all the affected
> rules.
>
> Regards,
> Vladimir
>
> чт, 18 февр. 2021 г. в 13:02, Stamatis Zampetakis :
>
>> Hi Vladimir,
>>
>> Thanks for bringing up this topic.
>>
>> There are various works who highlight the importance of cartesian products
>> for optimality and try to derive techniques for exploring the complete
>> search space efficiently.
>> Having said that indeed not considering cartesian products is a common &
>> popular heuristic for queries with many relations so I find it a good idea
>> to give to the users the option to use it or not.
>>
>> Best,
>> Stamatis
>>
>> On Sun, Feb 14, 2021 at 9:58 AM Vladimir Ozerov 
>> wrote:
>>
>> > Hi Julian,
>> >
>> > First of all, I want to clarify that if we decide to do the cross-join
>> > suppression, we would have to do that for JoinCommuteRule either. I
>> missed
>> > that in the original email.
>> >
>> > Now back to your comments. The proposal assumes that we start with a
>> > connected join graph without the cross-products and never generate other
>> > cross-products. This covers many common queries, but not all. First, as
>> you
>> > mentioned, some queries may potentially benefit from cross-products. I
>> do
>> > not have specific queries in mind, but I cannot prove that such queries
>> > don't exist either. Second, we may have cross-products in an otherwise
>> > connected join graph if the join condition is located in the WHERE
>> clause.
>> > For such queries, we may need to do some pre-processing, like filter
>> > pushdown. Therefore, I assume that real-world products would have to do
>> > some preliminary analysis and/or processing to decide whether it is
>> safe to
>> > use the cross-join suppression. This might not be very convenient for
>> new
>> > users, but IMO perfectly fine for production-grade systems because they
>> > typically do complex multi-phase optimization anyway. For example,
>> recall
>> > how VoltDB decides whether to perform the join planning in the physical
>> > phase depending on the number of joins in the query [1].
>> >
>> > Regarding cardinality estimations, the main problem is that the built-in
>> > Calcite mechanism is pretty imprecise because they use mainly row count
>> and
>> > magic numbers. For example, in TPC-H queries, the cardinality
>> estimations
>> > easily reach billions and billions of rows. And even if the system has
>> more
>> > advanced cardinality estimators, such as histograms, the estimation
>> errors
>> > are likely to build up quickly still [2]. That said, it could be
>> difficult
>> > to design a robust heuristic to suppress particular joins. The
>> advantage of
>> > a straightforward cross-join suppression heuristic is that it is a good
>> > choice for a good fraction of real-world queries.
>> >
>> > To clarify, the proposal is not to make the cross-join suppression the
>> > default behavior. Rather, we may add it as a configuration property to
>> join
>> > planning rules, that could be used by advanced users.
>> >
>> > Regards,
>> > Vladimir.
>> >
>> > [1]
>> >
>> >
>> https://github.com/VoltDB/voltdb/blob/0f2993cb9e1efe7c2c95cf68b83f10903e2697d3/src/frontend/org/voltdb/compiler/PlannerTool.java#L274
>> > [2] https://dl.acm.org/doi/10.14778/2850583.2850594
>> >
>> > сб, 13 февр. 2021 г. в 23:05, Julian Hyde :
>> >
>> > > Two separate points.
>> > >
>> > > 1. I recall that there is an important class of plans that first forms
>> > the
>> > > Cartesian product of some very small relations, and then joins
>> everything
>> > > else to it. Does anyone else encounter these plans? Would these plans
>> be
>> > > unavailable if we made your proposed change?
>> > >
>> > > 2. If I join on a low cardinality attribute (eg customers 

[jira] [Created] (CALCITE-4515) Do not generate the new join tree from commute/associate rules if it contains "always true" condition

2021-02-26 Thread Vladimir Ozerov (Jira)
Vladimir Ozerov created CALCITE-4515:


 Summary: Do not generate the new join tree from commute/associate 
rules if it contains "always true" condition
 Key: CALCITE-4515
 URL: https://issues.apache.org/jira/browse/CALCITE-4515
 Project: Calcite
  Issue Type: Improvement
  Components: core
Affects Versions: 1.27.0
Reporter: Vladimir Ozerov
Assignee: Vladimir Ozerov


{{JoinCommuteRule}} and {{JoinAssociateRule}} are the complete ruleset to 
generate all bushy trees (plus join-project transposes to counter new projects 
created during commute).

These rules do not check whether the new join trees contain an always {{TRUE}} 
condition which is a cross-product. Such join trees are not optimal in the 
majority of cases, and practical optimizers might want to suppress them.

In this ticket, I propose to add a configuration property to 
{{JoinCommuteRule}} and {{JoinAssociateRule}} to disable the creation of trees 
with always {{TRUE}} condition:
 # The property should be disabled by default because sometimes cross-products 
might be present in the query on purpose (e.g., see 
[https://dl.acm.org/doi/10./645916.671976], page 315)
 # We consider the "always {{TRUE}}" condition only, even though they represent 
only part of cross-products. The reason is that it might be difficult and 
expensive to understand whether the condition is cross-join or not (e.g., for 
hypergraphs). On the other hand, the check for "always true" is cheap and 
robust. We assume that the user might need to do some pre-processing (e.g., 
filter push-down) to take advantage of the new property. 
# We do not extend the {{JoinPushThroughJoinRule}} for now, since the [commute 
+ associate] ruleset already explores all bushy trees and testing of the new 
property for {{JoinPushThroughJoinRule}} is difficult. We may extend this rule 
in a separate ticket if needed.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Add cross-join suppression to JoinAssociateRule

2021-02-20 Thread Vladimir Ozerov
Thank you for the feedback. I'll try to prototype it for all the affected
rules.

Regards,
Vladimir

чт, 18 февр. 2021 г. в 13:02, Stamatis Zampetakis :

> Hi Vladimir,
>
> Thanks for bringing up this topic.
>
> There are various works who highlight the importance of cartesian products
> for optimality and try to derive techniques for exploring the complete
> search space efficiently.
> Having said that indeed not considering cartesian products is a common &
> popular heuristic for queries with many relations so I find it a good idea
> to give to the users the option to use it or not.
>
> Best,
> Stamatis
>
> On Sun, Feb 14, 2021 at 9:58 AM Vladimir Ozerov 
> wrote:
>
> > Hi Julian,
> >
> > First of all, I want to clarify that if we decide to do the cross-join
> > suppression, we would have to do that for JoinCommuteRule either. I
> missed
> > that in the original email.
> >
> > Now back to your comments. The proposal assumes that we start with a
> > connected join graph without the cross-products and never generate other
> > cross-products. This covers many common queries, but not all. First, as
> you
> > mentioned, some queries may potentially benefit from cross-products. I do
> > not have specific queries in mind, but I cannot prove that such queries
> > don't exist either. Second, we may have cross-products in an otherwise
> > connected join graph if the join condition is located in the WHERE
> clause.
> > For such queries, we may need to do some pre-processing, like filter
> > pushdown. Therefore, I assume that real-world products would have to do
> > some preliminary analysis and/or processing to decide whether it is safe
> to
> > use the cross-join suppression. This might not be very convenient for new
> > users, but IMO perfectly fine for production-grade systems because they
> > typically do complex multi-phase optimization anyway. For example, recall
> > how VoltDB decides whether to perform the join planning in the physical
> > phase depending on the number of joins in the query [1].
> >
> > Regarding cardinality estimations, the main problem is that the built-in
> > Calcite mechanism is pretty imprecise because they use mainly row count
> and
> > magic numbers. For example, in TPC-H queries, the cardinality estimations
> > easily reach billions and billions of rows. And even if the system has
> more
> > advanced cardinality estimators, such as histograms, the estimation
> errors
> > are likely to build up quickly still [2]. That said, it could be
> difficult
> > to design a robust heuristic to suppress particular joins. The advantage
> of
> > a straightforward cross-join suppression heuristic is that it is a good
> > choice for a good fraction of real-world queries.
> >
> > To clarify, the proposal is not to make the cross-join suppression the
> > default behavior. Rather, we may add it as a configuration property to
> join
> > planning rules, that could be used by advanced users.
> >
> > Regards,
> > Vladimir.
> >
> > [1]
> >
> >
> https://github.com/VoltDB/voltdb/blob/0f2993cb9e1efe7c2c95cf68b83f10903e2697d3/src/frontend/org/voltdb/compiler/PlannerTool.java#L274
> > [2] https://dl.acm.org/doi/10.14778/2850583.2850594
> >
> > сб, 13 февр. 2021 г. в 23:05, Julian Hyde :
> >
> > > Two separate points.
> > >
> > > 1. I recall that there is an important class of plans that first forms
> > the
> > > Cartesian product of some very small relations, and then joins
> everything
> > > else to it. Does anyone else encounter these plans? Would these plans
> be
> > > unavailable if we made your proposed change?
> > >
> > > 2. If I join on a low cardinality attribute (eg customers to suppliers
> on
> > > state) the result is almost as bad a cross join. I presume you would
> want
> > > to treat this join similarly. If so, it would make sense to have the
> > rules
> > > on selectivity (or similar) rather than structure alone.
> > >
> > > Julian
> > >
> > > > On Feb 12, 2021, at 23:04, Vladimir Ozerov 
> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Join order planning is an important optimization problem. The
> > widely-used
> > > > heuristic is to consider all bushy trees without cross-joins. There
> is
> > > > proof [1] that a pair of commute and associate rules is a complete
> > > ruleset
> > > > to explore all bushy trees without the cross-joins if the initial
> join
> &g

[jira] [Created] (CALCITE-4501) Clear costs in VolcanoPlanner when setting the new root

2021-02-16 Thread Vladimir Ozerov (Jira)
Vladimir Ozerov created CALCITE-4501:


 Summary: Clear costs in VolcanoPlanner when setting the new root
 Key: CALCITE-4501
 URL: https://issues.apache.org/jira/browse/CALCITE-4501
 Project: Calcite
  Issue Type: Improvement
  Components: core
Affects Versions: 1.26.0
Reporter: Vladimir Ozerov
Assignee: Vladimir Ozerov


In production-grade systems, the optimization is typically performed in several 
phases. The same instance of {{VolcanoPlanner}} tends to be re-used between the 
phases. 

In some sophisticated scenarios, we may want to alter the cost function between 
phases. 
For example, to adjust the weights of some operators depending on some 
heuristics, or even switch to a completely different cost model. 

One way to do that is to re-instantiate the cluster and the planner with the 
new cost function. 
However, it requires copying of the current rel tree, because its nodes still 
point to the old cluster. This may negatively affect the performance. 

The alternative approach could be to change the state of the cost factory. 
However, it would not work for two reasons:
# The cluster object has a reference to {{MetadataQuery}} with possibly cached 
costs. This may lead to a mix of different cost models.
# {{VolcanoPlanner}} caches zero and infinite costs in a final variable.

In this issue, I propose to introduce two slight modifications to the 
{{VolcanoPlanner}}:
# Clear the metadata query from within {{setRoot}} method.
# Make {{infCost}} and {{zeroCost}} non-final, and also re-initialize them in 
the {{setRoot}} method.

The {{setRoot}} method is convenient for these purposes because it is invoked 
from the {{RuleSetProgram}}, so the desired cost cleanup will work with no 
modifications to the application's code.

Please let me know if you think that this change is not valid.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Add cross-join suppression to JoinAssociateRule

2021-02-14 Thread Vladimir Ozerov
Hi Julian,

First of all, I want to clarify that if we decide to do the cross-join
suppression, we would have to do that for JoinCommuteRule either. I missed
that in the original email.

Now back to your comments. The proposal assumes that we start with a
connected join graph without the cross-products and never generate other
cross-products. This covers many common queries, but not all. First, as you
mentioned, some queries may potentially benefit from cross-products. I do
not have specific queries in mind, but I cannot prove that such queries
don't exist either. Second, we may have cross-products in an otherwise
connected join graph if the join condition is located in the WHERE clause.
For such queries, we may need to do some pre-processing, like filter
pushdown. Therefore, I assume that real-world products would have to do
some preliminary analysis and/or processing to decide whether it is safe to
use the cross-join suppression. This might not be very convenient for new
users, but IMO perfectly fine for production-grade systems because they
typically do complex multi-phase optimization anyway. For example, recall
how VoltDB decides whether to perform the join planning in the physical
phase depending on the number of joins in the query [1].

Regarding cardinality estimations, the main problem is that the built-in
Calcite mechanism is pretty imprecise because they use mainly row count and
magic numbers. For example, in TPC-H queries, the cardinality estimations
easily reach billions and billions of rows. And even if the system has more
advanced cardinality estimators, such as histograms, the estimation errors
are likely to build up quickly still [2]. That said, it could be difficult
to design a robust heuristic to suppress particular joins. The advantage of
a straightforward cross-join suppression heuristic is that it is a good
choice for a good fraction of real-world queries.

To clarify, the proposal is not to make the cross-join suppression the
default behavior. Rather, we may add it as a configuration property to join
planning rules, that could be used by advanced users.

Regards,
Vladimir.

[1]
https://github.com/VoltDB/voltdb/blob/0f2993cb9e1efe7c2c95cf68b83f10903e2697d3/src/frontend/org/voltdb/compiler/PlannerTool.java#L274
[2] https://dl.acm.org/doi/10.14778/2850583.2850594

сб, 13 февр. 2021 г. в 23:05, Julian Hyde :

> Two separate points.
>
> 1. I recall that there is an important class of plans that first forms the
> Cartesian product of some very small relations, and then joins everything
> else to it. Does anyone else encounter these plans? Would these plans be
> unavailable if we made your proposed change?
>
> 2. If I join on a low cardinality attribute (eg customers to suppliers on
> state) the result is almost as bad a cross join. I presume you would want
> to treat this join similarly. If so, it would make sense to have the rules
> on selectivity (or similar) rather than structure alone.
>
> Julian
>
> > On Feb 12, 2021, at 23:04, Vladimir Ozerov  wrote:
> >
> > Hi,
> >
> > Join order planning is an important optimization problem. The widely-used
> > heuristic is to consider all bushy trees without cross-joins. There is
> > proof [1] that a pair of commute and associate rules is a complete
> ruleset
> > to explore all bushy trees without the cross-joins if the initial join
> tree
> > has no cross-joins.
> >
> > In Apache Calcite, we do not suppress cross-products. As a result, even
> > simple join topologies, like a chain, cannot be planned in a reasonable
> > time for more than 6-7 tables. How do you feel if we add an optional
> > cross-join suppression to the JoinAssociateRule, and possibly
> > JoinPushThroughJoinRule?
> >
> > The cross-join suppression is not the only problem with the exhaustive
> join
> > planning in Apache Calcite. But at the very least, it is simple to
> > implement, and it extends the number of tables in a join that could be
> > planned exhaustively for some topologies by additional 2-5 tables.
> >
> > Regards,
> > Vladimir.
> >
> > [1] https://dl.acm.org/doi/10.14778/2732977.2732997
>


Add cross-join suppression to JoinAssociateRule

2021-02-12 Thread Vladimir Ozerov
Hi,

Join order planning is an important optimization problem. The widely-used
heuristic is to consider all bushy trees without cross-joins. There is
proof [1] that a pair of commute and associate rules is a complete ruleset
to explore all bushy trees without the cross-joins if the initial join tree
has no cross-joins.

In Apache Calcite, we do not suppress cross-products. As a result, even
simple join topologies, like a chain, cannot be planned in a reasonable
time for more than 6-7 tables. How do you feel if we add an optional
cross-join suppression to the JoinAssociateRule, and possibly
JoinPushThroughJoinRule?

The cross-join suppression is not the only problem with the exhaustive join
planning in Apache Calcite. But at the very least, it is simple to
implement, and it extends the number of tables in a join that could be
planned exhaustively for some topologies by additional 2-5 tables.

Regards,
Vladimir.

[1] https://dl.acm.org/doi/10.14778/2732977.2732997


[jira] [Created] (CALCITE-4476) DateTimeUtils.timeStringToUnixDate may produce wrong time

2021-01-23 Thread Vladimir Ozerov (Jira)
Vladimir Ozerov created CALCITE-4476:


 Summary: DateTimeUtils.timeStringToUnixDate may produce wrong time
 Key: CALCITE-4476
 URL: https://issues.apache.org/jira/browse/CALCITE-4476
 Project: Calcite
  Issue Type: Bug
  Components: avatica
Affects Versions: avatica-1.17.0
Reporter: Vladimir Ozerov
Assignee: Vladimir Ozerov
 Fix For: next


The query {{SELECT CAST("00:00" as TIME)}} returns {{00:00:01}} in Calcite. 
The problem is located in the {{DateTimeUtils.timeStringToUnixDate}} method 
that mistakenly assigns {{1}} to seconds and minutes in some cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CALCITE-4466) Do not RelTraitDef.convert if the "from" trait already satisfies the "to" trait.

2021-01-13 Thread Vladimir Ozerov (Jira)
Vladimir Ozerov created CALCITE-4466:


 Summary: Do not RelTraitDef.convert if the "from" trait already 
satisfies the "to" trait.
 Key: CALCITE-4466
 URL: https://issues.apache.org/jira/browse/CALCITE-4466
 Project: Calcite
  Issue Type: Task
  Components: core
Affects Versions: 1.26.0
    Reporter: Vladimir Ozerov
    Assignee: Vladimir Ozerov
 Fix For: 1.27.0


Consider that we have two trait defs {{A_def}} and {{B_def}}. We have a node 
with the trait set {{[A_from, B_from]}} that we want to convert to a node with 
the traits set {{[A_to, B_to]}}. Suppose that:
* A_from.satisfies(A_to) = false
* B_from.satisfies(B_to) = true

Currently, the {{VolcanoPlanner}} will invoke {{convert}} on both {{A_trait}} 
and {{B_trait}}, because it skips the conversion only when two traits are 
*equal*. In our example, it will lead to an unnecessary call to 
{{B_def.convert(node, B_to)}}. This is not a big problem but it forces 
implementors of custom traits to double-check whether the passed node already 
satisfies the required trait or not.

The proposal is to replace {{fromTrait.equals(toTrait)}} with 
{{fromTrait.satisfies(toTrait)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Apache Calcite Online Meetup January 2021

2021-01-12 Thread Vladimir Ozerov
*Talk duration*: 30 min.

ср, 13 янв. 2021 г. в 10:27, Vladimir Ozerov :

> Hi Stamatis, Julian,
>
> Thank you for releasing the slot. Just to be crystal clear - I was very
> late with my proposal, and have no urge to present in January. Both January
> and April are perfectly fine with me. So please prioritize this talk over
> others only if you see a really good reason to do so. In any case, I
> confirm that I can present in January. Please find the talk details below.
>
> *Speaker:* Vladimir Ozerov
> *Title:* Apache Calcite integration in Hazelcast IMDG
> *Abstract: *
> Hazelcast IMDG is a distributed in-memory key-value store. In this talk, I
> will present how we used Apache Calcite to create a new distributed SQL
> engine that queries Hazelcast IMDG data.
> We start with motivation and general design. Then we examine how Hazelcast
> IMDG leverages Apache Calcite for query parsing, validation, and
> optimization, and why we decided not to use Apache Calcite for JDBC and
> query execution. We also discuss several Apache Calcite problems that
> negatively affect the integration and propose possible future improvements.
>
> Regards,
> Vladimir.
>
> ср, 13 янв. 2021 г. в 02:48, Stamatis Zampetakis :
>
>> Yesterday, I updated our website to also include Vladimir's talk about
>> Hazelcast so I think it is better to keep it that way.
>> Initially I had in mind something between 2 and 4 talks for this meetup
>> (2-3h) so I think we are good to go.
>>
>> Let's now freeze the agenda to avoid changing it till the last minute.
>> I will open up the discussion for the next meetup in another email so we
>> start filling the slots for April.
>>
>> It's definitely useful to have the talks on camera. I will record them via
>> Zoom and split them after the meetup.
>> If the presenters give their consent I can upload them afterwards to
>> Youtube/Vimeo.
>>
>> Best,
>> Stamatis
>>
>>
>> On Wed, Jan 13, 2021 at 12:15 AM Julian Hyde  wrote:
>>
>> > I'm happy to yield my spot (and speak instead in April) if Vladimir is
>> > still prepared to talk. I'd love to get his work on Hazelcast
>> > integration on the record!
>> >
>> > By the way, will it be possible to record these talks and post them
>> > *as separate items* to (say) YouTube or Vimeo? I think it's important
>> > that they are separate, so that people can find the particular talk
>> > they are interested in.
>> >
>> > Julian
>> >
>> > On Tue, Jan 12, 2021 at 12:17 AM Vladimir Ozerov 
>> > wrote:
>> > >
>> > > Hi Stamatis,
>> > >
>> > > Sorry, my previous email was not clear enough. To clarify - I was
>> > thinking
>> > > about the mentioned next meetup in April because the schedule for the
>> > > January meetup appears to be too packed already (3 talks in 1:30h). I
>> > > I am ready to talk in both January and April. Just want to ensure
>> that we
>> > > do not overwhelm our attendees.
>> > >
>> > > I'll provide the talk details in the nearest day.
>> > >
>> > > Regards,
>> > > Vladimir.
>> > >
>> > > вт, 12 янв. 2021 г. в 01:41, Stamatis Zampetakis :
>> > >
>> > > > That would be great Vladimir, I will update the agenda.
>> > > >
>> > > > If possible please provide a title, duration, and abstract.
>> > > >
>> > > > Best,
>> > > > Stamatis
>> > > >
>> > > > On Mon, Jan 11, 2021 at 8:35 PM Vladimir Ozerov > >
>> > > > wrote:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > > I can share our experience with Apache Calcite  integration into
>> > > > Hazelcast
>> > > > > distributed SQL engine.
>> > > > >
>> > > > > Regards,
>> > > > > Vladimir
>> > > > >
>> > > > > Вт, 5 янв. 2021 г. в 00:48, Vineet G :
>> > > > >
>> > > > > > Hi Stamatis,
>> > > > > >
>> > > > > > Something has come up and unfortunately I will not be able to
>> > present
>> > > > the
>> > > > > > talk.
>> > > > > >
>> > > > > > Vineet
>> > > > > >
>> > > > > > > On Jan 3, 2021, at 1:37 PM, Stamatis Zampetakis <
>> > zabe...@gmail.com>
>> &g

Re: [DISCUSS] Apache Calcite Online Meetup January 2021

2021-01-12 Thread Vladimir Ozerov
Hi Stamatis, Julian,

Thank you for releasing the slot. Just to be crystal clear - I was very
late with my proposal, and have no urge to present in January. Both January
and April are perfectly fine with me. So please prioritize this talk over
others only if you see a really good reason to do so. In any case, I
confirm that I can present in January. Please find the talk details below.

*Speaker:* Vladimir Ozerov
*Title:* Apache Calcite integration in Hazelcast IMDG
*Abstract: *
Hazelcast IMDG is a distributed in-memory key-value store. In this talk, I
will present how we used Apache Calcite to create a new distributed SQL
engine that queries Hazelcast IMDG data.
We start with motivation and general design. Then we examine how Hazelcast
IMDG leverages Apache Calcite for query parsing, validation, and
optimization, and why we decided not to use Apache Calcite for JDBC and
query execution. We also discuss several Apache Calcite problems that
negatively affect the integration and propose possible future improvements.

Regards,
Vladimir.

ср, 13 янв. 2021 г. в 02:48, Stamatis Zampetakis :

> Yesterday, I updated our website to also include Vladimir's talk about
> Hazelcast so I think it is better to keep it that way.
> Initially I had in mind something between 2 and 4 talks for this meetup
> (2-3h) so I think we are good to go.
>
> Let's now freeze the agenda to avoid changing it till the last minute.
> I will open up the discussion for the next meetup in another email so we
> start filling the slots for April.
>
> It's definitely useful to have the talks on camera. I will record them via
> Zoom and split them after the meetup.
> If the presenters give their consent I can upload them afterwards to
> Youtube/Vimeo.
>
> Best,
> Stamatis
>
>
> On Wed, Jan 13, 2021 at 12:15 AM Julian Hyde  wrote:
>
> > I'm happy to yield my spot (and speak instead in April) if Vladimir is
> > still prepared to talk. I'd love to get his work on Hazelcast
> > integration on the record!
> >
> > By the way, will it be possible to record these talks and post them
> > *as separate items* to (say) YouTube or Vimeo? I think it's important
> > that they are separate, so that people can find the particular talk
> > they are interested in.
> >
> > Julian
> >
> > On Tue, Jan 12, 2021 at 12:17 AM Vladimir Ozerov 
> > wrote:
> > >
> > > Hi Stamatis,
> > >
> > > Sorry, my previous email was not clear enough. To clarify - I was
> > thinking
> > > about the mentioned next meetup in April because the schedule for the
> > > January meetup appears to be too packed already (3 talks in 1:30h). I
> > > I am ready to talk in both January and April. Just want to ensure that
> we
> > > do not overwhelm our attendees.
> > >
> > > I'll provide the talk details in the nearest day.
> > >
> > > Regards,
> > > Vladimir.
> > >
> > > вт, 12 янв. 2021 г. в 01:41, Stamatis Zampetakis :
> > >
> > > > That would be great Vladimir, I will update the agenda.
> > > >
> > > > If possible please provide a title, duration, and abstract.
> > > >
> > > > Best,
> > > > Stamatis
> > > >
> > > > On Mon, Jan 11, 2021 at 8:35 PM Vladimir Ozerov 
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I can share our experience with Apache Calcite  integration into
> > > > Hazelcast
> > > > > distributed SQL engine.
> > > > >
> > > > > Regards,
> > > > > Vladimir
> > > > >
> > > > > Вт, 5 янв. 2021 г. в 00:48, Vineet G :
> > > > >
> > > > > > Hi Stamatis,
> > > > > >
> > > > > > Something has come up and unfortunately I will not be able to
> > present
> > > > the
> > > > > > talk.
> > > > > >
> > > > > > Vineet
> > > > > >
> > > > > > > On Jan 3, 2021, at 1:37 PM, Stamatis Zampetakis <
> > zabe...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > I updated the agenda on meetup to include Julian's talk around
> > > > spatial
> > > > > > > queries.
> > > > > > >
> > > > > > > So far we have four presentations lasting approximately 1h45
> so I
> > > > still
> > > > > > > find the duration reasonable.
> > > > > > >
> > > > >

Re: [DISCUSS] Apache Calcite Online Meetup January 2021

2021-01-12 Thread Vladimir Ozerov
Hi Stamatis,

Sorry, my previous email was not clear enough. To clarify - I was thinking
about the mentioned next meetup in April because the schedule for the
January meetup appears to be too packed already (3 talks in 1:30h). I
I am ready to talk in both January and April. Just want to ensure that we
do not overwhelm our attendees.

I'll provide the talk details in the nearest day.

Regards,
Vladimir.

вт, 12 янв. 2021 г. в 01:41, Stamatis Zampetakis :

> That would be great Vladimir, I will update the agenda.
>
> If possible please provide a title, duration, and abstract.
>
> Best,
> Stamatis
>
> On Mon, Jan 11, 2021 at 8:35 PM Vladimir Ozerov 
> wrote:
>
> > Hi,
> >
> > I can share our experience with Apache Calcite  integration into
> Hazelcast
> > distributed SQL engine.
> >
> > Regards,
> > Vladimir
> >
> > Вт, 5 янв. 2021 г. в 00:48, Vineet G :
> >
> > > Hi Stamatis,
> > >
> > > Something has come up and unfortunately I will not be able to present
> the
> > > talk.
> > >
> > > Vineet
> > >
> > > > On Jan 3, 2021, at 1:37 PM, Stamatis Zampetakis 
> > > wrote:
> > > >
> > > > I updated the agenda on meetup to include Julian's talk around
> spatial
> > > > queries.
> > > >
> > > > So far we have four presentations lasting approximately 1h45 so I
> still
> > > > find the duration reasonable.
> > > >
> > > > Of course if there are more people interested to present something we
> > can
> > > > schedule another meetup in April as Julian suggested.
> > > > I am always happy to see what other people are working on and more
> > > Calcite
> > > > use-cases.
> > > >
> > > > Best,
> > > > Stamatis
> > > >
> > > > On Sun, Jan 3, 2021 at 2:09 AM Julian Hyde 
> > > wrote:
> > > >
> > > >> In other news I’ll be co-presenting (with Mosha Pasumansky) a
> talk
> > > >> “Open source SQL - beyond parsers: ZetaSQL and Apache Calcite” at
> the
> > > >> Northwest Database Society Annual Meeting on January 29th. It’s
> > virtual
> > > and
> > > >> free, but you must sign up to attend.
> > > >>
> > > >> Julian
> > > >>
> > > >> [1] https://sites.google.com/view/nwds2021
> > > >>
> > > >>> On Jan 2, 2021, at 12:47 PM, Julian Hyde 
> > > wrote:
> > > >>>
> > > >>> I can give a talk “Implementing spatial queries using algebra
> > > >> rewrites”, 20 minutes.
> > > >>>
> > > >>> But if that makes the meetup too long, I am equally happy to
> postpone
> > > >> the talk. How about scheduling another meetup  in say April?
> > > >>>
> > > >>> Julian
> > > >>>
> > > >>>> On Dec 31, 2020, at 3:10 AM, Stamatis Zampetakis <
> zabe...@gmail.com
> > >
> > > >> wrote:
> > > >>>>
> > > >>>> I just published the event on Meetup [1].
> > > >>>>
> > > >>>> The agenda is not yet finalized so if there are people who would
> > like
> > > to
> > > >>>> give a talk or add/remove things from the agenda please reply to
> > this
> > > >>>> thread.
> > > >>>>
> > > >>>> Best,
> > > >>>> Stamatis
> > > >>>>
> > > >>>> [1] https://www.meetup.com/Apache-Calcite/events/275461117/
> > > >>>>
> > > >>>>>> On Mon, Nov 30, 2020 at 12:37 AM Rui Wang  >
> > > >> wrote:
> > > >>>>>
> > > >>>>> Title: event timestamp semantic based streaming SQL
> > > >>>>> Abstract: this talk will cover in Calcite Streaming SQL case, how
> > to
> > > >> reason
> > > >>>>> data completeness in terms of event timestamp semantic and how to
> > > >> control
> > > >>>>> materialization latency given unbounded input data (in Calcite
> > > roadmap
> > > >> but
> > > >>>>> not implemented yet).
> > > >>>>>
> > > >>>>> Duration: 20~30 mins
> > > >>>>>
> > > >>>>>> On Tue, Nov

Re: [DISCUSS] Apache Calcite Online Meetup January 2021

2021-01-11 Thread Vladimir Ozerov
Hi,

I can share our experience with Apache Calcite  integration into Hazelcast
distributed SQL engine.

Regards,
Vladimir

Вт, 5 янв. 2021 г. в 00:48, Vineet G :

> Hi Stamatis,
>
> Something has come up and unfortunately I will not be able to present the
> talk.
>
> Vineet
>
> > On Jan 3, 2021, at 1:37 PM, Stamatis Zampetakis 
> wrote:
> >
> > I updated the agenda on meetup to include Julian's talk around spatial
> > queries.
> >
> > So far we have four presentations lasting approximately 1h45 so I still
> > find the duration reasonable.
> >
> > Of course if there are more people interested to present something we can
> > schedule another meetup in April as Julian suggested.
> > I am always happy to see what other people are working on and more
> Calcite
> > use-cases.
> >
> > Best,
> > Stamatis
> >
> > On Sun, Jan 3, 2021 at 2:09 AM Julian Hyde 
> wrote:
> >
> >> In other news I’ll be co-presenting (with Mosha Pasumansky) a talk
> >> “Open source SQL - beyond parsers: ZetaSQL and Apache Calcite” at the
> >> Northwest Database Society Annual Meeting on January 29th. It’s virtual
> and
> >> free, but you must sign up to attend.
> >>
> >> Julian
> >>
> >> [1] https://sites.google.com/view/nwds2021
> >>
> >>> On Jan 2, 2021, at 12:47 PM, Julian Hyde 
> wrote:
> >>>
> >>> I can give a talk “Implementing spatial queries using algebra
> >> rewrites”, 20 minutes.
> >>>
> >>> But if that makes the meetup too long, I am equally happy to postpone
> >> the talk. How about scheduling another meetup  in say April?
> >>>
> >>> Julian
> >>>
>  On Dec 31, 2020, at 3:10 AM, Stamatis Zampetakis 
> >> wrote:
> 
>  I just published the event on Meetup [1].
> 
>  The agenda is not yet finalized so if there are people who would like
> to
>  give a talk or add/remove things from the agenda please reply to this
>  thread.
> 
>  Best,
>  Stamatis
> 
>  [1] https://www.meetup.com/Apache-Calcite/events/275461117/
> 
> >> On Mon, Nov 30, 2020 at 12:37 AM Rui Wang 
> >> wrote:
> >
> > Title: event timestamp semantic based streaming SQL
> > Abstract: this talk will cover in Calcite Streaming SQL case, how to
> >> reason
> > data completeness in terms of event timestamp semantic and how to
> >> control
> > materialization latency given unbounded input data (in Calcite
> roadmap
> >> but
> > not implemented yet).
> >
> > Duration: 20~30 mins
> >
> >> On Tue, Nov 24, 2020 at 8:56 AM Slim Bouguerra 
> >> wrote:
> >>
> >> this is a great idea thanks @Statmatis looking forward to learning
> >> more
> >> about Calcite especially the Streaming work.
> >>
> >>> On Mon, Nov 23, 2020 at 2:19 PM Rui Wang 
> >> wrote:
> >>
> >>> Sorry for the late reply Statmatis. I have recently been pretty
> busy
> >> on
> >>> work as it is approaching the end of the year.
> >>>
> >>> The time in [1] works perfectly for me. I will share the abstract
> and
> >>> expected duration soon (should within this week).
> >>>
> >>>
> >>> -Rui
> >>>
> >>> On Fri, Nov 20, 2020 at 2:11 AM Stamatis Zampetakis <
> >> zabe...@gmail.com
> >>
> >>> wrote:
> >>>
>  That would be great Vineet!
> 
>  @Julian, @Rui, @Vineet:
>  Can you share a small abstract (2-3 sentences) and expected
> >> duration?
>  Can you check if the date/times proposed previously [1] work for
> >> you.
> >> If
>  not feel free to propose another slot.
> 
>  Best,
>  Stamatis
> 
>  [1] https://s.apache.org/uhrzo
> 
>  On Thu, Nov 19, 2020 at 6:18 PM Vineet Garg 
> > wrote:
> 
> > I think this is a great idea. +1 for the online meetup.
> >
> > If there are slots left I can also talk about how Hive leverages
> >>> Calcite
>  to
> > do query optimization.
> >
> > -Vineet
> >
> > On Fri, Nov 6, 2020 at 7:21 AM Stamatis Zampetakis <
> >> zabe...@gmail.com>
> > wrote:
> >
> >> Let's try to fix the date/time and tentative agenda so that we
> > can
> >>> add
> > some
> >> information on meetup [1].
> >>
> >> So far we have three presenters, Julian, Rui, and myself. We can
> >>> start
> > like
> >> that and if in the process there are more people interested to
> >> give a
> > small
> >> talk we can update the program.
> >>
> >> Let's try to get a date in the last two weeks of January to give
> >> us a
>  bit
> >> more time to prepare. Personally, I don't have a preference for
> >> that
> > being
> >> a business day or not and I am in UTC+1.
> >> For instance, how do you feel about Wednesday, 20 January 2021,
>  18:00:00
> > to
> >> 21:00 UTC+1 [2] ?
> 

[jira] [Created] (CALCITE-4461) Do not cast to logical node inside Enumerable rules

2021-01-11 Thread Vladimir Ozerov (Jira)
Vladimir Ozerov created CALCITE-4461:


 Summary: Do not cast to logical node inside Enumerable rules
 Key: CALCITE-4461
 URL: https://issues.apache.org/jira/browse/CALCITE-4461
 Project: Calcite
  Issue Type: Task
  Components: core
Affects Versions: 1.26.0
Reporter: Vladimir Ozerov
Assignee: Vladimir Ozerov
 Fix For: 1.27.0


Currently, some `Enumerable` rules work with the base operator classes, such as 
`Join`, while others cast to `Logical` counterparts, such as `LogicalJoin`, 
`LogicalProject`, etc. 

This makes it impossible to convert custom non-logical nodes into `Enumerable` 
using the built-in rules.

The proposal is to change all existing rules so that they work with the base 
`RelNode` classes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Contributor rights

2021-01-11 Thread Vladimir Ozerov
Thank you very much, Francis!

вс, 10 янв. 2021 г. в 00:45, Francis Chuang :

> Hey Vladimir,
>
> I've added you to the contributor role in jira.
>
> Francis
>
> On 9/01/2021 8:47 pm, Vladimir Ozerov wrote:
> > Hi,
> >
> > Could you please grant me contributor rights in Calcite JIRA? My username
> > is "vozerov".
> >
> > Thank you.
> > Vladimir.
> >
>


Contributor rights

2021-01-09 Thread Vladimir Ozerov
Hi,

Could you please grant me contributor rights in Calcite JIRA? My username
is "vozerov".

Thank you.
Vladimir.


Re: Need a Jason output if physical tree

2020-12-21 Thread Vladimir Ozerov
Hi Liya,

This will not work AFAIK. Consider the query "SELECT a FROM t WHERE b>1".
The top-level operator has only one column:

Root[$0]
  Project[$0]
Filter[$1>1]
  Scan[$0=a, $1=b]

If you invoke RelMdColumnOrigins on Root[$0], you will get [t.a], but miss
[t.b].
To my knowledge, rules are the only way to reliably. constrain columns
returned from the scan.

Regards,
Vladimir.

вт, 22 дек. 2020 г. в 05:14, Fan Liya :

> Hi Bhavya,
>
> Sorry I am not sure if I fully understand your question.
>
> Let me try to answer it according to my understanding:
>
> 1. Through RelColumnMDOrigins, we can get the RelColumnOrigin object, which
> includes a RelOptTable object.
> 2. The table scan also has a RelOptTable object, and all table scans of the
> plan can be found (e.g. through a visitor)
> 3. With the information of 1 and 2, given any output column, we can get to
> know it is derived from which columns from which table scans.
> 4. With the information of 3, given a table scan, we can get to know which
> column is never used in any output columns, and such columns can be pruned.
>
> Best,
> Liya Fan
>
>
> On Mon, Dec 21, 2020 at 11:31 PM Bhavya Aggarwal 
> wrote:
>
> > Hi Liya,
> >
> > I had a look at the RelColumnMDOrigins and it is useful in determining
> > which columns are from which table but still I am not sure how can I get
> > the column information for TableScan without the rules. If you have any
> > specific example where we have used this approach will be really helpful
> to
> > me.
> >
> > Thanks and regards
> > Bhavya
> >
> > On Mon, Dec 21, 2020 at 5:53 PM Fan Liya  wrote:
> >
> > > Hi Bhavya,
> > >
> > > IMO, to solve the problem from a global view, the following steps needs
> > to
> > > be taken:
> > >
> > > 1. Generate a physical plan in the original way (without considering
> > column
> > > pruning in the table scan)
> > > 2. Modify all the table scans in the plan with the RelColumnMDOrigins
> > > utility (the details have been described above)
> > > 3. Post process the plan with one of the following ways:
> > > a) a plan visitor that adjusts other operators in the tree.
> > > b) a light-weight planner (Hep or Volcano with limited rule sets)
> > >
> > > Run the query with the finally generated plan.
> > >
> > > Best,
> > > Liya Fan
> > >
> > >
> > > On Mon, Dec 21, 2020 at 3:33 PM Bhavya Aggarwal 
> > > wrote:
> > >
> > > > Hi Fan,
> > > >
> > > > I looked at the class RelColumnMDOrigins and it is giving me the
> origin
> > > of
> > > > the column, but even if I want to take it as a global decision I am
> not
> > > > sure how to proceed. Can you please elaborate on how to achieve this
> ?
> > I
> > > am
> > > > literally stuck as I do not want to use so many rules as in any case
> I
> > > have
> > > > to pass these to the TableScan, even if the user does a select * from
> > > > table, I need to add all those columns to the table scan.
> > > >
> > > > Regards
> > > > Bhavya
> > > >
> > > > --
> > > > Your feedback matters - At Knoldus we aim to be very professional in
> > our
> > > > quality of work, commitment to results, and proactive communication.
> If
> > > > you
> > > > feel otherwise please share your feedback
> > > >  and we would work on it.
> > > >
> > >
> >
> >
> > --
> > *Bhavya Aggarwal*
> > CTO & Partner
> > Knoldus Inc. 
> > +91-9910483067
> > Canada - USA - India - Singapore
> >   >
> >  
> >
> > --
> > Your feedback matters - At Knoldus we aim to be very professional in our
> > quality of work, commitment to results, and proactive communication. If
> > you
> > feel otherwise please share your feedback
> >  and we would work on it.
> >
>


Re: Need a Jason output if physical tree

2020-12-18 Thread Vladimir Ozerov
Hi Liya,

I do not think, this is a global decision. If you go bottom-up in the plan,
and find the Project operator, than none parent operators will use any
columns from the lower scans, other than those present in the Project.

It is a common practice for Calcite integrations to restrict the number of
columns returned from scans. It is achieved as follows:
1) There should be rules to extract projections from some operators (e.g.
Joins)
2) There should be rules to transpose projects with child operators
3) Finally, you should create a rule to move project into scan for the
pattern [Project<-Scan]

When all three prerequisites are ready, the optimizer will produce missing
projects (if any), then push them down the operator tree, and finally merge
into the scan.

Regards,
Vladimir

Пт, 18 дек. 2020 г. в 13:34, Fan Liya :

> Hi Bhavya,
>
> Thanks for your explanation.
>
> I do not think a rule like ProjectFilterTableScan solves the problem.
> To see this, please note that the parent operator of the LogicalProject may
> need other columns from the table
> (e.g. the parent operator may be a join, which uses some other columns as
> the join key).
>
> So the column pruning of table scan should be a global decision, depending
> on the whole plan.
>
> Calcite has RelMdColumnOrigins, which can help in this scenario.
> In particular, given an output column in the output operator, it gives the
> dependent columns in the underlying source tables.
> So columns not referenced from the final output can be pruned from the
> source table.
>
> Best,
> Liya Fan
>
>
> On Fri, Dec 18, 2020 at 4:43 PM Bhavya Aggarwal 
> wrote:
>
> > Hi Liya,
> >
> > Please see below , I have a query as given below
> >
> > SELECT t_1.ID FROM USERS t_1 where NAME = 'HELLO';
> >
> > Now I know that I have to fetch two columns from my columnar database
> i.e.
> > ID, Name, so need to pass these parameters in my custom Table Scan.
> > But when I convert the LogicalTableScan to my custom Table scan I do not
> > have that information, please see the tree below, in the Logical Table
> Scan
> > the Input is null, I was hoping that I will have the columns in the
> > LogicalTableScan, The question is what do I need to do in Calcite to pass
> > that information to the LogicalTableScan or in my physical operator. The
> > way I have found is that I create a Rule like ProjectFilterTableScan rule
> > and then extract what information I need. Is there a better way to do
> this?
> >
> > -Logical JSON Plan 
> > > {
> > >   "rels": [
> > > {
> > >   "id": "0",
> > >   "relOp": "LogicalTableScan",
> > >   "table": [
> > > "USERS"
> > >   ],
> > >   "inputs": []
> > > },
> > > {
> > >   "id": "1",
> > >   "relOp": "LogicalFilter",
> > >   "condition": {
> > > "op": {
> > >   "name": "=",
> > >   "kind": "EQUALS",
> > >   "syntax": "BINARY"
> > > },
> > > "operands": [
> > >   {
> > > "op": {
> > >   "name": "CAST",
> > >   "kind": "CAST",
> > >   "syntax": "SPECIAL"
> > > },
> > > "operands": [
> > >   {
> > > "input": 1,
> > > "name": "$1"
> > >   }
> > > ],
> > > "type": {
> > >   "type": "CHAR",
> > >   "nullable": true,
> > >   "precision": 5
> > > }
> > >   },
> > >   {
> > > "literal": "HELLO",
> > > "type": {
> > >   "type": "CHAR",
> > >   "nullable": false,
> > >   "precision": 5
> > > }
> > >   }
> > > ]
> > >   }
> > > },
> > > {
> > >   "id": "2",
> > >   "relOp": "LogicalProject",
> > >   "fields": [
> > > "ID"
> > >   ],
> > >   "exprs": [
> > > {
> > >   "input": 0,
> > >   "name": "$0"
> > > }
> > >   ]
> > > }
> > >   ]
> > > }
> > >
> > >> ---
> > >
> > >
> > Thanks and regards
> > Bhavya
> >
> > On Fri, Dec 18, 2020 at 11:54 AM Fan Liya  wrote:
> >
> > > Hi Bhavya,
> > >
> > > Sorry I do not understand your question. Why is it difficult to pass
> sth.
> > > to child operators?
> > >
> > > Best,
> > > Liya Fan
> > >
> > >
> > > On Fri, Dec 18, 2020 at 12:38 PM Bhavya Aggarwal 
> > > wrote:
> > >
> > > > Hi Liya,
> > > >
> > > > Actually the question is how to pass the projection columns to Table
> > Scan
> > > > as right now in the LogicalTableScan there are no projection columns
> > > being
> > > > passed so when I am trying to create my custom JSON , I do not have
> > those
> > > > projected columns or columns that are being used in the query. I want
> > to
> > > > understand what is the calcite mechanism for passing it to child
> > > operators
> > > > without using Rules 

Re: Decouple core from linq4j and Avatica

2020-11-27 Thread Vladimir Ozerov
Hi colleagues,

Thank you for the valuable feedback. The problem is indeed complex. I share
the worry that complete decoupling might be too disruptive for users, since
they will observe compilation problems when migrating to the newer version,
and will have to update their dependencies, which also could be problematic
(e.g. due to security concerns). So I'd like to propose a slightly
different approach that should not cause any problems for the existing
users. We change the goal from the complete decoupling to the *isolation *of
dependent classes.

Let me explain it with Avatica as an example. There are two class of
Avatica-related dependencies in the core: (1) utilities (e.g. classes from
org.apache.calcite.avatica.util), and (2) logic (e.g. classes from
org.apache.calcite.jdbc, org.apache.calcite.adapter.jdbc). The first class
is very easy to eliminate. The second class cannot be eliminated with the
serious repackaging of the whole Calcite. So we can do the following:

1. Introduce the "commons" module, and move utilities there, thus solving
(1).
2. Shade the "commons" module into the "core" during the build - if we do
this, the existing users will not have to change their dependencies, so
this is a critically important step (at least for now). An alternative to
this is just to copy-paste utility classes into the "core" module,
violating DRY
3. Contain the outstanding Avatica dependencies to a couple of JDBC-related
packages, and add a static analysis rule to disallow Avatica classes in any
other package. This may require some advanced refactoring (e.g.
CalciteConnectionConfig)

As a result, Avatica dependency is reduced to a handful of packages, and
existing applications will work mostly seamlessly during migration. Now we
can do one of two things:
1. Either create a separate reduced artifact "core-reduced" without
Avatica-dependent packages
2. Since many products shade Calcite during the build, we can advise them
to just exclude Avatica-dependent packages when shading

How does it sound?

Regards,
Vladimir


ср, 25 нояб. 2020 г. в 10:48, Chunwei Lei :

> I like the idea. But I have the same worry as Haisheng.
>
>
> Best,
> Chunwei
>
>
> On Wed, Nov 25, 2020 at 3:07 PM Xin Wang  wrote:
>
> > +1 for this idea. We only use the parser/optimizer part.
> >
> > JiaTao Tao  于2020年11月25日周三 下午2:38写道:
> >
> > > +1 for this idea, I have been developing Calcite for a long
> time(counting
> > > during project Kylin), we all treat calcite as an optimizer, but we
> need
> > to
> > > consider overhead.
> > >
> > > I aggre with Stamatis: "since those dependencies were not causing any
> > real
> > > trouble."
> > >
> > >
> > > What really troubling me is that when we do some in logical, we may
> have
> > to
> > > consider the implemnt, for an example, we used keep "In", not convert
> to
> > > join or "OR", but calcite have no impl about "In".
> > >
> > >
> > > Regards!
> > >
> > > Aron Tao
> > >
> > >
> > >
> > > Haisheng Yuan  于2020年11月25日周三 下午12:57写道:
> > >
> > > > > I would like to propose to decouple the "core" module from "ling4j"
> > and
> > > > Avatica.
> > > > I like the idea.
> > > >
> > > > Moving Enumerable out of core may be time consuming and disruptive,
> > > > because many core tests are using Enumerable to verify plan quality
> and
> > > > correctness.
> > > >
> > > > Best,
> > > > Haisheng
> > > >
> > > > On 2020/11/24 23:42:19, Stamatis Zampetakis 
> wrote:
> > > > > Hi Vladimir,
> > > > >
> > > > > Personally, I like the idea.
> > > > > I had similar thoughts in the past but it didn't try to break it
> down
> > > > since
> > > > > those dependencies were not causing any real trouble.
> > > > >
> > > > > Let's see what the others think.
> > > > >
> > > > > Best,
> > > > > Stamatis
> > > > >
> > > > >
> > > > > On Tue, Nov 24, 2020 at 7:30 PM Vladimir Ozerov <
> ppoze...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi colleagues,
> > > > > >
> > > > > > Many Calcite integrations use only part of the framework.
> > > > Specifically, it
> > > > > > is common to use only the parser/optimizer part. JDBC and runtime
> > are
> > > &g

Decouple core from linq4j and Avatica

2020-11-24 Thread Vladimir Ozerov
Hi colleagues,

Many Calcite integrations use only part of the framework. Specifically, it
is common to use only the parser/optimizer part. JDBC and runtime are used
less frequently because they are not very well suited for mature processing
engines (e.g. Enumerable runs out of memory easily).

However, in order to use the parser/optimizer from the core module, you
also need to add "linq4j" and Avatica modules to the classpath, which is
not convenient - why to include modules, that you do not use?

It turns out that most of the dependencies are indeed leaky abstractions,
that could be decoupled easily. For example, the RelOptUtil class from the
"core" depends on ... two string constants from the Avatica module.

I would like to propose to decouple the "core" module from "ling4j" and
Avatica. For example, we may introduce the new "common" module, that will
hold common constants, utility classes, and interfaces (e.g. Meta). Then,
we can organize the dependencies like this:
common -> core
common -> linq4j
common -> Avatica

Finally, we may shade and relocate the "common" module into the "core"
during the build. In the end, we will have -2 runtime dependencies with
relatively little effort. In principle, the same approach could be applied
to Janino and Jackson dependencies, but it could be more complex, so my
proposal is only about "linq4" and Avatica.

How do you feel about it? Does this proposal sense to the community? If
yes, I can try implementing the POC for this.

Regards,
Vladimir.


RelFieldTrimmer misses optimization opportunity for always false condition

2020-08-11 Thread Vladimir Ozerov
Hi colleagues,

Consider the following query. If the RelFieldTrimmer is applied to this,
then the expression is reduced to an empty LogicalValues:
SELECT field FROM table WHERE TRUE IS FALSE

However, the following query will not be simplified, leaving a table scan
with "always false" filter:
SELECT * FROM table WHERE TRUE IS FALSE

After some debugging I found that the problem is in the following piece of
code in the RelFieldTrimmer:

// If the input is unchanged, and we need to project all columns,
// there's nothing we can do.
if (newInput == input
&& fieldsUsed.cardinality() == fieldCount) {
  return result(filter, Mappings.createIdentity(fieldCount));
}

My question - is it a known issue? Looks like this early return from the
method misses an important optimization opportunity.  Can this check be
removed completely?

Regards,
Vladimir


How to support case-insensitive identifier resolution?

2020-06-16 Thread Vladimir Ozerov
Hi colleagues,

I am trying to implement case insensitive resolution of column/table/schema
names for our Apache Calcite integration in Hazelcast and got stuck. I hope
that the community might help me.

Consider that we have a table source that contains the following Java class:

class Employee {
LocalDate birthDate
}

I would like to have an SQL engine that could resolve identifiers in
accordance with ANSI SQL standard, which basically says that unquoted
identifiers should be compared in a case-insensitive way, while quoted
identifiers should be compared in a case-sensitive way. Let's focus on
columns only for now:

SELECT birthDate FROM employee // OK
SELECT birthdate FROM employee // OK
SELECT BIRTHDATE FROM employee // OK
SELECT `birthDate` FROM employee // OK
SELECT `birthdate` FROM employee // Fail
SELECT `BIRTHDATE` FROM employee // Fail

That is, my source is a collection of Java objects, and the natural name of
the column is "birthDate". But I would like it to be accessible as
"birthDate", birthDate, BIRTHDate, etc.

My problem comes from the fact that casing configuration is applied during
parsing and by the time the table is asked for a column, the information
whether the user request was quoted or not is lost. Consider that I have a
table RelDataType["birthDate"]. Now consider what happens with different
combinations of casing configuration:
1) [unquoted=UNCHANGED, quoted=UNCHANGED]: "SELECT BIRTHDATE" doesn't work
obviously
2) [unquoted=UPPER, quoted=UNCHANGED]: "SELECT BIRTHDATE" doesn't work
again, because parser normalizes unqouted identifier to upper case, but
RelDataType has a column "birthDate"
3) Same as p.2, but with manual normalization of RelDataType
to RelDataType["BIRTHDATE"]: "SELECT BIRTHDATE" works now, but "SELECT
`birthDate`" don't!

Is there any built-in solution to the above problem?

Regards,
Vladimir.


Re: Using indexes rather than table scans with Calcite

2020-06-02 Thread Vladimir Ozerov
 regular indexes. In this case, and many others, it is worth
> > thinking about the problem at a high level. For example, combining two
> > bitmap indexes can be modeled as a join, where the join keys are the
> > record ids, and the record ids are sorted within each index value.
> > Thinking at the high level can find plans that the rule-based approach
> > will never find.
> >
> > Indexes-as-MVs, indexes-as-tables, and
> > index-filtered-table-scan-as-join are other examples of the high-level
> > approach.
> >
> > Julian
> >
> >
> >
> > On Mon, Jun 1, 2020 at 12:00 PM Roman Kondakov
> >  wrote:
> > >
> > > Hi Xiening,
> > >
> > > the example was synthetic. What is still not clear for me is how to
> > > exploit index sortedness with a rule based approach. As far as I
> > > understand with this approach we need to write complex rules (for
> > > example [1]) that should decide which index is useful and which is not
> > > useful. These rules should take into account both statistics and
> > > collations, so they do some part of work that should be done by a cost
> > > model. And it makes writing such rules quite a difficult task.
> > >
> > > With a materialized views approach we can register all indexes as
> scans,
> > > push filters to them if needed. And the cost model, not a rule, will
> > > decide which index is better based on its cost and output collation.
> > >
> > > So the benefits of rule based approach are not so obvious to me. I
> would
> > > really appreciate if you could tell me in what cases rule-based
> approach
> > > is better. I understand that its definitely better in scenarios when
> the
> > > number of indexes is very high. But may be there are some other
> advantages?
> > >
> > > Thank you!
> > >
> > > [1]
> > >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/index/rules/DbScanToIndexScanPrule.java
> > >
> > > --
> > > Kind Regards
> > > Roman Kondakov
> > >
> > >
> > > On 01.06.2020 21:00, Xiening Dai wrote:
> > > > Hi Roman,
> > > >
> > > > The example you mentioned is an advanced scenario. Note that there
> are different types of index, such as clustered index, secondary index,
> covered and non-covered index. In your case, typical OLTP/OLAP optimizer
> would create an index-based join on top of the range table scan (or
> FilteredTableScan in your term). And these transformation can definitely be
> based on rules. But the difficult part is actually the statistics
> estimation and cost calculation. You could end up with higher runtime cost
> with index based join when join cardinality is high.
> > > >
> > > > But back to the original question, if we’d like to leverage index on
> table scan, I think simple rule would serve the purpose. In fact, we have
> FilterTableScanPredicatePushdownRule in our system which does exactly the
> same thing.
> > > >
> > > >> On May 31, 2020, at 12:45 PM, Roman Kondakov
>  wrote:
> > > >>
> > > >> Hi Vladimir,
> > > >>
> > > >> thank you for sharing your point. Could you please clarify some
> details
> > > >> with a rulse-based index selection? You said
> > > >>
> > > >>> the fundamental problem with "indexes as materialized
> > > >>> views" approach is that you have to register them beforehand,
> instead of
> > > >>> using them only when needed.
> > > >>
> > > >> I agree, it's kind of a problem. What is not clear for me with
> > > >> IndexScanRule-based approach is how to decide when and which index
> we
> > > >> need? I understand that is is pretty easy to do in the case like
> this:
> > > >>
> > > >> Filter
> > > >>  Scan
> > > >>
> > > >> we can match the IndexScanRule on this pattern and do an index
> lookup
> > > >> using filter condition. But what to do in the more complex
> scenarios?
> > > >> Let's consider an example
> > > >>
> > > >> SELECT * FROM A JOIN B ON A.a=B.b WHERE A.c > 100
> > > >>
> > > >> where A.a, A.c and B.b are indexed fields. The logical plan for this
> > > >> query might look like this:
> > > >>
> > > >> LogicalJoin(A.a=B.b)
> > > 

Re: Using indexes rather than table scans with Calcite

2020-05-31 Thread Vladimir Ozerov
Hi Roman,

This heavily depends on the architecture of the planner. In Hazelcast we
have separate logical and physical phases. The goal of logical phase is
normalization of a relational tree. In this case your example is converted
to:

LogicalJoin
  LogicalConstrainedScan(A, c>100)
  LogicalScan(B)

Next, during physical planning, different implementations are considered.
For example, if for table A there are many indexes - sorted(a), hash(b),
sorted(c) - then It is possible to prune unnecessary access methods. E.g.,
hash(b) is not considered because it doesn’t add any interesting physical
property, and is costlier than other methods. At the same time, sorted(a)
is considered still, even though it has higher cost than sorted(c), because
it provides an interesting collation.

That is the key difference to materialized views - indexes are considered
as needed.

вс, 31 мая 2020 г. в 22:46, Roman Kondakov :

> Hi Vladimir,
>
> thank you for sharing your point. Could you please clarify some details
> with a rulse-based index selection? You said
>
> > the fundamental problem with "indexes as materialized
> > views" approach is that you have to register them beforehand, instead of
> > using them only when needed.
>
> I agree, it's kind of a problem. What is not clear for me with
> IndexScanRule-based approach is how to decide when and which index we
> need? I understand that is is pretty easy to do in the case like this:
>
> Filter
>   Scan
>
> we can match the IndexScanRule on this pattern and do an index lookup
> using filter condition. But what to do in the more complex scenarios?
> Let's consider an example
>
> SELECT * FROM A JOIN B ON A.a=B.b WHERE A.c > 100
>
> where A.a, A.c and B.b are indexed fields. The logical plan for this
> query might look like this:
>
> LogicalJoin(A.a=B.b)
>   LogicalFilter(A.c > 100)
> LogicalScan(A)
>   LogicalScan(B)
>
> as I understand (please correct me if I'm wrong), with the rule-based
> approach, after allpying IndexScanRule the plan will look like this:
>
> LogicalJoin(A.a=B.b)
>   PhysicalIndexScan(A.c, lower bound = 100)
>   PhysicalTableScan(B)
>
> But in this case we lose the possibility of using index scans over A.a
> and B.b and joining them with MergeJoin, which can be more efficient
> plan in terms of the cost.
>
> My question is: how rule-based approach handle this scenario? Will it
> re-apply IndexScanRule once again to produce PhysicalIndexScan(A.a) and
> PhysicalIndexScan(B.b)? Or am I missing the crucial point of a rule-base
> approach?
>
> Thank you in advance!
>
>
> --
> Kind Regards
> Roman Kondakov
>
>
> On 31.05.2020 21:39, Vladimir Ozerov wrote:
> > As already mentioned, the fundamental problem with "indexes as
> materialized
> > views" approach is that you have to register them beforehand, instead of
> > using them only when needed. On the other hand, the complexity of index
> > planning comes from cost estimation and predicate splitting.
> > Materializations cannot help you with that anyhow. This is why I call
> this
> > approach (not materialized views per se) "hacky" - you reuse several
> simple
> > parts of the Calcite infrastructure at the cost of loss in the
> flexibility
> > of the planner, while the most complicated parts still need to be
> > implemented by hand.
> >
> > Materialized views could be a good fit e.g for partial indexes because in
> > this case, Calcite could help you with complex subsumption mechanics. But
> > for standard indexes, the pros/cons balance is not that obvious.
> >
> > вс, 31 мая 2020 г. в 19:28, xu :
> >
> >> Hi Tim,
> >>
> >> I am working on MySQL InnoDB adapter and trying to introduce this to
> >> Calcite, currently it is only in early stage, and not approved/reviewed
> by
> >> committers yet. Anyway, we are facing the same problem like what index
> to
> >> use, how to push down order by operation, etc. I have developed a simple
> >> rule based adapter to be "index aware" and being able to leverage a
> MySQL
> >> InnoDB storage engine written in Java. Hope this will help you to
> explore
> >> more options.
> >>
> >> https://issues.apache.org/jira/browse/CALCITE-4034
> >>
> >> Thanks,
> >> Xu
> >>
> >> Haisheng Yuan  于2020年5月31日周日 下午10:06写道:
> >>
> >>> Hi Roman,
> >>>
> >>> Thank you for sharing your thoughts.
> >>>
> >>>> It can be very tricky because the rule should consider not
> >>>> only filters, but also collations. This leads to increa

  1   2   >