Rewrite rule to convert self-joins into scans

2023-04-15 Thread Julian Hyde
Ian Bertolacci recently logged
https://issues.apache.org/jira/browse/CALCITE-5631, to convert

  select
 (select numarrayagg(C5633_203) from T893 where C5633_586 = T895.id),
 (select numarrayagg(C5633_170) from T893 where C5633_586 = T895.id)
  from T895

into

  select agg.agg1,
  agg.agg2
  from T895
  left join (
select C5633_586,
numarrayagg(C5633_203) as agg1,
numarrayagg(C5633_170) as agg2
from T893
where C5633_586 is not null
group by C5633_586) as agg
  on agg.C5633_586 = T895.id

This seems to me an interesting and important problem. But it's also a
hard problem, and it's not clear to me which approach is the best.
Does anyone have any ideas for how to approach it?

Also, we could use more example queries that illustrate the general
pattern.  (Preferably in terms of simple databases such as EMP and
DEPT.)

In Calcite rewrite rules (RelRule) are usually the preferred approach.
Because the common relational expressions scans can be an arbitrary
distance apart in the RelNode tree, RelRule doesn't seem suitable.

There seem to be some similarities to algorithms to use materialized
views, which use bottom-up unification.

Ian's original query actually has correlated scalar sub-queries rather
than explicit joins. Would it be better to target common sub-queries
rather than joins?

Lastly, there are similarities with the WinMagic algorithm, which
converts correlated sub-queries into window aggregates. Is that a
useful direction? (My implementation of measures in CALCITE-4496
naturally creates correlated scalar sub-queries that can be inlined in
the enclosing query if simple, or converted to window aggregates if
more complex.)

Julian


Re: Template SQL

2023-04-15 Thread Soumyadeep Mukhopadhyay
Hello Askar,

Thank you so much for taking the time to compile this for me.

I shall be indebted to you for this effort. Please let me know if I can be
of any assistance ever to repay your kindness.

I appreciate all your help and guidance. Thank you once again. I shall go
through the same and will get back to you as soon as possible. Thank you
again sir! :)

With best regards,
Soumyadeep Mukhopadhyay.


On Sat, 15 Apr 2023 at 11:52 PM, Askar Bozcan  wrote:

> Hello again!
> I have prepared this gist to perhaps help you understand how to traverse
> through a Sql parse tree:
> https://gist.github.com/askarbozcan/6ffc01b465550e171a95074308cab40f
>
> I have implemented a Filterer class that returns false on any queries which
> have 'HAVING COUNT > x' where x is <= 100, and true for all other queries.
> I have used Kotlin, so if you have difficulties understanding it (as you're
> using Scala), check out Kotlin's smart cast and nullability.
>
> Good luck and hopefully that was helpful,
> Askar Bozcan
>
> On Wed, 12 Apr 2023 at 23:37, Soumyadeep Mukhopadhyay <
> soumyamy...@gmail.com>
> wrote:
>
> > Hi Askar,
> >
> > I have gone through your email a couple of times. I am trying to
> > understand it bit by bit.
> >
> > I have a request, please feel free to say no, is there any code base
> where
> > I can refer and understand how to implement my own filter rule?
> > What I realised I asked wrong is I wanted to implement "(HAVING COUNT(*)
> > > 100)" as a global rule, if there are any aggregation queries.
> >
> > It is a bit difficult for me to understand where I should start.
> > For example, if I follow this
> https://github.com/zabetak/calcite-tutorial where
> > shall I start looking at?
> > In the below image I feel like some kind of comparison is going on,
> should
> > I devise my code like this? (in order to implement something like
> "(HAVING
> > COUNT(*) > 100)")
> > [image: Screenshot 2023-04-11 at 5.38.43 PM.png]
> >
> > Also is there any documentation I can go through regarding how I can
> > traverse through the AST?
> > Probably I am not looking in the right places but so far I could only go
> > through Tab9 code examples (or the documentation provided by the Calcite
> > website) and things did not seem to be clear.
> > I realise it is probably a lot to ask, so whatever you share will be a
> lot
> > of help for me.
> >
> > Thanks again for your time, patience and help!
> >
> > With regards,
> > Soumyadeep Mukhopadhyay.
> >
> > On Fri, Apr 7, 2023 at 8:33 PM Askar Bozcan 
> wrote:
> >
> >> Hey,
> >> You can use Planner.reset(). Note that it has mutable state inside, so
> do
> >> not reuse the same planner instance in any multi-threading environment.
> >>
> >> (I am assuming you have access to table metadata so that you will be
> able
> >> to convert your SqlNode tree into RelNode tree, relational expression
> >> tree)
> >> - Only return results above a certain threshold when using GROUP BY, for
> >> example (HAVING COUNT(col1) > 100).
> >>
> >> I'm not quite sure I understand this question is HAVING part of the
> query?
> >>
> >> - Restrict the column on which joins can happen, or else throw an error
> >> (almost like analysisException in Spark)
> >>
> >> Do you have access to table/schema metadata?
> >> If you do:
> >> 1) Convert your parsed syntax tree (SqlNode) into a logical relational
> >> tree
> >> (RelNode).
> >> Watch this tutorial by Stamatis:
> >> https://www.youtube.com/watch?v=p1O3E33FIs8.
> >> It will explain way better the usage than I can in an email. (And it
> what
> >> introduced me to Calcite's basics :))
> >> 2) Traverse your relational tree by implementing RelShuttle
> >> <
> >>
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/RelShuttle.html
> >> >.
> >> (Look up Visitor pattern to better understand how this recursive
> traversal
> >> works)
> >> 3) Ignore any RelNode's which are not LogicalJoin.
> >> 4) When you encounter LogicalJoin, traverse its children with getLeft
> and
> >> getRight
> >> 5) When you encounter LogicalTableScan,  You can get its RelOptTable
> with
> >> getTable, and then RelOptTable.getRowType() to find the fields.
> >> (Not 100% about step 5, but should be close to it)
> >>
> >>
> >> - Restrict the columns that can be passed in a select statement or else
> >> throw an error (like in the point above)
> >> Same logic as above. But instead of TableScan, look at the root
> >> LogicalProject.
> >>
> >> All of the above should be doable with just the parse tree (SqlNode) and
> >> without access to table metadata using SqlShuttle; however, it's easier
> >> and
> >> less error-prone with relational tree IMO.
> >>
> >> - Askar
> >>
> >>
> >> On Thu, 6 Apr 2023 at 23:31, Soumyadeep Mukhopadhyay <
> >> soumyamy...@gmail.com>
> >> wrote:
> >>
> >> > Hey Askar,
> >> >
> >> > It worked exactly how you suggested.
> >> >
> >> > ```
> >> >
> >> > private val sqlQuery: String = "SELECT list_of_columns FROM table_name
> >> > WHERE predicate_column = 

Re: Template SQL

2023-04-15 Thread Askar Bozcan
Hello again!
I have prepared this gist to perhaps help you understand how to traverse
through a Sql parse tree:
https://gist.github.com/askarbozcan/6ffc01b465550e171a95074308cab40f

I have implemented a Filterer class that returns false on any queries which
have 'HAVING COUNT > x' where x is <= 100, and true for all other queries.
I have used Kotlin, so if you have difficulties understanding it (as you're
using Scala), check out Kotlin's smart cast and nullability.

Good luck and hopefully that was helpful,
Askar Bozcan

On Wed, 12 Apr 2023 at 23:37, Soumyadeep Mukhopadhyay 
wrote:

> Hi Askar,
>
> I have gone through your email a couple of times. I am trying to
> understand it bit by bit.
>
> I have a request, please feel free to say no, is there any code base where
> I can refer and understand how to implement my own filter rule?
> What I realised I asked wrong is I wanted to implement "(HAVING COUNT(*)
> > 100)" as a global rule, if there are any aggregation queries.
>
> It is a bit difficult for me to understand where I should start.
> For example, if I follow this https://github.com/zabetak/calcite-tutorial 
> where
> shall I start looking at?
> In the below image I feel like some kind of comparison is going on, should
> I devise my code like this? (in order to implement something like "(HAVING
> COUNT(*) > 100)")
> [image: Screenshot 2023-04-11 at 5.38.43 PM.png]
>
> Also is there any documentation I can go through regarding how I can
> traverse through the AST?
> Probably I am not looking in the right places but so far I could only go
> through Tab9 code examples (or the documentation provided by the Calcite
> website) and things did not seem to be clear.
> I realise it is probably a lot to ask, so whatever you share will be a lot
> of help for me.
>
> Thanks again for your time, patience and help!
>
> With regards,
> Soumyadeep Mukhopadhyay.
>
> On Fri, Apr 7, 2023 at 8:33 PM Askar Bozcan  wrote:
>
>> Hey,
>> You can use Planner.reset(). Note that it has mutable state inside, so do
>> not reuse the same planner instance in any multi-threading environment.
>>
>> (I am assuming you have access to table metadata so that you will be able
>> to convert your SqlNode tree into RelNode tree, relational expression
>> tree)
>> - Only return results above a certain threshold when using GROUP BY, for
>> example (HAVING COUNT(col1) > 100).
>>
>> I'm not quite sure I understand this question is HAVING part of the query?
>>
>> - Restrict the column on which joins can happen, or else throw an error
>> (almost like analysisException in Spark)
>>
>> Do you have access to table/schema metadata?
>> If you do:
>> 1) Convert your parsed syntax tree (SqlNode) into a logical relational
>> tree
>> (RelNode).
>> Watch this tutorial by Stamatis:
>> https://www.youtube.com/watch?v=p1O3E33FIs8.
>> It will explain way better the usage than I can in an email. (And it what
>> introduced me to Calcite's basics :))
>> 2) Traverse your relational tree by implementing RelShuttle
>> <
>> https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/RelShuttle.html
>> >.
>> (Look up Visitor pattern to better understand how this recursive traversal
>> works)
>> 3) Ignore any RelNode's which are not LogicalJoin.
>> 4) When you encounter LogicalJoin, traverse its children with getLeft and
>> getRight
>> 5) When you encounter LogicalTableScan,  You can get its RelOptTable with
>> getTable, and then RelOptTable.getRowType() to find the fields.
>> (Not 100% about step 5, but should be close to it)
>>
>>
>> - Restrict the columns that can be passed in a select statement or else
>> throw an error (like in the point above)
>> Same logic as above. But instead of TableScan, look at the root
>> LogicalProject.
>>
>> All of the above should be doable with just the parse tree (SqlNode) and
>> without access to table metadata using SqlShuttle; however, it's easier
>> and
>> less error-prone with relational tree IMO.
>>
>> - Askar
>>
>>
>> On Thu, 6 Apr 2023 at 23:31, Soumyadeep Mukhopadhyay <
>> soumyamy...@gmail.com>
>> wrote:
>>
>> > Hey Askar,
>> >
>> > It worked exactly how you suggested.
>> >
>> > ```
>> >
>> > private val sqlQuery: String = "SELECT list_of_columns FROM table_name
>> > WHERE predicate_column = 'predicate_value'"
>> > private val frameworkConfig: FrameworkConfig =
>> > Frameworks.newConfigBuilder.build()
>> > private val planner: Planner = Frameworks.getPlanner(frameworkConfig)
>> > private val planner2: Planner = Frameworks.getPlanner(frameworkConfig)
>> > private val planner3: Planner = Frameworks.getPlanner(frameworkConfig)
>> > private val sqlNode: SqlNode = planner.parse(sqlQuery)
>> > println(sqlNode.getKind)
>> > private val sqlSelectStmt: SqlSelect = sqlNode.asInstanceOf[SqlSelect]
>> >
>> > private val setSelectColumnsQuery = "SELECT age"
>> > private val selectList =
>> >
>> planner2.parse(setSelectColumnsQuery).asInstanceOf[SqlSelect].getSelectList
>> > private val setFromTableQuery = "SELECT employee"
>> > 

[jira] [Created] (CALCITE-5652) How to add a database-specific comment symbol

2023-04-15 Thread Zine eddine Zidane (Jira)
Zine eddine Zidane created CALCITE-5652:
---

 Summary: How to add a database-specific comment symbol
 Key: CALCITE-5652
 URL: https://issues.apache.org/jira/browse/CALCITE-5652
 Project: Calcite
  Issue Type: New Feature
  Components: core
Affects Versions: 1.34.0
Reporter: Zine eddine Zidane


Is there a clean way to add a database-specific comment symbol? for example, 
the hash symbol "#" is used for inline comments on Mysql, Postgres, and Oracle.

A straightforward way is to just add it to the regular expression definition of 
the SINGLE_LINE_COMMENT keyword below but is this will make it a valid symbol 
regardless of a database which is not the intent
{code:java}
 SKIP :
{

} {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Running Sql Logic Tests for Calcite

2023-04-15 Thread Michael Mior
Very cool! One approach could be to add set these tests to run periodically
(daily/weekly) as opposed to being part of the CI pipeline. That way we
still have a mechanism to keep tabs on bugs but the whole build isn't
slow/broken until this is fixed.

On Fri, Apr 14, 2023, 15:20 Mihai Budiu  wrote:

> Hello all,
>
> I have submitted a PR for Calcite with a standalone executable that runs
> the Sql Logic Test suite of 7+ million tests from sqlite.
>
> This is the JIRA case: https://issues.apache.org/jira/browse/CALCITE-5615
> And this is the PR: https://github.com/apache/calcite/pull/3145
>
> As Stamatis pointed out, the PR isn't really specific to Calcite, it is a
> general framework in Java to run these tests on any JDBC compliant
> executor. So a question is whether this belongs to the Calcite project, or
> some place else. sqlite is a C project, I didn't see any Java in their
> source tree.
>
> Please note that SQLite is in the public domain, so their licensing terms
> are not an obstacle to using the test scripts.
>
> The submitted code runs Calcite in its default configuration, but the
> intent is for other projects that build Calcite-based compilers to be able
> to test them by subclassing the "TestExecutors". In our own project (
> https://github.com/vmware/sql-to-dbsp-compiler) we have done exactly that,
> and we are not using the JDBC API.
>
> The testsuite does find bugs in Calcite, both crashes and incorrect
> results. So I think it's usefulness is not debated.
>
> The second question is about the packaging of this program; right now it
> has a main() entry point and it prints the results to stderr for human
> consumption and triage. It is not clear to me how it should be inserted in
> a CI infrastructure, since running all 7 million tests could take a long
> time. One possible extension would be to have the program generate a
> regression test for Calcite for each bug it finds, but I haven't
> implemented this feature yet (and many failures could be due to the same
> bug). But even that mode would not naturally integrate in a CI
> infrastructure.
>
> A simple possibility is for me to just publish the code as an independent
> project on github with an MIT license (the code is derived from our
> MIT-licensed project) and just advertise it to the Calcite community.
>
> I would very much appreciate guidance.
>
> Mihai Budiu
>