Re: [VOTE] Release Apache Calcite 1.34.0 (release candidate 0)

2023-03-13 Thread Scott Reynolds
 +1 (non-binding)

 * Downloaded
 * checked LICENSE, NOTICE, README
 * compiled and ran tests on MacOS
 * reviewed release notes
 * Ran Twilio's internal test suite

Twilio's internal test suite is affected by CALCITE-5510 -- Order by using
literals in RelToSqlConverter -- but happy to make this change. Thanks for
releasing!

On Fri, Mar 10, 2023 at 3:33 AM Stamatis Zampetakis 
wrote:

> Hi all,
>
> I have created a build for Apache Calcite 1.34.0, release
> candidate 0.
>
> Thanks to everyone who has contributed to this release.
>
> You can read the release notes here:
>
> https://github.com/apache/calcite/blob/calcite-1.34.0-rc0/site/_docs/history.md
>
> The commit to be voted upon:
>
> https://gitbox.apache.org/repos/asf?p=calcite.git;a=commit;h=7dfd641baeb0e1b26dec04da5241c3999fe0ac6a
>
> Its hash is 7dfd641baeb0e1b26dec04da5241c3999fe0ac6a
>
> Tag:
> https://github.com/apache/calcite/tree/calcite-1.34.0-rc0
>
> The artifacts to be voted on are located here:
> https://dist.apache.org/repos/dist/dev/calcite/apache-calcite-1.34.0-rc0
> (revision 60513)
>
> The hashes of the artifacts are as follows:
>
> 1b4733229b67e3241329c24c52a79122e1834a4ef1db9c25856281de7e6db8a80a6ae6ac07d28dd05d3ee43c0a79587cc280ebfc6025b4ba3d974e38e804b47b
> *apache-calcite-1.34.0-src.tar.gz
>
> A staged Maven repository is available for review at:
>
> https://repository.apache.org/content/repositories/orgapachecalcite-1200/org/apache/calcite/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/zabetak.asc
> https://www.apache.org/dist/calcite/KEYS
>
> To create the jars and test Apache Calcite: "gradle build"
> (requires an appropriate Gradle/JDK installation)
>
> Please vote on releasing this package as Apache Calcite 1.34.0.
>
> The vote is open for the next 96 hours (due to the weekend) and passes if a
> majority of at
> least three +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Calcite 1.34.0
> [ ]  0 I don't feel strongly about it, but I'm okay with the release
> [ ] -1 Do not release this package because...
>
> Here is my vote:
>
> +1 (binding)
>
> Stamatis
>


Re: Output column names of Calc discarded by VolcanoPlanner in Calcite 2.22+

2023-02-03 Thread Scott Reynolds
Could you explain more how this affects Beam? Can Beam capture the row type
from the logical plan and pass it through to the output phase?

In my company's project, that is what we did. The project crafts the
logical plan, grabs the output type, executes the planner and passes the
resulting RelNode and the output type from the logical plan into the
execution phase.

On Thu, Feb 2, 2023 at 1:56 PM Andrew Pilloud 
wrote:

> A Beam SQL user found an issue where we are discarding their output field
> names that appears to be related to a change to the VolcanoPlanner to treat
> rel nodes that only differ by field names as equivalent:
>
> https://github.com/apache/calcite/commit/1e9b4da0573ec73d332d4e65fb7fd30491b4318d#diff-008c6d52bfd93bbe963a23c264bc412c68cac3b4837e3f10b8d5e4858cd4acb8L1136
>
> This creates an interaction between CalcMergeRule and the VolcanoPlanner
> when there are two equivalence sets with a Calc that only differs by output
> column name. The sets are merged and the output column rename is lost. This
> occurs with both Calcite's LogicalCalc and BeamCalc with a series of three
> mergeable Calc nodes.
>
> The minimal reproduction I've derived in Beam involves running the planner
> with CoreRules.FILTER_TO_CALC, CoreRules.PROJECT_TO_CALC,
> CoreRules.CALC_MERGE, and BeamCalcRule.INSTANCE using this test query:
> WITH tempTable (id, v) AS (SELECT f_int as id, f_string as v FROM
> PCOLLECTION) SELECT id AS fout_int, v AS fout_string FROM tempTable WHERE
> id >= 1
>
> It creates a logical plan like this:
> LogicalProject(fout_int=[$0], fout_string=[$1])
>   LogicalFilter(condition=[>=($0, 1)])
> LogicalProject(id=[$0], v=[$1])
>   BeamIOSourceRel(table=[[beam, PCOLLECTION]])
>
> And the planner creates a physical plan like this (with what started as the
> top LogicalProject dropped):
> BeamCalcRel(expr#0..1=[{inputs}], expr#2=[1], expr#3=[>=($t0, $t2)],
> proj#0..1=[{exprs}], $condition=[$t3])
>   BeamIOSourceRel(table=[[beam, PCOLLECTION]])
>
> I was able to work around it by limiting calc merging rules to our BeamCalc
> and adding field names back to the equivalency check there.
>
> https://github.com/apache/beam/pull/25290/files#diff-ead622461b5c25264d0c680fcacde454ff457b8c05dc73164cafd298573f56bcR58
>
> This seems like a bug to me but we've previously been told that Calcite
> doesn't promise to retain field names so I assume that Calcite doesn't
> consider this a bug? See
>
> https://issues.apache.org/jira/browse/CALCITE-1584?focusedCommentId=16031351&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16031351
>


[jira] [Created] (CALCITE-5391) JoinOnUniqueToSemiJoinRule removes a Projections aliases

2022-11-18 Thread Scott Reynolds (Jira)
Scott Reynolds created CALCITE-5391:
---

 Summary: JoinOnUniqueToSemiJoinRule removes a Projections aliases
 Key: CALCITE-5391
 URL: https://issues.apache.org/jira/browse/CALCITE-5391
 Project: Calcite
  Issue Type: Bug
Affects Versions: 1.32.0
Reporter: Scott Reynolds
Assignee: Scott Reynolds


The new rule pushes a copy of the [original project without the alias 
names|https://github.com/apache/calcite/pull/2848/files#r1026959819] 

{code:java}
builder.project(project.getProjects());
{code}

This results in a new SQL plan without the name fields -- they become {{$fN}} 
fields in different plans. 

Small change is required to put the names from the {{RelDataType}} of the 
{{Project}} similar to the other {{Semijoin}} rule.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (CALCITE-5226) Resolve security Vuln in Commons-DBCP

2022-08-03 Thread Scott Reynolds (Jira)
Scott Reynolds created CALCITE-5226:
---

 Summary: Resolve security Vuln in Commons-DBCP
 Key: CALCITE-5226
 URL: https://issues.apache.org/jira/browse/CALCITE-5226
 Project: Calcite
  Issue Type: Bug
Reporter: Scott Reynolds


In DBCP-562, the information is leaked. This was fixed in 2021. We should bump 
to to the latest version to resolve it.

 

[https://ossindex.sonatype.org/vulnerability/sonatype-2020-1349?component-type=maven&component-name=org.apache.commons%2Fcommons-dbcp2&utm_source=ossindex-client&utm_medium=integration&utm_content=1.7.0]

[https://ossindex.sonatype.org/vulnerability/sonatype-2020-0460?component-type=maven&component-name=org.apache.commons%2Fcommons-dbcp2&utm_source=ossindex-client&utm_medium=integration&utm_content=1.7.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (CALCITE-5025) Update commons-io:commons-io Directory Travesal vulnerabliltiy

2022-02-26 Thread Scott Reynolds (Jira)
Scott Reynolds created CALCITE-5025:
---

 Summary: Update commons-io:commons-io Directory Travesal 
vulnerabliltiy
 Key: CALCITE-5025
 URL: https://issues.apache.org/jira/browse/CALCITE-5025
 Project: Calcite
  Issue Type: Bug
Reporter: Scott Reynolds


Calcite depends commons-io:commons-io 2.4 – which was released on 
{{2012-06-12}} -- which can be exploited to access parent directories. In 
recent months, there have been a fair number of releases for this package and 
[Synk lists this as the only vulnerability it has 
seen|https://snyk.io/vuln/maven:commons-io:commons-io].

Task is simple, bump the version to 2.7 or higher -- if I may suggest just 
going to 2.11.0.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: dynamic reflective schema

2022-02-22 Thread Scott Reynolds
Whenever I see "at Baz.bind(Unknown Source)" I am reminded how important it
is, when debugging, to set the System Property calcite.debug = true
https://github.com/apache/calcite/blob/a8a6569e6ba75efe9d5725c49338a7f181d3ab5c/core/src/main/java/org/apache/calcite/config/CalciteSystemProperty.java#L59

That will get you a good stack trace and will help you figure out the
generated code is attempting to do.

On Tue, Feb 22, 2022 at 5:35 AM xiaobo  wrote:

> we have static data conbination use cases , reflective shcema  is the
> best choice!
>
>
>
> ---Original---
> From: "Gavin Ray" Date: Mon, Feb 21, 2022 01:52 AM
> To: "dev" Subject: Re: dynamic reflective schema
>
>
> Ah, you don't want to use ReflectiveSchema, it's a simple schema type meant
> for easily making test schemas
>
> You want to extend from "AbstractSchema" and override the function
> "Map For the "Table" type, you probably want to use "JsonScannableTable"
>
> The CsvSchema example does exactly this, if you want to see an example
> implementation:
>
> https://github.com/apache/calcite/blob/4bc916619fd286b2c0cc4d5c653c96a68801d74e/example/csv/src/main/java/org/apache/calcite/adapter/csv/CsvSchema.java#L69-L106
>
> Hope this helps =)
>
>
>
> On Sat, Feb 19, 2022 at 11:03 PM xiaobo  wrote:
>
> > Hi,
> > When using  reflectiveSchema we must define a static Java class
> for the
> > schema object, but in our use case the actual data to query is
> dynamic,
> > define JAVA class for each data combination is impossible, we have an
> idea
> > that can we make a JSONSchema which accepts a Map > List , and each
> > JSONObject mapped to a row, JSONObject keys will be column names. Is
> there
> > anything similar with this?
> >
> >
> > Regards


Re: Pushing filters with unsafe operations (CAST/DIVISION) below inner joins

2022-02-17 Thread Scott Reynolds
 Is it feasible to prevent the filter push down in cases like this ( detect
potential division by zero) ? What other exceptions that could happen?

Should it only push down some filters and preserve the complex filter above
the join?

Regarding a Jira ticket, generating an exception when it should produce
valid results is gotta be a Jira case IMHO.




On Thu, Feb 17, 2022, 7:17 AM Stamatis Zampetakis  wrote:

> Hi all,
>
> Till today, I believed that pushing a filter below an inner join is not
> strictly related to the operators (equals, plus, minus, cast, division)
> present in the filter.
>
> However, the query below will return some results if the filter is executed
> after the join or it will fail with an exception if the filter is pushed
> below the join.
>
> EMP [EMPNO, NAME, DEPTNO]
> 0   | Alex | 0
> 10 | Bob | 1
>
> DEP [DEPTNO]
> 1
>
> SELECT e.name
> FROM emp e
> INNER JOIN dep d ON e.deptno = d.deptno
> WHERE (10 / e.empno) = 1
>
> It seems that SQL standard (Section 7.4 general rule 1) mandates that WHERE
> should be applied to the result of FROM so in the case above pushing
> filters below the join seems to violate the standard.
>
> Currently, in Calcite we are going to push the division, cast, and
> basically any kind of operator below the join. Many well-known databases
> e.g., Postgres, do that as well.
>
> Should we log it as a bug and try to do something about it or let it be as
> it is?
>
> Best,
> Stamatis
>


Re: Apache Calcite - Generated code

2022-01-07 Thread Scott Reynolds
I am going to attempt to answer a few of your questions. The Enumberable
implementation generates java code as a String. There is a Java Property
caching per unique Java Code String and therefore, when the same Java Code
is generated the same compiled byte code will be used.

Where the Property is used
https://github.com/apache/calcite/blob/9c0e3130e6692d1960a34a680dc13d11083ff1c8/core/src/main/java/org/apache/calcite/adapter/enumerable/EnumerableInterpretable.java#L159-L166

The Property definition:
https://github.com/apache/calcite/blob/a8a6569e6ba75efe9d5725c49338a7f181d3ab5c/core/src/main/java/org/apache/calcite/config/CalciteSystemProperty.java#L353

Therefore, if you set calcite.bindable.cache.maxSize Java Property (however
you chose to do set the Property), you will get a cache of the byte code.

>requires changing the query and the data at run time

I am unclear on what this means and let me explain. Calcite takes in a
Logical Query and will "rewrite it" (better known as optimize it) via
RelRules (
https://github.com/apache/calcite/blob/a8a6569e6ba75efe9d5725c49338a7f181d3ab5c/core/src/main/java/org/apache/calcite/plan/RelRule.java#L112).
Once the query is optimized then Java Code String is generated and then it
is compiled and cached. This means every query goes through the
optimization process and so if you want to change how it fetches the data,
the most straightforward place is in your own RelRule.

On Fri, Jan 7, 2022 at 9:11 AM M Singh  wrote:

> Hi:
> I am working on a project that requires changing the query and the data at
> run time.  The data to be processed will be stored in memory as a list of
> strings.  I am using java 8 at the moment.
> I wanted to understand how the classes are generated in Calcite using
> janino at run time.
> Questions:
> 1. If the query is executed on the same data twice, does it generate the
> code twice ? If so, are all the classes regenerated or only specific ones
> ?2. If the query changes are all the classes regenerated ?3. If the process
> keeps running, will the regenerated classes cause oom ?  If so, is there
> any way to avoid this. 4. Is there a way to remove the generated classes at
> runtime ?5. Is there any way in Calcite to avoid generating the classes if
> the data or query changes while the process is running ?
> I tried one of the csv example tests at added the following sql line twice
> (as shown in the snippet below) (
> https://github.com/apache/calcite/blob/master/example/csv/src/test/java/org/apache/calcite/test/CsvTest.java#L351)
> and it does appear to generate some classes twice but please feel free to
> correct me if I am mistaken.
> 
>   @Test void testFilterableWhereTwice() throws SQLException {final
> String sql ="select empno, gender, name from EMPS where name =
> 'John'";sql("filterable-model", sql).returns("EMPNO=110;
> GENDER=M; NAME=John").ok();
> sql("filterable-model", sql).returns("EMPNO=110; GENDER=M;
> NAME=John").ok();  }
> 
>
> If there is any documentation, example, or advice, on how code generation
> works, is there a way to avoid it, please let me know.
> Thanks


Re: [DISCUSS] SBOM (Software Bill of Materials)

2021-12-30 Thread Scott Reynolds
When I was dealing with Log4j, I discovered uzaygezen-core is pretty old
and pulls in log4j1.x:
https://mvnrepository.com/artifact/com.google.uzaygezen/uzaygezen-core/0.2

But doesn't actually use log4j (
https://github.com/aioaneid/uzaygezen/search?q=log4j). For our project, I
excluded log4j1.x from this dependency (
https://github.com/twilio/calcite-kudu/pull/48/files#diff-9c5fb3d1b7e3b0f54bc5c4182965c4fe1f9023d449017cece3005d3f90e8e4d8R142-R156).


This is an example of dependency we could replace with our own
implementation?

On Thu, Dec 30, 2021 at 3:55 PM Julian Hyde  wrote:

> Regarding dependencies. Here are the runtime dependencies from
> core/build.gradle.kts (ignoring test and annotation libraries):
>
>  * api("com.esri.geometry:esri-geometry-api")
>  * api("com.fasterxml.jackson.core:jackson-annotations")
>  * api("com.google.guava:guava")
>  * api("org.apache.calcite.avatica:avatica-core")
>  * api("org.slf4j:slf4j-api")
>  * implementation("com.fasterxml.jackson.core:jackson-core")
>  * implementation("com.fasterxml.jackson.core:jackson-databind")
>  *
> implementation("com.fasterxml.jackson.dataformat:jackson-dataformat-yaml")
>  * implementation("com.google.uzaygezen:uzaygezen-core")
>  * implementation("com.jayway.jsonpath:json-path")
>  * implementation("com.yahoo.datasketches:sketches-core")
>  * implementation("commons-codec:commons-codec")
>  * implementation("net.hydromatic:aggdesigner-algorithm")
>  * implementation("org.apache.commons:commons-dbcp2")
>  * implementation("org.apache.commons:commons-lang3")
>  * implementation("commons-io:commons-io")
>  * implementation("org.codehaus.janino:commons-compiler")
>  * implementation("org.codehaus.janino:janino")
>
> A few libraries are used only for a narrow range of functionality:
>  * esri-geometry and uzaygezen-core are used by geospatial functions;
>  * sketches-core is used by the HLL aggregate functions;
>  * json-path is used by some JSON functions;
>  * jackson-core, jackson-databind, jackson-dataformat-yaml are used to
> load models, and to serialize RelNodes to and from JSON;
>  * commons-lang3, commons-codec, commons-io are probably only used in one
> or two places each;
>  * aggdesigner-algotihm is used for recommending materialized views.
>
> So, the easiest way to reduce dependencies would be to make certain
> classes of SQL functions optional (i.e. move them out of core).
>
> Julian
>
>
>
> > On Dec 29, 2021, at 1:30 PM, Jacques Nadeau  wrote:
> >
> > WRT SBOM (Julian): My general experience is that most large orgs use
> > scanners now (either open or closed) and they will scan whether you have
> a
> > bill of materials or not. I wouldn't worry about adding something
> > additional.
> >
> > WRT too many dependencies (Gunnar): I completely agree with the general
> > feeling of too many (and with Guava, jackson less so). I think the core
> > challenge (no pun intended) is that calcite-core is really a lot of
> > different components. For example, I have frequently wished that parser,
> > planner and enumerable were separate modules. And if they were, I'd guess
> > that each would have a narrower dependency range. I've also wished many
> > times that runtime compilation was an optional addon as opposed to
> > required/coupled in the core...
> >
> > When I've thought about how to dissect in the past, I think the big
> > challenge would be tests, where things are sometimes mixed together.
> > Breaking change possibilities could be at least somewhat mitigated by
> > moving classes but not packages.
> >
> > On Wed, Dec 29, 2021 at 1:51 AM Gunnar Morling
> >  wrote:
> >
> >> Hi,
> >>
> >> In a way, Calcite's build configuration as well as the published POM
> could
> >> be considered as such an SBOM? In particular when looking at the latter
> >> through services like mvnrepository [1], you get quite a good view on
> the
> >> dependency versions, licenses, any potential CVEs, etc. I think this
> should
> >> satisfy most user needs around this? Or are you referring to the notion
> of
> >> Maven BOM POMs specifically [2], i.e. the notion of publishing a POM
> with
> >> all the Calcite component versions which people can then use with
> Maven's
> >> import scope (there should be something comparable for Gradle)? If so,
> that
> >> could be useful for users working with multiple Calcite components,
> though
> >> I think the usability improvement provided by such BOM POM wouldn't be
> >> huge.
> >>
> >> I wanted to bring up a related matter though. Coming to Calcite as a
> user
> >> just recently (loving the possibilities it provides!), I was surprised
> by
> >> the large number of dependencies of the project. It looks like 1.29
> >> improves that a little bit (no more kotlin-stdlib, no more transitive
> >> dependency to log4j 1.x), but the transitive hull of all dependencies of
> >> calcite-core still is quite big. I lack insight about what the different
> >> dependencies are used for; but as an application developer, Guava for
> >> ins

Materialized View and Lattice Suggester help

2021-08-10 Thread Scott Reynolds
Hi Calcite devs,

Today our team deploys a Calcite service that uses Daily OLAP Cube
tables and fine grain Fact tables. Daily OLAP Cube tables are used to
vastly improve latency when querying for a wide date range but comes
with a restriction – the date range *must* be UTC day granularity. When
a request comes into the Cube endpoint, our service picks the best Cube,
creates a Logical Plan for that query and lets Calcite optimize it into
Enumerable. Our Service has matured to the point where we would like to
*remove* the UTC day granularity restriction and would like to do Query
Rewriting and I am writing to you all to figure out the best way to do
that. Calcite supports Materialized Views and Lattices and I haven't
been able to pull this together. Here is a motivating example:

┌
│ SELECT "SIM_ID", SUM("BYTES_DOWNLINK") + SUM(BYTES_UPLINK) as "SUM_BYTES"
│ FROM "kudu"."WirelessEvents"
│ WHERE "ACCOUNT_ID" = 18789 AND ("START_TIME" >= TIMESTAMP '2021-07-31
11:30:01' AND "START_TIME" < TIMESTAMP '2021-09-01 03:00:00')
│ GROUP BY "SIM_ID"
│ ORDER BY "SIM_ID" DESC;
└

This is a query we would like to rewrite to have it use the UTC Daily
OLAP Cube. One possible rewrite of this:

┌
│ SELECT "SIM_ID", SUM("SUM_BYTES_DOWNLINK") + SUM("SUM_BYTES_UPLINK") AS
"SUM_BYTES"
│
│ FROM (
│ /* Cube Table query with all the filters on dimensions */
│ /* First UTC Day and last UTC Day */
│ SELECT "SIM_ID", "SUM_BYTES_DOWNLINK", SUM_BYTES_UPLINK
│ FROM "kudu"."Wireless-Daily-Aggregation"
│ WHERE "ACCOUNT_ID" = 18789 AND ("START_TIME" >= TIMESTAMP '2021-08-01
00:00:00' AND "START_TIME" < TIMESTAMP '2021-08-31 23:59:59')
│
│ UNION ALL
│
│ /* Fact Table query with all the filters on dimensions */
│ /* Aliasing measure fields into their Cube table values */
│ /* Date filter is from startDate to the first UTC day in query */
│ SELECT "SIM_ID", "BYTES_DOWNLINK" AS "SUM_BYTES_DOWNLINK", "BYTES_UPLINK"
AS "SUM_BYTES_UPLINK"
│ FROM "kudu"."WirelessEvents"
│ WHERE "ACCOUNT_ID" = 18789 AND ("START_TIME" >= TIMESTAMP '2021-07-31
11:30:01' AND "START_TIME" < TIMESTAMP '2021-08-01 00:00:00')
│
│ UNION ALL
│
│ /* Fact Table query with all the filters on dimensions  */
│ /* Aliasing measure fields into their Cube table values */
│ /* Date filter is from last UTC day to endDate of the query */
│ SELECT "SIM_ID", "BYTES_DOWNLINK" AS "SUM_BYTES_DOWNLINK", "BYTES_UPLINK"
AS "SUM_BYTES_UPLINK"
│ FROM "kudu"."WirelessEvents"
│ WHERE "ACCOUNT_ID" = 18789 AND ("START_TIME" >= TIMESTAMP '2021-08-31
23:59:59' AND "START_TIME" < TIMESTAMP '2021-09-01 03:00:00')
│ )
│
│ GROUP BY "SIM_ID"
│ ORDER BY "SIM_ID" DESC;
└

In this instance, Materialized View rule will need to:
1. Create two scans (or one with a Disjunction) on the ranges outside of
   UTC boundary
2. Create `Projection' that changes the names of the Fact columns to
   match the measure names – `BYTES_UPLINK' becomes `SUM_BYTES_UPLINK'

I am overwhelmed by [Materialized View] rules and [Lattice Suggester] and
hoping you all could describe in broad terms how you would approach this
task. What do I need to configure? What do I need to build?

Thanks so much !


[Materialized View]
https://calcite.apache.org/docs/materialized_views.html

[Lattice Suggester] https://calcite.apache.org/docs/lattice.html


Re: [DISCUSS] New RexNode: RexListCmp

2020-07-21 Thread Scott Reynolds
I wanted to reply and share our recent requirement for handling SQL like
the following `error_code IN (3002, 3030)' and the challenges we
faced. For our implementation on top of Apache Kudu, each disjunction
creates a `Scanner' – a resource we need to limit as it represents a
denial of service attack vector (e.g. too many scanners, heap fills
up). Good news for us is Kudu ships with an [`inListPredicate'] and we
expected a plan to include the `SqlKind.IN' as the function which we
could translate into `inListPredicate'. We were surprised when it didn't
do that. We did eventually make this work for our customers with a hack
below but it is not valid plan – for instance unparsing the plan
produces invalid SQL query – and therefore is brittle (but *works*
:fingers-crossed:) .

┌
│ // This is not the correct use of Array.
│ final RelDataType listType =
builder.getTypeFactory().createArrayType(fieldType, -1);
│ return builder.call(SqlStdOperatorTable.IN,
│  builder.field(conditionTableName, columnName),
│  rexBuilder.makeLiteral(resultValue, listType, true));
└
We filed a ticket to do it the correct way, which is to take all the
disjunctions, and "un-parse" them into `inListPredicate' calls *if
possible*. This struck us as pretty dense code *but* would apply to
other disjunctions.

It would be *great* if Calcite shipped with a `RexCall' that our
implementation could translate with little effort into a
`inListPredicate'.


[`inListPredicate']
https://kudu.apache.org/apidocs/org/apache/kudu/client/KuduPredicate.html#newInListPredicate-org.apache.kudu.ColumnSchema-java.util.List-

On Mon, Jul 20, 2020 at 3:09 PM Stamatis Zampetakis 
wrote:

> Another quick thought as far as it concerns the IN operator would be to use
> RexCall as it is right now where the first operand in the list is a
> RexInputRef for instance and the rest are the literals.
> I assume that taking this direction would need to change a bit the
> respective SqlOperator.
>
> I haven't thought of this thoroughly so maybe there are important things
> that I am missing.
>
> Best,
> Stamatis
>
>
> On Tue, Jul 21, 2020 at 12:41 AM Julian Hyde  wrote:
>
> > The name isn't very intuitive.
> >
> > The concept of a list and a comparison operator seems OK. As Vladimir
> > points out, it is somewhat similar to RexSubQuery, so maybe this could
> > be a sub-class (but organizing the data a bit more efficiently).
> >
> > I would be very wary of null semantics. RexNode scalar operators are
> > forced to do 3-valued logic, but this is almost a relational operator
> > and it would be better without that burden.
> >
> > Julian
> >
> >
> >
> > On Mon, Jul 20, 2020 at 3:45 AM Vladimir Sitnikov
> >  wrote:
> > >
> > > >Do you know what is the impact on Enumerable implementation?
> > >
> > > I guess there are plenty of options there.
> > >
> > > The key question regarding RexListCmp is as we introduce a new Rex
> node,
> > > all the planning rules and all engines
> > > must support it somehow.
> > >
> > > Technically speaking, we have RexSubQuery.
> > > Haisheng, have you considered an option to stick with RexSubQuery to
> > avoid
> > > having two more-or-less the same rex classes?
> > >
> > > Vladimir
> >
>


Re: RelMetadataQuery.getRowCount stackoverflow

2020-04-20 Thread Scott Reynolds
I have had this happen numerous times when writing new planner rules. Most
of the time my rule is missing some boolean logic to prevent itself from
transforming the call. This results in the rule continuously transforming
it's previous transformations.

I can usually see this happening when I add a
System.out.println(RelOptUtil.dumpPlan()) to the line before the
call.transformTo(newRelationNode)

On Mon, Apr 20, 2020 at 3:13 AM JiaTao Tao  wrote:

> Hi
> Has anyone encountered this problem before? Just a simple query(no more
> than 20 lines, two joins, no union).
>
> And I see this ticket: https://issues.apache.org/jira/browse/CALCITE-2057,
> but there's no follow up, also I see flink may occur this problem(
> https://developer.aliyun.com/ask/129548)
>
> java.lang.StackOverflowError
> at java.util.HashMap.hash(HashMap.java:339)
> at java.util.HashMap.put(HashMap.java:612)
> at
> com.google.common.collect.StandardTable.getOrCreate(StandardTable.java:165)
> at com.google.common.collect.StandardTable.put(StandardTable.java:174)
> at com.google.common.collect.HashBasedTable.put(HashBasedTable.java:55)
> at GeneratedMetadataHandler_RowCount.getRowCount(Unknown Source)
> at
> org.apache.calcite.rel.metadata.RelMetadataQuery.getRowCount(RelMetadataQuery.java:208)
> at
> org.apache.calcite.rel.metadata.RelMdRowCount.getRowCount(RelMdRowCount.java:72)
> at GeneratedMetadataHandler_RowCount.getRowCount_$(Unknown Source)
> at GeneratedMetadataHandler_RowCount.getRowCount(Unknown Source)
> at ...
>
> Regards!
>
> Aron Tao
>


Re: "lazy" optmization?

2020-03-07 Thread Scott Reynolds
Can this be achieved with EnumerableNestedBatchJoin? Would need to make the
JdbcFilterRule and it's relation handle Correlation Variables so they can
push down the filter's into the JDBC and Elastic Search RPCs. The
EnumerableNestedBatchJoinRule [1] pushes a filter relation on top of the
right hand scan. And if JdbcFilterRule translates Correlation Variables
into Conditions into the JDBC query string I think you can make it all
work.

I spent this past week doing something similar though not precisely this
thing. I wrote our own Rule and Relation that calls correlateBatchJoin [3]
directly

[1]
https://github.com/apache/calcite/blob/b80bb1cbceb11ed31b73e419916b5cc98610503e/core/src/main/java/org/apache/calcite/adapter/enumerable/EnumerableBatchNestedLoopJoinRule.java#L124-L126
[2]
https://github.com/apache/calcite/blob/b80bb1cbceb11ed31b73e419916b5cc98610503e/core/src/main/java/org/apache/calcite/rex/RexCorrelVariable.java
[3]
https://github.com/apache/calcite/blob/b80bb1cbceb11ed31b73e419916b5cc98610503e/linq4j/src/main/java/org/apache/calcite/linq4j/EnumerableDefaults.java#L1492

On Sat, Mar 7, 2020 at 7:11 AM Stamatis Zampetakis 
wrote:

> Hi Yang,
>
> Another term that is used for the optimization that you mention is
> "selective join pushdown" which essentially relies on Bloom/Cuckoo and
> other probabilistic filters. You can check [1] for more details about this
> kind of techniques.
>
> In the example that you outlined between JDBC and Elastic maybe you could
> achieve the same result with a slightly different approach by using a
> correlated join. If the scan + filter on Elastic does not bring back many
> results then you could use this results to probe the JDBC datasource. For
> more details check the discussion in [2], I think it refers to the same
> problem.
>
> Best,
> Stamatis
>
> [1] http://www.vldb.org/pvldb/vol12/p502-lang.pdf
> [2]
>
> https://lists.apache.org/thread.html/d9f95683e66009872a53e7e617295158b98746b550d2bf68230b3096%40%3Cdev.calcite.apache.org%3E
>
> On Sat, Mar 7, 2020 at 4:16 AM Yang Liu  wrote:
>
> > Thanks all!
> >
> > @Julian is the “split processing into phases” you are referring to like
> > this?
> >
> > with t1 as (select * from es_table where xxx limit xxx);
> > select * from t2 join t1 on (t2.key = t1.key) where t2.key in (select key
> > from t1)
> >
> > which means the SQL writer need to adapt to this specific form of SQL for
> > better performance? And Calcite will cache the t1 right?
> >
> > Or, maybe I can implement a RelRunner or EnumerableHashJoin myself to
> have
> > the specific rule: the query result of right table can be used as filters
> > for the left table?
> >
> > Thanks!
> >
> >
> > Julian Hyde  于2020年3月7日周六 上午1:48写道:
> >
> > > Runtime optimization is always necessary, because you just don’t have
> the
> > > stats until you run the query. The best DB algorithms are adaptive, and
> > > therefore hard to write. The adaptations require a lot of tricky
> support
> > > from the runtime - e.g. propagating bloom filters against the flow of
> > data.
> > >
> > > Calcite can still help a little.
> > >
> > > One runtime optimization is where you split processing into phases.
> Only
> > > optimize the first part of your query. Build temp tables, analyze them,
> > and
> > > use those stats to optimize the second part of your query.
> > >
> > > Another technique is to gather stats when as you run the query today,
> so
> > > that when you run it tomorrow Calcite can do a better job.
> > >
> > > Julian
> > >
> > >
> > > > On Mar 6, 2020, at 5:52 AM, Danny Chan  wrote:
> > > >
> > > > Sorry to tell that Calcite runtime does not support this, the
> "dynamic
> > > > partition pruning" or "runtime filter" called in Impala, would build
> a
> > > > bloom filter for the join keys for the build side table and push it
> > down
> > > to
> > > > the probe table source, thus, in some cases, it can reduce the data.
> > > >
> > > > Yang Liu  于2020年3月6日周五 下午6:54写道:
> > > >
> > > >> discussed with one of our user groups, in Spark 3.0, this is called
> > > >> "dynamic
> > > >> partition pruning"
> > > >>
> > > >> Yang Liu  于2020年3月6日周五 下午6:12写道:
> > > >>
> > > >>> Hi,
> > > >>>
> > > >>> I am wondering if Calcite will support "lazy optimization"
> (execution
> > > >> time
> > > >>> optimization / runtime optimization).
> > > >>>
> > > >>> For example, we want to do an inner join between an Elasticsearch
> > table
> > > >>> and a MySQL table, like this:
> > > >>>
> > > >>> WITH logic_table_2 AS
> > > >>>  (SELECT _MAP['status'] AS "status",
> > > >>>  _MAP['user'] AS "user"
> > > >>>   FROM "es"."insight-by-sql-v3"
> > > >>>   LIMIT 12345)
> > > >>> SELECT *
> > > >>> FROM "insight_user"."user_tab" AS t1
> > > >>> JOIN logic_table_2 AS t2 ON t1."email" = t2."user"
> > > >>> WHERE t2."status" = 'fail'
> > > >>> LIMIT 10
> > > >>>
> > > >>> t2 is a ES table and t1 is a MySQL table, and it may generate a
> > > execution
> > > >>> plan like this:
> > > >>>
> > > >>> EnumerableProjec

Re: Why can't RexLiteral.getValue2() get a double/float value?

2020-03-05 Thread Scott Reynolds
Use a switch Statement on literal.getTypeName(). Then for each SqlTypeName,
call literal.getValueAs() with the appropriate java class type.

Been meaning to update the Cassandra Adapter for this as well:
https://github.com/apache/calcite/blob/master/cassandra/src/main/java/org/apache/calcite/adapter/cassandra/CassandraFilter.java#L182

On Thu, Mar 5, 2020 at 7:41 AM Xiangwei Wei  wrote:

> Maybe I don't express myself clearly. If I input a literal 1.23, I just
> want to get 1.23 whatever its type is, but the getValue2() returns 123 AS
> long. Is the only way to judge its type before and invoke
> literal.getValueAs(Double.class)?
>
> Danny Chan  于2020年3月5日周四 下午10:20写道:
>
> > Literal actually does not have a exact data type, 1.23 can be both float
> or
> > decimal, in order not to lose precision,we represent it as decimal
> > internal.
> >
> > Xiangwei Wei 于2020年3月5日 周四下午5:09写道:
> >
> > > Thank you for reply. What I want is a getValue() method which returns
> > > exactly the value of literal based on its type. You are right that I
> can
> > > use literal.getValueAs(Double.class) to get a Double value, but maybe
> > it's
> > > better to provide a method like what I say. What do you think?
> > >
> > > Julian Hyde  于2020年3月5日周四 下午2:20写道:
> > >
> > > > As the java doc says, RexLiteral.getValue2 returns values in the form
> > > that
> > > > the calculator builder needs them. That may not be the form that you
> > need
> > > > them. In which case, don’t use that method.
> > > >
> > > > If you want a Double, have you tried
> literal.getValueAs(Double.class)?
> > > >
> > > > (We don’t tend to use Float and Double much because when we’re
> > compiling
> > > > queries we can’t afford any loss of precision. If the user typed 1.1
> we
> > > > want exactly 1.1, whereas Decimal might be something like
> > > 1.0987.)
> > > >
> > > > > On Mar 4, 2020, at 9:43 PM, Xiangwei Wei 
> > > wrote:
> > > > >
> > > > > In RexLiteral.getValue2(), it treats Decimal just by `return
> > > > > getValueAs(Long.class);`.
> > > > > In this instance, we can't get a correct double/float value because
> > > it's
> > > > > considered as Decimal type here. Is this a problem or is there a
> > reason
> > > > > here?
> > > > >
> > > > > --
> > > > > Best,
> > > > > Xiangwei Wei
> > > >
> > > >
> > >
> > > --
> > > Best,
> > > Xiangwei Wei
> > >
> >
>
>
> --
> Best,
> Xiangwei Wei
>


Increasing number of concurrent queries

2020-03-04 Thread Scott Reynolds
List,

Our team's current application runs a thread pool for handling queries.
Each time a query comes it it is placed in the run queue and executed
outside of the HTTP RPC thread. This allows us to handle N concurrent
queries where N is the size of the thread pool. I would like to change this
to a N:M system allowing our service to handle more concurrent queries then
the thread pool size. Our TableScan uses Netty RPCs and Queues to fetch
results from the remote database giving use the opportunity to park the
current query while awaiting RPC responses.

Has anyone been able to achieve this? Is the best way to own Convention?

Our current service doesn't use JDBC and therefore have direct access to
the compiled Enumerable. One solution I haven't been able to get working is
using InterruptedException (or something similar -- Kafka created a
WakeUpException), which would allow a sub-enumerable the ability to signal
above that it needs to save it's state and yield.

Is this something that could be added to EnumerableDefaults? If not would
it be a InterruptableEnumerable convention instead?


Question about SortJoinTransposeRule and Inner Joins

2019-11-25 Thread Scott Reynolds
The performance of our queries are dependent on our ability to push the
filter and sort into the RPC layer. Today the planner's
SortJoinTransposeRule

pushes the sort through the join for LEFT OUTER and RIGHT OUTER  joins. The
logic comment states the following
// 1) If join is not a left or right outer, we bail out
// 2) If sort is not a trivial order-by, and if there is
// any sort column that is not part of the input where the
// sort is pushed, we bail out
// 3) If sort has an offset, and if the non-preserved side
// of the join is not count-preserving against the join
// condition, we bail out

I am wondering if we can actually push the sort down through the INNER JOIN
if all the sort conditions are on one side of the join.

SELECT b.title, b.published_date, b.sales
FROM Book b
INNER JOIN Author a ON b.author = a.id
ORDER BY b.published_date, b.sales, b.title


[jira] [Created] (CALCITE-3508) Strengthen JOINs when Filter enforces the nullable side(s) to non-nullable

2019-11-15 Thread Scott Reynolds (Jira)
Scott Reynolds created CALCITE-3508:
---

 Summary: Strengthen JOINs when Filter enforces the nullable 
side(s) to non-nullable
 Key: CALCITE-3508
 URL: https://issues.apache.org/jira/browse/CALCITE-3508
 Project: Calcite
  Issue Type: Improvement
Reporter: Scott Reynolds


Today, FilterJoinRule given an Outer Join the rule strengthens it to Inner Join 
when the nullable side contains a filter IS_NOT_NULL. Below is the code.
{code:java}
for (RexNode filter : aboveFilters) {
  if (joinType.generatesNullsOnLeft()
  && Strong.isNotTrue(filter, leftBitmap)) {
joinType = joinType.cancelNullsOnLeft();
  }
  if (joinType.generatesNullsOnRight()
  && Strong.isNotTrue(filter, rightBitmap)) {
joinType = joinType.cancelNullsOnRight();
  }
  if (!joinType.isOuterJoin()) {
break;
  }
}
{code}
This code looks at the filter to determine if it is always true, then it can 
alter the join type by removing the null on that side.

We can see this in the following test RelOptRules#testStrengthenJoinType, which 
executes the following SQL that transforms from a LEFT OUTER JOIN to an INNER 
JOIN
{code:sql}
select *
from dept left join emp on dept.deptno = emp.deptno
where emp.deptno is not null and emp.sal > 100
{code}
This ticket is about broadening the application of this rule to a sql like the 
following:
{code:sql}
select *
from dept left join emp on dept.deptno = emp.deptno
where emp.sal > 100
{code}
 This originally came up on the mailing list: 
[https://mail-archives.apache.org/mod_mbox/calcite-dev/201909.mbox/browser]

and in that thread it was pointed out that there are filters that prevent this 
from being applied:
{code:sql}
SELECT b.title
FROM Book b
LEFT JOIN Author a ON b.author = a.id
WHERE a.name <> 'Victor'
{code}
This means we need to ensure we that the OUTER JOIN doesn't contain – for lack 
of a different term – negation filters. If there is a negation – like NOT_EQUAL 
– the JOIN cannot be strengthened.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Query Compilation happening more often then expected

2019-09-18 Thread Scott Reynolds
Hi all,

I have submitted my pull request in github and wrote up a design document
as a comment on CALCITE-963


My implementation deviates considerably from what is being discussed in
this thread so I rushed it up so you all can let me know what I am missing.
Here is a excerpt from the design document:

Goal

When a query is issued to Calcite it is parsed, optimized and then
generates a String of Java Class that implements Bindable.
EnumerableInterpretable creates this string and checks to see if that
string exists in com.google.common.cache and if it doesn't it will call
into a Java compiler. Compilation process can take a considerable amount of
time, Apache Kylin reported 50 to 150ms of additional computation time.
Today, Apache Calcite will generate unique Java Class strings whenever any
part of the query changes. This document details out the design and
implementation of a hoisting technique within Apache Calcite. This design
and implementation greatly increases the cache hit rate of
EnumerableInterpretable's BINDABLE_CACHE.
Non Goals

This implementation is not designed to change the planning process. It does
not transform RexLiteral into RexDynamicParam, and doesn't change the cost
calculation of the query.

Please take a moment to read the design document let me know what you think.

On Tue, Sep 17, 2019 at 12:27 PM Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:

> Stamatis>Out of curiosity does anybody know if popular DBMS (Postgres)
> support "hoisting"?
>
> PostgreSQL does support it, and here's a reproducible case when that
> feature makes the query 300 times slower:
> https://gist.github.com/vlsi/df08cbef370b2e86a5c1
>
> Vladimir
>


Query Compilation happening more often then expected

2019-09-13 Thread Scott Reynolds
Hi,

Spent a bunch of time researching and staring at code today to understand
the code compilation path within Calcite. I started down this path because
we noticed whenever we changed the `startDate` or `endDate` for the query
it went through compilation process again. We expected it to use the
previous classes `bind` it with the new RexLiterals. I was *hoping*  the
RexLiterals were passed into the `bind()` method but that does not appear
to be the main goal of `DataContext` objects.

We also found the trick Kylin did to improve their query compilation with
prepared statements:
https://issues.apache.org/jira/browse/KYLIN-3434 but PreparedStatement is
stateful and I don't believe a good way to solve this issue.

I would like to propose a change to Calcite so that Filters are passed into
the `bind()` call alongside or within DataContext. This would allow the
`EnumerableRel` implementations to reference the `Filters` as arguments.
This -- I believe -- would cause any change to the filters to use
the previously compiled class instead of generating a brand new one.

I am emailing everyone on this list for two reasons:
1. Is this a bad idea ?
2. I don't have a design yet so would love any ideas. Should we stick more
stuff into `DataContext`? Should `EnumerableRel` have another method that
is used to gather these RexLiterals?


percentile_cont function -- can't get my query to parse

2019-09-13 Thread Scott Reynolds
Sorry for such a super lame message, spent way to much time trying to get
it to work. This query throws a parse error:

select dataset,
percentile_cont (0.25) WITHIN GROUP (order by rpc_time DESC)
OVER(PARTITION BY dataset) as percentile_25 from QueryLogs;

Fails with:
Encountered "percentile_cont" at line 1, column 17.

What am I missing? https://calcite.apache.org/docs/reference.html lists the
function as hypothetical ? It is a listed in the parser template:
https://github.com/apache/calcite/blob/705be2771f284533a998e941c9dc3b285f1d7351/core/src/main/codegen/templates/Parser.jj#L6839

Thanks!