Re: Using Calcite with Python

2022-01-31 Thread Gavin Ray
I have nothing of value to add, but:

> [5] https://github.com/oap-project/gazelle-jni/tree/velox_dev

Hot damn this is neat

On Mon, Jan 31, 2022 at 7:58 PM Jacques Nadeau  wrote:

> A couple of related (possibly useful?) pointers here:
>
>- Dask-sql [1] uses Calcite in a python context. Might be some good
>stuff to leverage there.
>- I'm working on compiling Calcite as a GraalVM shared native library
>[2] as part of Substrait [3] with the goal of ultimately having a
> friendly
>C binding [4] for use in non-jvm worlds. This connects to work being
> done
>by others to support tools like Arrow and Velox [5] as Substrait targets
>(and thus completing the path from c interface to native execution via
>Calcite).
>
>
> [1] https://github.com/dask-contrib/dask-sql
> [2] https://issues.apache.org/jira/browse/CALCITE-4786
> [3] https://github.com/substrait-io/substrait/pull/120
> [4] https://github.com/jacques-n/substrait/pull/3
> [5] https://github.com/oap-project/gazelle-jni/tree/velox_dev
>
> On Mon, Jan 31, 2022 at 3:32 PM Nicola Vitucci 
> wrote:
>
> > Hi Eugen, Michael, Gavin,
> >
> > Thank you very much for your input. Answering to your suggestions:
> >
> > - Phoenix client: I saw it but decided not to use it because it does not
> > seem very active and up to date (its Avatica version is 1.10, while
> latest
> > is 1.20). I may still give it a try though.
> > - Arrow Flight: I think it can be very useful especially, like Michael
> > mentioned, if it were integrated with Avatica as a transport; at the
> > moment, though, it is not.
> >
> > I am basically looking for a (relatively) easy and ready to implement,
> easy
> > to keep up to date, and reasonably performant solution. Although it
> incurs
> > some overhead, a solution based on Python + Java seems to me the most
> > reasonable for the time being. Do you have any other suggestions or
> > recommendations?
> >
> > Thanks again,
> >
> > Nicola
> >
> >
> >
> > Il giorno lun 31 gen 2022 alle ore 17:04 Michael Mior 
> > ha
> > scritto:
> >
> > > Flight is definitely another consideration for the future. Personally I
> > > think it would be most interesting to integrate Flight with Avatica as
> an
> > > alternative transport. But it would certainly also be useful to allow
> the
> > > Arrow adapter to connect to any Flight endpoint.
> > >
> > > --
> > > Michael Mior
> > > mm...@apache.org
> > >
> > >
> > > Le lun. 31 janv. 2022 à 10:00, Gavin Ray  a
> > écrit :
> > >
> > > > This is really interesting stuff you've done in the example notebooks
> > > >
> > > > Nicola & Michael, I wonder if you could benefit from the
> > > recently-released
> > > > Arrow Flight SQL?
> > > >
> > > >
> > >
> >
> https://www.dremio.com/subsurface/arrow-flight-and-arrow-flight-sql-accelerating-data-movement/
> > > >
> > > > I have asked Jacques about this a bit -- it's meant to be a
> > > standardization
> > > > for communicating SQL queries and metadata with Arrow.
> > > > I'm not intimately familiar with it, but it seems like it could be a
> > good
> > > > base to build a Calcite backend for Arrow from?
> > > >
> > > > They have a pretty thorough Java example in the repository:
> > > >
> > > >
> > >
> >
> https://github.com/apache/arrow/blob/968e6ea488c939c0e1f2bfe339a5a9ed1aed603e/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/example/FlightSqlExample.java#L169-L180
> > > >
> > > > On Mon, Jan 31, 2022 at 8:47 AM Michael Mior 
> wrote:
> > > >
> > > > > You may want to keep an eye on CALCITE-2040 (
> > > > > https://issues.apache.org/jira/browse/CALCITE-2040). I have a
> > student
> > > > who
> > > > > is working on a Calcite adapter for Apache Arrow. We're basically
> > hung
> > > up
> > > > > waiting on the Arrow team to release a compatible JAR. This still
> > won't
> > > > > fully solve your problem though as the first version of the adapter
> > is
> > > > only
> > > > > capable of reading from Arrow files. However, the goal is
> eventually
> > to
> > > > > allow passing a memory reference into the adapter so that it would
> be
> > > > > possible to make use of Arrow data which is constructed in-memory
> > > > > elsewhere.
> > > > > --
> > > > > Michael Mior
> > > > > mm...@apache.org
> > > > >
> > > > >
> > > > > Le dim. 30 janv. 2022 à 17:36, Nicola Vitucci <
> > > nicola.vitu...@gmail.com>
> > > > a
> > > > > écrit :
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > What would be the best way to use Calcite with Python? I've come
> up
> > > > with
> > > > > > two potential solutions:
> > > > > >
> > > > > > - using the jaydebeapi package, to connect via the JDBC driver
> > > directly
> > > > > > from a JVM created via jpype;
> > > > > > - using Apache Arrow via the pyarrow package, to connect in
> > basically
> > > > the
> > > > > > same way but creating Arrow objects with JdbcToArrowUtils (and
> > > > optionally
> > > > > > converting them to Pandas).
> > > > > >
> > > > > > Although the former is more straightforward, 

Re: Getting "AssertionError: Rule's description should be unique" after adding JOIN to query. Fails on "optPlanner.changeTraits()"

2022-01-31 Thread Gavin Ray
Stamatis, thank you!! Based on your comments about the naming/identity, I
was
able to figure out what was happening and fix it =D

In hindsight it should have been obvious. I was iterating over all schemas
in
the DB, but I was not creating unique names. So in this test, "DB1" and
"DB2"
both have a schema named "PUBLIC".

When I created the JdbcSchema, I was just passing "PUBLIC" as the name. The
solution was to make the name "$databaseName_$schemaName". Now it works!

My "CalciteSchemaManager.addDatabase()" functionality looks like this:

fun addDatabase(dbName: String, ds: DataSource): SchemaPlus {
val schemas = getSubSchemas(ds.connection)
if (schemas.isEmpty()) {
rootSchema.add(dbName, JdbcSchema.create(rootSchema, dbName, ds,
null, null))
} else {
val dbSchema = rootSchema.add(dbName, EMPTY_SCHEMA)
for (schemaName in schemas) {
dbSchema.add(
schemaName,
JdbcSchema.create(dbSchema, schemaName, ds, null,
schemaName)
)
}
}
return rootSchema
}


On Mon, Jan 31, 2022 at 4:59 PM Stamatis Zampetakis 
wrote:

> Hi Gavin,
>
> For each JdbcSchema, there is a JdbcConvention and when the latter is
> registered in the planner it creates and registers a bunch of JdbcRules
> [1]. If we are talking about the same schema then I guess the description
> that is generated for each rule is identical so you end up with this
> exception.
>
> Are you using the same schema? Should you? If the use-case is valid then
> you may have bumped into a small bug and probably you need to figure out a
> way to make the rule description unique.
>
> Best,
> Stamatis
>
> [1]
>
> https://github.com/apache/calcite/blob/8570cf2b293e9acab9525fbe76709d970b9f7104/core/src/main/java/org/apache/calcite/adapter/jdbc/JdbcConvention.java#L65
>
> On Mon, Jan 31, 2022 at 10:56 PM Gavin Ray  wrote:
>
> > Not sure if this is helpful, but enabling TRACE logging shows this:
> >
> > [Test worker] DEBUG org.apache.calcite.sql2rel - Plan after converting
> > SqlNode to RelNode
> > LogicalProject(ID=[$0], USER_ID=[$1], TITLE=[$2], CONTENT=[$3], ID0=[$4],
> > POST_ID=[$5], CONTENT0=[$6])
> >   LogicalJoin(condition=[=($0, $5)], joinType=[inner])
> > JdbcTableScan(table=[[db1, PUBLIC, POSTS]])
> > JdbcTableScan(table=[[db2, PUBLIC, COMMENTS]])
> >
> > [Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Subset cost
> > changed: subset [rel#11:RelSubset#0.JDBC.PUBLIC.[]] cost was {inf} now
> > {100.0 rows, 101.0 cpu, 0.0 io}
> > [Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Register
> > rel#7:JdbcTableScan.JDBC.PUBLIC.[](table=[db1, PUBLIC, POSTS]) in
> > rel#11:RelSubset#0.JDBC.PUBLIC.[]
> > [Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Rule-match
> > queued: rule
> [JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE)]
> > rels [#7]
> >
> > java.lang.AssertionError: Rule's description should be unique; existing
> > rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE); new
> > rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE)
> >
> >
> > On Mon, Jan 31, 2022 at 4:42 PM Gavin Ray  wrote:
> >
> > > I took the "FrameworksTest#executeQuery()" code from the Calcite repo
> > here
> > > and have been using it:
> > >
> > >
> >
> https://github.com/apache/calcite/blob/de847c38f3544f9c7282984f32dc1093bdb2fb60/core/src/test/java/org/apache/calcite/tools/FrameworksTest.java#L444-L477
> > >
> > > Have been getting an error I don't understand.
> > > I have a rootSchema, with two sub-schemas, that each have JdbcSchema's
> > > from hsqldb
> > >
> > >
> > >
> >
> https://gist.github.com/GavinRay97/cf39a9dd92a793df506ecc2135d6da0b#file-test-kt-L68-L78
> > >
> > > If I query just a single schema, it works fine:
> > > =
> > > Query: select * from db1.PUBLIC.POSTS
> > >
> > > -- Logical Plan
> > > LogicalProject(inputs=[0..3])
> > >   JdbcTableScan(table=[[db1, PUBLIC, POSTS]])
> > >
> > > -- Mid Plan
> > > LogicalProject(subset=[rel#8:RelSubset#1.ENUMERABLE.[]], inputs=[0..3])
> > >   JdbcTableScan(subset=[rel#5:RelSubset#0.JDBC.PUBLIC.[]], table=[[db1,
> > > PUBLIC, POSTS]])
> > >
> > > -- Best Plan
> > > JdbcToEnumerableConverter
> > >   JdbcTableScan(table=[[hsqldb1, PUBLIC, POSTS]])
> > >
> > > If I add a JOIN though, it fails before reaching the mid-plan:
> > > =
> > > Query:
> > > select * from db1.PUBLIC.POSTS
> > > inner join db2.PUBLIC.COMMENTS
> > > on db1.PUBLIC.POSTS.ID =
> db2.PUBLIC.COMMENTS.POST_ID
> > >
> > > -- Logical Plan
> > > LogicalProject(inputs=[0..6])
> > >   LogicalJoin(condition=[=($0, $5)], joinType=[inner])
> > > JdbcTableScan(table=[[db1, PUBLIC, POSTS]])
> > > JdbcTableScan(table=[[db2, PUBLIC, COMMENTS]])
> > >
> > > java.lang.AssertionError: Rule's description should be unique;
> > > existing
> > 

Re: How to handle the fact that different databases have different ideas of what a "db"/"schema" are?

2022-01-31 Thread Gavin Ray
Jacques, thank you. I appreciate the weigh-in with your experience, and that
link is brilliant.

This is exactly what I am trying to do. I tried to write something like
this but
it felt shaky, I wasn't confident in it.

It took a JDBC "DataSource" and then used metadata to iterate
catalogs/schemas
and add them to itself as a child "JdbcSchema", returning them in
"getSubschemaMap()". Creating a wrapping default/root catalog if necessary.

The code here is a much better version of what I was trying to do it seems.
I am
going to unabashedly steal the overall implementation/design patterns here
=)

On Mon, Jan 31, 2022 at 8:06 PM Jacques Nadeau  wrote:

> If I recall correctly, my experience is you have to code per underlying
> database. I believe there are only a few patterns but I don't think there
> is sufficient odbc/jdbc info to answer behavior reliably (or at least in a
> way that feels correct/native to each database). For example, I believe
> some databases require catalog selection at the connection level and so
> while a catalog concept exists, you have to use different connections for
> each catalog whereas other databases expose catalog within a connection.
>
> This is all quite old thinking. I remember writing some simpler logic for
> this here years ago at [1]. Note that code is long sense changed but wanted
> to call it.
>
> [1]
>
> https://github.com/apache/drill/blob/18a1ae4d31cd502d2f792b331fefeb0ed2106c53/contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcStoragePlugin.java#L301
>
>
> On Thu, Jan 27, 2022 at 7:38 PM Gavin Ray  wrote:
>
> > The filesystem hierarchy is a great analogy, I understand it much better
> > now I think -- thank you.
> >
> > This seems like a problem with potentially very brittle solutions. Using
> > your explanation I was able to get it to work,
> > with very terrible logic that says "If catalogs are all null, then it's
> > one-level so make a fake "root" schema to hold it in the rootSchema"
> >
> > But this doesn't feel so good. Not 100% sure what the best thing to do
> here
> > is, but at least you've cleared up what's going on.
> >
> > On Thu, Jan 27, 2022 at 6:41 PM Julian Hyde 
> > wrote:
> >
> > > Let’s not worry about the names, and say that some DBs have two
> namespace
> > > levels and others have just one.
> > >
> > > Calcite’s word for a namespace is ’schema’. Calcite schemas are
> arranged
> > > in a hierarchy, like a filesystem, so there is no preferred depth. Any
> > > schema can contain both tables and (sub)schemas. So you can easily
> built
> > a
> > > one- or two-level namespace structure, or whatever you want.
> > >
> > > Calcite’s catalog has a single ‘root schema’ (analogous to the root
> > > directory, ‘/‘ in file systems), and you can get to anything else from
> > > there.
> > >
> > > In JDBC parlance, a a level 1 namespace is called ‘catalog’, and a
> level
> > 2
> > > namespace is a ’schema’. If a DB has a one-level namespace then catalog
> > > will be null, or the empty string, or something.
> > >
> > > If you’re running an Avatica JDBC server backed by a particular Calcite
> > > root schema, and you want your database to look like a one-level or
> > > two-level database, we probably don’t make it particularly easy.
> > >
> > > Julian
> > >
> > >
> > > > On Jan 27, 2022, at 7:25 AM, Gavin Ray 
> wrote:
> > > >
> > > > My RDBMS experience is nearly exclusively Postgres
> > > > While working on this project, I've made the assumption that the
> > > structure
> > > > of a database is:
> > > >
> > > > Database -> Schema -> Table
> > > >
> > > > It turns out that this isn't accurate. In MySQL for instance,
> "Schema"
> > is
> > > > an alias for "DB".
> > > > From the below StackOverflow answer, it seems like this is all over
> the
> > > > place:
> > > >
> > > > https://stackoverflow.com/a/7944489/13485494
> > > >
> > > > I have a "CalciteSchemaManager" object which has a "rootSchema" to
> > which
> > > > all datasources are attached
> > > > This "rootSchema" is used to generate the GraphQL API and types
> > > >
> > > > It seems like I have two options, and I'm not sure which is a better
> > > design:
> > > >
> > > > 1. Force all datasources to conform to (Database -> Schema -> Table)
> > > >
> > > > This means that adding a new MySQL database, would generate
> ("mysql_db"
> > > ->
> > > > "root" (fake schema) -> "some_table")
> > > > Adding a CSV schema too, would be something like ("csv_datasource" ->
> > > > "root" -> "some_csv_file")
> > > >
> > > > 2. Have an irregular data shape. Datasources can be of arbitrary
> > > sub-schema
> > > > depth.
> > > >
> > > > Example Postgres: ("pg_db_1" -> "public" -> "user")
> > > > Example MySQL:   ("mysql_db_1" -> "user")
> > > > Example CSV: ("some_csv_file") or maybe ("csv_source_1" ->
> > > "some_csv_file")
> > > >
> > > > What do you folks think I ought to do?
> > > > Thank you =)
> > >
> > >
> >
>


Re: How to handle the fact that different databases have different ideas of what a "db"/"schema" are?

2022-01-31 Thread Jacques Nadeau
If I recall correctly, my experience is you have to code per underlying
database. I believe there are only a few patterns but I don't think there
is sufficient odbc/jdbc info to answer behavior reliably (or at least in a
way that feels correct/native to each database). For example, I believe
some databases require catalog selection at the connection level and so
while a catalog concept exists, you have to use different connections for
each catalog whereas other databases expose catalog within a connection.

This is all quite old thinking. I remember writing some simpler logic for
this here years ago at [1]. Note that code is long sense changed but wanted
to call it.

[1]
https://github.com/apache/drill/blob/18a1ae4d31cd502d2f792b331fefeb0ed2106c53/contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcStoragePlugin.java#L301


On Thu, Jan 27, 2022 at 7:38 PM Gavin Ray  wrote:

> The filesystem hierarchy is a great analogy, I understand it much better
> now I think -- thank you.
>
> This seems like a problem with potentially very brittle solutions. Using
> your explanation I was able to get it to work,
> with very terrible logic that says "If catalogs are all null, then it's
> one-level so make a fake "root" schema to hold it in the rootSchema"
>
> But this doesn't feel so good. Not 100% sure what the best thing to do here
> is, but at least you've cleared up what's going on.
>
> On Thu, Jan 27, 2022 at 6:41 PM Julian Hyde 
> wrote:
>
> > Let’s not worry about the names, and say that some DBs have two namespace
> > levels and others have just one.
> >
> > Calcite’s word for a namespace is ’schema’. Calcite schemas are arranged
> > in a hierarchy, like a filesystem, so there is no preferred depth. Any
> > schema can contain both tables and (sub)schemas. So you can easily built
> a
> > one- or two-level namespace structure, or whatever you want.
> >
> > Calcite’s catalog has a single ‘root schema’ (analogous to the root
> > directory, ‘/‘ in file systems), and you can get to anything else from
> > there.
> >
> > In JDBC parlance, a a level 1 namespace is called ‘catalog’, and a level
> 2
> > namespace is a ’schema’. If a DB has a one-level namespace then catalog
> > will be null, or the empty string, or something.
> >
> > If you’re running an Avatica JDBC server backed by a particular Calcite
> > root schema, and you want your database to look like a one-level or
> > two-level database, we probably don’t make it particularly easy.
> >
> > Julian
> >
> >
> > > On Jan 27, 2022, at 7:25 AM, Gavin Ray  wrote:
> > >
> > > My RDBMS experience is nearly exclusively Postgres
> > > While working on this project, I've made the assumption that the
> > structure
> > > of a database is:
> > >
> > > Database -> Schema -> Table
> > >
> > > It turns out that this isn't accurate. In MySQL for instance, "Schema"
> is
> > > an alias for "DB".
> > > From the below StackOverflow answer, it seems like this is all over the
> > > place:
> > >
> > > https://stackoverflow.com/a/7944489/13485494
> > >
> > > I have a "CalciteSchemaManager" object which has a "rootSchema" to
> which
> > > all datasources are attached
> > > This "rootSchema" is used to generate the GraphQL API and types
> > >
> > > It seems like I have two options, and I'm not sure which is a better
> > design:
> > >
> > > 1. Force all datasources to conform to (Database -> Schema -> Table)
> > >
> > > This means that adding a new MySQL database, would generate ("mysql_db"
> > ->
> > > "root" (fake schema) -> "some_table")
> > > Adding a CSV schema too, would be something like ("csv_datasource" ->
> > > "root" -> "some_csv_file")
> > >
> > > 2. Have an irregular data shape. Datasources can be of arbitrary
> > sub-schema
> > > depth.
> > >
> > > Example Postgres: ("pg_db_1" -> "public" -> "user")
> > > Example MySQL:   ("mysql_db_1" -> "user")
> > > Example CSV: ("some_csv_file") or maybe ("csv_source_1" ->
> > "some_csv_file")
> > >
> > > What do you folks think I ought to do?
> > > Thank you =)
> >
> >
>


Re: Using Calcite with Python

2022-01-31 Thread Jacques Nadeau
A couple of related (possibly useful?) pointers here:

   - Dask-sql [1] uses Calcite in a python context. Might be some good
   stuff to leverage there.
   - I'm working on compiling Calcite as a GraalVM shared native library
   [2] as part of Substrait [3] with the goal of ultimately having a friendly
   C binding [4] for use in non-jvm worlds. This connects to work being done
   by others to support tools like Arrow and Velox [5] as Substrait targets
   (and thus completing the path from c interface to native execution via
   Calcite).


[1] https://github.com/dask-contrib/dask-sql
[2] https://issues.apache.org/jira/browse/CALCITE-4786
[3] https://github.com/substrait-io/substrait/pull/120
[4] https://github.com/jacques-n/substrait/pull/3
[5] https://github.com/oap-project/gazelle-jni/tree/velox_dev

On Mon, Jan 31, 2022 at 3:32 PM Nicola Vitucci 
wrote:

> Hi Eugen, Michael, Gavin,
>
> Thank you very much for your input. Answering to your suggestions:
>
> - Phoenix client: I saw it but decided not to use it because it does not
> seem very active and up to date (its Avatica version is 1.10, while latest
> is 1.20). I may still give it a try though.
> - Arrow Flight: I think it can be very useful especially, like Michael
> mentioned, if it were integrated with Avatica as a transport; at the
> moment, though, it is not.
>
> I am basically looking for a (relatively) easy and ready to implement, easy
> to keep up to date, and reasonably performant solution. Although it incurs
> some overhead, a solution based on Python + Java seems to me the most
> reasonable for the time being. Do you have any other suggestions or
> recommendations?
>
> Thanks again,
>
> Nicola
>
>
>
> Il giorno lun 31 gen 2022 alle ore 17:04 Michael Mior 
> ha
> scritto:
>
> > Flight is definitely another consideration for the future. Personally I
> > think it would be most interesting to integrate Flight with Avatica as an
> > alternative transport. But it would certainly also be useful to allow the
> > Arrow adapter to connect to any Flight endpoint.
> >
> > --
> > Michael Mior
> > mm...@apache.org
> >
> >
> > Le lun. 31 janv. 2022 à 10:00, Gavin Ray  a
> écrit :
> >
> > > This is really interesting stuff you've done in the example notebooks
> > >
> > > Nicola & Michael, I wonder if you could benefit from the
> > recently-released
> > > Arrow Flight SQL?
> > >
> > >
> >
> https://www.dremio.com/subsurface/arrow-flight-and-arrow-flight-sql-accelerating-data-movement/
> > >
> > > I have asked Jacques about this a bit -- it's meant to be a
> > standardization
> > > for communicating SQL queries and metadata with Arrow.
> > > I'm not intimately familiar with it, but it seems like it could be a
> good
> > > base to build a Calcite backend for Arrow from?
> > >
> > > They have a pretty thorough Java example in the repository:
> > >
> > >
> >
> https://github.com/apache/arrow/blob/968e6ea488c939c0e1f2bfe339a5a9ed1aed603e/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/example/FlightSqlExample.java#L169-L180
> > >
> > > On Mon, Jan 31, 2022 at 8:47 AM Michael Mior  wrote:
> > >
> > > > You may want to keep an eye on CALCITE-2040 (
> > > > https://issues.apache.org/jira/browse/CALCITE-2040). I have a
> student
> > > who
> > > > is working on a Calcite adapter for Apache Arrow. We're basically
> hung
> > up
> > > > waiting on the Arrow team to release a compatible JAR. This still
> won't
> > > > fully solve your problem though as the first version of the adapter
> is
> > > only
> > > > capable of reading from Arrow files. However, the goal is eventually
> to
> > > > allow passing a memory reference into the adapter so that it would be
> > > > possible to make use of Arrow data which is constructed in-memory
> > > > elsewhere.
> > > > --
> > > > Michael Mior
> > > > mm...@apache.org
> > > >
> > > >
> > > > Le dim. 30 janv. 2022 à 17:36, Nicola Vitucci <
> > nicola.vitu...@gmail.com>
> > > a
> > > > écrit :
> > > >
> > > > > Hi all,
> > > > >
> > > > > What would be the best way to use Calcite with Python? I've come up
> > > with
> > > > > two potential solutions:
> > > > >
> > > > > - using the jaydebeapi package, to connect via the JDBC driver
> > directly
> > > > > from a JVM created via jpype;
> > > > > - using Apache Arrow via the pyarrow package, to connect in
> basically
> > > the
> > > > > same way but creating Arrow objects with JdbcToArrowUtils (and
> > > optionally
> > > > > converting them to Pandas).
> > > > >
> > > > > Although the former is more straightforward, the latter allows to
> > > achieve
> > > > > better performance (see [1] for instance) since it's exactly what
> > Arrow
> > > > is
> > > > > for. I've created two Jupyter notebooks [2] showing each solution.
> > What
> > > > > would you recommend? Is there an even better approach?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Nicola
> > > > >
> > > > > [1] https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html
> > > > > [2]
> > > >
> 

Re: Using Calcite with Python

2022-01-31 Thread Nicola Vitucci
Hi Eugen, Michael, Gavin,

Thank you very much for your input. Answering to your suggestions:

- Phoenix client: I saw it but decided not to use it because it does not
seem very active and up to date (its Avatica version is 1.10, while latest
is 1.20). I may still give it a try though.
- Arrow Flight: I think it can be very useful especially, like Michael
mentioned, if it were integrated with Avatica as a transport; at the
moment, though, it is not.

I am basically looking for a (relatively) easy and ready to implement, easy
to keep up to date, and reasonably performant solution. Although it incurs
some overhead, a solution based on Python + Java seems to me the most
reasonable for the time being. Do you have any other suggestions or
recommendations?

Thanks again,

Nicola



Il giorno lun 31 gen 2022 alle ore 17:04 Michael Mior  ha
scritto:

> Flight is definitely another consideration for the future. Personally I
> think it would be most interesting to integrate Flight with Avatica as an
> alternative transport. But it would certainly also be useful to allow the
> Arrow adapter to connect to any Flight endpoint.
>
> --
> Michael Mior
> mm...@apache.org
>
>
> Le lun. 31 janv. 2022 à 10:00, Gavin Ray  a écrit :
>
> > This is really interesting stuff you've done in the example notebooks
> >
> > Nicola & Michael, I wonder if you could benefit from the
> recently-released
> > Arrow Flight SQL?
> >
> >
> https://www.dremio.com/subsurface/arrow-flight-and-arrow-flight-sql-accelerating-data-movement/
> >
> > I have asked Jacques about this a bit -- it's meant to be a
> standardization
> > for communicating SQL queries and metadata with Arrow.
> > I'm not intimately familiar with it, but it seems like it could be a good
> > base to build a Calcite backend for Arrow from?
> >
> > They have a pretty thorough Java example in the repository:
> >
> >
> https://github.com/apache/arrow/blob/968e6ea488c939c0e1f2bfe339a5a9ed1aed603e/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/example/FlightSqlExample.java#L169-L180
> >
> > On Mon, Jan 31, 2022 at 8:47 AM Michael Mior  wrote:
> >
> > > You may want to keep an eye on CALCITE-2040 (
> > > https://issues.apache.org/jira/browse/CALCITE-2040). I have a student
> > who
> > > is working on a Calcite adapter for Apache Arrow. We're basically hung
> up
> > > waiting on the Arrow team to release a compatible JAR. This still won't
> > > fully solve your problem though as the first version of the adapter is
> > only
> > > capable of reading from Arrow files. However, the goal is eventually to
> > > allow passing a memory reference into the adapter so that it would be
> > > possible to make use of Arrow data which is constructed in-memory
> > > elsewhere.
> > > --
> > > Michael Mior
> > > mm...@apache.org
> > >
> > >
> > > Le dim. 30 janv. 2022 à 17:36, Nicola Vitucci <
> nicola.vitu...@gmail.com>
> > a
> > > écrit :
> > >
> > > > Hi all,
> > > >
> > > > What would be the best way to use Calcite with Python? I've come up
> > with
> > > > two potential solutions:
> > > >
> > > > - using the jaydebeapi package, to connect via the JDBC driver
> directly
> > > > from a JVM created via jpype;
> > > > - using Apache Arrow via the pyarrow package, to connect in basically
> > the
> > > > same way but creating Arrow objects with JdbcToArrowUtils (and
> > optionally
> > > > converting them to Pandas).
> > > >
> > > > Although the former is more straightforward, the latter allows to
> > achieve
> > > > better performance (see [1] for instance) since it's exactly what
> Arrow
> > > is
> > > > for. I've created two Jupyter notebooks [2] showing each solution.
> What
> > > > would you recommend? Is there an even better approach?
> > > >
> > > > Thanks,
> > > >
> > > > Nicola
> > > >
> > > > [1] https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html
> > > > [2]
> > > https://github.com/nvitucci/calcite-sparql/tree/v0.0.2/examples/python
> > > >
> > >
> >
>


Re: Getting "AssertionError: Rule's description should be unique" after adding JOIN to query. Fails on "optPlanner.changeTraits()"

2022-01-31 Thread Stamatis Zampetakis
Hi Gavin,

For each JdbcSchema, there is a JdbcConvention and when the latter is
registered in the planner it creates and registers a bunch of JdbcRules
[1]. If we are talking about the same schema then I guess the description
that is generated for each rule is identical so you end up with this
exception.

Are you using the same schema? Should you? If the use-case is valid then
you may have bumped into a small bug and probably you need to figure out a
way to make the rule description unique.

Best,
Stamatis

[1]
https://github.com/apache/calcite/blob/8570cf2b293e9acab9525fbe76709d970b9f7104/core/src/main/java/org/apache/calcite/adapter/jdbc/JdbcConvention.java#L65

On Mon, Jan 31, 2022 at 10:56 PM Gavin Ray  wrote:

> Not sure if this is helpful, but enabling TRACE logging shows this:
>
> [Test worker] DEBUG org.apache.calcite.sql2rel - Plan after converting
> SqlNode to RelNode
> LogicalProject(ID=[$0], USER_ID=[$1], TITLE=[$2], CONTENT=[$3], ID0=[$4],
> POST_ID=[$5], CONTENT0=[$6])
>   LogicalJoin(condition=[=($0, $5)], joinType=[inner])
> JdbcTableScan(table=[[db1, PUBLIC, POSTS]])
> JdbcTableScan(table=[[db2, PUBLIC, COMMENTS]])
>
> [Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Subset cost
> changed: subset [rel#11:RelSubset#0.JDBC.PUBLIC.[]] cost was {inf} now
> {100.0 rows, 101.0 cpu, 0.0 io}
> [Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Register
> rel#7:JdbcTableScan.JDBC.PUBLIC.[](table=[db1, PUBLIC, POSTS]) in
> rel#11:RelSubset#0.JDBC.PUBLIC.[]
> [Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Rule-match
> queued: rule [JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE)]
> rels [#7]
>
> java.lang.AssertionError: Rule's description should be unique; existing
> rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE); new
> rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE)
>
>
> On Mon, Jan 31, 2022 at 4:42 PM Gavin Ray  wrote:
>
> > I took the "FrameworksTest#executeQuery()" code from the Calcite repo
> here
> > and have been using it:
> >
> >
> https://github.com/apache/calcite/blob/de847c38f3544f9c7282984f32dc1093bdb2fb60/core/src/test/java/org/apache/calcite/tools/FrameworksTest.java#L444-L477
> >
> > Have been getting an error I don't understand.
> > I have a rootSchema, with two sub-schemas, that each have JdbcSchema's
> > from hsqldb
> >
> >
> >
> https://gist.github.com/GavinRay97/cf39a9dd92a793df506ecc2135d6da0b#file-test-kt-L68-L78
> >
> > If I query just a single schema, it works fine:
> > =
> > Query: select * from db1.PUBLIC.POSTS
> >
> > -- Logical Plan
> > LogicalProject(inputs=[0..3])
> >   JdbcTableScan(table=[[db1, PUBLIC, POSTS]])
> >
> > -- Mid Plan
> > LogicalProject(subset=[rel#8:RelSubset#1.ENUMERABLE.[]], inputs=[0..3])
> >   JdbcTableScan(subset=[rel#5:RelSubset#0.JDBC.PUBLIC.[]], table=[[db1,
> > PUBLIC, POSTS]])
> >
> > -- Best Plan
> > JdbcToEnumerableConverter
> >   JdbcTableScan(table=[[hsqldb1, PUBLIC, POSTS]])
> >
> > If I add a JOIN though, it fails before reaching the mid-plan:
> > =
> > Query:
> > select * from db1.PUBLIC.POSTS
> > inner join db2.PUBLIC.COMMENTS
> > on db1.PUBLIC.POSTS.ID = db2.PUBLIC.COMMENTS.POST_ID
> >
> > -- Logical Plan
> > LogicalProject(inputs=[0..6])
> >   LogicalJoin(condition=[=($0, $5)], joinType=[inner])
> > JdbcTableScan(table=[[db1, PUBLIC, POSTS]])
> > JdbcTableScan(table=[[db2, PUBLIC, COMMENTS]])
> >
> > java.lang.AssertionError: Rule's description should be unique;
> > existing
> rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE);
> > new rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE)
> > at
> >
> org.apache.calcite.plan.AbstractRelOptPlanner.addRule(AbstractRelOptPlanner.java:163)
> > at
> >
> org.apache.calcite.plan.volcano.VolcanoPlanner.addRule(VolcanoPlanner.java:418)
> > at
> >
> org.apache.calcite.adapter.jdbc.JdbcConvention.register(JdbcConvention.java:66)
> > at
> >
> org.apache.calcite.plan.AbstractRelOptPlanner.registerClass(AbstractRelOptPlanner.java:240)
> > at
> >
> org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1365)
> > at
> >
> org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598)
> > at
> >
> org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613)
> > at
> >
> org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:95)
> > at
> >
> org.apache.calcite.rel.AbstractRelNode.onRegister(AbstractRelNode.java:274)
> > at
> >
> org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1270)
> > at
> >
> org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598)
> > at
> >
> org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613)
> > at
> >
> 

Re: Getting "AssertionError: Rule's description should be unique" after adding JOIN to query. Fails on "optPlanner.changeTraits()"

2022-01-31 Thread Gavin Ray
Not sure if this is helpful, but enabling TRACE logging shows this:

[Test worker] DEBUG org.apache.calcite.sql2rel - Plan after converting
SqlNode to RelNode
LogicalProject(ID=[$0], USER_ID=[$1], TITLE=[$2], CONTENT=[$3], ID0=[$4],
POST_ID=[$5], CONTENT0=[$6])
  LogicalJoin(condition=[=($0, $5)], joinType=[inner])
JdbcTableScan(table=[[db1, PUBLIC, POSTS]])
JdbcTableScan(table=[[db2, PUBLIC, COMMENTS]])

[Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Subset cost
changed: subset [rel#11:RelSubset#0.JDBC.PUBLIC.[]] cost was {inf} now
{100.0 rows, 101.0 cpu, 0.0 io}
[Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Register
rel#7:JdbcTableScan.JDBC.PUBLIC.[](table=[db1, PUBLIC, POSTS]) in
rel#11:RelSubset#0.JDBC.PUBLIC.[]
[Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Rule-match
queued: rule [JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE)]
rels [#7]

java.lang.AssertionError: Rule's description should be unique; existing
rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE); new
rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE)


On Mon, Jan 31, 2022 at 4:42 PM Gavin Ray  wrote:

> I took the "FrameworksTest#executeQuery()" code from the Calcite repo here
> and have been using it:
>
> https://github.com/apache/calcite/blob/de847c38f3544f9c7282984f32dc1093bdb2fb60/core/src/test/java/org/apache/calcite/tools/FrameworksTest.java#L444-L477
>
> Have been getting an error I don't understand.
> I have a rootSchema, with two sub-schemas, that each have JdbcSchema's
> from hsqldb
>
>
> https://gist.github.com/GavinRay97/cf39a9dd92a793df506ecc2135d6da0b#file-test-kt-L68-L78
>
> If I query just a single schema, it works fine:
> =
> Query: select * from db1.PUBLIC.POSTS
>
> -- Logical Plan
> LogicalProject(inputs=[0..3])
>   JdbcTableScan(table=[[db1, PUBLIC, POSTS]])
>
> -- Mid Plan
> LogicalProject(subset=[rel#8:RelSubset#1.ENUMERABLE.[]], inputs=[0..3])
>   JdbcTableScan(subset=[rel#5:RelSubset#0.JDBC.PUBLIC.[]], table=[[db1,
> PUBLIC, POSTS]])
>
> -- Best Plan
> JdbcToEnumerableConverter
>   JdbcTableScan(table=[[hsqldb1, PUBLIC, POSTS]])
>
> If I add a JOIN though, it fails before reaching the mid-plan:
> =
> Query:
> select * from db1.PUBLIC.POSTS
> inner join db2.PUBLIC.COMMENTS
> on db1.PUBLIC.POSTS.ID = db2.PUBLIC.COMMENTS.POST_ID
>
> -- Logical Plan
> LogicalProject(inputs=[0..6])
>   LogicalJoin(condition=[=($0, $5)], joinType=[inner])
> JdbcTableScan(table=[[db1, PUBLIC, POSTS]])
> JdbcTableScan(table=[[db2, PUBLIC, COMMENTS]])
>
> java.lang.AssertionError: Rule's description should be unique;
> existing rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE);
> new rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE)
> at
> org.apache.calcite.plan.AbstractRelOptPlanner.addRule(AbstractRelOptPlanner.java:163)
> at
> org.apache.calcite.plan.volcano.VolcanoPlanner.addRule(VolcanoPlanner.java:418)
> at
> org.apache.calcite.adapter.jdbc.JdbcConvention.register(JdbcConvention.java:66)
> at
> org.apache.calcite.plan.AbstractRelOptPlanner.registerClass(AbstractRelOptPlanner.java:240)
> at
> org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1365)
> at
> org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598)
> at
> org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613)
> at
> org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:95)
> at
> org.apache.calcite.rel.AbstractRelNode.onRegister(AbstractRelNode.java:274)
> at
> org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1270)
> at
> org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598)
> at
> org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613)
> at
> org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:95)
> at
> org.apache.calcite.rel.AbstractRelNode.onRegister(AbstractRelNode.java:274)
> at
> org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1270)
> at
> org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598)
> at
> org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613)
> at
> org.apache.calcite.plan.volcano.VolcanoPlanner.changeTraits(VolcanoPlanner.java:498)
> at CalciteUtils.executeQuery(CalciteUtils.kt:41)
> at ForeignKeyTest.testForeignKey(ForeignKeyTest.kt:82)
>


Getting "AssertionError: Rule's description should be unique" after adding JOIN to query. Fails on "optPlanner.changeTraits()"

2022-01-31 Thread Gavin Ray
I took the "FrameworksTest#executeQuery()" code from the Calcite repo here
and have been using it:
https://github.com/apache/calcite/blob/de847c38f3544f9c7282984f32dc1093bdb2fb60/core/src/test/java/org/apache/calcite/tools/FrameworksTest.java#L444-L477

Have been getting an error I don't understand.
I have a rootSchema, with two sub-schemas, that each have JdbcSchema's from
hsqldb

https://gist.github.com/GavinRay97/cf39a9dd92a793df506ecc2135d6da0b#file-test-kt-L68-L78

If I query just a single schema, it works fine:
=
Query: select * from db1.PUBLIC.POSTS

-- Logical Plan
LogicalProject(inputs=[0..3])
  JdbcTableScan(table=[[db1, PUBLIC, POSTS]])

-- Mid Plan
LogicalProject(subset=[rel#8:RelSubset#1.ENUMERABLE.[]], inputs=[0..3])
  JdbcTableScan(subset=[rel#5:RelSubset#0.JDBC.PUBLIC.[]], table=[[db1,
PUBLIC, POSTS]])

-- Best Plan
JdbcToEnumerableConverter
  JdbcTableScan(table=[[hsqldb1, PUBLIC, POSTS]])

If I add a JOIN though, it fails before reaching the mid-plan:
=
Query:
select * from db1.PUBLIC.POSTS
inner join db2.PUBLIC.COMMENTS
on db1.PUBLIC.POSTS.ID = db2.PUBLIC.COMMENTS.POST_ID

-- Logical Plan
LogicalProject(inputs=[0..6])
  LogicalJoin(condition=[=($0, $5)], joinType=[inner])
JdbcTableScan(table=[[db1, PUBLIC, POSTS]])
JdbcTableScan(table=[[db2, PUBLIC, COMMENTS]])

java.lang.AssertionError: Rule's description should be unique;
existing rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE);
new rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE)
at
org.apache.calcite.plan.AbstractRelOptPlanner.addRule(AbstractRelOptPlanner.java:163)
at
org.apache.calcite.plan.volcano.VolcanoPlanner.addRule(VolcanoPlanner.java:418)
at
org.apache.calcite.adapter.jdbc.JdbcConvention.register(JdbcConvention.java:66)
at
org.apache.calcite.plan.AbstractRelOptPlanner.registerClass(AbstractRelOptPlanner.java:240)
at
org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1365)
at
org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598)
at
org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613)
at
org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:95)
at
org.apache.calcite.rel.AbstractRelNode.onRegister(AbstractRelNode.java:274)
at
org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1270)
at
org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598)
at
org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613)
at
org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:95)
at
org.apache.calcite.rel.AbstractRelNode.onRegister(AbstractRelNode.java:274)
at
org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1270)
at
org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598)
at
org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613)
at
org.apache.calcite.plan.volcano.VolcanoPlanner.changeTraits(VolcanoPlanner.java:498)
at CalciteUtils.executeQuery(CalciteUtils.kt:41)
at ForeignKeyTest.testForeignKey(ForeignKeyTest.kt:82)


Re: Using Calcite with Python

2022-01-31 Thread Michael Mior
Flight is definitely another consideration for the future. Personally I
think it would be most interesting to integrate Flight with Avatica as an
alternative transport. But it would certainly also be useful to allow the
Arrow adapter to connect to any Flight endpoint.

--
Michael Mior
mm...@apache.org


Le lun. 31 janv. 2022 à 10:00, Gavin Ray  a écrit :

> This is really interesting stuff you've done in the example notebooks
>
> Nicola & Michael, I wonder if you could benefit from the recently-released
> Arrow Flight SQL?
>
> https://www.dremio.com/subsurface/arrow-flight-and-arrow-flight-sql-accelerating-data-movement/
>
> I have asked Jacques about this a bit -- it's meant to be a standardization
> for communicating SQL queries and metadata with Arrow.
> I'm not intimately familiar with it, but it seems like it could be a good
> base to build a Calcite backend for Arrow from?
>
> They have a pretty thorough Java example in the repository:
>
> https://github.com/apache/arrow/blob/968e6ea488c939c0e1f2bfe339a5a9ed1aed603e/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/example/FlightSqlExample.java#L169-L180
>
> On Mon, Jan 31, 2022 at 8:47 AM Michael Mior  wrote:
>
> > You may want to keep an eye on CALCITE-2040 (
> > https://issues.apache.org/jira/browse/CALCITE-2040). I have a student
> who
> > is working on a Calcite adapter for Apache Arrow. We're basically hung up
> > waiting on the Arrow team to release a compatible JAR. This still won't
> > fully solve your problem though as the first version of the adapter is
> only
> > capable of reading from Arrow files. However, the goal is eventually to
> > allow passing a memory reference into the adapter so that it would be
> > possible to make use of Arrow data which is constructed in-memory
> > elsewhere.
> > --
> > Michael Mior
> > mm...@apache.org
> >
> >
> > Le dim. 30 janv. 2022 à 17:36, Nicola Vitucci 
> a
> > écrit :
> >
> > > Hi all,
> > >
> > > What would be the best way to use Calcite with Python? I've come up
> with
> > > two potential solutions:
> > >
> > > - using the jaydebeapi package, to connect via the JDBC driver directly
> > > from a JVM created via jpype;
> > > - using Apache Arrow via the pyarrow package, to connect in basically
> the
> > > same way but creating Arrow objects with JdbcToArrowUtils (and
> optionally
> > > converting them to Pandas).
> > >
> > > Although the former is more straightforward, the latter allows to
> achieve
> > > better performance (see [1] for instance) since it's exactly what Arrow
> > is
> > > for. I've created two Jupyter notebooks [2] showing each solution. What
> > > would you recommend? Is there an even better approach?
> > >
> > > Thanks,
> > >
> > > Nicola
> > >
> > > [1] https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html
> > > [2]
> > https://github.com/nvitucci/calcite-sparql/tree/v0.0.2/examples/python
> > >
> >
>


Re: Using Calcite with Python

2022-01-31 Thread Gavin Ray
This is really interesting stuff you've done in the example notebooks

Nicola & Michael, I wonder if you could benefit from the recently-released
Arrow Flight SQL?
https://www.dremio.com/subsurface/arrow-flight-and-arrow-flight-sql-accelerating-data-movement/

I have asked Jacques about this a bit -- it's meant to be a standardization
for communicating SQL queries and metadata with Arrow.
I'm not intimately familiar with it, but it seems like it could be a good
base to build a Calcite backend for Arrow from?

They have a pretty thorough Java example in the repository:
https://github.com/apache/arrow/blob/968e6ea488c939c0e1f2bfe339a5a9ed1aed603e/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/example/FlightSqlExample.java#L169-L180

On Mon, Jan 31, 2022 at 8:47 AM Michael Mior  wrote:

> You may want to keep an eye on CALCITE-2040 (
> https://issues.apache.org/jira/browse/CALCITE-2040). I have a student who
> is working on a Calcite adapter for Apache Arrow. We're basically hung up
> waiting on the Arrow team to release a compatible JAR. This still won't
> fully solve your problem though as the first version of the adapter is only
> capable of reading from Arrow files. However, the goal is eventually to
> allow passing a memory reference into the adapter so that it would be
> possible to make use of Arrow data which is constructed in-memory
> elsewhere.
> --
> Michael Mior
> mm...@apache.org
>
>
> Le dim. 30 janv. 2022 à 17:36, Nicola Vitucci  a
> écrit :
>
> > Hi all,
> >
> > What would be the best way to use Calcite with Python? I've come up with
> > two potential solutions:
> >
> > - using the jaydebeapi package, to connect via the JDBC driver directly
> > from a JVM created via jpype;
> > - using Apache Arrow via the pyarrow package, to connect in basically the
> > same way but creating Arrow objects with JdbcToArrowUtils (and optionally
> > converting them to Pandas).
> >
> > Although the former is more straightforward, the latter allows to achieve
> > better performance (see [1] for instance) since it's exactly what Arrow
> is
> > for. I've created two Jupyter notebooks [2] showing each solution. What
> > would you recommend? Is there an even better approach?
> >
> > Thanks,
> >
> > Nicola
> >
> > [1] https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html
> > [2]
> https://github.com/nvitucci/calcite-sparql/tree/v0.0.2/examples/python
> >
>


Re: Using Calcite with Python

2022-01-31 Thread Michael Mior
You may want to keep an eye on CALCITE-2040 (
https://issues.apache.org/jira/browse/CALCITE-2040). I have a student who
is working on a Calcite adapter for Apache Arrow. We're basically hung up
waiting on the Arrow team to release a compatible JAR. This still won't
fully solve your problem though as the first version of the adapter is only
capable of reading from Arrow files. However, the goal is eventually to
allow passing a memory reference into the adapter so that it would be
possible to make use of Arrow data which is constructed in-memory elsewhere.
--
Michael Mior
mm...@apache.org


Le dim. 30 janv. 2022 à 17:36, Nicola Vitucci  a
écrit :

> Hi all,
>
> What would be the best way to use Calcite with Python? I've come up with
> two potential solutions:
>
> - using the jaydebeapi package, to connect via the JDBC driver directly
> from a JVM created via jpype;
> - using Apache Arrow via the pyarrow package, to connect in basically the
> same way but creating Arrow objects with JdbcToArrowUtils (and optionally
> converting them to Pandas).
>
> Although the former is more straightforward, the latter allows to achieve
> better performance (see [1] for instance) since it's exactly what Arrow is
> for. I've created two Jupyter notebooks [2] showing each solution. What
> would you recommend? Is there an even better approach?
>
> Thanks,
>
> Nicola
>
> [1] https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html
> [2] https://github.com/nvitucci/calcite-sparql/tree/v0.0.2/examples/python
>


[jira] [Created] (CALCITE-4995) ArrayIndexOutOfBoundsException caused by RelFieldTrimmer on SEMI/ANTI join

2022-01-31 Thread Ruben Q L (Jira)
Ruben Q L created CALCITE-4995:
--

 Summary: ArrayIndexOutOfBoundsException caused by RelFieldTrimmer 
on SEMI/ANTI join
 Key: CALCITE-4995
 URL: https://issues.apache.org/jira/browse/CALCITE-4995
 Project: Calcite
  Issue Type: Bug
  Components: core
Affects Versions: 1.29.0
Reporter: Ruben Q L
Assignee: Ruben Q L


(Unit test to be provided)

It seems {{RelFieldTrimmer}} can cause an {{ArrayIndexOutOfBoundsException}} on 
certain plans involving SEMI/ANTI join (i.e. joins that do NOT project the RHS 
fields).
The root cause seems to be the "early return" in 
{{RelFieldTrimmer#trimFields(Join join, ImmutableBitSet fieldsUsed, 
Set extraFields)}} when nothing has been trimmed inside 
join's inputs (so the join itself can be return as it is):
{code:java}
if (changeCount == 0
&& mapping.isIdentity()) {
  return result(join, Mappings.createIdentity(fieldCount));
}
{code}
The problem is that this {{fieldCount}} is an addition of LHS + RHS fields (+ 
system fields); but in case of a SEMI/ANTI the mappings to be returned must not 
consider RHS fields (since they are not projected by these join types).

The problem only happens here (when the trimmer does not trim the join). Notice 
that, a few lines below, in the "other return scenario" of the method (when 
something has been trimmed), there is a special treatment of the mapping for 
ANTI/SEMI, so things will work fine in this case:
{code:java}
switch (join.getJoinType()) {
case SEMI:
case ANTI:
  // For SemiJoins and AntiJoins only map fields from the left-side
  if (join.getJoinType() == JoinRelType.SEMI) {
relBuilder.semiJoin(newConditionExpr);
  } else {
relBuilder.antiJoin(newConditionExpr);
  }
  Mapping inputMapping = inputMappings.get(0);
  mapping = Mappings.create(MappingType.INVERSE_SURJECTION,
  join.getRowType().getFieldCount(),
  newSystemFieldCount + inputMapping.getTargetCount());
  for (int i = 0; i < newSystemFieldCount; ++i) {
mapping.set(i, i);
  }
  offset = systemFieldCount;
  newOffset = newSystemFieldCount;
  for (IntPair pair : inputMapping) {
mapping.set(pair.source + offset, pair.target + newOffset);
  }
  break;
default:
  relBuilder.join(join.getJoinType(), newConditionExpr);
}
relBuilder.hints(join.getHints());
return result(relBuilder.build(), mapping);
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: Using Calcite with Python

2022-01-31 Thread Eugen Stan

Hi Nicola,

It's a question I was asking myself the other day.
I don't know the answer but I do have an exploration direction:
Avatica client.

There is some nice description and diagram here
https://calcite.apache.org/avatica/docs/

And also a list of clients down bellow.

See 
https://calcite.apache.org/avatica/docs/#apache-phoenix-database-adapter-for-python


Please let me know how it goes and what you find out.


On 31.01.2022 00:35, Nicola Vitucci wrote:

Hi all,

What would be the best way to use Calcite with Python? I've come up with
two potential solutions:

- using the jaydebeapi package, to connect via the JDBC driver directly
from a JVM created via jpype;
- using Apache Arrow via the pyarrow package, to connect in basically the
same way but creating Arrow objects with JdbcToArrowUtils (and optionally
converting them to Pandas).

Although the former is more straightforward, the latter allows to achieve
better performance (see [1] for instance) since it's exactly what Arrow is
for. I've created two Jupyter notebooks [2] showing each solution. What
would you recommend? Is there an even better approach?

Thanks,

Nicola

[1] https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html
[2] https://github.com/nvitucci/calcite-sparql/tree/v0.0.2/examples/python



Regards,
--
Eugen Stan

+40770 941 271  / https://www.netdava.combegin:vcard
fn:Eugen Stan
n:Stan;Eugen
email;internet:eugen.s...@netdava.com
tel;cell:+40720898747
x-mozilla-html:FALSE
url:https://www.netdava.com
version:2.1
end:vcard