Re: Using Calcite with Python
I have nothing of value to add, but: > [5] https://github.com/oap-project/gazelle-jni/tree/velox_dev Hot damn this is neat On Mon, Jan 31, 2022 at 7:58 PM Jacques Nadeau wrote: > A couple of related (possibly useful?) pointers here: > >- Dask-sql [1] uses Calcite in a python context. Might be some good >stuff to leverage there. >- I'm working on compiling Calcite as a GraalVM shared native library >[2] as part of Substrait [3] with the goal of ultimately having a > friendly >C binding [4] for use in non-jvm worlds. This connects to work being > done >by others to support tools like Arrow and Velox [5] as Substrait targets >(and thus completing the path from c interface to native execution via >Calcite). > > > [1] https://github.com/dask-contrib/dask-sql > [2] https://issues.apache.org/jira/browse/CALCITE-4786 > [3] https://github.com/substrait-io/substrait/pull/120 > [4] https://github.com/jacques-n/substrait/pull/3 > [5] https://github.com/oap-project/gazelle-jni/tree/velox_dev > > On Mon, Jan 31, 2022 at 3:32 PM Nicola Vitucci > wrote: > > > Hi Eugen, Michael, Gavin, > > > > Thank you very much for your input. Answering to your suggestions: > > > > - Phoenix client: I saw it but decided not to use it because it does not > > seem very active and up to date (its Avatica version is 1.10, while > latest > > is 1.20). I may still give it a try though. > > - Arrow Flight: I think it can be very useful especially, like Michael > > mentioned, if it were integrated with Avatica as a transport; at the > > moment, though, it is not. > > > > I am basically looking for a (relatively) easy and ready to implement, > easy > > to keep up to date, and reasonably performant solution. Although it > incurs > > some overhead, a solution based on Python + Java seems to me the most > > reasonable for the time being. Do you have any other suggestions or > > recommendations? > > > > Thanks again, > > > > Nicola > > > > > > > > Il giorno lun 31 gen 2022 alle ore 17:04 Michael Mior > > ha > > scritto: > > > > > Flight is definitely another consideration for the future. Personally I > > > think it would be most interesting to integrate Flight with Avatica as > an > > > alternative transport. But it would certainly also be useful to allow > the > > > Arrow adapter to connect to any Flight endpoint. > > > > > > -- > > > Michael Mior > > > mm...@apache.org > > > > > > > > > Le lun. 31 janv. 2022 à 10:00, Gavin Ray a > > écrit : > > > > > > > This is really interesting stuff you've done in the example notebooks > > > > > > > > Nicola & Michael, I wonder if you could benefit from the > > > recently-released > > > > Arrow Flight SQL? > > > > > > > > > > > > > > https://www.dremio.com/subsurface/arrow-flight-and-arrow-flight-sql-accelerating-data-movement/ > > > > > > > > I have asked Jacques about this a bit -- it's meant to be a > > > standardization > > > > for communicating SQL queries and metadata with Arrow. > > > > I'm not intimately familiar with it, but it seems like it could be a > > good > > > > base to build a Calcite backend for Arrow from? > > > > > > > > They have a pretty thorough Java example in the repository: > > > > > > > > > > > > > > https://github.com/apache/arrow/blob/968e6ea488c939c0e1f2bfe339a5a9ed1aed603e/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/example/FlightSqlExample.java#L169-L180 > > > > > > > > On Mon, Jan 31, 2022 at 8:47 AM Michael Mior > wrote: > > > > > > > > > You may want to keep an eye on CALCITE-2040 ( > > > > > https://issues.apache.org/jira/browse/CALCITE-2040). I have a > > student > > > > who > > > > > is working on a Calcite adapter for Apache Arrow. We're basically > > hung > > > up > > > > > waiting on the Arrow team to release a compatible JAR. This still > > won't > > > > > fully solve your problem though as the first version of the adapter > > is > > > > only > > > > > capable of reading from Arrow files. However, the goal is > eventually > > to > > > > > allow passing a memory reference into the adapter so that it would > be > > > > > possible to make use of Arrow data which is constructed in-memory > > > > > elsewhere. > > > > > -- > > > > > Michael Mior > > > > > mm...@apache.org > > > > > > > > > > > > > > > Le dim. 30 janv. 2022 à 17:36, Nicola Vitucci < > > > nicola.vitu...@gmail.com> > > > > a > > > > > écrit : > > > > > > > > > > > Hi all, > > > > > > > > > > > > What would be the best way to use Calcite with Python? I've come > up > > > > with > > > > > > two potential solutions: > > > > > > > > > > > > - using the jaydebeapi package, to connect via the JDBC driver > > > directly > > > > > > from a JVM created via jpype; > > > > > > - using Apache Arrow via the pyarrow package, to connect in > > basically > > > > the > > > > > > same way but creating Arrow objects with JdbcToArrowUtils (and > > > > optionally > > > > > > converting them to Pandas). > > > > > > > > > > > > Although the former is more straightforward,
Re: Getting "AssertionError: Rule's description should be unique" after adding JOIN to query. Fails on "optPlanner.changeTraits()"
Stamatis, thank you!! Based on your comments about the naming/identity, I was able to figure out what was happening and fix it =D In hindsight it should have been obvious. I was iterating over all schemas in the DB, but I was not creating unique names. So in this test, "DB1" and "DB2" both have a schema named "PUBLIC". When I created the JdbcSchema, I was just passing "PUBLIC" as the name. The solution was to make the name "$databaseName_$schemaName". Now it works! My "CalciteSchemaManager.addDatabase()" functionality looks like this: fun addDatabase(dbName: String, ds: DataSource): SchemaPlus { val schemas = getSubSchemas(ds.connection) if (schemas.isEmpty()) { rootSchema.add(dbName, JdbcSchema.create(rootSchema, dbName, ds, null, null)) } else { val dbSchema = rootSchema.add(dbName, EMPTY_SCHEMA) for (schemaName in schemas) { dbSchema.add( schemaName, JdbcSchema.create(dbSchema, schemaName, ds, null, schemaName) ) } } return rootSchema } On Mon, Jan 31, 2022 at 4:59 PM Stamatis Zampetakis wrote: > Hi Gavin, > > For each JdbcSchema, there is a JdbcConvention and when the latter is > registered in the planner it creates and registers a bunch of JdbcRules > [1]. If we are talking about the same schema then I guess the description > that is generated for each rule is identical so you end up with this > exception. > > Are you using the same schema? Should you? If the use-case is valid then > you may have bumped into a small bug and probably you need to figure out a > way to make the rule description unique. > > Best, > Stamatis > > [1] > > https://github.com/apache/calcite/blob/8570cf2b293e9acab9525fbe76709d970b9f7104/core/src/main/java/org/apache/calcite/adapter/jdbc/JdbcConvention.java#L65 > > On Mon, Jan 31, 2022 at 10:56 PM Gavin Ray wrote: > > > Not sure if this is helpful, but enabling TRACE logging shows this: > > > > [Test worker] DEBUG org.apache.calcite.sql2rel - Plan after converting > > SqlNode to RelNode > > LogicalProject(ID=[$0], USER_ID=[$1], TITLE=[$2], CONTENT=[$3], ID0=[$4], > > POST_ID=[$5], CONTENT0=[$6]) > > LogicalJoin(condition=[=($0, $5)], joinType=[inner]) > > JdbcTableScan(table=[[db1, PUBLIC, POSTS]]) > > JdbcTableScan(table=[[db2, PUBLIC, COMMENTS]]) > > > > [Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Subset cost > > changed: subset [rel#11:RelSubset#0.JDBC.PUBLIC.[]] cost was {inf} now > > {100.0 rows, 101.0 cpu, 0.0 io} > > [Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Register > > rel#7:JdbcTableScan.JDBC.PUBLIC.[](table=[db1, PUBLIC, POSTS]) in > > rel#11:RelSubset#0.JDBC.PUBLIC.[] > > [Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Rule-match > > queued: rule > [JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE)] > > rels [#7] > > > > java.lang.AssertionError: Rule's description should be unique; existing > > rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE); new > > rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE) > > > > > > On Mon, Jan 31, 2022 at 4:42 PM Gavin Ray wrote: > > > > > I took the "FrameworksTest#executeQuery()" code from the Calcite repo > > here > > > and have been using it: > > > > > > > > > https://github.com/apache/calcite/blob/de847c38f3544f9c7282984f32dc1093bdb2fb60/core/src/test/java/org/apache/calcite/tools/FrameworksTest.java#L444-L477 > > > > > > Have been getting an error I don't understand. > > > I have a rootSchema, with two sub-schemas, that each have JdbcSchema's > > > from hsqldb > > > > > > > > > > > > https://gist.github.com/GavinRay97/cf39a9dd92a793df506ecc2135d6da0b#file-test-kt-L68-L78 > > > > > > If I query just a single schema, it works fine: > > > = > > > Query: select * from db1.PUBLIC.POSTS > > > > > > -- Logical Plan > > > LogicalProject(inputs=[0..3]) > > > JdbcTableScan(table=[[db1, PUBLIC, POSTS]]) > > > > > > -- Mid Plan > > > LogicalProject(subset=[rel#8:RelSubset#1.ENUMERABLE.[]], inputs=[0..3]) > > > JdbcTableScan(subset=[rel#5:RelSubset#0.JDBC.PUBLIC.[]], table=[[db1, > > > PUBLIC, POSTS]]) > > > > > > -- Best Plan > > > JdbcToEnumerableConverter > > > JdbcTableScan(table=[[hsqldb1, PUBLIC, POSTS]]) > > > > > > If I add a JOIN though, it fails before reaching the mid-plan: > > > = > > > Query: > > > select * from db1.PUBLIC.POSTS > > > inner join db2.PUBLIC.COMMENTS > > > on db1.PUBLIC.POSTS.ID = > db2.PUBLIC.COMMENTS.POST_ID > > > > > > -- Logical Plan > > > LogicalProject(inputs=[0..6]) > > > LogicalJoin(condition=[=($0, $5)], joinType=[inner]) > > > JdbcTableScan(table=[[db1, PUBLIC, POSTS]]) > > > JdbcTableScan(table=[[db2, PUBLIC, COMMENTS]]) > > > > > > java.lang.AssertionError: Rule's description should be unique; > > > existing > >
Re: How to handle the fact that different databases have different ideas of what a "db"/"schema" are?
Jacques, thank you. I appreciate the weigh-in with your experience, and that link is brilliant. This is exactly what I am trying to do. I tried to write something like this but it felt shaky, I wasn't confident in it. It took a JDBC "DataSource" and then used metadata to iterate catalogs/schemas and add them to itself as a child "JdbcSchema", returning them in "getSubschemaMap()". Creating a wrapping default/root catalog if necessary. The code here is a much better version of what I was trying to do it seems. I am going to unabashedly steal the overall implementation/design patterns here =) On Mon, Jan 31, 2022 at 8:06 PM Jacques Nadeau wrote: > If I recall correctly, my experience is you have to code per underlying > database. I believe there are only a few patterns but I don't think there > is sufficient odbc/jdbc info to answer behavior reliably (or at least in a > way that feels correct/native to each database). For example, I believe > some databases require catalog selection at the connection level and so > while a catalog concept exists, you have to use different connections for > each catalog whereas other databases expose catalog within a connection. > > This is all quite old thinking. I remember writing some simpler logic for > this here years ago at [1]. Note that code is long sense changed but wanted > to call it. > > [1] > > https://github.com/apache/drill/blob/18a1ae4d31cd502d2f792b331fefeb0ed2106c53/contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcStoragePlugin.java#L301 > > > On Thu, Jan 27, 2022 at 7:38 PM Gavin Ray wrote: > > > The filesystem hierarchy is a great analogy, I understand it much better > > now I think -- thank you. > > > > This seems like a problem with potentially very brittle solutions. Using > > your explanation I was able to get it to work, > > with very terrible logic that says "If catalogs are all null, then it's > > one-level so make a fake "root" schema to hold it in the rootSchema" > > > > But this doesn't feel so good. Not 100% sure what the best thing to do > here > > is, but at least you've cleared up what's going on. > > > > On Thu, Jan 27, 2022 at 6:41 PM Julian Hyde > > wrote: > > > > > Let’s not worry about the names, and say that some DBs have two > namespace > > > levels and others have just one. > > > > > > Calcite’s word for a namespace is ’schema’. Calcite schemas are > arranged > > > in a hierarchy, like a filesystem, so there is no preferred depth. Any > > > schema can contain both tables and (sub)schemas. So you can easily > built > > a > > > one- or two-level namespace structure, or whatever you want. > > > > > > Calcite’s catalog has a single ‘root schema’ (analogous to the root > > > directory, ‘/‘ in file systems), and you can get to anything else from > > > there. > > > > > > In JDBC parlance, a a level 1 namespace is called ‘catalog’, and a > level > > 2 > > > namespace is a ’schema’. If a DB has a one-level namespace then catalog > > > will be null, or the empty string, or something. > > > > > > If you’re running an Avatica JDBC server backed by a particular Calcite > > > root schema, and you want your database to look like a one-level or > > > two-level database, we probably don’t make it particularly easy. > > > > > > Julian > > > > > > > > > > On Jan 27, 2022, at 7:25 AM, Gavin Ray > wrote: > > > > > > > > My RDBMS experience is nearly exclusively Postgres > > > > While working on this project, I've made the assumption that the > > > structure > > > > of a database is: > > > > > > > > Database -> Schema -> Table > > > > > > > > It turns out that this isn't accurate. In MySQL for instance, > "Schema" > > is > > > > an alias for "DB". > > > > From the below StackOverflow answer, it seems like this is all over > the > > > > place: > > > > > > > > https://stackoverflow.com/a/7944489/13485494 > > > > > > > > I have a "CalciteSchemaManager" object which has a "rootSchema" to > > which > > > > all datasources are attached > > > > This "rootSchema" is used to generate the GraphQL API and types > > > > > > > > It seems like I have two options, and I'm not sure which is a better > > > design: > > > > > > > > 1. Force all datasources to conform to (Database -> Schema -> Table) > > > > > > > > This means that adding a new MySQL database, would generate > ("mysql_db" > > > -> > > > > "root" (fake schema) -> "some_table") > > > > Adding a CSV schema too, would be something like ("csv_datasource" -> > > > > "root" -> "some_csv_file") > > > > > > > > 2. Have an irregular data shape. Datasources can be of arbitrary > > > sub-schema > > > > depth. > > > > > > > > Example Postgres: ("pg_db_1" -> "public" -> "user") > > > > Example MySQL: ("mysql_db_1" -> "user") > > > > Example CSV: ("some_csv_file") or maybe ("csv_source_1" -> > > > "some_csv_file") > > > > > > > > What do you folks think I ought to do? > > > > Thank you =) > > > > > > > > >
Re: How to handle the fact that different databases have different ideas of what a "db"/"schema" are?
If I recall correctly, my experience is you have to code per underlying database. I believe there are only a few patterns but I don't think there is sufficient odbc/jdbc info to answer behavior reliably (or at least in a way that feels correct/native to each database). For example, I believe some databases require catalog selection at the connection level and so while a catalog concept exists, you have to use different connections for each catalog whereas other databases expose catalog within a connection. This is all quite old thinking. I remember writing some simpler logic for this here years ago at [1]. Note that code is long sense changed but wanted to call it. [1] https://github.com/apache/drill/blob/18a1ae4d31cd502d2f792b331fefeb0ed2106c53/contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcStoragePlugin.java#L301 On Thu, Jan 27, 2022 at 7:38 PM Gavin Ray wrote: > The filesystem hierarchy is a great analogy, I understand it much better > now I think -- thank you. > > This seems like a problem with potentially very brittle solutions. Using > your explanation I was able to get it to work, > with very terrible logic that says "If catalogs are all null, then it's > one-level so make a fake "root" schema to hold it in the rootSchema" > > But this doesn't feel so good. Not 100% sure what the best thing to do here > is, but at least you've cleared up what's going on. > > On Thu, Jan 27, 2022 at 6:41 PM Julian Hyde > wrote: > > > Let’s not worry about the names, and say that some DBs have two namespace > > levels and others have just one. > > > > Calcite’s word for a namespace is ’schema’. Calcite schemas are arranged > > in a hierarchy, like a filesystem, so there is no preferred depth. Any > > schema can contain both tables and (sub)schemas. So you can easily built > a > > one- or two-level namespace structure, or whatever you want. > > > > Calcite’s catalog has a single ‘root schema’ (analogous to the root > > directory, ‘/‘ in file systems), and you can get to anything else from > > there. > > > > In JDBC parlance, a a level 1 namespace is called ‘catalog’, and a level > 2 > > namespace is a ’schema’. If a DB has a one-level namespace then catalog > > will be null, or the empty string, or something. > > > > If you’re running an Avatica JDBC server backed by a particular Calcite > > root schema, and you want your database to look like a one-level or > > two-level database, we probably don’t make it particularly easy. > > > > Julian > > > > > > > On Jan 27, 2022, at 7:25 AM, Gavin Ray wrote: > > > > > > My RDBMS experience is nearly exclusively Postgres > > > While working on this project, I've made the assumption that the > > structure > > > of a database is: > > > > > > Database -> Schema -> Table > > > > > > It turns out that this isn't accurate. In MySQL for instance, "Schema" > is > > > an alias for "DB". > > > From the below StackOverflow answer, it seems like this is all over the > > > place: > > > > > > https://stackoverflow.com/a/7944489/13485494 > > > > > > I have a "CalciteSchemaManager" object which has a "rootSchema" to > which > > > all datasources are attached > > > This "rootSchema" is used to generate the GraphQL API and types > > > > > > It seems like I have two options, and I'm not sure which is a better > > design: > > > > > > 1. Force all datasources to conform to (Database -> Schema -> Table) > > > > > > This means that adding a new MySQL database, would generate ("mysql_db" > > -> > > > "root" (fake schema) -> "some_table") > > > Adding a CSV schema too, would be something like ("csv_datasource" -> > > > "root" -> "some_csv_file") > > > > > > 2. Have an irregular data shape. Datasources can be of arbitrary > > sub-schema > > > depth. > > > > > > Example Postgres: ("pg_db_1" -> "public" -> "user") > > > Example MySQL: ("mysql_db_1" -> "user") > > > Example CSV: ("some_csv_file") or maybe ("csv_source_1" -> > > "some_csv_file") > > > > > > What do you folks think I ought to do? > > > Thank you =) > > > > >
Re: Using Calcite with Python
A couple of related (possibly useful?) pointers here: - Dask-sql [1] uses Calcite in a python context. Might be some good stuff to leverage there. - I'm working on compiling Calcite as a GraalVM shared native library [2] as part of Substrait [3] with the goal of ultimately having a friendly C binding [4] for use in non-jvm worlds. This connects to work being done by others to support tools like Arrow and Velox [5] as Substrait targets (and thus completing the path from c interface to native execution via Calcite). [1] https://github.com/dask-contrib/dask-sql [2] https://issues.apache.org/jira/browse/CALCITE-4786 [3] https://github.com/substrait-io/substrait/pull/120 [4] https://github.com/jacques-n/substrait/pull/3 [5] https://github.com/oap-project/gazelle-jni/tree/velox_dev On Mon, Jan 31, 2022 at 3:32 PM Nicola Vitucci wrote: > Hi Eugen, Michael, Gavin, > > Thank you very much for your input. Answering to your suggestions: > > - Phoenix client: I saw it but decided not to use it because it does not > seem very active and up to date (its Avatica version is 1.10, while latest > is 1.20). I may still give it a try though. > - Arrow Flight: I think it can be very useful especially, like Michael > mentioned, if it were integrated with Avatica as a transport; at the > moment, though, it is not. > > I am basically looking for a (relatively) easy and ready to implement, easy > to keep up to date, and reasonably performant solution. Although it incurs > some overhead, a solution based on Python + Java seems to me the most > reasonable for the time being. Do you have any other suggestions or > recommendations? > > Thanks again, > > Nicola > > > > Il giorno lun 31 gen 2022 alle ore 17:04 Michael Mior > ha > scritto: > > > Flight is definitely another consideration for the future. Personally I > > think it would be most interesting to integrate Flight with Avatica as an > > alternative transport. But it would certainly also be useful to allow the > > Arrow adapter to connect to any Flight endpoint. > > > > -- > > Michael Mior > > mm...@apache.org > > > > > > Le lun. 31 janv. 2022 à 10:00, Gavin Ray a > écrit : > > > > > This is really interesting stuff you've done in the example notebooks > > > > > > Nicola & Michael, I wonder if you could benefit from the > > recently-released > > > Arrow Flight SQL? > > > > > > > > > https://www.dremio.com/subsurface/arrow-flight-and-arrow-flight-sql-accelerating-data-movement/ > > > > > > I have asked Jacques about this a bit -- it's meant to be a > > standardization > > > for communicating SQL queries and metadata with Arrow. > > > I'm not intimately familiar with it, but it seems like it could be a > good > > > base to build a Calcite backend for Arrow from? > > > > > > They have a pretty thorough Java example in the repository: > > > > > > > > > https://github.com/apache/arrow/blob/968e6ea488c939c0e1f2bfe339a5a9ed1aed603e/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/example/FlightSqlExample.java#L169-L180 > > > > > > On Mon, Jan 31, 2022 at 8:47 AM Michael Mior wrote: > > > > > > > You may want to keep an eye on CALCITE-2040 ( > > > > https://issues.apache.org/jira/browse/CALCITE-2040). I have a > student > > > who > > > > is working on a Calcite adapter for Apache Arrow. We're basically > hung > > up > > > > waiting on the Arrow team to release a compatible JAR. This still > won't > > > > fully solve your problem though as the first version of the adapter > is > > > only > > > > capable of reading from Arrow files. However, the goal is eventually > to > > > > allow passing a memory reference into the adapter so that it would be > > > > possible to make use of Arrow data which is constructed in-memory > > > > elsewhere. > > > > -- > > > > Michael Mior > > > > mm...@apache.org > > > > > > > > > > > > Le dim. 30 janv. 2022 à 17:36, Nicola Vitucci < > > nicola.vitu...@gmail.com> > > > a > > > > écrit : > > > > > > > > > Hi all, > > > > > > > > > > What would be the best way to use Calcite with Python? I've come up > > > with > > > > > two potential solutions: > > > > > > > > > > - using the jaydebeapi package, to connect via the JDBC driver > > directly > > > > > from a JVM created via jpype; > > > > > - using Apache Arrow via the pyarrow package, to connect in > basically > > > the > > > > > same way but creating Arrow objects with JdbcToArrowUtils (and > > > optionally > > > > > converting them to Pandas). > > > > > > > > > > Although the former is more straightforward, the latter allows to > > > achieve > > > > > better performance (see [1] for instance) since it's exactly what > > Arrow > > > > is > > > > > for. I've created two Jupyter notebooks [2] showing each solution. > > What > > > > > would you recommend? Is there an even better approach? > > > > > > > > > > Thanks, > > > > > > > > > > Nicola > > > > > > > > > > [1] https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html > > > > > [2] > > > > >
Re: Using Calcite with Python
Hi Eugen, Michael, Gavin, Thank you very much for your input. Answering to your suggestions: - Phoenix client: I saw it but decided not to use it because it does not seem very active and up to date (its Avatica version is 1.10, while latest is 1.20). I may still give it a try though. - Arrow Flight: I think it can be very useful especially, like Michael mentioned, if it were integrated with Avatica as a transport; at the moment, though, it is not. I am basically looking for a (relatively) easy and ready to implement, easy to keep up to date, and reasonably performant solution. Although it incurs some overhead, a solution based on Python + Java seems to me the most reasonable for the time being. Do you have any other suggestions or recommendations? Thanks again, Nicola Il giorno lun 31 gen 2022 alle ore 17:04 Michael Mior ha scritto: > Flight is definitely another consideration for the future. Personally I > think it would be most interesting to integrate Flight with Avatica as an > alternative transport. But it would certainly also be useful to allow the > Arrow adapter to connect to any Flight endpoint. > > -- > Michael Mior > mm...@apache.org > > > Le lun. 31 janv. 2022 à 10:00, Gavin Ray a écrit : > > > This is really interesting stuff you've done in the example notebooks > > > > Nicola & Michael, I wonder if you could benefit from the > recently-released > > Arrow Flight SQL? > > > > > https://www.dremio.com/subsurface/arrow-flight-and-arrow-flight-sql-accelerating-data-movement/ > > > > I have asked Jacques about this a bit -- it's meant to be a > standardization > > for communicating SQL queries and metadata with Arrow. > > I'm not intimately familiar with it, but it seems like it could be a good > > base to build a Calcite backend for Arrow from? > > > > They have a pretty thorough Java example in the repository: > > > > > https://github.com/apache/arrow/blob/968e6ea488c939c0e1f2bfe339a5a9ed1aed603e/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/example/FlightSqlExample.java#L169-L180 > > > > On Mon, Jan 31, 2022 at 8:47 AM Michael Mior wrote: > > > > > You may want to keep an eye on CALCITE-2040 ( > > > https://issues.apache.org/jira/browse/CALCITE-2040). I have a student > > who > > > is working on a Calcite adapter for Apache Arrow. We're basically hung > up > > > waiting on the Arrow team to release a compatible JAR. This still won't > > > fully solve your problem though as the first version of the adapter is > > only > > > capable of reading from Arrow files. However, the goal is eventually to > > > allow passing a memory reference into the adapter so that it would be > > > possible to make use of Arrow data which is constructed in-memory > > > elsewhere. > > > -- > > > Michael Mior > > > mm...@apache.org > > > > > > > > > Le dim. 30 janv. 2022 à 17:36, Nicola Vitucci < > nicola.vitu...@gmail.com> > > a > > > écrit : > > > > > > > Hi all, > > > > > > > > What would be the best way to use Calcite with Python? I've come up > > with > > > > two potential solutions: > > > > > > > > - using the jaydebeapi package, to connect via the JDBC driver > directly > > > > from a JVM created via jpype; > > > > - using Apache Arrow via the pyarrow package, to connect in basically > > the > > > > same way but creating Arrow objects with JdbcToArrowUtils (and > > optionally > > > > converting them to Pandas). > > > > > > > > Although the former is more straightforward, the latter allows to > > achieve > > > > better performance (see [1] for instance) since it's exactly what > Arrow > > > is > > > > for. I've created two Jupyter notebooks [2] showing each solution. > What > > > > would you recommend? Is there an even better approach? > > > > > > > > Thanks, > > > > > > > > Nicola > > > > > > > > [1] https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html > > > > [2] > > > https://github.com/nvitucci/calcite-sparql/tree/v0.0.2/examples/python > > > > > > > > > >
Re: Getting "AssertionError: Rule's description should be unique" after adding JOIN to query. Fails on "optPlanner.changeTraits()"
Hi Gavin, For each JdbcSchema, there is a JdbcConvention and when the latter is registered in the planner it creates and registers a bunch of JdbcRules [1]. If we are talking about the same schema then I guess the description that is generated for each rule is identical so you end up with this exception. Are you using the same schema? Should you? If the use-case is valid then you may have bumped into a small bug and probably you need to figure out a way to make the rule description unique. Best, Stamatis [1] https://github.com/apache/calcite/blob/8570cf2b293e9acab9525fbe76709d970b9f7104/core/src/main/java/org/apache/calcite/adapter/jdbc/JdbcConvention.java#L65 On Mon, Jan 31, 2022 at 10:56 PM Gavin Ray wrote: > Not sure if this is helpful, but enabling TRACE logging shows this: > > [Test worker] DEBUG org.apache.calcite.sql2rel - Plan after converting > SqlNode to RelNode > LogicalProject(ID=[$0], USER_ID=[$1], TITLE=[$2], CONTENT=[$3], ID0=[$4], > POST_ID=[$5], CONTENT0=[$6]) > LogicalJoin(condition=[=($0, $5)], joinType=[inner]) > JdbcTableScan(table=[[db1, PUBLIC, POSTS]]) > JdbcTableScan(table=[[db2, PUBLIC, COMMENTS]]) > > [Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Subset cost > changed: subset [rel#11:RelSubset#0.JDBC.PUBLIC.[]] cost was {inf} now > {100.0 rows, 101.0 cpu, 0.0 io} > [Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Register > rel#7:JdbcTableScan.JDBC.PUBLIC.[](table=[db1, PUBLIC, POSTS]) in > rel#11:RelSubset#0.JDBC.PUBLIC.[] > [Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Rule-match > queued: rule [JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE)] > rels [#7] > > java.lang.AssertionError: Rule's description should be unique; existing > rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE); new > rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE) > > > On Mon, Jan 31, 2022 at 4:42 PM Gavin Ray wrote: > > > I took the "FrameworksTest#executeQuery()" code from the Calcite repo > here > > and have been using it: > > > > > https://github.com/apache/calcite/blob/de847c38f3544f9c7282984f32dc1093bdb2fb60/core/src/test/java/org/apache/calcite/tools/FrameworksTest.java#L444-L477 > > > > Have been getting an error I don't understand. > > I have a rootSchema, with two sub-schemas, that each have JdbcSchema's > > from hsqldb > > > > > > > https://gist.github.com/GavinRay97/cf39a9dd92a793df506ecc2135d6da0b#file-test-kt-L68-L78 > > > > If I query just a single schema, it works fine: > > = > > Query: select * from db1.PUBLIC.POSTS > > > > -- Logical Plan > > LogicalProject(inputs=[0..3]) > > JdbcTableScan(table=[[db1, PUBLIC, POSTS]]) > > > > -- Mid Plan > > LogicalProject(subset=[rel#8:RelSubset#1.ENUMERABLE.[]], inputs=[0..3]) > > JdbcTableScan(subset=[rel#5:RelSubset#0.JDBC.PUBLIC.[]], table=[[db1, > > PUBLIC, POSTS]]) > > > > -- Best Plan > > JdbcToEnumerableConverter > > JdbcTableScan(table=[[hsqldb1, PUBLIC, POSTS]]) > > > > If I add a JOIN though, it fails before reaching the mid-plan: > > = > > Query: > > select * from db1.PUBLIC.POSTS > > inner join db2.PUBLIC.COMMENTS > > on db1.PUBLIC.POSTS.ID = db2.PUBLIC.COMMENTS.POST_ID > > > > -- Logical Plan > > LogicalProject(inputs=[0..6]) > > LogicalJoin(condition=[=($0, $5)], joinType=[inner]) > > JdbcTableScan(table=[[db1, PUBLIC, POSTS]]) > > JdbcTableScan(table=[[db2, PUBLIC, COMMENTS]]) > > > > java.lang.AssertionError: Rule's description should be unique; > > existing > rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE); > > new rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE) > > at > > > org.apache.calcite.plan.AbstractRelOptPlanner.addRule(AbstractRelOptPlanner.java:163) > > at > > > org.apache.calcite.plan.volcano.VolcanoPlanner.addRule(VolcanoPlanner.java:418) > > at > > > org.apache.calcite.adapter.jdbc.JdbcConvention.register(JdbcConvention.java:66) > > at > > > org.apache.calcite.plan.AbstractRelOptPlanner.registerClass(AbstractRelOptPlanner.java:240) > > at > > > org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1365) > > at > > > org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598) > > at > > > org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613) > > at > > > org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:95) > > at > > > org.apache.calcite.rel.AbstractRelNode.onRegister(AbstractRelNode.java:274) > > at > > > org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1270) > > at > > > org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598) > > at > > > org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613) > > at > > >
Re: Getting "AssertionError: Rule's description should be unique" after adding JOIN to query. Fails on "optPlanner.changeTraits()"
Not sure if this is helpful, but enabling TRACE logging shows this: [Test worker] DEBUG org.apache.calcite.sql2rel - Plan after converting SqlNode to RelNode LogicalProject(ID=[$0], USER_ID=[$1], TITLE=[$2], CONTENT=[$3], ID0=[$4], POST_ID=[$5], CONTENT0=[$6]) LogicalJoin(condition=[=($0, $5)], joinType=[inner]) JdbcTableScan(table=[[db1, PUBLIC, POSTS]]) JdbcTableScan(table=[[db2, PUBLIC, COMMENTS]]) [Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Subset cost changed: subset [rel#11:RelSubset#0.JDBC.PUBLIC.[]] cost was {inf} now {100.0 rows, 101.0 cpu, 0.0 io} [Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Register rel#7:JdbcTableScan.JDBC.PUBLIC.[](table=[db1, PUBLIC, POSTS]) in rel#11:RelSubset#0.JDBC.PUBLIC.[] [Test worker] TRACE org.apache.calcite.plan.RelOptPlanner - Rule-match queued: rule [JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE)] rels [#7] java.lang.AssertionError: Rule's description should be unique; existing rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE); new rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE) On Mon, Jan 31, 2022 at 4:42 PM Gavin Ray wrote: > I took the "FrameworksTest#executeQuery()" code from the Calcite repo here > and have been using it: > > https://github.com/apache/calcite/blob/de847c38f3544f9c7282984f32dc1093bdb2fb60/core/src/test/java/org/apache/calcite/tools/FrameworksTest.java#L444-L477 > > Have been getting an error I don't understand. > I have a rootSchema, with two sub-schemas, that each have JdbcSchema's > from hsqldb > > > https://gist.github.com/GavinRay97/cf39a9dd92a793df506ecc2135d6da0b#file-test-kt-L68-L78 > > If I query just a single schema, it works fine: > = > Query: select * from db1.PUBLIC.POSTS > > -- Logical Plan > LogicalProject(inputs=[0..3]) > JdbcTableScan(table=[[db1, PUBLIC, POSTS]]) > > -- Mid Plan > LogicalProject(subset=[rel#8:RelSubset#1.ENUMERABLE.[]], inputs=[0..3]) > JdbcTableScan(subset=[rel#5:RelSubset#0.JDBC.PUBLIC.[]], table=[[db1, > PUBLIC, POSTS]]) > > -- Best Plan > JdbcToEnumerableConverter > JdbcTableScan(table=[[hsqldb1, PUBLIC, POSTS]]) > > If I add a JOIN though, it fails before reaching the mid-plan: > = > Query: > select * from db1.PUBLIC.POSTS > inner join db2.PUBLIC.COMMENTS > on db1.PUBLIC.POSTS.ID = db2.PUBLIC.COMMENTS.POST_ID > > -- Logical Plan > LogicalProject(inputs=[0..6]) > LogicalJoin(condition=[=($0, $5)], joinType=[inner]) > JdbcTableScan(table=[[db1, PUBLIC, POSTS]]) > JdbcTableScan(table=[[db2, PUBLIC, COMMENTS]]) > > java.lang.AssertionError: Rule's description should be unique; > existing rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE); > new rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE) > at > org.apache.calcite.plan.AbstractRelOptPlanner.addRule(AbstractRelOptPlanner.java:163) > at > org.apache.calcite.plan.volcano.VolcanoPlanner.addRule(VolcanoPlanner.java:418) > at > org.apache.calcite.adapter.jdbc.JdbcConvention.register(JdbcConvention.java:66) > at > org.apache.calcite.plan.AbstractRelOptPlanner.registerClass(AbstractRelOptPlanner.java:240) > at > org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1365) > at > org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598) > at > org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613) > at > org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:95) > at > org.apache.calcite.rel.AbstractRelNode.onRegister(AbstractRelNode.java:274) > at > org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1270) > at > org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598) > at > org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613) > at > org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:95) > at > org.apache.calcite.rel.AbstractRelNode.onRegister(AbstractRelNode.java:274) > at > org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1270) > at > org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598) > at > org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613) > at > org.apache.calcite.plan.volcano.VolcanoPlanner.changeTraits(VolcanoPlanner.java:498) > at CalciteUtils.executeQuery(CalciteUtils.kt:41) > at ForeignKeyTest.testForeignKey(ForeignKeyTest.kt:82) >
Getting "AssertionError: Rule's description should be unique" after adding JOIN to query. Fails on "optPlanner.changeTraits()"
I took the "FrameworksTest#executeQuery()" code from the Calcite repo here and have been using it: https://github.com/apache/calcite/blob/de847c38f3544f9c7282984f32dc1093bdb2fb60/core/src/test/java/org/apache/calcite/tools/FrameworksTest.java#L444-L477 Have been getting an error I don't understand. I have a rootSchema, with two sub-schemas, that each have JdbcSchema's from hsqldb https://gist.github.com/GavinRay97/cf39a9dd92a793df506ecc2135d6da0b#file-test-kt-L68-L78 If I query just a single schema, it works fine: = Query: select * from db1.PUBLIC.POSTS -- Logical Plan LogicalProject(inputs=[0..3]) JdbcTableScan(table=[[db1, PUBLIC, POSTS]]) -- Mid Plan LogicalProject(subset=[rel#8:RelSubset#1.ENUMERABLE.[]], inputs=[0..3]) JdbcTableScan(subset=[rel#5:RelSubset#0.JDBC.PUBLIC.[]], table=[[db1, PUBLIC, POSTS]]) -- Best Plan JdbcToEnumerableConverter JdbcTableScan(table=[[hsqldb1, PUBLIC, POSTS]]) If I add a JOIN though, it fails before reaching the mid-plan: = Query: select * from db1.PUBLIC.POSTS inner join db2.PUBLIC.COMMENTS on db1.PUBLIC.POSTS.ID = db2.PUBLIC.COMMENTS.POST_ID -- Logical Plan LogicalProject(inputs=[0..6]) LogicalJoin(condition=[=($0, $5)], joinType=[inner]) JdbcTableScan(table=[[db1, PUBLIC, POSTS]]) JdbcTableScan(table=[[db2, PUBLIC, COMMENTS]]) java.lang.AssertionError: Rule's description should be unique; existing rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE); new rule=JdbcToEnumerableConverterRule(in:JDBC.PUBLIC,out:ENUMERABLE) at org.apache.calcite.plan.AbstractRelOptPlanner.addRule(AbstractRelOptPlanner.java:163) at org.apache.calcite.plan.volcano.VolcanoPlanner.addRule(VolcanoPlanner.java:418) at org.apache.calcite.adapter.jdbc.JdbcConvention.register(JdbcConvention.java:66) at org.apache.calcite.plan.AbstractRelOptPlanner.registerClass(AbstractRelOptPlanner.java:240) at org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1365) at org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598) at org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613) at org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:95) at org.apache.calcite.rel.AbstractRelNode.onRegister(AbstractRelNode.java:274) at org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1270) at org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598) at org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613) at org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:95) at org.apache.calcite.rel.AbstractRelNode.onRegister(AbstractRelNode.java:274) at org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1270) at org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598) at org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613) at org.apache.calcite.plan.volcano.VolcanoPlanner.changeTraits(VolcanoPlanner.java:498) at CalciteUtils.executeQuery(CalciteUtils.kt:41) at ForeignKeyTest.testForeignKey(ForeignKeyTest.kt:82)
Re: Using Calcite with Python
Flight is definitely another consideration for the future. Personally I think it would be most interesting to integrate Flight with Avatica as an alternative transport. But it would certainly also be useful to allow the Arrow adapter to connect to any Flight endpoint. -- Michael Mior mm...@apache.org Le lun. 31 janv. 2022 à 10:00, Gavin Ray a écrit : > This is really interesting stuff you've done in the example notebooks > > Nicola & Michael, I wonder if you could benefit from the recently-released > Arrow Flight SQL? > > https://www.dremio.com/subsurface/arrow-flight-and-arrow-flight-sql-accelerating-data-movement/ > > I have asked Jacques about this a bit -- it's meant to be a standardization > for communicating SQL queries and metadata with Arrow. > I'm not intimately familiar with it, but it seems like it could be a good > base to build a Calcite backend for Arrow from? > > They have a pretty thorough Java example in the repository: > > https://github.com/apache/arrow/blob/968e6ea488c939c0e1f2bfe339a5a9ed1aed603e/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/example/FlightSqlExample.java#L169-L180 > > On Mon, Jan 31, 2022 at 8:47 AM Michael Mior wrote: > > > You may want to keep an eye on CALCITE-2040 ( > > https://issues.apache.org/jira/browse/CALCITE-2040). I have a student > who > > is working on a Calcite adapter for Apache Arrow. We're basically hung up > > waiting on the Arrow team to release a compatible JAR. This still won't > > fully solve your problem though as the first version of the adapter is > only > > capable of reading from Arrow files. However, the goal is eventually to > > allow passing a memory reference into the adapter so that it would be > > possible to make use of Arrow data which is constructed in-memory > > elsewhere. > > -- > > Michael Mior > > mm...@apache.org > > > > > > Le dim. 30 janv. 2022 à 17:36, Nicola Vitucci > a > > écrit : > > > > > Hi all, > > > > > > What would be the best way to use Calcite with Python? I've come up > with > > > two potential solutions: > > > > > > - using the jaydebeapi package, to connect via the JDBC driver directly > > > from a JVM created via jpype; > > > - using Apache Arrow via the pyarrow package, to connect in basically > the > > > same way but creating Arrow objects with JdbcToArrowUtils (and > optionally > > > converting them to Pandas). > > > > > > Although the former is more straightforward, the latter allows to > achieve > > > better performance (see [1] for instance) since it's exactly what Arrow > > is > > > for. I've created two Jupyter notebooks [2] showing each solution. What > > > would you recommend? Is there an even better approach? > > > > > > Thanks, > > > > > > Nicola > > > > > > [1] https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html > > > [2] > > https://github.com/nvitucci/calcite-sparql/tree/v0.0.2/examples/python > > > > > >
Re: Using Calcite with Python
This is really interesting stuff you've done in the example notebooks Nicola & Michael, I wonder if you could benefit from the recently-released Arrow Flight SQL? https://www.dremio.com/subsurface/arrow-flight-and-arrow-flight-sql-accelerating-data-movement/ I have asked Jacques about this a bit -- it's meant to be a standardization for communicating SQL queries and metadata with Arrow. I'm not intimately familiar with it, but it seems like it could be a good base to build a Calcite backend for Arrow from? They have a pretty thorough Java example in the repository: https://github.com/apache/arrow/blob/968e6ea488c939c0e1f2bfe339a5a9ed1aed603e/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/example/FlightSqlExample.java#L169-L180 On Mon, Jan 31, 2022 at 8:47 AM Michael Mior wrote: > You may want to keep an eye on CALCITE-2040 ( > https://issues.apache.org/jira/browse/CALCITE-2040). I have a student who > is working on a Calcite adapter for Apache Arrow. We're basically hung up > waiting on the Arrow team to release a compatible JAR. This still won't > fully solve your problem though as the first version of the adapter is only > capable of reading from Arrow files. However, the goal is eventually to > allow passing a memory reference into the adapter so that it would be > possible to make use of Arrow data which is constructed in-memory > elsewhere. > -- > Michael Mior > mm...@apache.org > > > Le dim. 30 janv. 2022 à 17:36, Nicola Vitucci a > écrit : > > > Hi all, > > > > What would be the best way to use Calcite with Python? I've come up with > > two potential solutions: > > > > - using the jaydebeapi package, to connect via the JDBC driver directly > > from a JVM created via jpype; > > - using Apache Arrow via the pyarrow package, to connect in basically the > > same way but creating Arrow objects with JdbcToArrowUtils (and optionally > > converting them to Pandas). > > > > Although the former is more straightforward, the latter allows to achieve > > better performance (see [1] for instance) since it's exactly what Arrow > is > > for. I've created two Jupyter notebooks [2] showing each solution. What > > would you recommend? Is there an even better approach? > > > > Thanks, > > > > Nicola > > > > [1] https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html > > [2] > https://github.com/nvitucci/calcite-sparql/tree/v0.0.2/examples/python > > >
Re: Using Calcite with Python
You may want to keep an eye on CALCITE-2040 ( https://issues.apache.org/jira/browse/CALCITE-2040). I have a student who is working on a Calcite adapter for Apache Arrow. We're basically hung up waiting on the Arrow team to release a compatible JAR. This still won't fully solve your problem though as the first version of the adapter is only capable of reading from Arrow files. However, the goal is eventually to allow passing a memory reference into the adapter so that it would be possible to make use of Arrow data which is constructed in-memory elsewhere. -- Michael Mior mm...@apache.org Le dim. 30 janv. 2022 à 17:36, Nicola Vitucci a écrit : > Hi all, > > What would be the best way to use Calcite with Python? I've come up with > two potential solutions: > > - using the jaydebeapi package, to connect via the JDBC driver directly > from a JVM created via jpype; > - using Apache Arrow via the pyarrow package, to connect in basically the > same way but creating Arrow objects with JdbcToArrowUtils (and optionally > converting them to Pandas). > > Although the former is more straightforward, the latter allows to achieve > better performance (see [1] for instance) since it's exactly what Arrow is > for. I've created two Jupyter notebooks [2] showing each solution. What > would you recommend? Is there an even better approach? > > Thanks, > > Nicola > > [1] https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html > [2] https://github.com/nvitucci/calcite-sparql/tree/v0.0.2/examples/python >
[jira] [Created] (CALCITE-4995) ArrayIndexOutOfBoundsException caused by RelFieldTrimmer on SEMI/ANTI join
Ruben Q L created CALCITE-4995: -- Summary: ArrayIndexOutOfBoundsException caused by RelFieldTrimmer on SEMI/ANTI join Key: CALCITE-4995 URL: https://issues.apache.org/jira/browse/CALCITE-4995 Project: Calcite Issue Type: Bug Components: core Affects Versions: 1.29.0 Reporter: Ruben Q L Assignee: Ruben Q L (Unit test to be provided) It seems {{RelFieldTrimmer}} can cause an {{ArrayIndexOutOfBoundsException}} on certain plans involving SEMI/ANTI join (i.e. joins that do NOT project the RHS fields). The root cause seems to be the "early return" in {{RelFieldTrimmer#trimFields(Join join, ImmutableBitSet fieldsUsed, Set extraFields)}} when nothing has been trimmed inside join's inputs (so the join itself can be return as it is): {code:java} if (changeCount == 0 && mapping.isIdentity()) { return result(join, Mappings.createIdentity(fieldCount)); } {code} The problem is that this {{fieldCount}} is an addition of LHS + RHS fields (+ system fields); but in case of a SEMI/ANTI the mappings to be returned must not consider RHS fields (since they are not projected by these join types). The problem only happens here (when the trimmer does not trim the join). Notice that, a few lines below, in the "other return scenario" of the method (when something has been trimmed), there is a special treatment of the mapping for ANTI/SEMI, so things will work fine in this case: {code:java} switch (join.getJoinType()) { case SEMI: case ANTI: // For SemiJoins and AntiJoins only map fields from the left-side if (join.getJoinType() == JoinRelType.SEMI) { relBuilder.semiJoin(newConditionExpr); } else { relBuilder.antiJoin(newConditionExpr); } Mapping inputMapping = inputMappings.get(0); mapping = Mappings.create(MappingType.INVERSE_SURJECTION, join.getRowType().getFieldCount(), newSystemFieldCount + inputMapping.getTargetCount()); for (int i = 0; i < newSystemFieldCount; ++i) { mapping.set(i, i); } offset = systemFieldCount; newOffset = newSystemFieldCount; for (IntPair pair : inputMapping) { mapping.set(pair.source + offset, pair.target + newOffset); } break; default: relBuilder.join(join.getJoinType(), newConditionExpr); } relBuilder.hints(join.getHints()); return result(relBuilder.build(), mapping); {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
Re: Using Calcite with Python
Hi Nicola, It's a question I was asking myself the other day. I don't know the answer but I do have an exploration direction: Avatica client. There is some nice description and diagram here https://calcite.apache.org/avatica/docs/ And also a list of clients down bellow. See https://calcite.apache.org/avatica/docs/#apache-phoenix-database-adapter-for-python Please let me know how it goes and what you find out. On 31.01.2022 00:35, Nicola Vitucci wrote: Hi all, What would be the best way to use Calcite with Python? I've come up with two potential solutions: - using the jaydebeapi package, to connect via the JDBC driver directly from a JVM created via jpype; - using Apache Arrow via the pyarrow package, to connect in basically the same way but creating Arrow objects with JdbcToArrowUtils (and optionally converting them to Pandas). Although the former is more straightforward, the latter allows to achieve better performance (see [1] for instance) since it's exactly what Arrow is for. I've created two Jupyter notebooks [2] showing each solution. What would you recommend? Is there an even better approach? Thanks, Nicola [1] https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html [2] https://github.com/nvitucci/calcite-sparql/tree/v0.0.2/examples/python Regards, -- Eugen Stan +40770 941 271 / https://www.netdava.combegin:vcard fn:Eugen Stan n:Stan;Eugen email;internet:eugen.s...@netdava.com tel;cell:+40720898747 x-mozilla-html:FALSE url:https://www.netdava.com version:2.1 end:vcard