Hello Benjamin, Can you share the reasons why Apache Calcite is not suitable for this case and why it was rejected? It has custom syntax support, CBO, so I am interested to see some technical details in the "Rejected Alternatives" section, I'm pretty sure they exist, but they weren't mentioned there, and don't take this as an ad, please :-)
In Apache Ignite, I had experience in improving the query execution engine and one of the reasons for moving from one query engine to another (to Calcite, to be precise), was that we had a problem with calculating memory quotas for queries and aborting a query when those quotas were exceeded the limit. An engine can load and hold rows in memory, preventing the GC from collecting them, or objects that are too large, so the JVM can easily run out of memory, and it is important to have full control over a query execution path. btw, here is a Calcite adapter for Cassandra: https://calcite.apache.org/docs/cassandra_adapter.html On Wed, 13 Dec 2023 at 09:55, Benedict <bened...@apache.org> wrote: > > A CBO can only make worse decisions than the status quo for what I presume > are the majority of queries - i.e. those that touch only primary indexes. In > general, there are plenty of use cases that prefer determinism. So I agree > that there should at least be a CBO implementation that makes the same > decisions as the status quo, deterministically. > > > I do support the proposal, but would like to see some elements discussed in > more detail. The maintenance and distribution of summary statistics in > particular is worthy of its own CEP, and it might be preferable to split it > out. The proposal also seems to imply we are aiming for coordinators to all > make the same decision for a query, which I think is challenging, and it > would be worth fleshing out the design here a little (perhaps just in Jira). > > > While I’m not a fan of ALLOW FILTERING, I’m not convinced that this CEP > deprecates it. It is a concrete qualitative guard rail, that I expect some > users will prefer to a cost-based guard rail. Perhaps this could be left to > the CBO to decide how to treat. > > > There’s also not much discussion of the execution model: I think it would > make most sense for this to be independent of any cost and optimiser models > (though they might want to operate on them), so that EXPLAIN and hints can > work across optimisers (a suitable hint might essentially bypass the > optimiser, if the optimiser permits it, by providing a standard execution > model) > > > I think it would be worth considering providing the execution plan to the > client as part of query preparation, as an opaque payload to supply to > coordinators on first contact, as this might simplify the problem of ensuring > queries behave the same without adopting a lot of complexity for > synchronising statistics (which will never provide strong guarantees). Of > course, re-preparing a query might lead to a new plan, though any > coordinators with the query in their cache should be able to retrieve it > cheaply. If the execution model is efficiently serialised this might have the > ancillary benefit of improving the occupancy of our prepared query cache. > > > On 13 Dec 2023, at 00:44, Jon Haddad <j...@jonhaddad.com> wrote: > > > I think it makes sense to see what the actual overhead is of CBO before > making the assumption it'll be so high that we need to have two code paths. > I'm happy to provide thorough benchmarking and analysis when it reaches a > testing phase. > > I'm excited to see where this goes. I think it sounds very forward looking > and opens up a lot of possibilities. > > Jon > > On Tue, Dec 12, 2023 at 4:25 PM guo Maxwell <cclive1...@gmail.com> wrote: >> >> Nothing expresses my thoughts better than +1 >> ,It feels like it means a lot to Cassandra. >> >> I have a question. Is it easy to turn off cbo's optimizer or by pass in some >> way? Because some simple read and write requests will have better >> performance without cbo, which is also the advantage of Cassandra compared >> to some rdbms. >> >> >> David Capwell <dcapw...@apple.com>于2023年12月13日 周三上午3:37写道: >>> >>> Overall LGTM. >>> >>> >>> On Dec 12, 2023, at 5:29 AM, Benjamin Lerer <ble...@apache.org> wrote: >>> >>> Hi everybody, >>> >>> I would like to open the discussion on the introduction of a cost based >>> optimizer to allow Cassandra to pick the best execution plan based on the >>> data distribution.Therefore, improving the overall query performance. >>> >>> This CEP should also lay the groundwork for the future addition of features >>> like joins, subqueries, OR/NOT and index ordering. >>> >>> The proposal is here: >>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-39%3A+Cost+Based+Optimizer >>> >>> Thank you in advance for your feedback. >>> >>>