Re: [DISCUSS] CEP-39: Cost Based Optimizer

Benjamin Lerer Wed, 13 Dec 2023 09:08:24 -0800

 One thing that I did not mention is the fact that this CEP is only a high
level proposal. There will be deeper discussions on the dev list around the
different parts of this proposal when we reach those parts and have enough
details to make those discussions more meaningful.



> The maintenance and distribution of summary statistics in particular is
> worthy of its own CEP, and it might be preferable to split it out.


For maintaining node statistics the idea is to re-use the current
Memtable/SSTable mechanism and relies on mergeable statistics. That will
allow us to easily build node level statistics for a given table by merging
all the statistics of its memtable and SSTables. For the distribution of
these node statistics we are still exploring different options. We can come
back with a precise proposal once we have hammered all the details.
Is it for you a blocker for this CEP or do you just want to make sure that
this part is discussed in deeper details before we implement it?

>
> The proposal also seems to imply we are aiming for coordinators to all
> make the same decision for a query, which I think is challenging, and it
> would be worth fleshing out the design here a little (perhaps just in Jira).


The goal is that the large majority of nodes preparing a query at a given
point in time should make the same decision and that over time all nodes
should converge toward the same decision. This part is dependent on the
node statistics distribution, the cost model and the triggers for
re-optimization (that will require some experimentation).

There’s also not much discussion of the execution model: I think it would
> make most sense for this to be independent of any cost and optimiser models
> (though they might want to operate on them), so that EXPLAIN and hints can
> work across optimisers (a suitable hint might essentially bypass the
> optimiser, if the optimiser permits it, by providing a standard execution
> model)
>

It is not clear to me what you mean by "a standard execution model"?
Otherwise, we were not planning to have the execution model or the hints
depending on the optimizer.

I think it would be worth considering providing the execution plan to the
> client as part of query preparation, as an opaque payload to supply to
> coordinators on first contact, as this might simplify the problem of
> ensuring queries behave the same without adopting a lot of complexity for
> synchronising statistics (which will never provide strong guarantees). Of
> course, re-preparing a query might lead to a new plan, though any
> coordinators with the query in their cache should be able to retrieve it
> cheaply. If the execution model is efficiently serialised this might have
> the ancillary benefit of improving the occupancy of our prepared query
> cache.
>

I am not sure that I understand your proposal. If 2 nodes build a different
execution plan how do you solve that conflict?

Le mer. 13 déc. 2023 à 09:55, Benedict <[email protected]> a écrit :

> A CBO can only make worse decisions than the status quo for what I presume
> are the majority of queries - i.e. those that touch only primary indexes.
> In general, there are plenty of use cases that prefer determinism. So I
> agree that there should at least be a CBO implementation that makes the
> same decisions as the status quo, deterministically.
>
>
> I do support the proposal, but would like to see some elements discussed
> in more detail. The maintenance and distribution of summary statistics in
> particular is worthy of its own CEP, and it might be preferable to split it
> out. The proposal also seems to imply we are aiming for coordinators to all
> make the same decision for a query, which I think is challenging, and it
> would be worth fleshing out the design here a little (perhaps just in Jira).
>
>
> While I’m not a fan of ALLOW FILTERING, I’m not convinced that this CEP
> deprecates it. It is a concrete qualitative guard rail, that I expect some
> users will prefer to a cost-based guard rail. Perhaps this could be left to
> the CBO to decide how to treat.
>
>
> There’s also not much discussion of the execution model: I think it would
> make most sense for this to be independent of any cost and optimiser models
> (though they might want to operate on them), so that EXPLAIN and hints can
> work across optimisers (a suitable hint might essentially bypass the
> optimiser, if the optimiser permits it, by providing a standard execution
> model)
>
>
> I think it would be worth considering providing the execution plan to the
> client as part of query preparation, as an opaque payload to supply to
> coordinators on first contact, as this might simplify the problem of
> ensuring queries behave the same without adopting a lot of complexity for
> synchronising statistics (which will never provide strong guarantees). Of
> course, re-preparing a query might lead to a new plan, though any
> coordinators with the query in their cache should be able to retrieve it
> cheaply. If the execution model is efficiently serialised this might have
> the ancillary benefit of improving the occupancy of our prepared query
> cache.
>
> On 13 Dec 2023, at 00:44, Jon Haddad <[email protected]> wrote:
>
> 
> I think it makes sense to see what the actual overhead is of CBO before
> making the assumption it'll be so high that we need to have two code
> paths.  I'm happy to provide thorough benchmarking and analysis when it
> reaches a testing phase.
>
> I'm excited to see where this goes.  I think it sounds very forward
> looking and opens up a lot of possibilities.
>
> Jon
>
> On Tue, Dec 12, 2023 at 4:25 PM guo Maxwell <[email protected]> wrote:
>
>> Nothing expresses my thoughts better than +1
>> ，It feels like it means a lot to Cassandra.
>>
>> I have a question. Is it easy to turn off cbo's optimizer or by pass in
>> some way? Because some simple read and write requests will have better
>> performance without cbo, which is also the advantage of Cassandra compared
>> to some rdbms.
>>
>>
>> David Capwell <[email protected]>于2023年12月13日 周三上午3:37写道：
>>
>>> Overall LGTM.
>>>
>>>
>>> On Dec 12, 2023, at 5:29 AM, Benjamin Lerer <[email protected]> wrote:
>>>
>>> Hi everybody,
>>>
>>> I would like to open the discussion on the introduction of a cost based
>>> optimizer to allow Cassandra to pick the best execution plan based on the
>>> data distribution.Therefore, improving the overall query performance.
>>>
>>> This CEP should also lay the groundwork for the future addition of
>>> features like joins, subqueries, OR/NOT and index ordering.
>>>
>>> The proposal is here:
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-39%3A+Cost+Based+Optimizer
>>>
>>> Thank you in advance for your feedback.
>>>
>>>
>>>

Re: [DISCUSS] CEP-39: Cost Based Optimizer

Reply via email to