Re: Calcite's CBO

Lekshmi Fri, 08 Feb 2019 11:34:43 -0800

Hi,
  I understand... If we are overriding the calcite's cost function in
adapter, then should we also generate cost as record? And how does calcite
compare the costs of different nodes in different convention? And how
calcite generate the cheapest plan? Is there any algorithm dedicated for
that? I would like to know more about that. Because we would like to
develop a system in which Calcite plays the major role of optimization and
execution. So for that we would be developing a cost function and costs
estimated for this should be in the comparable format of calcite.


Thank you so much for the support and suggestions....

On Thu, 7 Feb 2019, 4:07 pm Michael Mior <mm...@apache.org wrote:

> I/O cost is always zero because Calcite itself doesn't try to estimate
> I/O cost. As I mentioned. my previous reply, this can be overridden by
> adapters. Row count is an estimate which starts off as the estimated
> number of rows in the table but then is impacted by the estimated
> filter factors of queries. So if the estimated number of rows is 100
> and filters are applied that remove an estimated 95% and 30% of rows,
> then the estimated rows produced by the query will be 3.5.
>
> --
> Michael Mior
> mm...@apache.org
>
> Le jeu. 7 févr. 2019 à 04:55, Lekshmi <lekshmib...@gmail.com> a écrit :
> >
> > Dear Michael Mior,
> >    Thank you so much for the reply. But still, I'm confused with
> following
> >
> > When I connected PostgreSQL with Calcite and execute TPCH queries,
> provided
> > data residing in the Postgres database, I got the cheapest plan like
> below
> > for TPCH query 1.
> >
> > EnumerableProject(l_returnflag=[$0], l_linestatus=[$1],
> SUM_QTY=[CASE(=($3,
> > 0), null, $2)], SUM_BASE_PRICE=[CASE(=($5, 0), null, $4)],
> > SUM_DISC_PRICE=[CASE(=($7, 0), null, $6)], SUM_CHARGE=[CASE(=($9, 0),
> null,
> > $8)], AVG_QTY=[/(CASE(=($3, 0), null, $2), $3)], AVG_PRICE=[/(CASE(=($5,
> > 0), null, $4), $5)], AVG_DISC=[/(CASE(=($11, 0), null, $10), $11)],
> > COUNT_ORDER=[$12]): rowcount = 1.0, cumulative cost = {216.75 rows,
> > 1870.344248356904 cpu, 0.0 io}, id = 316
> >
> >   EnumerableLimit(fetch=[1]): rowcount = 1.0, cumulative cost = {215.75
> > rows, 1860.344248356904 cpu, 0.0 io}, id = 315
> >
> >     EnumerableSort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]):
> > rowcount = 10.0, cumulative cost = {214.75 rows, 1859.344248356904 cpu,
> 0.0
> > io}, id = 314
> >
> >       JdbcToEnumerableConverter: rowcount = 10.0, cumulative cost =
> {204.75
> > rows, 662.0 cpu, 0.0 io}, id = 313
> >
> >         JdbcAggregate(group=[{0, 1}], SUM_QTY=[$SUM0($2)],
> > agg#1=[COUNT($2)], SUM_BASE_PRICE=[$SUM0($3)], agg#3=[COUNT($3)],
> > SUM_DISC_PRICE=[$SUM0($4)], agg#5=[COUNT($4)], SUM_CHARGE=[$SUM0($5)],
> > agg#7=[COUNT($5)], agg#8=[$SUM0($6)], agg#9=[COUNT($6)],
> > COUNT_ORDER=[COUNT()]): rowcount = 10.0, cumulative cost = {203.75 rows,
> > 661.0 cpu, 0.0 io}, id = 312
> >
> >           JdbcProject(l_returnflag=[$8], l_linestatus=[$9],
> > l_quantity=[$4], l_extendedprice=[$5], $f4=[*($5, -(1, $6))],
> $f5=[*(*($5,
> > -(1, $6)), +(1, $7))], l_discount=[$6]): rowcount = 100.0, cumulative
> cost
> > = {180.0 rows, 661.0 cpu, 0.0 io}, id = 311
> >
> >             JdbcTableScan(table=[[TPCH, lineitem]]): rowcount = 100.0,
> > cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 0
> >
> >
> > But I didn't understand why I/O cost is always zero? And why row count is
> > not an integer? Is row count is not actually a count? What is an I/O
> cost?
> > Is it count of I/O read for required input or time taken for reading
> > required input? because i/o is also not an integer above.
> >
> >
> >
> >
> >
> >
> >
> > Thanks and Regards
> >
> > Lekshmi B.G
> > Email: lekshmib...@gmail.com
> >
> >
> >
> >
> > On Thu, Feb 7, 2019 at 1:37 AM Michael Mior <mm...@apache.org> wrote:
> >
> > > You're correct that it's not a single value. That said, in practice,
> > > not much is actually done with. CPU or I/O cost. You'll see in
> > > TableScan that computeSelfCost returns a cost which really only makes
> > > use of the number of rows in the table. Individual adapters may
> > > override this, but few do.
> > > --
> > > Michael Mior
> > > mm...@apache.org
> > >
> > > Le mer. 6 févr. 2019 à 05:22, Lekshmi <lekshmib...@gmail.com> a écrit
> :
> > > >
> > > > Hi,
> > > >  In [0], they suggest, "the optimizer implementer can choose the
> cost to
> > > be
> > > > a number or record". Which does one Apache Calcite use? I found, it
> as a
> > > > record, when I run queries in debug mode.  Is that correct? Then can
> you
> > > > please define (rows, CPU, I/O) in Calcite? Also, when, we are
> connecting
> > > > calcite across multiple systems, then 'i/o' should be greater than
> '0.0',
> > > > right? Because it needs to read the results from the push-down
> > > operations?
> > > > Can you please explain these in a bit more detail? I appreciate your
> > > > support and suggestions.
> > > >
> > > >
> > > > Thanks and Regards
> > > >
> > > > Lekshmi B.G
> > > > Email: lekshmib...@gmail.com
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Feb 5, 2019 at 11:30 PM Michael Mior <mm...@apache.org>
> wrote:
> > > >
> > > > > Calcite's optimizer is based off the Volcano optimizer[0]. In that
> > > > > paper you'll find an outline of the algorithm which is basically
> > > > > equivalent to what Calcite uses. Adding multiple systems doesn't
> > > > > complicate things very much. The main addition used by Calcite is
> what
> > > > > we call a "convention" trait that allows the optimizer to deal with
> > > > > expressions across multiple systems. More details are available in
> a
> > > > > recently published paper on Calcite [1].
> > > > >
> > > > > One important caveat to note is that the cost model used is not
> likely
> > > > > to reflect the actual cost of query execution in many cases. It's
> > > > > generally "good enough" in that the ordering of plans by cost will
> be
> > > > > approximately correct. So although the optimal plan will be
> selected
> > > > > according to the cost model, the plan which is actually the best in
> > > > > practice may not be selected. That said, I would expect Calcite
> will
> > > > > pick a plan which is generally quite close to the optimal,  but we
> > > > > have no guarantee of this.
> > > > >
> > > > > [0]
> > > > >
> > >
> https://pdfs.semanticscholar.org/a817/a3e74d1663d9eb35b4baf3161ab16f57df85.pdf
> > > > > [1] https://arxiv.org/pdf/1802.10233.pdf
> > > > >
> > > > > --
> > > > > Michael Mior
> > > > > mm...@apache.org
> > > > >
> > > > > Le mar. 5 févr. 2019 à 15:52, Lekshmi <lekshmib...@gmail.com> a
> écrit
> > > :
> > > > > >
> > > > > > Hi,
> > > > > >    I would like to know about the Calcite CBO in detail,
> including
> > > how it
> > > > > > deals with global optimization when multiple processing systems
> are
> > > > > > associated with it. Any documentation, pointers are much
> appreciated.
> > > > > >
> > > > > >
> > > > > > Thanks and Regards
> > > > > >
> > > > > > Lekshmi B.G
> > > > > > Email: lekshmib...@gmail.com
> > > > >
> > >
>

Re: Calcite's CBO

Reply via email to