Thanks Ashutosh & Robert for the explanation.

On Mon, Mar 6, 2017 at 10:02 AM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Sat, Mar 4, 2017 at 2:50 PM, Robert Haas <robertmh...@gmail.com> wrote:
> > On Thu, Mar 2, 2017 at 6:48 PM, Ashutosh Bapat
> > <ashutosh.ba...@enterprisedb.com> wrote:
> >> On Thu, Mar 2, 2017 at 6:06 PM, Rushabh Lathia <
> rushabh.lat...@gmail.com> wrote:
> >>> While reading through the cost_agg() I found that startup cost for the
> >>> group aggregate is not correctly assigned. Due to this explain plan is
> >>> not printing the correct startup cost.
> >>>
> >>> Without patch:
> >>>
> >>> postgres=# explain select aid, sum(abalance) from pgbench_accounts
> where
> >>> filler like '%foo%' group by aid;
> >>>                                      QUERY PLAN
> >>> ------------------------------------------------------------
> -------------------------
> >>>  GroupAggregate  (cost=81634.33..85102.04 rows=198155 width=12)
> >>>    Group Key: aid
> >>>    ->  Sort  (cost=81634.33..82129.72 rows=198155 width=8)
> >>>          Sort Key: aid
> >>>          ->  Seq Scan on pgbench_accounts  (cost=0.00..61487.89
> rows=198155
> >>> width=8)
> >>>                Filter: (filler ~~ '%foo%'::text)
> >>> (6 rows)
> >>>
> >>> With patch:
> >>>
> >>> postgres=# explain select aid, sum(abalance) from pgbench_accounts
> where
> >>> filler like '%foo%' group by aid;
> >>>                                      QUERY PLAN
> >>> ------------------------------------------------------------
> -------------------------
> >>>  GroupAggregate  (cost=82129.72..85102.04 rows=198155 width=12)
> >>>    Group Key: aid
> >>>    ->  Sort  (cost=81634.33..82129.72 rows=198155 width=8)
> >>>          Sort Key: aid
> >>>          ->  Seq Scan on pgbench_accounts  (cost=0.00..61487.89
> rows=198155
> >>> width=8)
> >>>                Filter: (filler ~~ '%foo%'::text)
> >>> (6 rows)
> >>>
> >>
> >> The reason the reason why startup_cost = input_startup_cost and not
> >> input_total_cost for aggregation by sorting is we don't need the whole
> >> input before the Group/Agg plan can produce the first row. But I think
> >> setting startup_cost = input_startup_cost is also not exactly correct.
> >> Before the plan can produce one row, it has to transit through all the
> >> rows belonging to the group to which the first row belongs. On an
> >> average it has to scan (total number of rows)/(number of groups)
> >> before producing the first aggregated row. startup_cost will be
> >> input_startup_cost + cost to scan (total number of rows)/(number of
> >> groups) rows + cost of transiting over those many rows. Total cost =
> >> startup_cost + cost of scanning and transiting through the remaining
> >> number of input rows.
> >
> > While that idea has some merit, I think it's inconsistent with current
> > practice.  cost_seqscan(), for example, doesn't include the cost of
> > reading the first page in the startup cost, even though that certainly
> > must be done before returning the first row.
>
> OTOH, while costing for merge join, initial_cost_mergejoin() seems to
> consider the cost of scanning outer and inner relations upto the first
> matching tuple on both sides in the startup_cost. Different policies
> at different places.
>
> > I think there have been
> > previous discussions of switching over to the practice for which you
> > are advocating here, but my impression (without researching) is that
> > the current practice is more like what Rushabh did.
> >
>
> I am not sure Rushabh's approach is correct. Here's the excerpt from my
> mail.
>
> >> The reason the reason why startup_cost = input_startup_cost and not
> >> input_total_cost for aggregation by sorting is we don't need the whole
> >> input before the Group/Agg plan can produce the first row.
>
>

> With Rushabh's approach the startup cost of aggregation by sorting
> would be way higher that it's right now. Secondly, it would match that
> of hash aggregation and thus forcing hash aggregation to be chosen in
> almost all the cases.
>

I understood you reasoning of why startup_cost = input_startup_cost and not
input_total_cost for aggregation by sorting. But what I didn't understand is
how come higher startup cost for aggregation by sorting would force hash
aggregation to be chosen? I am not clear about this part.

--
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>



Regards.
Rushabh Lathia
www.EnterpriseDB.com

Reply via email to