On Tue, Mar 24, 2020 at 12:01 PM Kyotaro Horiguchi <horikyota....@gmail.com> wrote:
> At Mon, 23 Mar 2020 21:13:48 +0800, Richard Guo <guofengli...@gmail.com> > wrote in > > Hi all, > > > > With the following statements on latest master (c81bd3b9), I find > > negative cost for plan nodes. > > > > create table a (i int, j int); > > insert into a select i%100000, i from generate_series(1,1000000)i; > > analyze a; > > > > # explain select i from a group by i; > > QUERY PLAN > > ----------------------------------------------------------------- > > HashAggregate (cost=1300.00..-1585.82 rows=102043 width=4) > > Good catch! > > > Group Key: i > > Planned Partitions: 4 > > -> Seq Scan on a (cost=0.00..14425.00 rows=1000000 width=4) > > (4 rows) > > > > In function cost_agg, when we add the disk costs of hash aggregation > > that spills to disk, nbatches is calculated as 1.18 in this case. It is > > greater than 1, so there will be spill. And the depth is calculated as > > -1 in this case, with num_partitions being 4. I think this is where > > thing goes wrong. > > The depth is the expected number of iterations of reading the relation. > > > depth = ceil( log(nbatches - 1) / log(num_partitions) ); > Yes correct. > > I'm not sure what the expression based on, but apparently it is wrong > for nbatches <= 2.0. It looks like a thinko of something like this. > > depth = ceil( log(nbatches) / log(num_partitions + 1) ); > It seems to me we should use '(nbatches - 1)', without the log function. Maybe I'm wrong. I have sent this issue to the 'Memory-Bounded Hash Aggregation' thread. Thanks Richard