On 2/15/15 7:16 PM, Tomas Vondra wrote:
Hi,

On 16.2.2015 00:50, Andrew Gierth wrote:
"Tom" == Tom Lane <t...@sss.pgh.pa.us> writes:

I've now tried the attached patch to correct the bucketsize
estimates, and it does indeed stop the planner from considering the
offending path (in this case it just does the join the other way
round).

One thing I couldn't immediately see how to do was account for the
case where there are a lot of nulls in the table but a strict qual
(or an IS NOT NULL) filters them out; this patch will be overly
pessimistic about such cases. Do estimates normally try and take
things like this into account? I didn't find any other relevant
examples.

Improving the estimates is always good, but it's not going to fix the
case of non-NULL values (it shouldn't be all that difficult to create
such examples with a value whose hash starts with a bunch of zeroes).

  Tom> There may also be something we can do in the executor, but it
  Tom> would take closer analysis to figure out what's going wrong.  I
  Tom> don't think kluging the behavior for NULL in particular is the
  Tom> answer.

The point with nulls is that a hash value of 0 is currently special
in two distinct ways: it's always in batch 0 and bucket 0 regardless
of how many batches and buckets there are, and it's the result of
hashing a null. These two special cases interact in a worst-case
manner, so it seems worthwhile to avoid that.

I think you're right this is a flaw in general - all it takes is a
sufficiently common value with a hash value falling into the first batch
(i.e. either 0 or starting with a lot of 0 bits, thus giving hashno==0).

I think this might be solved by relaxing the check a bit. Currently we
do this (see nodeHash.c:735):

     if (nfreed == 0 || nfreed == ninmemory)
     {
         hashtable->growEnabled = false;
     }

which only disables nbatch growth if *all* the tuples remain in the
batch. That's rather strict, and it takes a single tuple to break this.

With each nbatch increase we expect 50:50 split, i.e. 1/2 the tuples
staying in the batch, 1/2 moved to the new one. So a much higher ratio
is suspicious and most likely mean the same hash value, so what if we
did something like this:

     if ((nfreed >= 0.9 * (nfreed + ninmemory)) ||
         (nfreed <= 0.1 * (nfreed + ninmemory)))
     {
         hashtable->growEnabled = false;
     }

which disables nbatch growth if either >=90% tuples remained in the
first batch, or >= 90% tuples were moved from it. The exact thresholds
might be set a bit differently, but I think 10:90 sounds about good.

Trivial patch implementing this attached - with it the explain analyze
looks like this:

test=# explain analyze select status, count(*) from q3 left join m3 on
(m3.id=q3.id) group by status;

                                QUERY PLAN
----------------------------------------------------------------------
  HashAggregate  (cost=64717.63..64717.67 rows=4 width=4) (actual
time=514.619..514.630 rows=5 loops=1)
    Group Key: m3.status
    ->  Hash Right Join  (cost=18100.00..62217.63 rows=500000 width=4)
                     (actual time=75.260..467.911 rows=500108 loops=1)
          Hash Cond: (m3.id = q3.id)
          ->  Seq Scan on m3  (cost=0.00..18334.00 rows=1000000 width=37)
                         (actual time=0.003..91.799 rows=1000000 loops=1)
          ->  Hash  (cost=7943.00..7943.00 rows=500000 width=33)
                    (actual time=74.916..74.916 rows=500000 loops=1)
                Buckets: 65536 (originally 65536)
                Batches: 32 (originally 16)  Memory Usage: 8824kB
                ->  Seq Scan on q3
                    (cost=0.00..7943.00 rows=500000 width=33)
                    (actual time=0.005..27.395 rows=500000 loops=1)
  Planning time: 0.172 ms
  Execution time: 514.663 ms
(10 rows)

Without the patch this runs in ~240 seconds and the number of batches
explodes to ~131k.

In theory it might happen that threre just a few hash values and all of
them are exactly the same within the first N bits (thus falling into the
first batch), but N+1 bits are different. So the next split would
actually help. But I think we can leave that up to probability, just
like all the hash code.

Anything happen with this, or the patch Andrew posted?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to