Hi Stephen,

On 3.7.2014 20:10, Stephen Frost wrote:
> Tomas,
> 
> * Tomas Vondra (t...@fuzzy.cz) wrote:
>> However it's likely there are queries where this may not be the case,
>> i.e. where rebuilding the hash table is not worth it. Let me know if you
>> can construct such query (I wasn't).
> 
> Thanks for working on this! I've been thinking on this for a while
> and this seems like it may be a good approach. Have you considered a
> bloom filter over the buckets..? Also, I'd suggest you check the

I know you've experimented with it, but I haven't looked into that yet.

> archives from about this time last year for test cases that I was
> using which showed cases where hashing the larger table was a better
> choice- those same cases may also show regression here (or at least
> would be something good to test).

Good idea, I'll look at the test cases - thanks.

> Have you tried to work out what a 'worst case' regression for this 
> change would look like? Also, how does the planning around this
> change? Are we more likely now to hash the smaller table (I'd guess
> 'yes' just based on the reduction in NTUP_PER_BUCKET, but did you
> make any changes due to the rehashing cost?)?

The case I was thinking about is underestimated cardinality of the inner
table and a small outer table. That'd lead to a large hash table and
very few lookups (thus making the rehash inefficient). I.e. something
like this:

  Hash Join
     Seq Scan on small_table (rows=100) (actual rows=100)
     Hash
        Seq Scan on bad_estimate (rows=100) (actual rows=1000000000)
            Filter: ((a < 100) AND (b < 100))

But I wasn't able to reproduce this reasonably, because in practice
that'd lead to a nested loop or something like that (which is a planning
issue, impossible to fix in hashjoin code).

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to