> While investigating some performance problems recently I've had cause > to think about the way PostgreSQL uses hash joins. So here are a few > thoughts. Some of these have been brought up before. > > 1. When the hash is not expected to spill to disk, it preserves the > pathkeys of the outer side of the join. If the optimizer were allowed > to assume that, it could produce significantly more efficient query > plans in some cases.
This is definitely possible, but you will have to dynamically modify the execution path if the hash join ends up to be more than one batch. > 3. Avoid building the exact same hash table twice in the same query. > This happens more often you'd think. For example, a table may have > two columns creator_id and last_updater_id which both reference person > (id). If you're considering a hash join between paths A and B, you > could conceivably check whether what is essentially a duplicate of B > has already been hashed somewhere within path A. If so, you can reuse > that same hash table at zero startup-cost. > 4. As previously discussed, avoid hashing for distinct and then > hashing the results for a hash join on the same column with the same > operators. > > Thoughts on the value and/or complexity of implementation of any of these? I would be interested in working with you on any of these changes to hash join if you decide to pursue them. I am especially interested in looking at the hash aggregation code and potentially improving its efficiency. We have implemented a multi-way hash join (can join more than 2 tables at a time) which may help with cases #3 and #4. Performance results look very good, and we are planning on building a patch for this over the summer. -- Ramon Lawrence -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers