PostgreSQL development community:
Our research group has been using the PostgreSQL code base to test new join algorithms. During testing, we noticed that the planner is not pushing down projections to the outer relation in a hash join. Although this makes sense for in-memory (1 batch) joins, for joins larger than memory (such as for TPC-H DSS), this causes the system to perform significantly more disk I/Os when reading/writing batches of the outer relation. A simple solution is to add a single line of code to src\backend\optimizer\plan\createplan.c after line 1771: disuse_physical_tlist(outer_plan, best_path->jpath.outerjoinpath); This will always force the projection on the outer relation. A more complicated modification alternative is to add a state variable to allow the planner to know how many batches the hash join expects and only push down the projection if it is greater than one. However, pushing the projection on the outer relation is almost always the best choice as it eliminates unneeded attributes for operators above the hash join in the plan and will be robust in the case of poor estimates. We have been testing using TPC-H scale factor 1 GB. A sample query that demonstrates the behavior is: SELECT c_custkey, c_name, o_orderkey, o_orderdate FROM Customer, Orders WHERE c_custkey = o_custkey Note that EXPLAIN on this query will indicate that the projection is performed on the outer relation even though it is not done. We found the difference by modifying our code to track tuples and bytes output to disk, but it also can be detected by watching the size of the temporary files produced during the join. Sincerely, Dr. Ramon Lawrence Assistant Professor, Department of Computer Science, University of British Columbia Okanagan http://people.ok.ubc.ca/rlawrenc/ E-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>