The planner doesn't currently worry about work_mem restrictions when planning a hash join, figuring that the executor should be able to subdivide the data arbitrarily finely by splitting buckets at runtime. However there's a thread here: https://www.postgresql.org/message-id/flat/CACw4T0p4Lzd6VpwptxgPgoTMh2dEKTQBGu7NTaJ1%2BA0PRx1BGg%40mail.gmail.com exhibiting a case where a hash join was chosen even though a single value accounts for three-quarters of the inner relation. Bucket splitting obviously can never separate multiple instances of the same value, so this choice forced the executor to try to load three-quarters of the (very large) inner relation into memory at once; unsurprisingly, it failed.
To fix this, I think we need to discourage use of hash joins whenever a single bucket is predicted to exceed work_mem, as in the attached draft patch. The patch results in changing from hash to merge join in one regression test case, which is fine; that case only cares about the join order not the types of the joins. This might be overly aggressive, because it will pretty much shut off any attempt to use hash joining on a large inner relation unless we have statistics for it (and those stats are favorable). But having seen this example, I think we need to be worried. I'm inclined to treat this as a bug and back-patch it, but I wonder if anyone is concerned about possible plan destabilization in the back branches. regards, tom lane
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index d01630f..9170b92 100644 *** a/src/backend/optimizer/path/costsize.c --- b/src/backend/optimizer/path/costsize.c *************** final_cost_hashjoin(PlannerInfo *root, H *** 2800,2805 **** --- 2800,2806 ---- double hashjointuples; double virtualbuckets; Selectivity innerbucketsize; + double inner_bucket_rows; ListCell *hcl; /* Mark the path with the correct row estimate */ *************** final_cost_hashjoin(PlannerInfo *root, H *** 2894,2899 **** --- 2895,2914 ---- } /* + * If even a single bucket would exceed work_mem, we don't want to hash + * unless there is really no other alternative, so apply disable_cost. + * (Because estimate_hash_bucketsize's estimate is mostly driven by the + * MCV frequency, this condition suggests that the executor will be unable + * to drive the batch size below work_mem no matter how much it splits + * buckets: splitting cannot separate values that are equal.) + */ + inner_bucket_rows = clamp_row_est(inner_path_rows * innerbucketsize); + if (relation_byte_size(inner_bucket_rows, + inner_path->pathtarget->width) > + (work_mem * 1024L)) + startup_cost += disable_cost; + + /* * Compute cost of the hashquals and qpquals (other restriction clauses) * separately. */ *************** final_cost_hashjoin(PlannerInfo *root, H *** 2964,2970 **** */ startup_cost += hash_qual_cost.startup; run_cost += hash_qual_cost.per_tuple * outer_path_rows * ! clamp_row_est(inner_path_rows * innerbucketsize) * 0.5; /* * Get approx # tuples passing the hashquals. We use --- 2979,2985 ---- */ startup_cost += hash_qual_cost.startup; run_cost += hash_qual_cost.per_tuple * outer_path_rows * ! inner_bucket_rows * 0.5; /* * Get approx # tuples passing the hashquals. We use diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out index 4992048..b2270ca 100644 *** a/src/test/regress/expected/join.out --- b/src/test/regress/expected/join.out *************** where not exists ( *** 2429,2461 **** ) a1 on t3.c2 = a1.c1 where t1.c1 = t2.c2 ); ! QUERY PLAN ! --------------------------------------------------------- ! Hash Anti Join ! Hash Cond: (t1.c1 = t2.c2) ! -> Seq Scan on tt4x t1 ! -> Hash ! -> Merge Right Join ! Merge Cond: (t5.c1 = t3.c2) ! -> Merge Join ! Merge Cond: (t4.c2 = t5.c1) ! -> Sort ! Sort Key: t4.c2 ! -> Seq Scan on tt4x t4 ! -> Sort ! Sort Key: t5.c1 ! -> Seq Scan on tt4x t5 ! -> Sort ! Sort Key: t3.c2 ! -> Merge Left Join ! Merge Cond: (t2.c3 = t3.c1) -> Sort ! Sort Key: t2.c3 ! -> Seq Scan on tt4x t2 -> Sort ! Sort Key: t3.c1 ! -> Seq Scan on tt4x t3 ! (24 rows) -- -- regression test for problems of the sort depicted in bug #3494 --- 2429,2465 ---- ) a1 on t3.c2 = a1.c1 where t1.c1 = t2.c2 ); ! QUERY PLAN ! --------------------------------------------------------------- ! Merge Anti Join ! Merge Cond: (t1.c1 = t2.c2) ! -> Sort ! Sort Key: t1.c1 ! -> Seq Scan on tt4x t1 ! -> Materialize ! -> Sort ! Sort Key: t2.c2 ! -> Merge Right Join ! Merge Cond: (t5.c1 = t3.c2) ! -> Merge Join ! Merge Cond: (t4.c2 = t5.c1) -> Sort ! Sort Key: t4.c2 ! -> Seq Scan on tt4x t4 -> Sort ! Sort Key: t5.c1 ! -> Seq Scan on tt4x t5 ! -> Sort ! Sort Key: t3.c2 ! -> Merge Left Join ! Merge Cond: (t2.c3 = t3.c1) ! -> Sort ! Sort Key: t2.c3 ! -> Seq Scan on tt4x t2 ! -> Sort ! Sort Key: t3.c1 ! -> Seq Scan on tt4x t3 ! (28 rows) -- -- regression test for problems of the sort depicted in bug #3494
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers