I have a pretty simple parent-child relationship, where parents are segmented into many bins (actually states). I need to query over the (parent, child) join but filter based on aggregates of the parent. That is -- all parent, child pairs for parents that are in bin x and also have more than y children. Here's what the data looks like:
CREATE TABLE parent ( id SERIAL PRIMARY KEY, bin INTEGER ); CREATE INDEX foo ON parent(bin); CREATE TABLE child ( parent_id INTEGER REFERENCES parent(id), data INTEGER ); INSERT INTO parent (bin) SELECT s.a % 50 FROM generate_series(1, 100000) AS s(a); INSERT INTO CHILD (parent_id, data) SELECT id, floor(random() * 50) FROM parent; INSERT INTO CHILD (parent_id, data) SELECT id, floor(random() * 50) FROM parent; INSERT INTO CHILD (parent_id, data) SELECT id, floor(random() * 50) FROM parent; ANALYZE parent; ANALYZE child; And the query: (1) SELECT * FROM parent INNER JOIN child ON child.parent_id = parent.id LEFT JOIN (SELECT parent.id, COUNT(*) AS c FROM parent INNER JOIN child ON child.parent_id = parent.id WHERE child.data > 25 GROUP BY 1) agg1 ON agg1.id = parent.id WHERE parent.bin = 1 AND agg1.c >= 3; This does not perform very well, because the subquery is calculated across all bins, even the 49 that will be discarded by the base query. The query plan: http://explain.depesz.com/s/Ty4. I feel like the planner should be able to move the bin condition into the subquery, like this: (2) SELECT * FROM parent INNER JOIN child ON child.parent_id = parent.id LEFT JOIN (SELECT parent.id, COUNT(*) AS c FROM parent INNER JOIN child ON child.parent_id = parent.id WHERE child.data > 25 -- manually move the base query's condition into the subquery AND parent.bin = 1 GROUP BY 1) agg1 ON agg1.id = parent.id WHERE parent.bin = 1 AND agg1.c >= 3; This query produces the ideal query plan (http://explain.depesz.com/s/8aRo), but it feels like we're doing the planner's work for it. This SQL is generated from a reporting interface, so it would be nice if this stuff could be figured out automatically. I know there are other ways to write this query, but this style of joining an aggregation is really nice for reporting. I actually end up joining several different aggregations and produce a condition across all of them. Now, maybe the planner doesn't know about the primary key, and that if parent.id is the same, parent.bin must be the same. Let's try to give it this information as part of the join clause: (3) SELECT * FROM parent INNER JOIN child ON child.parent_id = parent.id LEFT JOIN (SELECT parent.id, parent.bin, COUNT(*) AS c FROM parent INNER JOIN child ON child.parent_id = parent.id WHERE child.data > 25 GROUP BY 1, 2) agg1 ON agg1.id = parent.id AND agg1.bin = parent.bin WHERE parent.bin = 1 AND agg1.c >= 3; This works! Well, at first. As soon as we say `bin IN (1, 2)` instead of `bin = 1`, the query plan falls down again: http://explain.depesz.com/s/u7R: (4) SELECT * FROM parent INNER JOIN child ON child.parent_id = parent.id LEFT JOIN (SELECT parent.id, parent.bin, COUNT(*) AS c FROM parent INNER JOIN child ON child.parent_id = parent.id WHERE child.data > 25 GROUP BY 1, 2) agg1 ON agg1.id = parent.id AND agg1.bin = parent.bin WHERE parent.bin IN (1, 2) AND agg1.c >= 3; Note that again, moving the condition inside the subquery produces the correct plan. It'd be nice if the planner could optimize the query (1) by turning it into (2). I understand that it might not be able to, but if it can pull the condition up in (3), why can't it in (4)? PS: This is on postgres 9.3