On May 3, 2013 12:15 PM, "Yang Zhang" <yanghates...@gmail.com> wrote:
> I have an `account` table with 5.3M rows, with primary key `id` of > type `text` (and 600+ columns if that matters). > > I'm trying to create a `newaccount` table with the same schema but > 600k newly imported rows, then insert all the old rows for which `id` > isn't already in the set of newly updated rows. > > => create table newaccount as select * from account limit 0; > > => \copy newaccount from stdin with (format csv) > [...copy 600k rows in...] > > => analyze newaccount; > > The most obvious query doesn't work so hot due to repeated execution > of the subplan: > > => explain insert into newaccount select * from account where id not > in (select id from account); > > QUERY PLAN > > ----------------------------------------------------------------------------------------- > Insert on newaccount (cost=0.00..458800400224.89 rows=2985976 width=5366) > -> Seq Scan on account (cost=0.00..458800400224.89 rows=2985976 > width=5366) > Filter: (NOT (SubPlan 1)) > SubPlan 1 > -> Materialize (cost=0.00..153276.00 rows=150000 width=32) > -> Seq Scan on newaccount (cost=0.00..151500.00 > rows=150000 width=32) > (6 rows) > > This works fine, though: > > => explain insert into newaccount select * from account where id in > (select id from account except select id from newaccount); > > QUERY PLAN > > ------------------------------------------------------------------------------------------------------------------------ > Insert on newaccount (cost=3749026.24..14034547.77 rows=2985976 > width=5366) > -> Hash Join (cost=3749026.24..14034547.77 rows=2985976 width=5366) > Hash Cond: (account.id = "ANY_subquery".id) > -> Seq Scan on account (cost=0.00..2263744.51 rows=5971951 > width=5366) > -> Hash (cost=3633552.85..3633552.85 rows=5971951 width=32) > -> Subquery Scan on "ANY_subquery" > (cost=3543223.59..3633552.85 rows=5971951 width=32) > -> SetOp Except (cost=3543223.59..3573833.34 > rows=5971951 width=19) > -> Sort (cost=3543223.59..3558528.47 > rows=6121951 width=19) > Sort Key: "*SELECT* 1".id > -> Append (cost=0.00..2476464.02 > rows=6121951 width=19) > -> Subquery Scan on "*SELECT* > 1" (cost=0.00..2323464.02 rows=5971951 width=19) > -> Seq Scan on account > (cost=0.00..2263744.51 rows=5971951 width=19) > -> Subquery Scan on "*SELECT* > 2" (cost=0.00..153000.00 rows=150000 width=32) > -> Seq Scan on > newaccount (cost=0.00..151500.00 rows=150000 width=32) > (14 rows) > > This is all in PG 9.1. > > This isn't a big deal as there's a straightforward workaround, but I > am curious what happened here. Googling turns up various mentions of > "NOT IN" with poor plans that involve subplans. Then again I have > read mention of hash anti-join which seems appropriate here(?), but > that wasn't used here (nor was the latter join used, though for > whatever reason it looks like a more complex/deeper plan tree than I > had expected, so maybe it was out of the plan generator's reach?). > > E.g. the following mentions cranking up work_mem, but I probably can't > crank up work_mem to meet the requirements of this example, and even > if it doesn't fit in memory, it'd be nice for the planner to not > degenerate to a pathological plan and still execute this join > efficiently while spilling to and from disk. > > > http://stackoverflow.com/questions/7125291/postgresql-not-in-versus-except-performance-difference-edited-2 > > Thanks! > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >