Tom Lane, 24.07.2012 19:12:
Well, it would only help if you're running a PG version that's new
enough to recognize the NOT EXISTS as an anti-join; and even then,
it's possible that joining on a tid column forecloses enough plan
types that you don't get any real benefit.  But I'm just guessing.
Can you show exactly what you tried and what EXPLAIN ANALYZE results
you got?


I am using 9.1.4 (as I said in my initial post).

I finally found a solution that runs fine:

DELETE FROM dupes a
WHERE EXISTS (SELECT 1
              FROM  dupes b
              WHERE b.first_name = a.first_name
                AND b.last_name = a.last_name
                AND b.ctid > a.ctid);

The execution plan for this is:

Delete on public.dupes a  (cost=14575.95..16978.87 rows=25000 width=12) (actual 
time=2419.334..2419.334 rows=0 loops=1)
  Buffers: shared hit=18029
  ->  Merge Semi Join  (cost=14575.95..16978.87 rows=25000 width=12) (actual 
time=2043.674..2392.707 rows=17097 loops=1)
        Output: a.ctid, b.ctid
        Merge Cond: ((a.first_name = b.first_name) AND (a.last_name = 
b.last_name))
        Join Filter: (b.ctid > a.ctid)
        Buffers: shared hit=930
        ->  Sort  (cost=7287.98..7475.48 rows=75000 width=20) (actual 
time=1024.195..1030.051 rows=75000 loops=1)
              Output: a.ctid, a.first_name, a.last_name
              Sort Key: a.first_name, a.last_name
              Sort Method: quicksort  Memory: 8870kB
              Buffers: shared hit=465
              ->  Seq Scan on public.dupes a  (cost=0.00..1215.00 rows=75000 
width=20) (actual time=0.025..23.234 rows=75000 loops=1)
                    Output: a.ctid, a.first_name, a.last_name
                    Buffers: shared hit=465
        ->  Sort  (cost=7287.98..7475.48 rows=75000 width=20) (actual 
time=1019.148..1028.483 rows=105841 loops=1)
              Output: b.ctid, b.first_name, b.last_name
              Sort Key: b.first_name, b.last_name
              Sort Method: quicksort  Memory: 8870kB
              Buffers: shared hit=465
              ->  Seq Scan on public.dupes b  (cost=0.00..1215.00 rows=75000 
width=20) (actual time=0.017..19.133 rows=75000 loops=1)
                    Output: b.ctid, b.first_name, b.last_name
                    Buffers: shared hit=465
Total runtime: 2420.953 ms

Which is a lot better than the plan using "WHERE ctid NOT IN (.....)":

Delete on public.dupes  (cost=1777.50..4925055.00 rows=37500 width=6) (actual 
time=582515.094..582515.094 rows=0 loops=1)
  Buffers: shared hit=18027
  ->  Seq Scan on public.dupes  (cost=1777.50..4925055.00 rows=37500 width=6) 
(actual time=1038.164..582332.927 rows=17097 loops=1)
        Output: dupes.ctid
        Filter: (NOT (SubPlan 1))
        Buffers: shared hit=930
        SubPlan 1
          ->  Materialize  (cost=1777.50..1890.00 rows=7500 width=20) (actual 
time=0.001..2.283 rows=35552 loops=75000)
                Output: (min(b.ctid)), b.first_name, b.last_name
                Buffers: shared hit=465
                ->  HashAggregate  (cost=1777.50..1852.50 rows=7500 width=20) 
(actual time=90.964..120.228 rows=57903 loops=1)
                      Output: min(b.ctid), b.first_name, b.last_name
                      Buffers: shared hit=465
                      ->  Seq Scan on public.dupes b  (cost=0.00..1215.00 
rows=75000 width=20) (actual time=0.008..25.515 rows=75000 loops=1)
                            Output: b.ctid, b.first_name, b.last_name
                            Buffers: shared hit=465
Total runtime: 582517.711 ms

Using "WHERE id NOT IN (...)" is the fastest way:

Delete on public.dupes  (cost=1871.25..3273.75 rows=37500 width=6) (actual 
time=187.949..187.949 rows=0 loops=1)
  Buffers: shared hit=18490
  ->  Seq Scan on public.dupes  (cost=1871.25..3273.75 rows=37500 width=6) 
(actual time=125.351..171.108 rows=17097 loops=1)
        Output: dupes.ctid
        Filter: (NOT (hashed SubPlan 1))
        Buffers: shared hit=930
        SubPlan 1
          ->  HashAggregate  (cost=1777.50..1852.50 rows=7500 width=18) (actual 
time=73.131..93.421 rows=57903 loops=1)
                Output: min(b.id), b.first_name, b.last_name
                Buffers: shared hit=465
                ->  Seq Scan on public.dupes b  (cost=0.00..1215.00 rows=75000 
width=18) (actual time=0.004..8.515 rows=75000 loops=1)
                      Output: b.id, b.first_name, b.last_name
                      Buffers: shared hit=465
Total runtime: 189.222 ms

Regards
Thomas



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to