Re: [PERFORM] Sorted group by

Matthew Wakeling Wed, 11 Aug 2010 06:47:02 -0700

Original query:

explain analyse select * from tracker where objectid < 1200000;
                          QUERY PLAN
-----------------------------------------------------------------------
 Index Scan using tracker_objectid on tracker
        (cost=0.00..915152.62 rows=3684504 width=33)
        (actual time=0.061..5402.608 rows=3790872 loops=1)
   Index Cond: (objectid < 1200000)
 Total runtime: 9134.362 ms
(3 rows)


On Tue, 10 Aug 2010, hubert depesz lubaczewski wrote:

select distinct on (group) *
from table
order by group desc, number desc;

This solution is rather obvious, and works on older versions of Postgres.Thanks. However, the burden of sorting by two columns (actually, in ourapplication the group is two column, so sorting by three columns instead)makes this significantly slower than just copying the whole data throughour application (which effectively does a hash aggregation).

explain analyse select distinct on (objectid, fieldname) objectid,fieldname, sourcename, version from tracker where objectid < 1200000 orderby objectid, fieldname, version desc;


                            QUERY PLAN
--------------------------------------------------------------------------
 Unique  (cost=1330828.11..1357953.05 rows=361666 width=34)
         (actual time=12815.878..22452.737 rows=1782996 loops=1)
   ->  Sort  (cost=1330828.11..1339869.76 rows=3616658 width=34)
             (actual time=12815.873..16608.903 rows=3790872 loops=1)
         Sort Key: objectid, fieldname, version
         Sort Method:  quicksort  Memory: 420980kB
         ->  Index Scan using tracker_objectid on tracker
             (cost=0.00..936861.47 rows=3616658 width=34)
             (actual time=0.061..5441.050 rows=3790872 loops=1)
               Index Cond: (objectid < 1200000)
 Total runtime: 24228.724 ms
(7 rows)

On Tue, 10 Aug 2010, Thomas Kellerer wrote:

select group_, value_
from (
select group_, value_, number_, row_number() over (partition by group_order by value_ desc) as row_num
 from numbers
) t
where row_num = 1
order by group_ desc

This looks quite cute, however it is slightly slower than the DISTINCT ONapproach.

explain analyse select objectid, fieldname, sourcename from (selectobjectid, fieldname, sourcename, version, row_number() over (partition byobjectid, fieldname order by version desc) as row_num from tracker whereobjectid < 1200000) as t where row_num = 1;

                           QUERY PLAN
-------------------------------------------------------------------------
 Subquery Scan t  (cost=1330828.11..1457411.14 rows=18083 width=68)
                  (actual time=12835.553..32220.075 rows=1782996 loops=1)
   Filter: (t.row_num = 1)
   ->  WindowAgg  (cost=1330828.11..1412202.92 rows=3616658 width=34)
                  (actual time=12835.541..26471.802 rows=3790872 loops=1)
         ->  Sort  (cost=1330828.11..1339869.76 rows=3616658 width=34)
                   (actual time=12822.560..16646.112 rows=3790872 loops=1)
               Sort Key: tracker.objectid, tracker.fieldname, tracker.version
               Sort Method:  quicksort  Memory: 420980kB
               ->  Index Scan using tracker_objectid on tracker
                   (cost=0.00..936861.47 rows=3616658 width=34)
                   (actual time=0.067..5433.790 rows=3790872 loops=1)
                     Index Cond: (objectid < 1200000)
 Total runtime: 34002.828 ms
(9 rows)

On Tue, 10 Aug 2010, Kevin Grittner wrote:

select group, value from tbl x
 where not exists
       (select * from tbl y
         where y.group = x.group and y.number > x.number);


This is a join, which is quite a bit slower:

explain analyse select objectid, fieldname, sourcename from tracker as awhere not exists(select * from tracker as b where a.objectid = b.objectidand a.fieldname = b.fieldname and a.version < b.version and b.objectid <1200000) and a.objectid < 1200000;


                               QUERY PLAN
---------------------------------------------------------------------------
 Merge Anti Join  (cost=2981427.73..3042564.32 rows=2411105 width=30)
                  (actual time=24834.372..53939.131 rows=1802376 loops=1)
   Merge Cond: ((a.objectid = b.objectid) AND (a.fieldname = b.fieldname))
   Join Filter: (a.version < b.version)
   ->  Sort  (cost=1490713.86..1499755.51 rows=3616658 width=34)
             (actual time=12122.478..15944.255 rows=3790872 loops=1)
         Sort Key: a.objectid, a.fieldname
         Sort Method:  quicksort  Memory: 420980kB
         ->  Index Scan using tracker_objectid on tracker a
             (cost=0.00..1096747.23 rows=3616658 width=34)
             (actual time=0.070..5403.235 rows=3790872 loops=1)
               Index Cond: (objectid < 1200000)
   ->  Sort  (cost=1490713.86..1499755.51 rows=3616658 width=17)
             (actual time=12710.564..20952.841 rows=8344994 loops=1)
         Sort Key: b.objectid, b.fieldname
         Sort Method:  quicksort  Memory: 336455kB
         ->  Index Scan using tracker_objectid on tracker b
             (cost=0.00..1096747.23 rows=3616658 width=17)
             (actual time=0.084..5781.844 rows=3790872 loops=1)
               Index Cond: (objectid < 1200000)
 Total runtime: 55756.482 ms
(14 rows)

On Tue, 10 Aug 2010, Jonathan Blitz wrote:

Select groupfield,value
From tbl x1
Where number = (select max(number) from tbl x2 where x2.groupfield=
x1.groupfield)


This one effectively forces a nested loop join:

explain analyse select objectid, fieldname, sourcename from tracker as awhere version = (select max(version) from tracker as b where a.objectid =b.objectid and a.fieldname = b.fieldname) and a.objectid < 1200000;


                               QUERY PLAN
-------------------------------------------------------------------------
 Index Scan using tracker_objectid on tracker a
       (cost=0.00..58443381.75 rows=18083 width=34)
       (actual time=6.482..59803.225 rows=1802376 loops=1)
   Index Cond: (objectid < 1200000)
   Filter: (version = (SubPlan 2))
   SubPlan 2
     ->  Result  (cost=15.89..15.90 rows=1 width=0)
                 (actual time=0.011..0.012 rows=1 loops=3790872)
           InitPlan 1 (returns $2)
             ->  Limit  (cost=0.00..15.89 rows=1 width=4)
                        (actual time=0.007..0.008 rows=1 loops=3790872)
                   ->  Index Scan Backward using tracker_all on tracker b
                        (cost=0.00..31.78 rows=2 width=4)
                        (actual time=0.005..0.005 rows=1 loops=3790872)
                         Index Cond: (($0 = objectid) AND ($1 = fieldname))
                         Filter: (version IS NOT NULL)
 Total runtime: 61649.116 ms
(11 rows)

On Tue, 10 Aug 2010, Jonathan Blitz wrote:

Select groupfield,value
From tbl x1
Where (groupfield,number) in (select groupfield,max(number) from tbl group
by groupfield)


This is another join.

explain analyse select objectid, fieldname, sourcename from tracker where(objectid, fieldname, version) in (select objectid, fieldname,max(version) from tracker group by objectid, fieldname);


I terminated this query after about an hour. Here is the EXPLAIN:

                              QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=55310973.80..72060974.32 rows=55323 width=30)
   Hash Cond: ((public.tracker.objectid = public.tracker.objectid)
           AND (public.tracker.fieldname = public.tracker.fieldname)
           AND (public.tracker.version = (max(public.tracker.version))))
   ->  Seq Scan on tracker  (cost=0.00..5310600.80 rows=293972480 width=34)
   ->  Hash  (cost=54566855.96..54566855.96 rows=29397248 width=40)
         ->  GroupAggregate
             (cost=50965693.08..54272883.48 rows=29397248 width=17)
               ->  Sort
                   (cost=50965693.08..51700624.28 rows=293972480 width=17)
                     Sort Key: public.tracker.objectid, public.tracker.fieldname
                     ->  Seq Scan on tracker
                         (cost=0.00..5310600.80 rows=293972480 width=17)
(8 rows)

Matthew

--
I quite understand I'm doing algebra on the blackboard and the usual response
is to throw objects...  If you're going to freak out... wait until party time
and invite me along                     -- Computer Science Lecturer

--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Sorted group by

Reply via email to