[GENERAL] Performance cost of a sort-merge join

2010-02-22 Thread Yang Zhang
Hi, I have the given tables and query, but I'm confused why the cost
of the query is so high. I've left it running over night. By
comparison, a select * from metarelcloud_transactionlog order by
transactionid takes on the order of seconds/minutes (at least in
MySQL). As far as I can tell, the complex query below basically just
consists of two sorts (three if PG doesn't in fact reuse the leaf
sorts). Why the enormous estimated page IO count (cost) on the second
sort? There are roughly 30 tuples per transactionid, so the join
shouldn't produce a vastly exploded dataset. Thanks in advance.

tpcc=# \d metarelcloud_graph
  Table public.metarelcloud_graph
  Column  | Type  | Modifiers
--+---+---
 tableid1 | character varying(20) | not null
 tupleid1 | integer   | not null
 tableid2 | character varying(20) | not null
 tupleid2 | integer   | not null
 node1| integer   | not null
 node2| integer   | not null
 weight   | integer   | not null
Indexes:
metarelcloud_graph_pkey PRIMARY KEY, btree (tableid1, tupleid1,
tableid2, tupleid2)

tpcc=# \d metarelcloud_transactionlog
   Table
public.metarelcloud_transactionlog
   Column| Type  |
   Modifiers
-+---+--
 id  | integer   | not null default
nextval('metarelcloud_transactionlog_id_seq'::regclass)
 transactionid   | integer   | not null
 queryid | smallint  | not null
 tableid | character varying(30) | not null
 tupleid | integer   | not null
 querytype   | character varying | not null
 graphpartition  | smallint  |
 replicatedpartition | smallint  |
 justifiedpartition  | smallint  |
 hashpartition   | smallint  |
 nodeid  | integer   |
 manualpartition | smallint  |
Indexes:
metarelcloud_transactionlog_pkey PRIMARY KEY, btree (id)
Check constraints:
metarelcloud_transactionlog_graphpartition_check CHECK
(graphpartition = 0)
metarelcloud_transactionlog_hashpartition_check CHECK (hashpartition = 0)
metarelcloud_transactionlog_justifiedpartition_check CHECK
(justifiedpartition = 0)
metarelcloud_transactionlog_manualpartition_check CHECK
(manualpartition = 0)
metarelcloud_transactionlog_querytype_check CHECK
(querytype::text = ANY (ARRAY['select'::character varying,
'insert'::character varying, 'delete'::character varying,
'update'::character varying]::text[]))
metarelcloud_transactionlog_replicatedpartition_check CHECK
(replicatedpartition = 0)

tpcc=# analyze metarelcloud_transactionlog;
ANALYZE

tpcc=# explain insert into metarelcloud_graph (node1, node2, tableid1,
tupleid1, tableid2, tupleid2, weight)
select 0, 0, a.tableid, a.tupleid, b.tableid, b.tupleid, count(*)
from metarelcloud_transactionlog a, metarelcloud_transactionlog b
where a.transactionid = b.transactionid
  and (a.tableid, a.tupleid)  (b.tableid, b.tupleid)
group by a.tableid, a.tupleid, b.tableid, b.tupleid;
  QUERY PLAN
--
 Subquery Scan *SELECT*  (cost=968062444010.30..1088362355018.20
rows=2673331355731 width=180)
   -  GroupAggregate  (cost=968062444010.30..1041579056292.91
rows=2673331355731 width=26)
 -  Sort  (cost=968062444010.30..974745772399.63
rows=2673331355731 width=26)
   Sort Key: a.tableid, a.tupleid, b.tableid, b.tupleid
   -  Merge Join  (cost=16817274.69..160416950669.79
rows=2673331355731 width=26)
 Merge Cond: (a.transactionid = b.transactionid)
 Join Filter: (ROW((a.tableid)::text, a.tupleid) 
ROW((b.tableid)::text, b.tupleid))
 -  Sort  (cost=8408637.34..8534662.95
rows=50410244 width=17)
   Sort Key: a.transactionid
   -  Seq Scan on metarelcloud_transactionlog
a  (cost=0.00..925543.44 rows=50410244 width=17)
 -  Materialize  (cost=8408637.34..9038765.39
rows=50410244 width=17)
   -  Sort  (cost=8408637.34..8534662.95
rows=50410244 width=17)
 Sort Key: b.transactionid
 -  Seq Scan on
metarelcloud_transactionlog b  (cost=0.00..925543.44 rows=50410244
width=17)
(14 rows)

-- 
Yang Zhang
http://www.mit.edu/~y_z/

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Performance cost of a sort-merge join

2010-02-22 Thread Tom Lane
Yang Zhang yanghates...@gmail.com writes:
 Hi, I have the given tables and query, but I'm confused why the cost
 of the query is so high.

The reason the estimated cost is so high is that the estimated number of
rows out of the join is enormous.  It's going to take awhile.  One
question worth asking is what you've got work_mem set to.  Another
possible problem is that you're doing a lot of sorts/comparisons on
varchars, which could be pretty expensive if you're using a non-C locale.

 I've left it running over night. By
 comparison, a select * from metarelcloud_transactionlog order by
 transactionid takes on the order of seconds/minutes (at least in
 MySQL).

That's got approximately nothing to do with this query.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Performance cost of a sort-merge join

2010-02-22 Thread Yang Zhang
On Mon, Feb 22, 2010 at 12:10 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I've left it running over night. By
 comparison, a select * from metarelcloud_transactionlog order by
 transactionid takes on the order of seconds/minutes (at least in
 MySQL).

 That's got approximately nothing to do with this query.

Isn't that exactly what the leaf sorts are doing? By comparison,
select * from metarelcloud_transactionlog order by transactionid
takes much, much longer in PG (it's been running for 68 minutes now,
and still going, whereas MySQL took 6 minutes).
--
Yang Zhang
http://www.mit.edu/~y_z/

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general