Kyle Bateman wrote:

Tom Lane wrote:


Before 8.2 the optimizer has no ability to rearrange the order of outer
joins.  Do you have time to try your test case against CVS HEAD?
I've done some more refinement on my accounting ledger system that has clarified some of the problems I was having with performance joining to a union. Here's an interesting test case. I just tried it with PG 8.2beta1. I don't claim to understand the new feature of reordering queries very well, but it seems like this is related to that feature. Since its still a performance problem in 8.2, I thought this might be a helpful test case during beta.

To see this demo, run create.sql on a clean database. It will create all the needed tables, views and test data.

Then run q1 and q2 which are very efficient. Then q3 is the slow one. The reason is, it does the union, producing 300,000 records before trying to select by project. It seems like the optimizer should internally rewrite the query to look more like what is in q4 (which is very fast).

Is there a way to make the optimizer do this?

Kyle Bateman

-- Make test schema for demonstrating how the postgres optimizer might improve
-- performance on joins with unions

-- Contains a record for each project (job-costing) code
-- Projects are arranged in a hierarchical structure (parent/child)
-- --------------------------------------------------------------
create table proj (
    proj_id		int4 primary key,
    title		varchar,
    par			int4 references proj on update cascade
);
create index i_proj_par on proj (par);

-- Contains a record for every 2 combinations of projects which are related
-- to each other in the hierarchical project tree 
-- (parent/child, ancestor/progenitor, etc.)
-- --------------------------------------------------------------
create table proj_rel (
    anst_id		int4 references proj on update cascade on delete cascade,	-- ancestor project number
    prog_id		int4 references proj on update cascade on delete cascade,	-- progenitor project number
    rel			int4,								-- 0=self, 1=child, 2=grandchild, etc.
    primary key (anst_id, prog_id)
);

-- Contains a record for each account number and an optional alpha code to identify a sub-ledger
-- --------------------------------------------------------------
create table acct (
    acct_id		int4 primary key,	-- account number
    title		varchar,		-- name of the account
    code		varchar			-- alpha code for the account
);
create index i_acct_code on acct (code);

-- Each sub-ledger contains transactions unique to a certain part of the business
-- In addiiton to the standard fields, they all share in common, each sub-ledger
-- contains additional fields that are unique to it (so they can not all be
-- stored in a single table).  In our actual implementation, these sub-ledgers
-- are actually implemented as views joining even lower level tables.
-- --------------------------------------------------------------
create table subledg_A (
    rid			int4 primary key,		-- record ID
    amount		numeric(14,2),
    proj		int4 references proj on update cascade on delete cascade,
    unique_A		varchar				-- some other data
);
create index i_subledg_A_proj on subledg_A (proj);
-- --------------------------------------------------------------
create table subledg_B (
    rid			int4 primary key,		-- record ID
    amount		numeric(14,2),
    proj		int4 references proj on update cascade on delete cascade,
    unique_B		varchar				-- some other data
);
create index i_subledg_B_proj on subledg_B (proj);
-- --------------------------------------------------------------
create table subledg_C (
    rid			int4 primary key,		-- record ID
    amount		numeric(14,2),
    proj		int4 references proj on update cascade on delete cascade,
    unique_C		varchar				-- some other data
);
create index i_subledg_C_proj on subledg_C (proj);

-- These views allow a standard account code to presented in the appropriate ledgers
-- --------------------------------------------------------------
create view subview_A as select
   'AP ' || rid as trans_id,
    l.amount, l.proj,
    a.acct_id as acct
    from	subledg_A	l
        join	acct		a on a.code = 'ap';
-- --------------------------------------------------------------
create view subview_B as select
   'AR ' || rid as trans_id,
    l.amount, l.proj,
    a.acct_id as acct
    from	subledg_B	l
        join	acct		a on a.code = 'ar';
-- --------------------------------------------------------------
create view subview_C as select
   'PR ' || rid as trans_id,
    l.amount, l.proj,
    a.acct_id as acct
    from	subledg_C	l
        join	acct		a on a.code = 'pr';

-- General ledger - this should contain all transactions from all subledgers
-- --------------------------------------------------------------
create view gen_ledg as
    select trans_id, amount, proj, acct from subview_A
    union 
    select trans_id, amount, proj, acct from subview_B
    union 
    select trans_id, amount, proj, acct from subview_C;

-- Populate the project table:
insert into proj (proj_id,title,par) values ( 1, 'The main parent project', null);
insert into proj (proj_id,title,par) values ( 2, 'First  child of 1', 1);
insert into proj (proj_id,title,par) values ( 3, 'Second child of 1', 1);
insert into proj (proj_id,title,par) values ( 4, 'First  child of 2', 2);
insert into proj (proj_id,title,par) values ( 5, 'second child of 2', 2);
insert into proj (proj_id,title,par) values ( 6, 'First  child of 5', 5);
insert into proj (proj_id,title,par) values ( 7, 'Second child of 5', 5);
insert into proj (proj_id,title,par) values ( 8, 'Third  child of 5', 5);
insert into proj (proj_id,title,par) values ( 9, 'Fourth child of 5', 5);
insert into proj (proj_id,title,par) select *,'Sample project',1 from generate_series(10,5000);

-- Populate the project relationships table:
insert into proj_rel (anst_id,prog_id,rel) select proj_id, proj_id, 0 from proj;		-- self
insert into proj_rel (anst_id,prog_id,rel) select par, proj_id, 1 from proj where proj_id != 1;	-- parents
insert into proj_rel (anst_id,prog_id,rel) select 1, proj_id, 2 from proj where proj_id in (4,5);
insert into proj_rel (anst_id,prog_id,rel) select 1, proj_id, 3 from proj where proj_id in (6,7,8,9);
insert into proj_rel (anst_id,prog_id,rel) select 2, proj_id, 2 from proj where proj_id in (6,7,8,9);

-- Populate the account table:
insert into acct (acct_id,title,code) values ( 100, 'Account 100', 'cash');
insert into acct (acct_id,title,code) values ( 101, 'Account 101', null);
insert into acct (acct_id,title,code) values ( 102, 'Account 102', null);
insert into acct (acct_id,title,code) values ( 103, 'Account 103', 'ar');
insert into acct (acct_id,title,code) values ( 104, 'Account 104', null);
insert into acct (acct_id,title,code) values ( 105, 'Account 105', null);
insert into acct (acct_id,title,code) values ( 106, 'Account 106', 'ap');
insert into acct (acct_id,title,code) values ( 107, 'Account 107', null);
insert into acct (acct_id,title,code) values ( 108, 'Account 108', 'pr');
insert into acct (acct_id,title,code) values ( 109, 'Account 109', null);

insert into subledg_A (rid,amount,proj) select *, (random() * 10000), (random() * 4999) + 1 from generate_series(1,100000);
insert into subledg_B (rid,amount,proj) select *, (random() * 10000), (random() * 4999) + 1 from generate_series(1,100000);
insert into subledg_C (rid,amount,proj) select *, (random() * 10000), (random() * 4999) + 1 from generate_series(1,100000);
vacuum analyze;
-- When querying the general ledger for all transactions belonging to project 5,
-- this first query is very fast.  In spite of the union, the optimizer seems to apply 
-- the condition "proj = 5" to the inner tables first, and then append the results

--explain analyze
select * from gen_ledg where proj = 5;
-- It is also very fast when trying to find all transactions for the progeny of 5 
-- projects (5,6,7,8,9), but only if you are only querying a sub-ledger directly:

--explain analyze
select lg.*
    from 	subledg_A	lg
    join	proj_rel	pr	on pr.prog_id = lg.proj
    where	pr.anst_id = 5;
-- But when searching the general ledger for transactions belonging to the progeny 
-- of project 5, the optimizer produces the union first.  It then has to process 
-- 300,000 records to find the few it needs (even though the desired fields are 
-- indexed in the underlying tables).

--explain analyze
select lg.*
    from 	gen_ledg	lg
    join	proj_rel	pr	on pr.prog_id = lg.proj
    where	pr.anst_id = 5;
-- It would be nice if the optimizer could rewrite the query to work like this:
-- (which is very fast)

--explain analyze
select lg.*
    from 	subledg_A	lg
    join	proj_rel	pr	on pr.prog_id = lg.proj
    where	pr.anst_id = 5

union select lg.*
    from 	subledg_B	lg
    join	proj_rel	pr	on pr.prog_id = lg.proj
    where	pr.anst_id = 5

union select lg.*
    from 	subledg_C	lg
    join	proj_rel	pr	on pr.prog_id = lg.proj
    where	pr.anst_id = 5
;
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

Reply via email to