Re: [PERFORM] experiments in query optimization

2010-04-01 Thread Faheem Mitha



On Wed, 31 Mar 2010, Faheem Mitha wrote:


On Tue, 30 Mar 2010, Robert Haas wrote:


On Tue, Mar 30, 2010 at 12:30 PM, Faheem Mitha fah...@email.unc.edu



You might need to create some indices, too.


Ok. To what purpose? This query picks up everything from the
tables and the planner does table scans, so conventional wisdom
and indeed my experience, says that indexes are not going to be so
useful.


There are situations where scanning the entire table to build up a
hash table is more expensive than using an index.  Why not test it?


Certainly, but I don't know what you and Robert have in mind, and I'm not
experienced enough to make an educated guess. I'm open to specific
suggestions.


Try creating an index on geno on the columns that are being used for the 
join.


Ok, I'll try that. I guess the cols in question on geno are idlink_id and 
anno_id. I thought that I already had indexes on them, but no. Maybe I had 
indexes, but removed them.


Looking at this more closely, idlink_id and anno_id are primary keys, so 
already have indexes on them, so my understanding (from the docs) is there 
is no purpose in creating them. That's why I removed the indexes that were 
there (back last August, actually, according to my logs). Anyway, doesn't 
look there is anything I can do here. Does anyone have additions or 
corrections to this?


  Regards, Faheem.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-04-01 Thread Eliot Gable
On Thu, Apr 1, 2010 at 7:46 AM, Faheem Mitha fah...@email.unc.edu wrote:


 Looking at this more closely, idlink_id and anno_id are primary keys, so
 already have indexes on them, so my understanding (from the docs) is there
 is no purpose in creating them. That's why I removed the indexes that were
 there (back last August, actually, according to my logs). Anyway, doesn't
 look there is anything I can do here. Does anyone have additions or
 corrections to this?


When you do a join, you typically have a foreign key in one table
referencing a primary key in another table. While designating a foreign key
does put a constraint on the key to ensure referential integrity, it does
not put an index on the column that is being designated as a foreign key. If
I understand correctly, the scan done as the inner loop of the nested loop
scan for the join is going to be your foreign key column, not your primary
key column. Thus, if you have no index on the foreign key column, you will
be forced to do a sequential table scan to do the join. In that case the
hash-based join will almost certainly be faster (especially for such a large
number of rows). If you put an index on the foreign key, then the inner scan
can be an index scan and that might turn out to be faster than building the
hash indexes on all the table rows.

Somebody can correct me if I'm wrong.


-- 
Eliot Gable

We do not inherit the Earth from our ancestors: we borrow it from our
children. ~David Brower

I decided the words were too conservative for me. We're not borrowing from
our children, we're stealing from them--and it's not even considered to be a
crime. ~David Brower

Esse oportet ut vivas, non vivere ut edas. (Thou shouldst eat to live; not
live to eat.) ~Marcus Tullius Cicero


Re: [PERFORM] experiments in query optimization

2010-04-01 Thread Faheem Mitha


Hi Eliot,

Thanks for the comment.

On Thu, 1 Apr 2010, Eliot Gable wrote:


On Thu, Apr 1, 2010 at 7:46 AM, Faheem Mitha fah...@email.unc.edu wrote:


Looking at this more closely, idlink_id and anno_id are primary keys, so 
already have indexes on them, so my understanding (from the docs) is 
there is no purpose in creating them. That's why I removed the indexes 
that were there (back last August, actually, according to my logs). 
Anyway, doesn't look there is anything I can do here. Does anyone have 
additions or corrections to this?


When you do a join, you typically have a foreign key in one table 
referencing a primary key in another table. While designating a foreign 
key does put a constraint on the key to ensure referential integrity, it 
does not put an index on the column that is being designated as a 
foreign key. If I understand correctly, the scan done as the inner loop 
of the nested loop scan for the join is going to be your foreign key 
column, not your primary key column. Thus, if you have no index on the 
foreign key column, you will be forced to do a sequential table scan to 
do the join. In that case the hash-based join will almost certainly be 
faster (especially for such a large number of rows). If you put an index 
on the foreign key, then the inner scan can be an index scan and that 
might turn out to be faster than building the hash indexes on all the 
table rows.



Somebody can correct me if I'm wrong.


I had set the foreign keys in question (on the geno table) to be primary 
keys. This is because this setup is basically a glorified spreadsheet, and 
I don't want more than one cell corresponding to a particular tuple of 
idlink.id and anno.id (the conceptual rows and cols). Since a primary key 
defines an index, I thought putting indexes on idlink_id and anno_id was 
redundant. However, it looks like (unsurprisingly) the index corresponding 
to the primary key is across both columns, which may not be what is wanted 
for the aforesaid join. Ie.


ALTER TABLE ONLY geno ADD CONSTRAINT geno_pkey PRIMARY KEY (idlink_id, anno_id)

(As a side comment, with respect to the indexes on the other side of the 
joins, in one case, we have idlink.id = geno.idlink_id, and idlink.id is a 
primary key too. In the other, namely geno.anno_id = 
dedup_patient_anno.id, dedup_patient_anno is a CTE, so no index on 
dedup_patient_anno.id. But maybe indexes aren't needed there.)


Here is the join

   SELECT decode_genotype(geno.snpval_id, %(allelea)s, %(alleleb)s) AS g,
 geno.idlink_id, geno.anno_id
 FROMgeno
 INNER JOIN dedup_patient_anno
 ON  geno.anno_id = dedup_patient_anno.id
 INNER JOIN idlink
 ON  geno.idlink_id = idlink.id
 ORDER BY idlink_id, anno_id

Here is the table dump.


-- Name: geno; Type: TABLE; Schema: hapmap; Owner: snp; Tablespace:
--
CREATE TABLE geno (
idlink_id integer NOT NULL,
anno_id integer NOT NULL,
snpval_id integer NOT NULL
)
WITH (autovacuum_enabled=true);

ALTER TABLE hapmap.geno OWNER TO snp;
--
-- Name: geno_pkey; Type: CONSTRAINT; Schema: hapmap; Owner: snp; 
Tablespace:

--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_pkey PRIMARY KEY (idlink_id, anno_id); ()
--
-- Name: geno_anno_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner: 
snp

--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_anno_id_fkey FOREIGN KEY (anno_id) REFERENCES 
anno(id) ON UPDATE CASCADE ON DELETE CASCADE;

--
-- Name: geno_idlink_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner: 
snp

--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_idlink_id_fkey FOREIGN KEY (idlink_id) REFERENCES 
idlink(id) ON UPDATE CASCADE ON DELETE CASCADE;

--
-- Name: geno_snpval_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner: 
snp

--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_snpval_id_fkey FOREIGN KEY (snpval_id) REFERENCES 
snpval(val) ON UPDATE CASCADE ON DELETE CASCADE;

*

So, should I add indexes on the individual foreign key cols idlink_id
and anno_id after all?

  Regards, Faheem.


--
Eliot Gable

We do not inherit the Earth from our ancestors: we borrow it from our 
children. ~David Brower


I decided the words were too conservative for me. We're not borrowing 
from our children, we're stealing from them--and it's not even 
considered to be a crime. ~David Brower


Nice quotes.

Esse oportet ut vivas, non vivere ut edas. (Thou shouldst eat to live; 
not live to eat.) ~Marcus Tullius Cicero



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-04-01 Thread Robert Haas
On Thu, Apr 1, 2010 at 2:15 PM, Faheem Mitha fah...@email.unc.edu wrote:
 I had set the foreign keys in question (on the geno table) to be primary
 keys. This is because this setup is basically a glorified spreadsheet, and I
 don't want more than one cell corresponding to a particular tuple of
 idlink.id and anno.id (the conceptual rows and cols). Since a primary key
 defines an index, I thought putting indexes on idlink_id and anno_id was
 redundant. However, it looks like (unsurprisingly) the index corresponding
 to the primary key is across both columns, which may not be what is wanted
 for the aforesaid join

Actually it is what is wanted - that is good.

 So, should I add indexes on the individual foreign key cols idlink_id
 and anno_id after all?

I doubt that would help.

The bottom line may be that you're dealing with hundreds of millions
of rows here, so things are going to take a long time.  Of course you
can always get more/faster memory, a bigger I/O subsystem, faster
processors...  and it could be that with detailed study there are
optimizations that could be done even without spending money, but I
think I'm about tapped out on what I can do over an Internet mailing
list.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-04-01 Thread Faheem Mitha



On Thu, 1 Apr 2010, Robert Haas wrote:


On Thu, Apr 1, 2010 at 2:15 PM, Faheem Mitha fah...@email.unc.edu wrote:


I had set the foreign keys in question (on the geno table) to be 
primary keys. This is because this setup is basically a glorified 
spreadsheet, and I don't want more than one cell corresponding to a 
particular tuple of idlink.id and anno.id (the conceptual rows and 
cols). Since a primary key defines an index, I thought putting indexes 
on idlink_id and anno_id was redundant. However, it looks like 
(unsurprisingly) the index corresponding to the primary key is across 
both columns, which may not be what is wanted for the aforesaid join



Actually it is what is wanted - that is good.


I see.


So, should I add indexes on the individual foreign key cols idlink_id
and anno_id after all?


I doubt that would help.


You're sure of this?

The bottom line may be that you're dealing with hundreds of millions of 
rows here, so things are going to take a long time.  Of course you can 
always get more/faster memory, a bigger I/O subsystem, faster 
processors...  and it could be that with detailed study there are 
optimizations that could be done even without spending money, but I 
think I'm about tapped out on what I can do over an Internet mailing 
list.


Thanks for your assistance, Robert. It's been educational.

  Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-04-01 Thread Eliot Gable
On Thu, Apr 1, 2010 at 3:01 PM, Faheem Mitha fah...@email.unc.edu wrote:


  So, should I add indexes on the individual foreign key cols idlink_id
 and anno_id after all?


 I doubt that would help.


 You're sure of this?



It is always best to test and be certain.


Re: [PERFORM] experiments in query optimization

2010-04-01 Thread Faheem Mitha



On Thu, 1 Apr 2010, Eliot Gable wrote:




On Thu, Apr 1, 2010 at 3:01 PM, Faheem Mitha fah...@email.unc.edu wrote:

  So, should I add indexes on the individual foreign key cols 
idlink_id
  and anno_id after all?


I doubt that would help.


You're sure of this?


It is always best to test and be certain.


Fair enough. I may also try disabling hash joins and see what happens...

  Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-31 Thread Matthew Wakeling

On Tue, 30 Mar 2010, Faheem Mitha wrote:

work_mem = 1 GB (see diag.{tex/pdf}).


Sure, but define sane setting, please. I guess part of the point is that I'm 
trying to keep memory low


You're trying to keep memory usage low, but you have work_mem set to 1GB?

Matthew

--
Prove to thyself that all circuits that radiateth and upon which thou worketh
are grounded, lest they lift thee to high-frequency potential and cause thee
to radiate also.  -- The Ten Commandments of Electronics

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-31 Thread Faheem Mitha



On Wed, 31 Mar 2010, Matthew Wakeling wrote:


On Tue, 30 Mar 2010, Faheem Mitha wrote:

work_mem = 1 GB (see diag.{tex/pdf}).


Sure, but define sane setting, please. I guess part of the point is that 
I'm trying to keep memory low


You're trying to keep memory usage low, but you have work_mem set to 1GB?


I'm trying to keep both runtime and memory usage low. I assume that with 
lower levels of memory, the runtime would be longer, other things being 
equal.

 Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-31 Thread Robert Haas
On Wed, Mar 31, 2010 at 6:10 AM, Faheem Mitha fah...@email.unc.edu wrote:

 [If Kevin Grittner reads this, please fix your email address. I am getting
 bounces from your email address.]

 On Tue, 30 Mar 2010, Robert Haas wrote:

 On Tue, Mar 30, 2010 at 12:30 PM, Faheem Mitha fah...@email.unc.edu
 wrote:

 Sure, but define sane setting, please. I guess part of the point is that
 I'm
 trying to keep memory low, and it seems this is not part of the planner's
 priorities. That it, it does not take memory usage into consideration
 when
 choosing a plan. If that it wrong, let me know, but that is my
 understanding.

 I don't understand quite why you're confused here.  We've already
 explained to you that the planner will not employ a plan that uses
 more than the amount of memory defined by work_mem for each sort or
 hash.

 Typical settings for work_mem are between 1MB and 64MB.  1GB is enormous.

 I don't think I am confused. To be clear, when I said it does not take
 memory usage into consideration' I was talking about overall memory usage.
 Let me summarize:

 The planner will choose the plan with the minimum total cost, with the
 constraint that the number of memory used for each of certain steps is less
 than work_mem. In other words with k such steps it can use at most

 k(plan)*work_mem

 memory where k(plan) denotes that k is a function of the plan. (I'm assuming
 here that memory is not shared between the different steps). However,
 k(plan)*work_mem is not itself bounded. I fail to see how reducing work_mem
 significantly would help me. This would mean that the current plans I am
 using would likely be ruled out, and I would be left with plans which, by
 definition, would have larger cost and so longer run times. The current
 runtimes are already quite long - for the PED query, the best I can do with
 work_mem=1 GB is 2 1/2 hrs, and that is after splitting the query into two
 pieces.

 I might actually be better off *increasing* the memory, since then the
 planner would have more flexibility to choose plans where the individual
 steps might require more memory, but the overall memory sum might be lower.

OK, your understanding is correct.

 You might need to create some indices, too.

 Ok. To what purpose? This query picks up everything from the
 tables and the planner does table scans, so conventional wisdom
 and indeed my experience, says that indexes are not going to be so
 useful.

 There are situations where scanning the entire table to build up a
 hash table is more expensive than using an index.  Why not test it?

 Certainly, but I don't know what you and Robert have in mind, and I'm not
 experienced enough to make an educated guess. I'm open to specific
 suggestions.

 Try creating an index on geno on the columns that are being used for the
 join.

 Ok, I'll try that. I guess the cols in question on geno are idlink_id and
 anno_id. I thought that I already had indexes on them, but no. Maybe I had
 indexes, but removed them.

 If I understand the way this works, if you request, say an INNER JOIN, the
 planner can choose different ways/algorithms to do this, as in
 http://en.wikipedia.org/wiki/Join_(SQL)#Nested_loops . It may choose a hash
 join, or an nested loop join or something else, based on cost. If the
 indexes don't exist that may make the inner loop join more expensive, so tip
 the balance in favor of using a hash join. However, I have no way to control
 which option it chooses, short of disabling eg. the hash join option, which
 is not an option for production usage anyway. Correct?

Yep.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-30 Thread Kevin Grittner
Faheem Mitha fah...@email.unc.edu wrote:
 
 If you're concerned about memory usage, try reducing work_mem;
 you've probably got it set to something huge.
 
 work_mem = 1 GB (see diag.{tex/pdf}).
 
 The point isn't that I'm using so much memory. Again, my question
 is, why are these changes affecting memory usage so drastically?
 
Because the planner looks at a very wide variety of plans, some of
which may use many allocations of work_mem size, and some of which
don't.  The costs are compared and the lowest cost one is chosen. If
you are close to the tipping point then even a very small change
might affect which is chosen.  It pays to keep the work_mem setting
sane so that unexpected plan changes don't cause problems.
 
Look at the plans and their costs to get a feel for what's being
chosen and why.  Although it's a very bad idea to use these in
production, you can often shift the plan to something you *think*
would be better using the enable_* settings, to see what the planner
thinks such a plan will cost and where it thinks the cost would be;
that can help in tuning the settings.
 
 You might need to create some indices, too.
 
 Ok. To what purpose? This query picks up everything from the
 tables and the planner does table scans, so conventional wisdom
 and indeed my experience, says that indexes are not going to be so
 useful.
 
There are situations where scanning the entire table to build up a
hash table is more expensive than using an index.  Why not test it?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-30 Thread Faheem Mitha



On Tue, 30 Mar 2010, Kevin Grittner wrote:


Faheem Mitha fah...@email.unc.edu wrote:


If you're concerned about memory usage, try reducing work_mem;
you've probably got it set to something huge.


work_mem = 1 GB (see diag.{tex/pdf}).

The point isn't that I'm using so much memory. Again, my question
is, why are these changes affecting memory usage so drastically?


Because the planner looks at a very wide variety of plans, some of
which may use many allocations of work_mem size, and some of which
don't.  The costs are compared and the lowest cost one is chosen. If
you are close to the tipping point then even a very small change
might affect which is chosen.  It pays to keep the work_mem setting
sane so that unexpected plan changes don't cause problems.


Sure, but define sane setting, please. I guess part of the point is that 
I'm trying to keep memory low, and it seems this is not part of the 
planner's priorities. That it, it does not take memory usage into 
consideration when choosing a plan. If that it wrong, let me know, but 
that is my understanding.



Look at the plans and their costs to get a feel for what's being
chosen and why.  Although it's a very bad idea to use these in
production, you can often shift the plan to something you *think*
would be better using the enable_* settings, to see what the planner
thinks such a plan will cost and where it thinks the cost would be;
that can help in tuning the settings.


Right. You mean to close off certain options to the planner using 'Planner 
Method Configuration'. I suppose one can also use 'Planner Cost Constants' 
to alter plan behaviour. I haven't tried changing these.



You might need to create some indices, too.


Ok. To what purpose? This query picks up everything from the
tables and the planner does table scans, so conventional wisdom
and indeed my experience, says that indexes are not going to be so
useful.


There are situations where scanning the entire table to build up a
hash table is more expensive than using an index.  Why not test it?


Certainly, but I don't know what you and Robert have in mind, and I'm not 
experienced enough to make an educated guess. I'm open to specific 
suggestions.


 Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-30 Thread Faheem Mitha


On thing which I haven't really mentioned in this thread or in my writeup, 
is that the planners value for the number of rows in geno is way off base 
some of the time. It is around 800 million, it thinks it is 100 million. I 
don't know if this is significant or not, or what to do about it.


eg. in the ped_bigjoin EXPLAIN ANALYZE VERBOSE:

  -  Sort  (cost=56855882.72..57144683.54 rows=115520330 width=42) 
(actual time=23027732.092..37113627.380 rows=823086774 loops=1)
Output: (CASE WHEN (hapmap.geno.snpval_id = (-1)) THEN '0 0'::text 
WHEN (hapmap.geno.snpval_id = 0) THEN 
(((dedup_patient_anno.allelea_id)::text || ' '::text) || 
(dedup_patient_anno.allelea_id)::text) WHEN (hapmap.geno.snpval_id = 1) 
THEN (((dedup_patient_anno.allelea_id)::text || ' '::text) || 
(dedup_patient_anno.alleleb_id)::text) WHEN (hapmap.geno.snpval_id = 2) 
THEN (((dedup_patient_anno.alleleb_id)::text || ' '::text) || 
(dedup_patient_anno.alleleb_id)::text) ELSE NULL::text END), 
hapmap.geno.idlink_id, hapmap.geno.anno_id, pheno.patientid, 
pheno.phenotype, sex.code


  Faheem.

On Tue, 30 Mar 2010, Kevin Grittner wrote:


Faheem Mitha fah...@email.unc.edu wrote:


If you're concerned about memory usage, try reducing work_mem;
you've probably got it set to something huge.


work_mem = 1 GB (see diag.{tex/pdf}).

The point isn't that I'm using so much memory. Again, my question
is, why are these changes affecting memory usage so drastically?


Because the planner looks at a very wide variety of plans, some of
which may use many allocations of work_mem size, and some of which
don't.  The costs are compared and the lowest cost one is chosen. If
you are close to the tipping point then even a very small change
might affect which is chosen.  It pays to keep the work_mem setting
sane so that unexpected plan changes don't cause problems.

Look at the plans and their costs to get a feel for what's being
chosen and why.  Although it's a very bad idea to use these in
production, you can often shift the plan to something you *think*
would be better using the enable_* settings, to see what the planner
thinks such a plan will cost and where it thinks the cost would be;
that can help in tuning the settings.


You might need to create some indices, too.


Ok. To what purpose? This query picks up everything from the
tables and the planner does table scans, so conventional wisdom
and indeed my experience, says that indexes are not going to be so
useful.


There are situations where scanning the entire table to build up a
hash table is more expensive than using an index.  Why not test it?

-Kevin




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-30 Thread Robert Haas
On Tue, Mar 30, 2010 at 12:30 PM, Faheem Mitha fah...@email.unc.edu wrote:
 Sure, but define sane setting, please. I guess part of the point is that I'm
 trying to keep memory low, and it seems this is not part of the planner's
 priorities. That it, it does not take memory usage into consideration when
 choosing a plan. If that it wrong, let me know, but that is my
 understanding.

I don't understand quite why you're confused here.  We've already
explained to you that the planner will not employ a plan that uses
more than the amount of memory defined by work_mem for each sort or
hash.

Typical settings for work_mem are between 1MB and 64MB.  1GB is enormous.

 You might need to create some indices, too.

 Ok. To what purpose? This query picks up everything from the
 tables and the planner does table scans, so conventional wisdom
 and indeed my experience, says that indexes are not going to be so
 useful.

 There are situations where scanning the entire table to build up a
 hash table is more expensive than using an index.  Why not test it?

 Certainly, but I don't know what you and Robert have in mind, and I'm not
 experienced enough to make an educated guess. I'm open to specific
 suggestions.

Try creating an index on geno on the columns that are being used for the join.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-29 Thread Robert Haas
On Thu, Mar 25, 2010 at 3:57 PM, Faheem Mitha fah...@email.unc.edu wrote:

 Hi everyone,

 I've been trying to reduce both memory usage and runtime for a query.
 Comments/suggestions gratefully received. Details are at

 http://bulldog.duhs.duke.edu/~faheem/snppy/opt.pdf

 See particularly Section 1 - Background and Discussion.

 If you want a text version, see

 http://bulldog.duhs.duke.edu/~faheem/snppy/opt.tex

 For background see

 http://bulldog.duhs.duke.edu/~faheem/snppy/diag.pdf (text version
 http://bulldog.duhs.duke.edu/~faheem/snppy/diag.tex) and
 http://bulldog.duhs.duke.edu/~faheem/snppy/snppy.pdf

 Please CC any replies to me at the above email address. Thanks.

Didn't you (or someone) post about these queries before?

It's not really too clear to me from reading this what specific
questions you're trying to answer.  One random thought: WHERE
row_number() = 1 is not too efficient.  Try using LIMIT or DISTINCT ON
instead.

If you're concerned about memory usage, try reducing work_mem; you've
probably got it set to something huge.

You might need to create some indices, too.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-29 Thread Faheem Mitha



On Mon, 29 Mar 2010, Robert Haas wrote:


On Thu, Mar 25, 2010 at 3:57 PM, Faheem Mitha fah...@email.unc.edu wrote:


Hi everyone,

I've been trying to reduce both memory usage and runtime for a query.
Comments/suggestions gratefully received. Details are at

http://bulldog.duhs.duke.edu/~faheem/snppy/opt.pdf

See particularly Section 1 - Background and Discussion.

If you want a text version, see

http://bulldog.duhs.duke.edu/~faheem/snppy/opt.tex

For background see

http://bulldog.duhs.duke.edu/~faheem/snppy/diag.pdf (text version
http://bulldog.duhs.duke.edu/~faheem/snppy/diag.tex) and
http://bulldog.duhs.duke.edu/~faheem/snppy/snppy.pdf

Please CC any replies to me at the above email address. Thanks.


Didn't you (or someone) post about these queries before?


I did write to the list about an earlier version of these queries, yes. In 
fact you replied to that message.



It's not really too clear to me from reading this what specific
questions you're trying to answer.


Quote from opt.{tex/pdf}, Section 1:

If I have to I can use Section~\ref{ped_hybrid} and 
Section~\ref{tped_hybrid}, but I am left wondering why I get the 
performance I do out of the earlier versions. Specifically, why is 
Section~\ref{ped_bigjoin} so much slower than Section~\ref{ped_trunc}, and 
why does the memory usage in Section~\ref{ped_phenoout} blow up relative 
to Section~\ref{ped_bigjoin} and Section~\ref{ped_trunc}?



One random thought: WHERE row_number() = 1 is not too efficient.
Try using LIMIT or DISTINCT ON instead.


Possibly. However, the CTE that uses

WHERE row_number() = 1

doesn't dominate the runtime or memory usage, so I'm not too concerned
about it.

If you're concerned about memory usage, try reducing work_mem; you've 
probably got it set to something huge.


work_mem = 1 GB (see diag.{tex/pdf}).

The point isn't that I'm using so much memory. Again, my question is, why 
are these changes affecting memory usage so drastically?



You might need to create some indices, too.


Ok. To what purpose? This query picks up everything from the tables and 
the planner does table scans, so conventional wisdom and indeed my 
experience, says that indexes are not going to be so useful.


Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-29 Thread Robert Haas
On Mon, Mar 29, 2010 at 2:31 PM, Faheem Mitha fah...@email.unc.edu wrote:
 It's not really too clear to me from reading this what specific
 questions you're trying to answer.

 Quote from opt.{tex/pdf}, Section 1:

 If I have to I can use Section~\ref{ped_hybrid} and
 Section~\ref{tped_hybrid}, but I am left wondering why I get the performance
 I do out of the earlier versions. Specifically, why is
 Section~\ref{ped_bigjoin} so much slower than Section~\ref{ped_trunc}, and
 why does the memory usage in Section~\ref{ped_phenoout} blow up relative to
 Section~\ref{ped_bigjoin} and Section~\ref{ped_trunc}?

Here and in the document, you refer to section numbers for the
hybrid version but I don't see where you define what the hybrid
version actually is.  And the differences between your queries are not
real clear either - first you say you took out pheno and sex because
they weren't necessary, but then you decide to put them back.  I don't
know what that means.  If they're not necessary, leave them out.

 One random thought: WHERE row_number() = 1 is not too efficient.
 Try using LIMIT or DISTINCT ON instead.

 Possibly. However, the CTE that uses

 WHERE row_number() = 1

 doesn't dominate the runtime or memory usage, so I'm not too concerned
 about it.

Hmm, you might be right.

 If you're concerned about memory usage, try reducing work_mem; you've
 probably got it set to something huge.

 work_mem = 1 GB (see diag.{tex/pdf}).

 The point isn't that I'm using so much memory. Again, my question is, why
 are these changes affecting memory usage so drastically?

Well each sort or hash can use an amount of memory that is limited
from above by work_mem.  So if you write the query in a way that
involves more sorts or hashes, each one can add up to 1GB to your
memory usage, plus overhead.  However, it doesn't look like any of
your queries including 30 sorts or hashes, so I'm thinking that the
RSS number probably also includes some of the shared memory that has
been mapped into each backend's address space.  RSS is not a terribly
reliable number when dealing with shared memory; it's hard to say what
that really means.

 You might need to create some indices, too.

 Ok. To what purpose? This query picks up everything from the tables and the
 planner does table scans, so conventional wisdom and indeed my experience,
 says that indexes are not going to be so useful.

Well, a hash join is not usually the first thing that pops to mind
when dealing with a table that has 825 million rows (geno).  I don't
know if a nested loop with inner-indexscan would be faster, but it
would almost certainly use less memory.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-29 Thread Faheem Mitha



On Mon, 29 Mar 2010, Robert Haas wrote:


On Mon, Mar 29, 2010 at 2:31 PM, Faheem Mitha fah...@email.unc.edu wrote:

It's not really too clear to me from reading this what specific
questions you're trying to answer.


Quote from opt.{tex/pdf}, Section 1:

If I have to I can use Section~\ref{ped_hybrid} and
Section~\ref{tped_hybrid}, but I am left wondering why I get the performance
I do out of the earlier versions. Specifically, why is
Section~\ref{ped_bigjoin} so much slower than Section~\ref{ped_trunc}, and
why does the memory usage in Section~\ref{ped_phenoout} blow up relative to
Section~\ref{ped_bigjoin} and Section~\ref{ped_trunc}?


Here and in the document, you refer to section numbers for the
hybrid version but I don't see where you define what the hybrid
version actually is.


It is defined later in the file. I don't know if you are looking at the 
pdf, but if so, it is Section 2.4 (for the hybrid PED query). In the text 
file, I guess the easist way would be to grep for the label ped_hybrid.


And the differences between your queries are not real clear either - 
first you say you took out pheno and sex because they weren't necessary, 
but then you decide to put them back.  I don't know what that means. 
If they're not necessary, leave them out.


I don't see where I say that pheno and sex weren't necessary. In fact, the 
word 'necessary' does not appear in the opt document. I took them out to 
see how it would affect performance. Which is does, dramatically. I say


So, I decided to remove the joins to tables corresponding to the patient 
data, namely pheno and sex, and the runtime dropped to 150 min, while the 
memory stayed around 5G.


Maybe I wasn't being sufficiently explicit here. Perhaps

So, I decided to remove the joins to tables corresponding to the patient
data, namely pheno and sex, to see how it would affect performance...

would have been better.


One random thought: WHERE row_number() = 1 is not too efficient.
Try using LIMIT or DISTINCT ON instead.


Possibly. However, the CTE that uses

WHERE row_number() = 1

doesn't dominate the runtime or memory usage, so I'm not too concerned
about it.


Hmm, you might be right.


If you're concerned about memory usage, try reducing work_mem; you've
probably got it set to something huge.


work_mem = 1 GB (see diag.{tex/pdf}).

The point isn't that I'm using so much memory. Again, my question is, why
are these changes affecting memory usage so drastically?


Well each sort or hash can use an amount of memory that is limited
from above by work_mem.  So if you write the query in a way that
involves more sorts or hashes, each one can add up to 1GB to your
memory usage, plus overhead.  However, it doesn't look like any of
your queries including 30 sorts or hashes, so I'm thinking that the
RSS number probably also includes some of the shared memory that has
been mapped into each backend's address space.  RSS is not a terribly
reliable number when dealing with shared memory; it's hard to say what
that really means.



You might need to create some indices, too.



Ok. To what purpose? This query picks up everything from the tables and the
planner does table scans, so conventional wisdom and indeed my experience,
says that indexes are not going to be so useful.


Well, a hash join is not usually the first thing that pops to mind when 
dealing with a table that has 825 million rows (geno).  I don't know if 
a nested loop with inner-indexscan would be faster, but it would almost 
certainly use less memory.


Can you provide an illustration of what you mean? I don't know what a 
nested loop with inner-indexscan is in this context.


   Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] experiments in query optimization

2010-03-25 Thread Faheem Mitha


Hi everyone,

I've been trying to reduce both memory usage and runtime for a query. 
Comments/suggestions gratefully received. Details are at


http://bulldog.duhs.duke.edu/~faheem/snppy/opt.pdf

See particularly Section 1 - Background and Discussion.

If you want a text version, see

http://bulldog.duhs.duke.edu/~faheem/snppy/opt.tex

For background see

http://bulldog.duhs.duke.edu/~faheem/snppy/diag.pdf (text version 
http://bulldog.duhs.duke.edu/~faheem/snppy/diag.tex) and 
http://bulldog.duhs.duke.edu/~faheem/snppy/snppy.pdf


Please CC any replies to me at the above email address. Thanks.

 Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance