On Mon, 29 Mar 2010, Robert Haas wrote:

On Mon, Mar 29, 2010 at 2:31 PM, Faheem Mitha <fah...@email.unc.edu> wrote:
It's not really too clear to me from reading this what specific
questions you're trying to answer.

Quote from opt.{tex/pdf}, Section 1:

"If I have to I can use Section~\ref{ped_hybrid} and
Section~\ref{tped_hybrid}, but I am left wondering why I get the performance
I do out of the earlier versions. Specifically, why is
Section~\ref{ped_bigjoin} so much slower than Section~\ref{ped_trunc}, and
why does the memory usage in Section~\ref{ped_phenoout} blow up relative to
Section~\ref{ped_bigjoin} and Section~\ref{ped_trunc}?"

Here and in the document, you refer to section numbers for the
"hybrid" version but I don't see where you define what the "hybrid"
version actually is.

It is defined later in the file. I don't know if you are looking at the pdf, but if so, it is Section 2.4 (for the hybrid PED query). In the text file, I guess the easist way would be to grep for the label ped_hybrid.

And the differences between your queries are not real clear either - first you say you took out pheno and sex because they weren't necessary, but then you decide to put them back. I don't know what that means. If they're not necessary, leave them out.

I don't see where I say that pheno and sex weren't necessary. In fact, the word 'necessary' does not appear in the opt document. I took them out to see how it would affect performance. Which is does, dramatically. I say

"So, I decided to remove the joins to tables corresponding to the patient data, namely pheno and sex, and the runtime dropped to 150 min, while the memory stayed around 5G."

Maybe I wasn't being sufficiently explicit here. Perhaps

"So, I decided to remove the joins to tables corresponding to the patient
data, namely pheno and sex, to see how it would affect performance..."

would have been better.

One random thought: WHERE row_number() = 1 is not too efficient.
Try using LIMIT or DISTINCT ON instead.

Possibly. However, the CTE that uses

WHERE row_number() = 1

doesn't dominate the runtime or memory usage, so I'm not too concerned
about it.

Hmm, you might be right.

If you're concerned about memory usage, try reducing work_mem; you've
probably got it set to something huge.

work_mem = 1 GB (see diag.{tex/pdf}).

The point isn't that I'm using so much memory. Again, my question is, why
are these changes affecting memory usage so drastically?

Well each sort or hash can use an amount of memory that is limited
from above by work_mem.  So if you write the query in a way that
involves more sorts or hashes, each one can add up to 1GB to your
memory usage, plus overhead.  However, it doesn't look like any of
your queries including 30 sorts or hashes, so I'm thinking that the
RSS number probably also includes some of the shared memory that has
been mapped into each backend's address space.  RSS is not a terribly
reliable number when dealing with shared memory; it's hard to say what
that really means.

You might need to create some indices, too.

Ok. To what purpose? This query picks up everything from the tables and the
planner does table scans, so conventional wisdom and indeed my experience,
says that indexes are not going to be so useful.

Well, a hash join is not usually the first thing that pops to mind when dealing with a table that has 825 million rows (geno). I don't know if a nested loop with inner-indexscan would be faster, but it would almost certainly use less memory.

Can you provide an illustration of what you mean? I don't know what a "nested loop with inner-indexscan" is in this context.

                                                       Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to