Re: [HACKERS] new correlation metric

Greg Stark Sun, 26 Oct 2008 09:04:04 -0700

I haven't look at the patch yet -- I'm actually on a train now. I'msorry if these questions are answered in the patch.


I think there are three questions here:
A) what metric are we going to use
B) how do we estimate thy metric given a sample
C) how do we draw the conclusions we need based on this metric

I looked recently on this topic and I found papers on estimating twometrics based on samples: longest sorted subsequence and number ofsorted subsequence. The latter of which sounds like it might beequivalent to what you have.

I didn't did any papers on drawing conclusions from either of thesemetrics though.

I think we can just look at block number to avoid intra-block disorderconfusing us. It does mean we have effectively a much smaller samplethough.

I don't think the other objection is worth worrying about though. Ifour sample has 15263748 then it may be that readahead will save us inthat particular instance but it's clear the table isn't well clusteredand it's just luck that those blocks were intermixed. In a veryparticular way. The rest of the table is unlikely to be so lucky.



greg

On 26 Oct 2008, at 03:51 PM, Tom Lane <[EMAIL PROTECTED]> wrote:

Martijn van Oosterhout <[EMAIL PROTECTED]> writes:

On Sun, Oct 26, 2008 at 01:38:02AM -0700, Jeff Davis wrote:

I worked with Nathan Boley to come up with what we think is a better
metric for measuring this cost.

I think the code is in the right direction, but I think want you want
is some kind of estimate of "given I've looked for tuple X, how many
tuples in the next k pages are near this one".


ISTM that some experimental studies would be required to justify any
proposal in this area.

Having said that ... one of the things I never liked about theexistingcode is that it pays no attention to block-boundary effects. Itdoesn't

matter to an indexscan how badly tuples within a single block are
misordered --- what matters is how many block reads you have to do.
So I think that any changes here ought to include measuring the

correlation on the basis of block numbers not tuple numbers. Butwhat'snot too clear to me is whether our sampling methods would mess thatup.


           regards, tom lane

--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] new correlation metric

Reply via email to