I haven't look at the patch yet -- I'm actually on a train now. I'm sorry if these questions are answered in the patch.

I think there are three questions here:
A) what metric are we going to use
B) how do we estimate thy metric given a sample
C) how do we draw the conclusions we need based on this metric

I looked recently on this topic and I found papers on estimating two metrics based on samples: longest sorted subsequence and number of sorted subsequence. The latter of which sounds like it might be equivalent to what you have.

I didn't did any papers on drawing conclusions from either of these metrics though.

I think we can just look at block number to avoid intra-block disorder confusing us. It does mean we have effectively a much smaller sample though.

I don't think the other objection is worth worrying about though. If our sample has 15263748 then it may be that readahead will save us in that particular instance but it's clear the table isn't well clustered and it's just luck that those blocks were intermixed. In a very particular way. The rest of the table is unlikely to be so lucky.


greg

On 26 Oct 2008, at 03:51 PM, Tom Lane <[EMAIL PROTECTED]> wrote:

Martijn van Oosterhout <[EMAIL PROTECTED]> writes:
On Sun, Oct 26, 2008 at 01:38:02AM -0700, Jeff Davis wrote:
I worked with Nathan Boley to come up with what we think is a better
metric for measuring this cost.

I think the code is in the right direction, but I think want you want
is some kind of estimate of "given I've looked for tuple X, how many
tuples in the next k pages are near this one".

ISTM that some experimental studies would be required to justify any
proposal in this area.

Having said that ... one of the things I never liked about the existing code is that it pays no attention to block-boundary effects. It doesn't
matter to an indexscan how badly tuples within a single block are
misordered --- what matters is how many block reads you have to do.
So I think that any changes here ought to include measuring the
correlation on the basis of block numbers not tuple numbers. But what's not too clear to me is whether our sampling methods would mess that up.

           regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to