Re: [PERFORM] change sample size for statistics

2011-06-10 Thread Nathan Boley
[ Sorry, forgot to cc list ]

 It is said to be 10%. i would like to raise that, because we are getting bas
 estimations for n_distinct.

 More to the point, the estimator we use is going to be biased for many
 ( probably most ) distributions no matter how large your sample size
 is.

 If you need to fix ndistinct, a better approach may be to do it manually.

 Best,
 Nathan


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

2011-05-16 Thread Nathan Boley
 The accesses to an index are far more likely to be clustered than the
 accesses to the underlying table, because the index is organized in a
 way that's application-meaningful and the table not so much.

So, to clarify, are you saying that if query were actually requesting
rows uniformly random, then there would be no reason to suspect that
index accesses would have hotspots? It seems like the index structure
( ie, the top node in b-trees ) could also get in the way.

Best,
Nathan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance

2011-04-13 Thread Nathan Boley
 If you model the costing to reflect the reality on your server, good
 plans will be chosen.

 Wouldn't it be better to derive those costs from actual performance
 data measured at runtime?

 Say, pg could measure random/seq page cost, *per tablespace* even.

 Has that been tried?

FWIW, awhile ago I wrote a simple script to measure this and found
that the *actual* random_page / seq_page cost ratio was much higher
than 4/1.

The problem is that caching effects have a large effect on the time it
takes to access a random page, and caching effects are very workload
dependent. So anything automated would probably need to optimize the
parameter values over a set of 'typical' queries, which is exactly
what a good DBA does when they set random_page_cost...

Best,
Nathan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] how explain works

2011-04-11 Thread Nathan Boley
 how explian works as math equations to estimate cost with  constatn query
 parameters
 such as cpu_tuple cost ,random page cost ...etc
  i want maths  expression  in order to know how these parameters will effect
 in cost ???

The expressions are complicated, and they are certainly not linear as
you seem to think from your previous post.

 please any one can help me ??

What do you need this for? If your goal is to optimize a real
application, then you should just vary the cost parameters and measure
the resulting change in query times. If your interests are academic,
there were some excellent suggestions for places to start in response
to your previous post.

Best,
Nathan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Shouldn't we have a way to avoid risky plans?

2011-03-25 Thread Nathan Boley
 mergejoinscansel doesn't currently try to fix up the histogram bounds by
 consulting indexes.  At the time I was afraid of the costs of doing
 that, and I still am; but it would be a way to address this issue.


Another cheaper but less accurate way to deal with this is to note
that we are trying to estimate the max of the population by using the
max of the sample, which obviously has a negative bias. If we could
correct the bias ( though the bootstrap, or an analytical correction
under some parametric assumptions ( ie, the distribution is uniform in
the last bucket ) ) , then we should get better estimates at the cost
of some analyze time. But this wouldn't even deal with Josh's
particular problem, since it's due to out of date stats rather than
sampling error...

-Nathan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Shouldn't we have a way to avoid risky plans?

2011-03-24 Thread Nathan Boley
 This can se GUC-controllable. Like plan_safety=0..1 with low default value.
 This can influence costs of plans where cost changes dramatically with small
 table changes and/or statistics is uncertain. Also this can be used as
 direct hint for such dangerous queries by changing GUC for session/single
 query.

 ISTM if you add statistics miss and 'risk margin' to the things the
 planner would have to consider while generating a plan, you are
 greatly increasing the number of plan paths that would have to be
 considered for any non trivial query.


FWIW, these ideas are well established in the statistical community.

Currently, we are essentially using maximum likelihood estimators.
We estimate a bunch of selectivities by choosing what is most likely,
plug them in to our objective function, and then minimize based upon
the plugged in values. In a wide variety of cases, MLE's can be shown
to be asymptotically optimal. That is, as our sample distribution
approaches the true distribution, the best we can possibly do is to
use the MLE. This is pretty sensible - if we actually knew all of the
selectivities then the results aren't really random anymore. However,
they often perform very poorly with small sample sizes - particularly
if the loss function is very sensitive to relatively small
fluctuations in the parameter estimates ( which postgres certainly is
- switching from a hash join to a nest-loop can be disastrous ).

Using the estimate that minimizes the worst-case performance is
precisely a minimax estimator. There, the goal is to minimize the risk
function ( iow, plan cost ) under the worst possible conditions.
Wikipedia has a pretty good treatment - just think plan cost
whenever you see risk.

Another approach, that hasn't been suggested yet, is some Bayesian
update method. There, rather than calculating a specific parameter
value ( like ndistinct ), you try to store the entire distribution and
choose the plan that minimizes cost averaged over all of the possible
parameter values.

Example: ( please excuse the unrealistic numbers )

For instance, rather than estimate the selectivity of the join (
relation1.col1 = relation2.col1 ) to be 0.01, we would say it is 0.1
w/ probability 0.2 and 0.001 with probability 0.8. So, here is how we
would choose the plan now:

cost( nestloop | selectivity = 0.01 ) = 1
cost( hashjoin | selectivity = 0.01 ) = 2
cost( mergejoin | selectivity = 0.01 ) = 50

Here would be the bayesian approach:

cost( nestloop | selectivity = 0.001 ) = 0.1
cost( hashjoin | selectivity = 0.001 ) = 1
cost( mergejoin | selectivity = 0.001 ) = 50

cost( nestloop | selectivity = 0.1 ) = 10
cost( hashjoin | selectivity = 0.1 ) = 3
cost( mergejoin | selectivity = 0.1 ) = 50

So, the bayesian costs are:

nestloop: 0.1*0.8 + 10*0.2 = 2.08
hashjoin: 1.0*0.8 + 3*0.2 = 1.4
nestloop: 50*0.8 + 50*0.2 = 50

so the hashjoin would be chosen.

For completeness, the minimax costs would be:

nestloop: max( 0.1, 10 )
hashjoin: max( 1, 3   )
nestloop: max( 50, 50 )

So, again, the hashjoin is chosen.

I obviously have a bias towards the Bayesian approach, but it's not
because I expect it to necessarily perform a whole lot better but,
rather, it reduces to the other two approaches. If we want the current
behavior, then simply store the MLE selectivity with probability 1. If
we want the minimax estimate, choose the worst possible value. Or
anything in between.

Also, ( not that I have even close to the experience / expertise to
make this claim - so take this with a grain of salt ) it seems that
the code changes would be substantial but pretty straightforward and
easy to localize. Rather than passing a selectivity, pass a pair of
arrays with selectivities and probabilities. Initially, we could keep
the current estimates ( every passed array would be of length one )
and then make changes as problems appear ( like Josh's )

I hope my little estimation procedure tutorial has been a little
helpful, please feel free to contact me off list if you have
questions/want references.

Best,
Nathan Boley

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] plpgsql arrays

2009-04-03 Thread Nathan Boley
 Uh, no, it wouldn't.  Visually:

        L1      -
        L2      ---
        L3      -

        R1                     

 At L2, you'd conclude that you're done matching R1.


No, you should conclude that you're done matching L2. You conclude
you're done matching R1 when you reach L4  ( or there exists a j st
Lj.start  R1.end, or equivalently Lj is strictly greater than R1 )

FWIW, this is a very common problem in bioinformatics. I've mostly
implemented this in python and C. The code is available at
encodestatistics.org. Look in encode.py at the overlap method of the
feature_region class, or ( for the C version ) region_overlap in
block_bootstrap.c ( svn is more up to date for the C ).

-Nathan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance