Re: [PERFORM] change sample size for statistics
[ Sorry, forgot to cc list ] It is said to be 10%. i would like to raise that, because we are getting bas estimations for n_distinct. More to the point, the estimator we use is going to be biased for many ( probably most ) distributions no matter how large your sample size is. If you need to fix ndistinct, a better approach may be to do it manually. Best, Nathan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan
The accesses to an index are far more likely to be clustered than the accesses to the underlying table, because the index is organized in a way that's application-meaningful and the table not so much. So, to clarify, are you saying that if query were actually requesting rows uniformly random, then there would be no reason to suspect that index accesses would have hotspots? It seems like the index structure ( ie, the top node in b-trees ) could also get in the way. Best, Nathan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance
If you model the costing to reflect the reality on your server, good plans will be chosen. Wouldn't it be better to derive those costs from actual performance data measured at runtime? Say, pg could measure random/seq page cost, *per tablespace* even. Has that been tried? FWIW, awhile ago I wrote a simple script to measure this and found that the *actual* random_page / seq_page cost ratio was much higher than 4/1. The problem is that caching effects have a large effect on the time it takes to access a random page, and caching effects are very workload dependent. So anything automated would probably need to optimize the parameter values over a set of 'typical' queries, which is exactly what a good DBA does when they set random_page_cost... Best, Nathan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] how explain works
how explian works as math equations to estimate cost with constatn query parameters such as cpu_tuple cost ,random page cost ...etc i want maths expression in order to know how these parameters will effect in cost ??? The expressions are complicated, and they are certainly not linear as you seem to think from your previous post. please any one can help me ?? What do you need this for? If your goal is to optimize a real application, then you should just vary the cost parameters and measure the resulting change in query times. If your interests are academic, there were some excellent suggestions for places to start in response to your previous post. Best, Nathan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Shouldn't we have a way to avoid risky plans?
mergejoinscansel doesn't currently try to fix up the histogram bounds by consulting indexes. At the time I was afraid of the costs of doing that, and I still am; but it would be a way to address this issue. Another cheaper but less accurate way to deal with this is to note that we are trying to estimate the max of the population by using the max of the sample, which obviously has a negative bias. If we could correct the bias ( though the bootstrap, or an analytical correction under some parametric assumptions ( ie, the distribution is uniform in the last bucket ) ) , then we should get better estimates at the cost of some analyze time. But this wouldn't even deal with Josh's particular problem, since it's due to out of date stats rather than sampling error... -Nathan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Shouldn't we have a way to avoid risky plans?
This can se GUC-controllable. Like plan_safety=0..1 with low default value. This can influence costs of plans where cost changes dramatically with small table changes and/or statistics is uncertain. Also this can be used as direct hint for such dangerous queries by changing GUC for session/single query. ISTM if you add statistics miss and 'risk margin' to the things the planner would have to consider while generating a plan, you are greatly increasing the number of plan paths that would have to be considered for any non trivial query. FWIW, these ideas are well established in the statistical community. Currently, we are essentially using maximum likelihood estimators. We estimate a bunch of selectivities by choosing what is most likely, plug them in to our objective function, and then minimize based upon the plugged in values. In a wide variety of cases, MLE's can be shown to be asymptotically optimal. That is, as our sample distribution approaches the true distribution, the best we can possibly do is to use the MLE. This is pretty sensible - if we actually knew all of the selectivities then the results aren't really random anymore. However, they often perform very poorly with small sample sizes - particularly if the loss function is very sensitive to relatively small fluctuations in the parameter estimates ( which postgres certainly is - switching from a hash join to a nest-loop can be disastrous ). Using the estimate that minimizes the worst-case performance is precisely a minimax estimator. There, the goal is to minimize the risk function ( iow, plan cost ) under the worst possible conditions. Wikipedia has a pretty good treatment - just think plan cost whenever you see risk. Another approach, that hasn't been suggested yet, is some Bayesian update method. There, rather than calculating a specific parameter value ( like ndistinct ), you try to store the entire distribution and choose the plan that minimizes cost averaged over all of the possible parameter values. Example: ( please excuse the unrealistic numbers ) For instance, rather than estimate the selectivity of the join ( relation1.col1 = relation2.col1 ) to be 0.01, we would say it is 0.1 w/ probability 0.2 and 0.001 with probability 0.8. So, here is how we would choose the plan now: cost( nestloop | selectivity = 0.01 ) = 1 cost( hashjoin | selectivity = 0.01 ) = 2 cost( mergejoin | selectivity = 0.01 ) = 50 Here would be the bayesian approach: cost( nestloop | selectivity = 0.001 ) = 0.1 cost( hashjoin | selectivity = 0.001 ) = 1 cost( mergejoin | selectivity = 0.001 ) = 50 cost( nestloop | selectivity = 0.1 ) = 10 cost( hashjoin | selectivity = 0.1 ) = 3 cost( mergejoin | selectivity = 0.1 ) = 50 So, the bayesian costs are: nestloop: 0.1*0.8 + 10*0.2 = 2.08 hashjoin: 1.0*0.8 + 3*0.2 = 1.4 nestloop: 50*0.8 + 50*0.2 = 50 so the hashjoin would be chosen. For completeness, the minimax costs would be: nestloop: max( 0.1, 10 ) hashjoin: max( 1, 3 ) nestloop: max( 50, 50 ) So, again, the hashjoin is chosen. I obviously have a bias towards the Bayesian approach, but it's not because I expect it to necessarily perform a whole lot better but, rather, it reduces to the other two approaches. If we want the current behavior, then simply store the MLE selectivity with probability 1. If we want the minimax estimate, choose the worst possible value. Or anything in between. Also, ( not that I have even close to the experience / expertise to make this claim - so take this with a grain of salt ) it seems that the code changes would be substantial but pretty straightforward and easy to localize. Rather than passing a selectivity, pass a pair of arrays with selectivities and probabilities. Initially, we could keep the current estimates ( every passed array would be of length one ) and then make changes as problems appear ( like Josh's ) I hope my little estimation procedure tutorial has been a little helpful, please feel free to contact me off list if you have questions/want references. Best, Nathan Boley -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] plpgsql arrays
Uh, no, it wouldn't. Visually: L1 - L2 --- L3 - R1 At L2, you'd conclude that you're done matching R1. No, you should conclude that you're done matching L2. You conclude you're done matching R1 when you reach L4 ( or there exists a j st Lj.start R1.end, or equivalently Lj is strictly greater than R1 ) FWIW, this is a very common problem in bioinformatics. I've mostly implemented this in python and C. The code is available at encodestatistics.org. Look in encode.py at the overlap method of the feature_region class, or ( for the C version ) region_overlap in block_bootstrap.c ( svn is more up to date for the C ). -Nathan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance