Re: [HACKERS] Measure Theoretic Data Types in Postgresql
However, by realizing that the bounds on the ranges have a linear ordering one can speed this up to 0(m) using windowing functions on common table expressions. So what I am proposing is formalizing this optimization into a class of data types, that will hide the implementation details. Could this not also be handled by extending merge join to work with an overlap operator? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
What about MCV's? Will those be removed as well? Sure. Those seem even less useful. Ya, this will destroy the performance of several queries without some heavy tweaking. Maybe this is bad design, but I've gotten in the habit of storing sequences as arrays and I commonly join on them. I looked through my code this morning, and I only have one 'range' query ( of the form described up-thread ), but there are tons of the form SELECT att1, attb2 FROM rela, relb where rela.seq_array_1 = relb.seq_array; I can provide some examples if that would make my argument more compelling. Sorry to be difficult, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
[ sorry Tom, reply all this time... ] What do you mean by storing sequences as arrays? So, a simple example is, for transcripts ( sequences of DNA that are turned into proteins ), we store each of the connected components as an array of the form: exon_type in [1,6] splice_type = [1,3] and then the array elements are [ exon_type, splice_type, exon_type ] ~ 99% of the elements are of the form [ [1,3], 1, [1,3] ], so I almost always get a hash or merge join ( correctly ) but for the rare junction types ( which are usually more interesting as well ) I correctly get nest loops with an index scan. Can you demonstrate that the existing stats are relevant at all to the query you're worried about? Well, if we didn't have mcv's and just relied on ndistinct to estimate the '=' selectivities, either my low selectivity quals would use the index, or my high selectivity quals would use a table scan, either of which would be wrong. I guess I could wipe out the stats and get some real numbers tonight, but I can't see how the planner would be able to distinguish *without* mcv's... Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch n...@leadboat.com wrote: I've attached a new version that includes the UINT64_FMT fix, some edits of your newest comments, and a rerun of pgindent on the new files. I see no other issues precluding commit, so I am marking the patch Ready for Committer. If I made any of the comments worse, please post another update. Changes looks reasonable for me. Thanks! I am starting to look at this patch now. I'm wondering exactly why the decision was made to continue storing btree-style statistics for arrays, in addition to the new stuff. If I understand you're suggestion, queries of the form SELECT * FROM rel WHERE ARRAY[ 1,2,3,4 ] = x AND x =ARRAY[ 1, 2, 3, 1000]; would no longer use an index. Is that correct? Are you suggesting removing MCV's in lieu of MCE's as well? -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Wed, Feb 29, 2012 at 2:43 PM, Tom Lane t...@sss.pgh.pa.us wrote: Nathan Boley npbo...@gmail.com writes: On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane t...@sss.pgh.pa.us wrote: I am starting to look at this patch now. I'm wondering exactly why the decision was made to continue storing btree-style statistics for arrays, in addition to the new stuff. If I understand you're suggestion, queries of the form SELECT * FROM rel WHERE ARRAY[ 1,2,3,4 ] = x AND x =ARRAY[ 1, 2, 3, 1000]; would no longer use an index. Is that correct? No, just that we'd no longer have statistics relevant to that, and would have to fall back on default selectivity assumptions. Which, currently, would mean queries of that form would typically use a table scan, right? Do you think that such applications are so common as to justify bloating pg_statistic for everybody that uses arrays? I have no idea, but it seems like it will be a substantial regression for the people that are. What about MCV's? Will those be removed as well? Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches
I actually don't know much about the I/O subsystem, but, no, WAL is not separated from data. I believe $PGDATA is on a SAN, but I don't know anything about its characteristics. The computer is here: http://www.supermicro.nl/Aplus/system/2U/2042/AS-2042G-6RF.cfm $PGDATA is on a 5 disk SATA software raid 5. Is there anything else that would be useful to know? ( Sorry, this isn't a subject that I'm very familiar with ) -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: collect frequency statistics for arrays
Version of patch with few more comments and some fixes. Where are we on the idea of better statistics for arrays? I need to finish reviewing the patch: I'll try to send in something tomorrow night. So far it looks good. Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
Rebased with head. FYI, I've added myself as the reviewer for the current commitfest. Best, Nathan Boley -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: collect frequency statistics for arrays
Looking now, I see that Alexander wasn't Cc'd on the review, so it's possible he missed the message? We've corresponded off list and have discussed my review at some length. Alex submitted an updated patch on Sep 22 to me personally ( although not to the list? Alex? ), with the promise of a new version with improved comments. Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch Review: Collect frequency statistics and selectivity estimation for arrays
Actually, not MCV slot is used but MCELEM. It is a different slot. ps_stats view map both into same fileds. Yes, you're right. I am sorry about the error. Surely, non-standard use of histogram slot should be avoided. Agreed. Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Patch Review: Collect frequency statistics and selectivity estimation for arrays
Review of patch: https://commitfest.postgresql.org/action/patch_view?id=539 I glanced through the code and played around with the feature and, in general, it looks pretty good. But I have a some comments/questions. Design: First, it makes me uncomfortable that you are using the MCV and histogram slot kinds in a way that is very different from other data types. I realize that tsvector uses MCV in the same way that you do but: 1) I don't like that very much either. 2) TS vector is different in that equality ( in the btree sense ) doesn't make sense, whereas it does for arrays. Using the histogram slot for the array lengths is also very surprising to me. Why not just use a new STA_KIND? It's not like we are running out of room, and this will be the second 'container' type that splits the container and stores stats about the elements. Second, I'm fairly unclear about what the actual underlying statistical method is and what assumptions it makes. Before I can really review the method, I will probably need to see some documentation, as I say below. Do you have any tests/simulations that explore the estimation procedure that I can look at? When will it fail? In what cases does it improve estimation? Documentation: Seems a bit lacking - especially if you keep the non standard usage of mcv/histograms. Also, it would be really nice to see some update to the row-estimation-examples page ( in chapter 55 ). The code comments are good in general, but there are not enough high level comments . It would be really nice to have a comment that gives an overview of what each of the following functions are supposed to do: mcelem_array_selec mcelem_array_contain_overlap_selec and especially mcelem_array_contained_selec Also, in the compute_array_stats you reference compute_tsvector_stats for the algorithm, but I think that it would be smarter to either copy the relevant portions of the comment over or to reference a published document. If compute_tsvector_stats changes, I doubt that anyone will remember to update the comment. Finally, it would be really nice to see, explicitly, and in mathematical notation 1) The data that is being collected and summarized. ( the statistic ) 2) The estimate being derived from the statistic ( ie the selectivity ) i) Any assumptions being made ( ie independence of elements within an array ) For instance, the only note I could find that actually addressed the estimator and assumptions underneath it was +* mult * dist[i] / mcelem_dist[i] gives us probability probability +* of qual matching from assumption of independent element occurence +* with condition that length = i. and that's pretty cryptic. That should be expanded upon and placed more prominently. A couple nit picks: 1) In calc_distr you go to some lengths to avoid round off errors. Since it is certainly just the order of the estimate that matters, why not just perform the calculation in log space? 2) compute_array_stats is really long. Could you break it up? ( I know that the other analyze functions are the same way, but why continue in that vein? ) Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: cross column correlation ...
Personally, I think the first thing we ought to do is add a real, bona fide planner hint to override the selectivity calculation manually, maybe something like this: WHERE (x 5 AND y = 1) SELECTIVITY (0.1); If you're going to go that far, why not just collect statistics on that specific predicate? ie, ANALYZE SELECTIVITY ON tablename (x, y) WHERE (x 5 AND y = 1); Then it won't fall subject to all of the pitfalls that Tom outlines below. Selectivities are easy to estimate if we know the predicate. They only become hard when they have to work for every possible predicate. Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Range Types: empty ranges
FWIW, a very informal survey of probabilists didn't yield any reason for trying to put an order on the empty set ( unless the metric was cardinality or other equivalence relation ). I think the problem here is that the idea of union and intersection forming a ring over sets is being conflated with the order relation. Clearly, combining the two notions can be inconsistent. However... A UNION (B INTERSECT C) = (A UNION B) INTERSECT (A UNION C) But the basic range type isn't even closed under UNION. An excellent point. Allow me to move the target a little: WHERE A B AND A C and: WHERE A (B INTERSECT C) That seems like a logically sound transformation, but if (B INTERSECT C) is empty, it relies on the empty range for those two to be equivalent. Now, I agree that lack of closure on UNION exhibits many of the problems that I am pointing out related to forbidding empty ranges. However, I'm not sure if that means we should give up on either. This seems potentially very useful, because we can transform WHERE A B AND A C from a bitmap scan into WHERE A (B INTERSECT C), a simple index scan. In the union case ( even if we had a type that supported disjoint intervals), I doubt we would ever make that transformation because the index will probably still be over connected intervals. So, +1 for keeping it how it is ( but maybe with a better error message ). Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
On Wed, Jan 19, 2011 at 2:13 PM, Tomas Vondra t...@fuzzy.cz wrote: Dne 19.1.2011 20:25, Robert Haas napsal(a): On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote: Yes, I was aware of this problem (amount of memory consumed with lots of updates), and I kind of hoped someone will come up with a reasonable solution. As far as I can see, periodically sampling some or all of the table is really the only thing that has a chance of working OK. You could try to track changes only when they're not too big, but I think that's still going to have nasty performance consequences. Yes, sampling all the table is the only way to get reliable ndistinct estimates. IMHO continuing to focus on worst case results is a dead end. Yes, for some distributions, ndistinct is very hard to estimate well. However, let us not forget that the current ndistinct estimator is very bad but the postgres planner is, on the whole, very good. As far as I can see this is due to two facts: 1) The distribution of values in a table is rarely pathological, and usually follows one of several common patterns. ( IOW, we have good heuristics ) 2) The choice of plan is fairly robust to mis-estimation of ndistinct, because there are only a few things the executer can choose. It doesn't usually matter if a value composes 5% or 20% ( or 99% ) of the table, we still usually need to scan the entire table. Again, in my *very* humble opinion, If the ultimate goal is to improve the planner, we should try to cut down on the number of cases in which a poor estimate of ndistinct gives a really bad plan instead of chasing after marginal gains from a better ndistinct estimator. Maybe ( as I've advocated for before ) this means parameterizing the distribution of values ( ie, find that the values are roughly uniform ), maybe this means estimating the error of our statistics and using the most robust rather than the best plan, or maybe it means something totally different. But: ( and the literature seems to support me ) pounding away at the ndistinct estimator seems like a dead end. If you think about it, it's a bit ridiculous to look at the whole table *just* to estimate ndistinct - if we go that far why dont we just store the full distribution and be done with it? Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
If you think about it, it's a bit ridiculous to look at the whole table *just* to estimate ndistinct - if we go that far why dont we just store the full distribution and be done with it? - the best you could do is to average the individual probabilities which gives ... well, 1/ndistinct. That is certainly *not* the best you could do in every case. The mean is only the best estimate in L2, which is definitely not the case here. Consider a table with 10K values, 9,990 of which are 1 and the rest of which are 2, 3, ..., 10, versus a table that has the same 10 distinct values evenly distributed. For a simple equality query, in the first case, a bitmap scan might be best. In the second case, a sequential scan would always be best. This is precisely the point I was trying to make in my email: the loss function is very complicated. Improving the ndistinct estimator could significantly improve the estimates of ndistinct ( in the quadratic loss sense ) while only marginally improving the plans. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
Not true. You're missing the goal of this effort - to get ndistinct estimate for combination of multiple columns. Which is usually pathological in case of dependent columns. snip Again, don't think about a single column (although even in that case there are known fail cases). Think about multiple columns and the number of distinct combinations. With attribute value independence assumption, this is usually significantly underestimated. And that's a problem as it often leads to an index scan instead of sequential scan (or something like that). My point was that, in the case of single columns, we've done pretty well despite the poor ndistinct estimates. I suspect the same will be true in the case of multiple columns; good heuristics will trump minimax estimators. ( as I've advocated for before ) this means parameterizing the distribution of values ( ie, find that the values are roughly uniform ), maybe this means estimating the error of our statistics and using the most robust rather than the best plan, or maybe it means something totally different. But: ( and the literature seems to support me ) Which literature? I've read an awful lot of papers on this topic lately, and I don't remember anything like that. If there's an interesting paper, I'd like to read it. Yes, all the papers state estimating a ndistinct is a particularly tricky task, but that's somehow expected? It is definitely expected - non-parametric minimax results are always *very* weak. The papers that you've cited seem to confirm this; bounding ndistinct estimation error is tricky in the general case ( even with very simple loss functions, which we do not have ). The same is true for other non-parametric methods. Think about kernel density estimation: how many data points do you need to estimate the density of a normal distribution well? What about if you *know* that the data is normal ( or even that it comes from a big family? ). No, I'm not trying to do this just to improve the ndistinct estimate. The goal is to improve the cardinality estimate in case of dependent columns, and one of the papers depends on good ndistinct estimates. And actually it does not depend on ndistinct for the columns only, it depends on ndistinct estimates for the combination of columns. So improving the ndistinct estimates for columns is just a necessary first step (and only if it works reasonably well, we can do the next step). I think that any approach which depends on precise estimates of ndistinct is not practical. I am very happy that you've spent so much time on this, and I'm sorry if my previous email came off as combative. My point was only that simple heuristics have served us well in the past and, before we go to the effort of new, complicated schemes, we should see how well similar heuristics work in the multiple column case. I am worried that if the initial plan is too complicated then nothing will happen and, even if something does happen, it will be tough to get it committed ( check the archives for cross column stat threads - there are a lot ). Best, Nathan Boley -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] estimating # of distinct values
On Wed, Jan 19, 2011 at 6:35 PM, Florian Pflug f...@phlo.org wrote: On Jan20, 2011, at 02:41 , Nathan Boley wrote: If you think about it, it's a bit ridiculous to look at the whole table *just* to estimate ndistinct - if we go that far why dont we just store the full distribution and be done with it? - the best you could do is to average the individual probabilities which gives ... well, 1/ndistinct. That is certainly *not* the best you could do in every case. The mean is only the best estimate in L2, which is definitely not the case here. No, not in every case. But on average it comes out best, no? In the sense of minimizing the average plan cost over all values? Definitely not. Consider the trivial case where a bitmap scan is the cost of a sequential scan + epsilon. Consider a table with 10K values, 9,990 of which are 1 and the rest of which are 2, 3, ..., 10, versus a table that has the same 10 distinct values evenly distributed. For a simple equality query, in the first case, a bitmap scan might be best. In the second case, a sequential scan would always be best. True. But selectivity estimates alone don't show the difference there. Yet the full distribution would - supporting my argument that even in cases where we dont know a specific value, the full distribution is more informative. Also, in my personal experience this isn't really the area we've got problems now. The problem cases for me always were queries with a rather large number of joins (7 to 10 or so) plus rather complex search conditions. The join order, not the access strategy, then becomes the deciding factor in plan performance. And the join order *is* driven purely off the selectivity estimates, unlike the access strategy which even today takes other factors into account (like clusteredness, I believe) I think I'm losing you. Why would ndistinct be complete sufficient for estimating the optimal join order? This is precisely the point I was trying to make in my email: the loss function is very complicated. Improving the ndistinct estimator could significantly improve the estimates of ndistinct ( in the quadratic loss sense ) while only marginally improving the plans. The point of this exercise isn't really to improve the ndistinct estimates for single columns. Rather, we'd like to get ndistinct estimates for groups of columns because that allows us to use the uniform bayesian approach to multi-column selectivity estimation that Tomas found literature on. Which on the whole seems like it *does* have a chance of greatly improving the plans in some cases. We could, of course, estimate multi-column ndistinct the same way we estimate single-column ndistincts, but quite a few people raised concerns that this wouldn't work due to the large error our current ndistinct estimates have. Sure. But estimating multi-column ndistinct is surely not the only way to approach this problem. The comment I made, which you objected to, was that it's silly to look at the whole table and then throw away all information *except* ndistinct. If we really think that looking at the whole table is the best approach, then shouldn't we be able to get better statistics? Is this really an open question? -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why don't we accept exponential format for integers?
print int(1e+01) 10 That isn't building an integer: it is creating a float and casting to an int. try: int( 1e100 ) Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal : cross-column stats
The proposed solution is based on contingency tables, built for selected groups of columns (not for each possible group). And the contingency table gives you the ability to estimate the probabilities needed to compute the selectivity. Or am I missing something? Well, I'm not real familiar with contingency tables, but it seems like you could end up needing to store a huge amount of data to get any benefit out of it, in some cases. For example, in the United States, there are over 40,000 postal codes, and some even larger number of city names, and doesn't the number of entries go as O(m*n)? Not to mention that the model parameters will be impossible to estimate well. ( I've only scanned the thread, so sorry if Im repeating something that's already been said ) My intuition is that storing the covariance structure will be unworkable both technically and statistically for all but the simplest of cases. That being said, I dont think the problem is a lack of ways to parameterize the covariance structure ( there are several papers on mutli-dim histogram estimators, at least one of whichtalks about databases explicitly, not to mention approaches like CART[1] ) , but a complete lack of infrastructure to do anything with the estimates. So keep working on it ;-) ( but try to make the framework general enough to allow better estimators ). I wonder if a good first step might be to focus on the AND and OR operators since they seem like the most common cases and union and intersection commute ( although it's important to remember that the estimate variances do NOT ) That is, what about estimating the selectivity of the condition WHERE X=A and Y=B by f(A,B) = x_1*(selectivity(X = A) + selectivity( Y = B )) + x_2*selectivity(X = A)*selectivity( Y = B ) + x_3 where x_{1,2,3} are parameters to be estimated from the data. Another quick note: I think that storing the full contingency table is wasteful since the marginals are already stored in the single column statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was looking at this a couple years back ). Best, Nathan [1] http://en.wikipedia.org/wiki/Classification_and_regression_tree [2] http://en.wikipedia.org/wiki/Copula_%28statistics%29 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die
Robert explained why having more MCVs might be useful because we use the frequency of the least common MCV as an upper bound on the frequency of any value in the MCV. Where is that being used? The only non-MCV frequency estimate that I recall seeing is ( nrows - n_ndistinct_rows )/ndistinct. Obviously changing the number of mcv's affects this by lowering n_ndistinct_rows, but it's always pretty coarse estimate. Binding the length of the MCV list to the size of the histogram is arbitrary but so would any other value Wouldn't the best approach be to stop adding MCV's/histogram buckets when adding new ones doesn't decrease your prediction error 'substantially'? One very hacky threshold heuristic is to stop adding MCV's when a simple equality select ( SELECT col FROM table WHERE col == VALUE ) would switch the plan from an index to a sequential scan ( or vice versa, although with the current code this would never happen ). ie, if the non_mcv frequency estimate is 0.1% ( producing an index scan ), but adding the MCV gives us an estimate of 5% ( pbly producing a seq scan ) then add that value as an mcv. More sophisticated variations might also consider plan changes to very suboptimal joins; even more sophisticated would be to stop when the MAX( curr - optimal plan / optimal plan ) was below some threshold, say 20%, over a bunch of recently executed queries. A similar approach would work for histogram bins, except the queries of interest are inequality rather than equality selections. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die
That one's used, too, but the other is used as an upper bound. n_distinct tends to come out too small on large tables, so that formula is prone to overestimation. Actually, both formulas are prone to overestimation. Right - thanks. When this happens depends on the values of a whole boat-load of GUCs... Well, then doesn't the 'proper' number of buckets depend on a whole boat-load of GUCs... -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] extended operator classes vs. type interfaces
The advantage of specifying a + and a - in the type interface is that the unit definition can then be specified as part of the type declaration itself. So you can do: CREATE TYPE ts_sec AS RANGE OVER timestamp (UNIT = '1s'); CREATE TYPE ts_min AS RANGE OVER timestamp (UNIT = '1m'); All of the stuff about defining + and - is hidden from the user - it's part of the type interface, which is pre-created. The disadvantage is that it does not permit irregularly spaced units. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpython3
On the basis of all of the foregoing, I don't think we can consider this patch further for this CommitFest and will update commitfest.postgresql.org accordingly. FWIW, I am very excited about this patch and would be happy to review it but have been very busy over the past month. If I can promise a review by Thursday morning could we keep it active? Hopefully, at the very least, I can provide some useful feedback and spawn some community interest. I am worried that there is a bit of a chicken and an egg problem with this patch. I code nearly exclusively in python and C, but I have often found pl/python to be very unwieldy. For this reason I often use pl/perl or pl/pgsql for problems that, outside of postgres, I would always use python. From the documentation, this patch seems like an enormous step in the right direction. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpython3
I think it would be great for you to review it... I doubt that will cause it to get committed for 9.0, but my doubt is no reason for you to hold off reviewing it. I assumed so, but the pretense of a chance will probably help to motivate me :-) I'll have something by Thursday, and then 'Returned with Feedback' will at least be factual. Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Range types
Because intervals (mathematical not SQL) can be open or closed at each end point we need to know what the next an previous value would be at the specified granularity. And while you can do some operations without knowing this, there are many you can't. For instance you could not tell whether two [] or () ranges were adjacent, or be able to coalesce an array of ranges. This statement seems to me to demonstrate that you don't actually understand the concept of open and closed ranges. It has nothing whatsoever to do with assuming that the data type is discrete; these concepts are perfectly well defined for the reals, for example. What it is about is whether the inclusion conditions are bound or = bound. IMHO the first question is whether, for integers, [1,2] UNION [3,5] should be equal to [1,5]. In math this is certainly true, and defining 'next' seems like a reasonable way to establish this in postgres. The next question is whether, for floats, [1,3-FLT_EPSILON] UNION [3,5] should be [1,5]. And the next question is whether, for numeric(6,2), [1,2.99] UNION [3,5] should be [1,5]. FWIW, I would answer yes, no, yes to those three questions. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Range types
Right, I don't think strings are any more or less countable than integers. (and yes, it's a bit OT). Well, if I remember my algebra, I think the issue is that integers are locally finite whereas strings are not ( assuming standard orderings, of course ). -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Python 3.1 support
Again, I'm only one user. But so far I haven't seen anyone else speak up here, and clearly accepting this for inclusion will need nontrivial convincing. Well, FWIW, I am excited about better type integration. Also, I am a little skeptical about this patch. I am sorry if this has already been discussed, but would this mean that I need to choose whether pl/python is built against Python 2.* or Python 3.*? -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Python 3.1 support
Here's the patch to support Python =3.1 with PL/Python. The compatibility code is mostly in line with the usual 2-3 C porting practice and is documented inline. I took a cursory look at this patch and, while the logic seems sound and roughly in line with the suggested python porting procedure, I'm not quite certain what this implies for potential future patches. For instance, if I wanted to write a type converter for bytea - the python 3 byte type would the expectation be that I ensure that it works in Python 2? Or is an ifdef that ignores it in the case of Python 2 OK, and we can just put a note in the docs. Also, how far back do we want to maintain 2.x compatibility? 2.0? If I wanted to submit a patch that makes use of the list sort method, do I need to ensure that it can either use the cmp arguments or a key argument? What if I wanted to implement a set returning function that made use of an iterators next() method. Would I just put ifdefs around the code or a preprocessor definition that defines NEXT as next() for Python 2.x and __next__() for 3.x? I guess that my first impression is that Python broke compatibility for a reason, and that either plpython can't evolve, or it will quickly become impossible to maintain. That being said, I mostly buy the maintenance arguments from the previous discussion, but if we want to have plpython and plpython3, a bunch of defines and ifdefs does not seem like the best way to do this. Would a better approach be to maintain compatibility layer? ie plython_compat.h/c plython2.c plython3.c Then patches that apply to a python3 can be applied to plython3.c and any changed function can be ripped out of plython_compat and moved into plpython2. I'm sorry to snipe from the sidelines like this. If we didn't expect plpython to evolve then this patch seems like the correct approach, but there is clearly some desire to expand plpython and following this path seems like it will end in a much more painful split in the future or a necessary rewrite. If there is some consensus that this is the best approach, then I will do a more comprehensive review. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] operator exclusion constraints
After reading the docs in the patch I don't believe you're going to all this trouble to ensure two circles don't overlap. Can you give some better examples of what you're trying to achieve and why anyone else would care? (I'm busy, so are others). Non overlapping time intervals is one use case - think about room scheduling. I personally want to use it to ensure the consistency of genomic annotations. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi-Dimensional Histograms
I'm finding myself unable to follow all the terminology on this thead. What's dimension reduction? What's PCA? ( Sorry for the jargon - thanks Josh ) It feels like what you might need is statistics for colB (MCVs and/or a histogram) for certain particular values of colA. Certainly - this is exactly what a multi-dimensional histogram would store. Unfortunately, in the general case the set of values of colA for which you need these statistics might be inconveniently large. Which is why, in the general case, we need some sort of dimension reduction. If we could say that, for instance, the distribution of colB is roughly the same for all values of colA less than 100, and it is roughly the same for all values of colA = 100, then we would have effectively reduced the dimension from ndistinct(colB)*ndistinct(colA) to 2*ndistinct(colB). The multidimensional histogram in the above paper is somewhat akin to what you suggested and I just described - it attempts to select contiguous regions that are the same. PCA is a much different approach, but it is still, broadly, a dimension reduction technique. At the end of the day, we cant do any dimension reduction unless the ordering encodes some sort of useful information, and the data type being in R^n is certainly no guarantee. Consider, for instance, the cross correlation of zip-codes and area codes - you would really want to order those by some geographic relation. I think that is why cross-column stats is so hard in the general case. To clarify my response to Tom, directly above, if as in Robert's example, all of the values in colA 100 were different than all of the values 100 with respect to colB, then it is easy to represent the structure in something manageable. Ie, we store two histograms for colB: 1 for colA 100, one for colA 100. However, if there are the same two classes in colB ( lets say C1 and C2 ) but rather than C1 being associated with values in colA 100, it is associated with 100 completely random values in colA ( ie 1, 22, 47, 89, 91, 92, 107, ... ) there is not a whole lot we can do besides storing a list of all of those values. We could maybe use the ratio of classes to improve the average plan choice, but you would still get a lot of bad plans. Could anyone provide a concrete example ( or better yet a data set ) where this occurs? -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi-Dimensional Histograms
For things like PostGIS, which will want to index in 4 dimensions (x, y, z, t), we might want to have multi-dimensional selectivity histograms and some way to use same. Another use case is cross column statistics. Anybody here qualified to check out this paper on the subject, please speak up :) Well, I don't know if I'm qualified per se, but here goes... It seems seems like a very reasonable approach to multidimensional histograms. For those who haven't had time to read the paper, it describes SGRID, a greedy algorithm for partitioning an n-dimensional space into non-overlapping n-dimensional boxes and 'outliers' ( points that dont fit into any boxes, but dont warrant their own boxes ). Then, they compare their method to a regression technique, where the space is fit into a fixed grid histogram and then the grid densities are estimated via regression. They also compare a plane splitting algorithm, called MHIST, where the planes are constrained to be orthogonal to the naive dimension vectors. They dismiss singular value decomposition and the discrete wavelet transform as being too parametric ( which is silly, IMHO ) and briefly mention a couple standard clustering techniques ( which are probably not appropriate for us, given their complexity ). Unsurprisingly, it does well on the two test sets that they consider. It think the general idea is fine, but it would certainly need some modification to work for postgres. In particular, Using the naive dimension vectors ( ie, north and east for geo data, or column a and column b for cross-column stats ) in combination with the box constraint will probably lead to problems. Consider, for example, a river that goes in a straight line north-east, and a table that stores camp sites which are mostly along the river. Because SGRID can only create boxes with north east edges, you would end up with a bunch of tiny box on the river where one, long north-east pointing box would do. We would probably want to replace the error term that SGRID uses to determine when to create a new box by a statistical test ( maybe homogeneity of means? ) ( also, the same holds for the outliers ). Finally, this creates the partition but ( AFAICT ) it doesn't describe a method for locating the histogram estimate given a point ( although that doesn't seem too difficult ). -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi-Dimensional Histograms
Finally, this creates the partition but ( AFAICT ) it doesn't describe a method for locating the histogram estimate given a point ( although that doesn't seem too difficult ). Is that not difficult, in terms of the math that needs doing, or not difficult, in terms of how well PostgreSQL is already set up to implement, or...? I only meant that any implementation would need to address this, but I can think of simple ways to do it ( for instance, use the fixed width grid method, and then store a reference to all of the intersecting boxes ). But I am sure that there are much better ways ( nested containment lists perhaps? ). -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi-Dimensional Histograms
On Mon, Jun 29, 2009 at 3:43 PM, Tom Lanet...@sss.pgh.pa.us wrote: David Fetter da...@fetter.org writes: On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote: ... They dismiss singular value decomposition and the discrete wavelet transform as being too parametric ( which is silly, IMHO ) Should we have a separate discussion about eigenvalues? Wavelets? I think it'd be a short discussion: what will you do with non-numeric datatypes? We probably don't really want to assume anything stronger than that the datatype has a total ordering. Well, in the general case, we could use their ranks. At the end of the day, we cant do any dimension reduction unless the ordering encodes some sort of useful information, and the data type being in R^n is certainly no guarantee. Consider, for instance, the cross correlation of zip-codes and area codes - you would really want to order those by some geographic relation. I think that is why cross-column stats is so hard in the general case. That being said, for geographic data in particular, PCA or similar could work well. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hist boundary duplicates bug in head and 8.3
Surely the most important point in the OP was that ineqsel does not correctly binary search in the presence of duplicates. It would have been if I were correct :-( . Looking at it again, that was from a bug in my code. Thanks for your time, and sorry about the noise. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hist boundary duplicates bug in head and 8.3
For heavy tailed distributions, it is possible for analyze to duplicate histogram boundaries. I don't think this is a bug. hmmm... Well, I assumed it was a bug from a comment in analyze. From ( near ) line 2130 in analyze.c * least 2 instances in the sample. Also, we won't suppress values * that have a frequency of at least 1/K where K is the intended * number of histogram bins; such values might otherwise cause us to * emit duplicate histogram bin boundaries. */ If this is expected, I'm also not sure what the use of maxmincount in analyze is... Thanks for the response, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] benchmarking the query planner
One more direction could be implementing MCV for range of values (group values and interpolate in between). Consider statistics on timestamp column that says that for 2008-December there are as many X rows, for 2008-November as many as Y, etc. That could be used for rather accurate cardinality estimation of between cases, while keeping number of entries in MCV list small. I suggested this earlier ( http://archives.postgresql.org/pgsql-hackers/2008-06/msg00353.php ). If there's interest, I'd be happy to clean up my patch and submit it for 8.5 -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] benchmarking the query planner
What is the specific difference between what you are talking about and what scalarineqsel already implements? Hmm... Northing new. Feel sorry for bothering you. I did not realize histograms are implemented. Well, ISTM there is a profound difference. For scalarineqsel we care about the total number of values in a bucket. For eqsel we care about the total number of *distinct* values in each bucket ( which we don't track ). IMHO, the whole idea of increasing mcv's seems a mistake. Why not use the limited storage in pg_statistic to try and estimate the selectivity for ranges of values rather than a single value? That gives way better coverage of the distribution. If the number of values is too high to fit in a single bucket we put it in an mcv slot anyways. *That* should be the mechanism by which the number of mcv's increases. I guess this is a bit off topic for the middle of a commit fest though. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] benchmarking the query planner
Thanks for the response. Well, ISTM there is a profound difference. For scalarineqsel we care about the total number of values in a bucket. For eqsel we care about the total number of *distinct* values in each bucket Really? Well, we would obviously also care about the total number of values in the buckets if we were trying to use the histogram in eqsel. Isn't a selectivity estimate of x = v as ( the number of values in v's histogram bucket ) / ( number of distinct values in v's histogram bucket ) pretty rational? Thats currently what we do for non-mcv values, except that we look at ndistinct over the whole table instead of individual histogram buckets. IMHO, the whole idea of increasing mcv's seems a mistake. Why not use the limited storage in pg_statistic to try and estimate the selectivity for ranges of values rather than a single value? MCVs are useful for questions that are not related to ranges of values --- an example of something highly useful we do with them is to try to estimate the fraction of a column that satisfies a LIKE or regex pattern. Good point. I guess I was responding to the eqsel benchmarks. I should remember that I don't necessarily know all the places that mcv's are used. In fact, as I was just pointing out to Bruce, we can compute them and do useful things with them for datatypes that don't have a defined sort order and so the whole concept of range is meaningless. Another good point. But don't we have bigger stat problems for datatypes without an ordering relation? IIRC, analyze doesn't even fully compute the mcv list, as that would be N^2 in the sample size. Now, if your point is that it'd be more flexible to not tie the MCV list length to the histogram length, you're right. No, my point is just the opposite. I think the two should be *more* tightly linked. It seems that ( at least for eqsel and scalarineqsel ) mcv's should be the values that the histogram can't deal with effectively. ie, there's no ordering relation, there are too many values to fit into a histogram bucket, the histogram eqsel estimate does an especially bad job for a relatively common value, etc. Even now, if there are 100 histogram buckets then any values that occupy more than 1% of the table will be mcv's regardless - why force a value to be an mcv if it only occupies 0.1% of the table? That just bloats pg_statistic and slows down joins unnecessarily. I'll submit a patch for 8.5 and then, hopefully, some simple benchmarks can make my point.. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] benchmarking the query planner
Isn't a selectivity estimate of x = v as ( the number of values in v's histogram bucket ) / ( number of distinct values in v's histogram bucket ) pretty rational? Thats currently what we do for non-mcv values, except that we look at ndistinct over the whole table instead of individual histogram buckets. But the histogram buckets are (meant to be) equal-population, so it should come out the same either way. Why is it the same either way? The histogram buckets have equal population in the number of total values, not the number of distinct value. Consider [0,0,0,0,1,2,3,4,5,6]. The histogram buckets would be [0,2) and [2,+oo), but they would have 2 and 5 distinct values respectively. The removal of MCVs from the population will knock any possible variance in ndistinct down to the point where I seriously doubt that this could offer a win. Well, if the number of MCV's is large enough then it certainly won't matter. But isn't that pointlessly inefficient for most distributions? I provided some empirical evidence of this in an earlier post ( http://archives.postgresql.org/pgsql-hackers/2008-06/msg00353.php ) for normal distributions. An even bigger problem is that this requires estimation of ndistinct among fractions of the population, which will be proportionally less accurate than the overall estimate. Accurate ndistinct estimation is *hard*. Agreed, but I'm not convinced that the overall error rate will go up for our current estimator. Ndistinct estimation is hardest for populations with a large ndistinct count variation. If the assumptions underlying this are founded ( namely that ndistinct distributions are related to the ordering relation in a predictable way ) then ndistinct estimation should be easier because the ndistinct distribution will be more uniform. But that's certainly something that should be tested on real data. now, if there are 100 histogram buckets then any values that occupy more than 1% of the table will be mcv's regardless - why force a value to be an mcv if it only occupies 0.1% of the table? Didn't you just contradict yourself? The cutoff would be 1% not 0.1%. No, I'm saying that if the number of histogram buckets is 100, then even if the mcv list is set to length 2, every value that occupies 1% or more of the table will be an mcv. However, if the size is fixed to 100 I think that it will be common to see values with relative frequencies much lower than 1% near the bottom of the list. In any case there's already a heuristic to cut off the MCV list at some shorter length (ie, more than 1% in this example) if it seems not worthwhile to keep the last entries. See lines 2132ff (in CVS HEAD) in analyze.c. Given an increase in the default stats target, limits like the 25% are exactly what I think there should be more of. Can anyone suggest a good data set to test this sort of question on? -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Simple postgresql.conf wizard
Thanks for putting out pgtune - it's a sorely needed tool. I had a chance to look over the source code and have a few comments, mostly about python specific coding conventions. - The windows specific try block ( line 16 ) raises a ValueError vs ImportError on my debian machine. Maybe it would be more appropriate to explicitly test platform.system()==Windows? - from ctypes import * ( 18 ) makes the block difficult to read and pollutes the namespace. - on line 45, the try block should probably catch exceptions derived from Exception ( to avoid catching SystemExit and KeyboardInterrupt errors ). ie, except Exception: return None. Also, printing out the expection in debug mode would probably be a good idea ( ie except Exception, e: print e\ return None ) - all classes ( 58, 135, 205 ) are 'old-style' classes. I dont see any reason to use classic classes ( unless Python 2.1 is a support goal? ) To make classes 'new style' classes ( http://www.python.org/doc/2.5.2/ref/node33.html ) they should inherit object. i.e. class PGConfigLine(object): - The doc strings ( 59, 136, 206 ) don't follow standard conventions, described here http://www.python.org/dev/peps/pep-0257/. - Functions also support doc strings ( 342, 351, etc. ) - Tuple unpacking doesn't require the tuple boundaries ( 446 and others ). ie, options, args = ReadOptions() works. This is more of a style comment about the 'Java-ish interface' ( line 112 ), feel free to ignore it. overloading __str__ and __repr__ are the python ways to return string representations of an object. ie, instead of toSting use __str__ and then ( on 197 ) print l or print str(l) instead of print l.toString() for the other methods ( getValue, getLineNumber, isSetting ) I'm assuming you didnt call the attributes directly because you didnt want them to be accidently overwritten. Have you considered the use of properties ( http://www.python.org/download/releases/2.2.3/descrintro/#property )? Also, it would be more clear to mark attributes as private ( i.e. _name or __name, _readable, _lineNumber, _setsParameter ) if you dont want them to be accessed directly. Hope my comments are useful! Thanks again for writing this. -Nathan P.S. I'd be happy to officially review this if it gets to that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
I still need to go through backend/utils/adt/selfuncs.c to figure out exactly how we use the one-dimensional values. Here's a page that helped me figure all this out. http://www.postgresql.org/docs/8.1/static/planner-stats-details.html 2) Do we want to fold the MCV's into the dependence histogram? That will cause problems in our copula approach but I'd hate to have to keep an N^d histogram dependence relation in addition to the copula. Yeah, if we're already trying to figure out how to compress copulae, having also to compress MCV matrices seems painful and error-prone. But I'm not sure why it would cause problems to keep them in the copula -- is that just because we are most interested in the copula modeling the parts of the distribution that are most sparsely populated? The problem I was thinking of is that we don't currently store the true marginal distribution. As it stands, histograms only include non mcv values. So we would either need to take the mcv's separately ( which would assume independence between mcv's and non-mcv values ) or store multiple histograms. 4) How will this approach deal with histogram buckets that have scaling count sizes ( ie -0.4 )? I'm not sure what you mean here. That was more a note to myself, and should have been numbered 3.5. ndistinct estimates currently start to scale after a large enough row/ndistinct ratio. If we try to model ndistinct, we need to deal with scaling ndistinct counts somehow. But that's way off in the future, it was probably pointless to mention it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
Right now our histogram values are really quantiles; the statistics_target T for a column determines a number of quantiles we'll keep track of, and we grab values from into an ordered list L so that approximately 1/T of the entries in that column fall between values L[n] and L[n+1]. I'm thinking that multicolumn statistics would instead divide the range of each column up into T equally sized segments, Why would you not use the same histogram bin bounds derived for the scalar stats (along each axis of the matrix, of course)? This seems to me to be arbitrarily replacing something proven to work with something not proven. Also, the above forces you to invent a concept of equally sized ranges, which is going to be pretty bogus for a lot of datatypes. Because I'm trying to picture geometrically how this might work for the two-column case, and hoping to extend that to more dimensions, and am finding that picturing a quantile-based system like the one we have now in multiple dimensions is difficult. I believe those are the same difficulties Gregory Stark mentioned having in his first post in this thread. But of course that's an excellent point, that what we do now is proven. I'm not sure which problem will be harder to solve -- the weird geometry or the equally sized ranges for data types where that makes no sense. Look at copulas. They are a completely general method of describing the dependence between two marginal distributions. It seems silly to rewrite the stats table in terms of joint distributions when we'll still need the marginals anyways. Also, It might be easier to think of the dimension reduction problem in that form. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cross-column statistics revisited
I'm still working my way around the math, but copulas sound better than anything else I've been playing with. I think the easiest way to think of them is, in 2-D finite spaces, they are just a plot of the order statistics against one another. Feel free to mail me off list if you have any math questions. I've previously thought that, at least in the 2D case, we could use image compression algorithms to compress the copula, but recently I've realized that this is a change point problem. In terms of compression, we want to decompose the copula into regions that are as homogenous as possible. I'm not familiar with change point problems in multiple dimensions, but I'll try and ask someone that is, probably late next week. If you decide to go the copula route, I'd be happy to write the decomposition algorithm - or at least work on the theory. Finally, a couple points that I hadn't seen mentioned earlier that should probably be considered- 1) NULL's need to be treated specially - I suspect the assumption of NULL independence is worse than other independence assumptions. Maybe dealing with NULL dependence could be a half step towards full dependence calculations? 2) Do we want to fold the MCV's into the dependence histogram? That will cause problems in our copula approach but I'd hate to have to keep an N^d histogram dependence relation in addition to the copula. 3) For equality selectivity estimates, I believe the assumption that the ndistinct value distribution is uniform in the histogram will become worse as the dimension increases. I proposed keeping track of ndistinct per histogram beckets earlier in the marginal case partially motivated by this exact scenario. Does that proposal make more sense in this case? If so we'd need to store two distributions - one of the counts and one of ndistinct. 4) How will this approach deal with histogram buckets that have scaling count sizes ( ie -0.4 )? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IN, BETWEEN, spec compliance, and odd operator names
Yes, but always in relation to operator classes, so from BTrees opclass or such, which I refered to as the context of index searches, as I don't really see any theorical need for opclass if it's not for indexing. IIRC analyze uses the btree op class to build the selectivity histogram. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
Assuming that the threshold for switching to an indexscan is somewhere around selectivity 0.005 (I am not certain offhand, but it's in that general area), this cannot possibly require more than 200 MCV slots, and for most data distributions it'd be a whole lot less. Thats a really good point. Given such an MCV list, the planner will always make the right choice of whether to do index or seqscan Given that, wouldn't it be smarter to consider a value as an mcv candidate iff it has a density greater than 0.005, rather than having a count greater than 1.5*average? This would allow people to raise the hard mcv limit without having to worry as much about including worthless mcv values... Cheers, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
One more problem with low ndistinct values is that the condition might very well hit no rows at all. But Idea 1 will largely overestimate the number of hits. Thats a good point, but I don't see a clear solution. Maybe we could look at past queries and keep track of how often they return empty result sets? It seems that, in some ways, we care about the distribution of the query values in addition to the column values... I think for low ndistinct values we will want to know the exact value + counts and not a bin. So I think we will want additional stats rows that represent value 'a1' stats. Isn't that what our most frequent values list does? Maybe ? Do we have the relevant stats for each ? But the trick is to then exclude those values from the histogram bins. Currently, the histogram is only made up of non-mcv values. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
One more problem with low ndistinct values is that the condition might very well hit no rows at all. But Idea 1 will largely overestimate the number of hits. Thats a good point, but I don't see a clear solution. Maybe we could I think that MCVs are the solution, right? Only if they cover the entire range of values in the table. A low ndistinct means that those values will likely be MCVs. Yes, but I don't think thats the point. If we query on values that aren't in the table, the planner will always overestimate the expected number of returned rows because it ( implicitly ) assumes that every query will return at least 1 record. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
If we query on values that aren't in the table, the planner will always overestimate the expected number of returned rows because it ( implicitly ) assumes that every query will return at least 1 record. That's intentional and should not be changed. Why? What if ( somehow ) we knew that there was a 90% chance that query would return an empty result set on a big table with 20 non-mcv distinct values. Currently the planner would always choose a seq scan, where an index scan might be better. Better yet, couldn't that be optimized to *if record exists, execute seq scan*. That being said, I think queries are generally searching for values that exist in the table. I can't see the value of allowing fractional-row estimates anyway. Neither can I, but I could probably think of cases where knowing the SD of the result set could result in better plans. -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
Why? What if ( somehow ) we knew that there was a 90% chance that query would return an empty result set on a big table with 20 non-mcv distinct values. Currently the planner would always choose a seq scan, where an index scan might be better. (1) On what grounds do you assert the above? For a table with 100 non-mcv rows, the planner estimates a result set of cardinality 100/20 = 5, not 1. (2) What makes you think that an estimate of zero rather than one row would change the plan? I see where the confusion is coming from. When I said What if ( somehow ) we knew that there was a 90% chance that query would return an empty result set I meant that the planner doesn't know that information. And how could it? The estimate for ndistinct is an estimate for the number of distinct values in the table, not an estimate for the number of distinct values that will be queried for. My original point was that we sometimes care about the distribution of what's being queried for and not just what's in the table. But this is all silly anyways: if this was really a concern you would write a function if values exist return values else return none (In fact, I don't think the plan would change, in this case. The reason for the clamp to 1 row is to avoid foolish results for join situations.) Which makes sense. My point certainly wasn't, in any way, a criticism of clamping selectivity to 1. Cheers, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
Your argument seems to consider only columns having a normal distribution. My example was based upon normally distributed data because people usually know what they are and they are reasonably common. How badly does it fall apart for non-normal distributions? This should work to the extent that there is a correlation between the number of distinct values in a histogram bin and the 'size' of the bin. I know that this isn't a very good answer, but it's a hard problem in the general case. However, for any non-uniform probability distribution there exists a measure under which there is a non-zero correlation between these. I wrote up a hand-wavie semi-formal argument at the end, but it's not tough to see intuitively. Just think about the graph of the cdf. The histogram boundaries are horizontal lines that are equally spaced from top to bottom. If we rescale the x-axis st there is a fixed distance between each distinct value, and define the distance as ( ndistinct in a given interval)/(total ndistinct) then the only place where this doesn't hold is when the CDF is f(x) = x, aka the dist is uniform. And even then we just get a 0 coefficient, which is exactly what we are always assuming now. Obviously we run into problems when a) we have a poor estimate for ndistinct - but then we have worse problems b) our length measure doesn't correspond well with ndistinct in an interval expanding on b)... your mention of Zipfian distributions is particularly good example of where this could be poor. Right now ( someone correct me if I'm wrong ) words are sorted alphabetically. However, if we wanted this estimator to do a good job, we would sort them by their length or, better yet, frequency in the english language ( which is highly correlated with length ). (For instance, Zipfian distributions seem to be pretty common in database work, from what I've seen.) This should work well for any power law distribution. If people are curious, I can rerun my previous example using a power law distribution instead of normal. However, the easy way of thinking about all of this is that we're building a linear model between ndistinct and histogram bucket width. It's intuitive to expect there to be a correlation between the two, and so the model should have some predictive power. -Nathan somewhat formal - probably will be difficult to read without some basic probability theory To see this, first note that we can alternatively define the uniform distribution on [a,b] as the distribution whose CDF is a straight line that passes through both (a,0) and (b,1) ( just integrate the PDF ). So any non-uniform distribution will have a CDF with slope that is both below and above 1/(b-a) at some set of points, implying the existence of an interval [i1, i2] st ( CDF(i2) - CDF(i1) ) ( i2 - i1 )/(b-a). Then, under the constraints of the probability measure, there exists a second disjoint interval st ( CDF(i2') - CDF(i1') ) ( i2' - i1' )/(b-a). In other words, Next, assume that the number of potential distinct values in our interval scales linearly with the length of the interval. Although it seems as if this assumption could be somewhat burdensome, there always exists a measure under which this is true for sets with a defined order relation. ( As remarked earlier by Tom, we are already making this assumption ). To see this, consider defining the length(i1, i2) as ( the number of distinct value in [i1, i2] )/( total num distinct values ), where the number of distinct values is the set of values { v | v = i1 and v = i2 }. Next, note that the joint distribution of identically distributed, independent random variables is multinomial with cell probabilities given by the value of the pdf at each distinct point. Next, I'll state without proof that for an IID RV the expected number of distinct values is maximized for a uniform distribution ( this is pretty easy to see: think about the binomial case. Do you want your cell probabilities to be ( 1.0, 0 ) or ( 0.5, 0.5 ) ) Finally, note that the number of expected distinct values decreases faster than linearly in the length of the interval. This is pretty clear when we consider the sparse case. As the number of potential entries ( in this case, the interval length) approaches infinity, the probability of a new entry being distinct approaches 1. This means that, in this limit, every new entry ends up being distinct, aka the number of distinct values scales linearly in the number of new entries. As the interval shrinks, new entries have some probability of being repeats. As the interval shrinks to 0, there is a zero probability of new entries being unique. Since, a) there doesn't exists a linear relationship that contains the two boundary points b) the multinomial distribution of the PDF is continuous c) the relationship is clearly decreasing we can surmise that it is sub-linear. Therefore, we have two intervals that have sub and super linear slopes that cancel one another. However,
[HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
Currently eqsel assumes that, except for the values stored as mcv's, the number of times that a value appears in a table column is independent of it's value. Unfortunately, this often yields inappropriate selectivity estimates and frequently leads to inappropriate plans. As an example, consider an insurance company that keeps a record of patient heights. Assume there are a 100 patient heights in this column, and they are distributed normally with a mean of 1.7526 and a standard deviation of 0.0762. Furthermore, assume that the heights are only measured to the nearest centimeter. Then, we'd expect there to be about 73 distinct heights, with a SD of 1.5. Ignoring the effects of MCV's, the planner expects SELECT height FROM heights WHERE height = 1.75; to yield roughly 13000 results. However, given that we know the underlying distribution, we would expect to see ~52000 results. Similarly, the planner expects to see 13000 results from SELECT 1.75 FROM heights WHERE height = 2.05; While we expect to see 2.7. Obviously this example is not totally convincing: if I were to post this to pg-general looking for advice I'm sure that everyone would tell me to just increase the size of my mcv stats. However, in cases where the number of distinct values is higher, this isn't always feasible. Also, why store a list of 50 values and their frequencies when 10 extra would provide the same plans without bloating pg_statistics? To combat this problem, I have two different proposals. Idea 1: Keep an array of stadistinct that correspond to each bucket size. In the example above, ( again ignoring mcv's ) the quantile data is 0%10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1.38 1.66 1.69 1.71 1.73 1.75 1.77 1.79 1.82 1.85 2.12 with numdistinct values of ( respectively ) 29 2 2 2 2 2 2 3 3 25 For the two above examples, this new approach would yield selectivity estimates of (100/10)/2 = 5 ( vs an actual ED of ~52000 ) and (100/10)/25 = 4000 ( vs an actual ED of ~2.7 ) Furthermore, this is done without mcvs. Since mcv's would make the histogram more sensitive to the edges, the estimates with mcv's should be correspondingly better. There are two potential problems that I see with this approach: 1) It assumes the = is equivalent to = and = . This is certainly true for real numbers, but is it true for every equality relation that eqsel predicts for? 2) It bloats the stats table. Idea 2: Keep a correlation statistic between ndistinct and bucket size This addresses problem #2. In lieu of keeping an actual list of ndistinct per histogram bucket, we store the linear scaling coefficient between histogram bucket width and ndistinct/(avg ndistinct). To visualize this, it is easiest to consider plotting the bucket width versus ndistinct. The scaling coefficient is the linear line that passes through origin and minimizes the square of the difference between it's estimate for ndistinct and the actual value. When I apply this method to the above data I find a coefficient of 13.63 for an average ndist of 72/10. This provides selectivity estimates, for the above two examples, of (100/10)/( 13.63*7.2*(1.77 - 1.75) ) = 50950 ( vs an actual ED of ~52000 ) and (100/10)/( 13.63*7.2*(2.12 - 1.85) ) = 3774 ( vs an actual ED of ~2.7 ) Although this yields better results than idea 1 for this particular example, it will be much more sensitive to weird distributions. Obviously there are some special cases to consider: we wouldn't want the stats to be skewed such that they provide really bad plans. However, with some carefully designed caps I believe that we could ensure that the estimates are at least as good as they are now. In fact, I'm not certain that an R^2 penalty is the correct loss function. Ideally, we want to minimize the extra time that the db spends by choosing an incorrect plan. Maybe slight overestimations are better than slight underestimations? Maybe the cost of the occasional (really) bad plan is less than the cost of a bunch of kinda bad plans? Finally, we aren't limited to just one coefficient. We could also store multiple coefficents to improve our estimates, and provide a compromise between ideas 1 and 2. Food for future thought... I addition to the previous benefits, I think that this method has the potential to make the process by which MCV are chosen (or not chosen) smarter. Now the planner chooses a value to be an mcv candidate if it's frequency is greater than 1.25 * the average frequency. Given that this improved selectivity estimate is implemented, maybe a better way would be to include a value as an mcv if it's a) above a certain threshold and b) the histogram selectivity estimator does do a poor job. What are peoples thoughts on idea 1 vs idea 2? Am I missing any relevant details about the planner's operation? Do people think that the improved estimates would be worth the additional overhead? -Nathan -- Sent via pgsql-hackers