Re: [HACKERS] Measure Theoretic Data Types in Postgresql

2012-10-12 Thread Nathan Boley
 However, by realizing that the bounds on the ranges have a linear ordering
 one can speed this up to 0(m) using windowing functions on common table
 expressions.

 So what I am proposing is formalizing this optimization into a class of data
 types, that will hide the implementation details.

Could this not also be handled by extending merge join to work with an
overlap operator?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-01 Thread Nathan Boley
 What about MCV's? Will those be removed as well?

 Sure.  Those seem even less useful.

Ya, this will destroy the performance of several queries without some
heavy tweaking.

Maybe this is bad design, but I've gotten in the habit of storing
sequences as arrays and I commonly join on them. I looked through my
code this morning, and I only have one 'range' query ( of the form
described up-thread ), but there are tons of the form

SELECT att1, attb2 FROM rela, relb where rela.seq_array_1 = relb.seq_array;

I can provide some examples if that would make my argument more compelling.

Sorry to be difficult,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-03-01 Thread Nathan Boley
[ sorry Tom, reply all this time... ]

 What do you mean by storing sequences as arrays?

So, a simple example is, for transcripts ( sequences of DNA that are
turned into proteins ), we store each of the connected components as
an array of the form:

exon_type in [1,6]
splice_type = [1,3]

and then the array elements are

[ exon_type, splice_type, exon_type ]

~ 99% of the elements are of the form [ [1,3], 1, [1,3] ],

so I almost always get a hash or merge join ( correctly ) but for the
rare junction types ( which are usually more interesting as well ) I
correctly get nest loops with an index scan.

 Can you demonstrate
 that the existing stats are relevant at all to the query you're worried
 about?

Well, if we didn't have mcv's and just relied on ndistinct to estimate
the '=' selectivities, either my low selectivity quals would use the
index, or my high selectivity quals would use a table scan, either of
which would be wrong.

I guess I could wipe out the stats and get some real numbers tonight,
but I can't see how the planner would be able to distinguish *without*
mcv's...

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-02-29 Thread Nathan Boley
On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Alexander Korotkov aekorot...@gmail.com writes:
 On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch n...@leadboat.com wrote:
 I've attached a new version that includes the UINT64_FMT fix, some edits of
 your newest comments, and a rerun of pgindent on the new files.  I see no
 other issues precluding commit, so I am marking the patch Ready for
 Committer.
 If I made any of the comments worse, please post another update.

 Changes looks reasonable for me. Thanks!

 I am starting to look at this patch now.  I'm wondering exactly why the
 decision was made to continue storing btree-style statistics for arrays,
 in addition to the new stuff.

If I understand you're suggestion, queries of the form

SELECT * FROM rel
WHERE ARRAY[ 1,2,3,4 ] = x
 AND x =ARRAY[ 1, 2, 3, 1000];

would no longer use an index. Is that correct?

Are you suggesting removing MCV's in lieu of MCE's as well?

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2012-02-29 Thread Nathan Boley
On Wed, Feb 29, 2012 at 2:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Nathan Boley npbo...@gmail.com writes:
 On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I am starting to look at this patch now.  I'm wondering exactly why the
 decision was made to continue storing btree-style statistics for arrays,
 in addition to the new stuff.

 If I understand you're suggestion, queries of the form

 SELECT * FROM rel
 WHERE ARRAY[ 1,2,3,4 ] = x
      AND x =ARRAY[ 1, 2, 3, 1000];

 would no longer use an index. Is that correct?

 No, just that we'd no longer have statistics relevant to that, and would
 have to fall back on default selectivity assumptions.

Which, currently, would mean queries of that form would typically use
a table scan, right?

 Do you think that
 such applications are so common as to justify bloating pg_statistic for
 everybody that uses arrays?

I have no idea, but it seems like it will be a substantial regression
for the people that are.

What about MCV's? Will those be removed as well?

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-25 Thread Nathan Boley
 I actually don't know much about the I/O subsystem, but, no, WAL is
 not separated from data.  I believe $PGDATA is on a SAN, but I don't
 know anything about its characteristics.

The computer is here:
http://www.supermicro.nl/Aplus/system/2U/2042/AS-2042G-6RF.cfm

$PGDATA is on a 5 disk SATA software raid 5.

Is there anything else that would be useful to know? ( Sorry, this
isn't a subject that I'm very familiar with )

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: collect frequency statistics for arrays

2011-11-28 Thread Nathan Boley
 Version of patch with few more comments and some fixes.

 Where are we on the idea of better statistics for arrays?

I need to finish reviewing the patch: I'll try to send in something
tomorrow night. So far it looks good.

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collect frequency statistics for arrays

2011-11-15 Thread Nathan Boley
 Rebased with head.

FYI, I've added myself as the reviewer for the current commitfest.

Best,
Nathan Boley

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: collect frequency statistics for arrays

2011-10-15 Thread Nathan Boley
 Looking now, I see that Alexander wasn't Cc'd on the review, so it's
 possible he missed the message?


We've corresponded off list and have discussed my review at some length.

Alex submitted an updated patch on Sep 22 to me personally ( although
not to the list? Alex? ), with the promise of a new version with
improved comments.

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch Review: Collect frequency statistics and selectivity estimation for arrays

2011-07-15 Thread Nathan Boley
 Actually, not MCV slot is used but MCELEM. It is a different slot. ps_stats
 view map both into same fileds.

Yes, you're right. I am sorry about the error.

 Surely, non-standard use of histogram slot
 should be avoided.

Agreed.

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Patch Review: Collect frequency statistics and selectivity estimation for arrays

2011-07-14 Thread Nathan Boley
Review of patch: https://commitfest.postgresql.org/action/patch_view?id=539

I glanced through the code and played around with the feature and,
in general, it looks pretty good. But I have a some comments/questions.

Design:

First, it makes me uncomfortable that you are using the MCV and histogram slot
kinds in a way that is very different from other data types.

I realize that tsvector uses MCV in the same way that you do but:

1) I don't like that very much either.
2) TS vector is different in that equality ( in the btree sense )
doesn't make sense, whereas it does for arrays.

Using the histogram slot for the array lengths is also very surprising to me.

Why not just use a new STA_KIND? It's not like we are running out of
room, and this will be the second 'container' type that splits the container
and stores stats about the elements.


Second, I'm fairly unclear about what the actual underlying statistical
method is and what assumptions it makes. Before I can really review
the method, I will probably need to see some documentation, as I say below.
Do you have any tests/simulations that explore the estimation procedure
that I can look at? When will it fail? In what cases does it improve
estimation?

Documentation:

Seems a bit lacking - especially if you keep the non standard usage of
mcv/histograms. Also, it would be really nice to see some update to
the row-estimation-examples page ( in chapter 55 ).

The code comments are good in general, but there are not enough high level
comments . It would be really nice to have a comment that gives an overview
of what each of the following functions are supposed to do:

mcelem_array_selec
mcelem_array_contain_overlap_selec

and especially

mcelem_array_contained_selec

Also, in the compute_array_stats you reference compute_tsvector_stats for
the algorithm, but I think that it would be smarter to either copy the
relevant portions of the comment over or to reference a published document.
If compute_tsvector_stats changes, I doubt that anyone will remember to
update the comment.

Finally, it would be really nice to see, explicitly, and in
mathematical notation

1) The data that is being collected and summarized. ( the statistic )
2) The estimate being derived from the statistic ( ie the selectivity )
i) Any assumptions being made ( ie independence of elements within
an array )

For instance, the only note I could find that actually addressed the
estimator and
assumptions underneath it was

+* mult * dist[i] / mcelem_dist[i] gives us probability 
probability
+* of qual matching from assumption of independent 
element occurence
+* with condition that length = i.

and that's pretty cryptic. That should be expanded upon and placed
more prominently.

A couple nit picks:

1) In calc_distr you go to some lengths to avoid round off errors. Since it is
   certainly just the order of the estimate that matters, why not just
   perform the calculation in log space?

2) compute_array_stats is really long. Could you break it up? ( I know that
   the other analyze functions are the same way, but why continue in
that vein? )

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: cross column correlation ...

2011-02-22 Thread Nathan Boley
 Personally, I think the first thing we ought to do is add a real, bona
 fide planner hint to override the selectivity calculation manually,
 maybe something like this:

 WHERE (x  5 AND y = 1) SELECTIVITY (0.1);



If you're going to go that far, why not just collect statistics on
that specific predicate?

ie,  ANALYZE SELECTIVITY ON tablename (x, y) WHERE (x  5 AND y = 1);

Then it won't fall subject to all of the pitfalls that Tom outlines below.

Selectivities are easy to estimate if we know the predicate. They only
become hard when they have to work for every possible predicate.

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Types: empty ranges

2011-02-11 Thread Nathan Boley
FWIW, a very informal survey of probabilists didn't yield any reason
for trying to put an order on the empty set ( unless the metric was
cardinality or other equivalence relation ).

I think the problem here is that the idea of union and intersection
forming a ring over sets is being conflated with the order relation.
Clearly, combining the two notions can be inconsistent.  However...

   A UNION (B INTERSECT C) = (A UNION B) INTERSECT (A UNION C)

 But the basic range type isn't even closed under UNION.

 An excellent point. Allow me to move the target a little:

  WHERE A  B AND A  C
 and:
  WHERE A  (B INTERSECT C)

 That seems like a logically sound transformation, but if (B INTERSECT C)
 is empty, it relies on the empty range for those two to be equivalent.

 Now, I agree that lack of closure on UNION exhibits many of the problems
 that I am pointing out related to forbidding empty ranges. However, I'm
 not sure if that means we should give up on either.

This seems potentially very useful, because we can transform WHERE A
 B AND A  C from a bitmap scan into WHERE A  (B INTERSECT C), a
simple index scan. In the union case ( even if we had a type that
supported disjoint intervals), I doubt we would ever make that
transformation because the index will probably still be over connected
intervals.

So, +1 for keeping it how it is ( but maybe with a better error message ).

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
On Wed, Jan 19, 2011 at 2:13 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Dne 19.1.2011 20:25, Robert Haas napsal(a):
 On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Yes, I was aware of this problem (amount of memory consumed with lots of
 updates), and I kind of hoped someone will come up with a reasonable
 solution.

 As far as I can see, periodically sampling some or all of the table is
 really the only thing that has a chance of working OK.  You could try
 to track changes only when they're not too big, but I think that's
 still going to have nasty performance consequences.

 Yes, sampling all the table is the only way to get reliable ndistinct
 estimates.

IMHO continuing to focus on worst case results is a dead end. Yes, for
some distributions, ndistinct is very hard to estimate well. However,
let us not forget that the current ndistinct estimator is very bad but
the postgres planner is, on the whole, very good.  As far as I can see
this is due to two facts:

1) The distribution of values in a table is rarely pathological, and
usually follows one of several common patterns. ( IOW, we have good
heuristics )

2) The choice of plan is fairly robust to mis-estimation of ndistinct,
because there are only a few things the executer can choose. It
doesn't usually matter if a value composes 5% or 20%  ( or 99% ) of
the table, we still usually need to scan the entire table.

Again, in my *very* humble opinion, If the ultimate goal is to improve
the planner, we should try to cut down on the number of cases in which
a poor estimate of ndistinct gives a really bad plan instead of
chasing after marginal gains from a better ndistinct estimator. Maybe
( as I've advocated for before ) this means parameterizing the
distribution of values ( ie, find that the values are roughly uniform
), maybe this means estimating the error of our statistics and using
the most robust rather than the best plan, or maybe it means something
totally different. But: ( and the literature seems to support me )
pounding away at the ndistinct estimator seems like a dead end. If you
think about it, it's a bit ridiculous to look at the whole table
*just* to estimate ndistinct - if we go that far why dont we just
store the full distribution and be done with it?

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
 If you think about it, it's a bit ridiculous to look at the whole table
 *just* to estimate ndistinct - if we go that far why dont we just
 store the full distribution and be done with it?

 - the best you could do is to average the
 individual probabilities which gives ... well, 1/ndistinct.


That is certainly *not* the best you could do in every case. The mean
is only the best estimate in L2, which is definitely not the case
here.

Consider a table with 10K values, 9,990 of which are 1 and the rest of
which are 2, 3, ..., 10, versus a table that has the same 10 distinct
values evenly distributed. For a simple equality query, in the first
case, a bitmap scan might be best. In the second case, a sequential
scan would always be best.

This is precisely the point I was trying to make in my email: the loss
function is very complicated. Improving the ndistinct estimator could
significantly improve the estimates of ndistinct ( in the quadratic
loss sense ) while only marginally improving the plans.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley

 Not true. You're missing the goal of this effort - to get ndistinct
 estimate for combination of multiple columns. Which is usually
 pathological in case of dependent columns.

snip

 Again, don't think about a single column (although even in that case
 there are known fail cases). Think about multiple columns and the number
 of distinct combinations. With attribute value independence assumption,
 this is usually significantly underestimated. And that's a problem as it
 often leads to an index scan instead of sequential scan (or something
 like that).

My point was that, in the case of single columns, we've done pretty
well despite the poor ndistinct estimates. I suspect the same will be
true in the case of multiple columns; good heuristics will trump
minimax estimators.

 ( as I've advocated for before ) this means parameterizing the
 distribution of values ( ie, find that the values are roughly uniform
 ), maybe this means estimating the error of our statistics and using
 the most robust rather than the best plan, or maybe it means something
 totally different. But: ( and the literature seems to support me )

 Which literature? I've read an awful lot of papers on this topic lately,
 and I don't remember anything like that. If there's an interesting
 paper, I'd like to read it. Yes, all the papers state estimating a
 ndistinct is a particularly tricky task, but that's somehow expected?

It is definitely expected - non-parametric minimax results are always
*very* weak. The papers that you've cited seem to confirm this;
bounding ndistinct estimation error is tricky in the general case (
even with very simple loss functions, which we do not have ). The same
is true for other non-parametric methods. Think about kernel density
estimation: how many data points do you need to estimate the density
of a normal distribution well? What about if you *know* that the data
is normal ( or even that it comes from a big family? ).

 No, I'm not trying to do this just to improve the ndistinct estimate.
 The goal is to improve the cardinality estimate in case of dependent
 columns, and one of the papers depends on good ndistinct estimates.

 And actually it does not depend on ndistinct for the columns only, it
 depends on ndistinct estimates for the combination of columns. So
 improving the ndistinct estimates for columns is just a necessary first
 step (and only if it works reasonably well, we can do the next step).

I think that any approach which depends on precise estimates of
ndistinct is not practical.

I am very happy that you've spent so much time on this, and I'm sorry
if my previous email came off as combative. My point was only that
simple heuristics have served us well in the past and, before we go to
the effort of new, complicated schemes, we should see how well similar
heuristics work in the multiple column case. I am worried that if the
initial plan is too complicated then nothing will happen and, even if
something does happen, it will be tough to get it committed ( check
the archives for cross column stat threads - there are a lot ).

Best,
Nathan Boley

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
On Wed, Jan 19, 2011 at 6:35 PM, Florian Pflug f...@phlo.org wrote:
 On Jan20, 2011, at 02:41 , Nathan Boley wrote:
 If you think about it, it's a bit ridiculous to look at the whole table
 *just* to estimate ndistinct - if we go that far why dont we just
 store the full distribution and be done with it?

 - the best you could do is to average the
 individual probabilities which gives ... well, 1/ndistinct.

 That is certainly *not* the best you could do in every case. The mean
 is only the best estimate in L2, which is definitely not the case
 here.

 No, not in every case. But on average it comes out best, no?

In the sense of minimizing the average plan cost over all values?
Definitely not. Consider the trivial case where a bitmap scan is the
cost of a sequential scan + epsilon.


 Consider a table with 10K values, 9,990 of which are 1 and the rest of
 which are 2, 3, ..., 10, versus a table that has the same 10 distinct
 values evenly distributed. For a simple equality query, in the first
 case, a bitmap scan might be best. In the second case, a sequential
 scan would always be best.

 True. But selectivity estimates alone don't show the difference there.

Yet the full distribution would - supporting my argument that even in
cases where we dont know a specific value, the full distribution is
more informative.


 Also, in my personal experience this isn't really the area we've got
 problems now. The problem cases for me always were queries with a rather
 large number of joins (7 to 10 or so) plus rather complex search
 conditions. The join order, not the access strategy, then becomes the
 deciding factor in plan performance. And the join order *is* driven purely
 off the selectivity estimates, unlike the access strategy which even today
 takes other factors into account (like clusteredness, I believe)


I think I'm losing you. Why would ndistinct be complete sufficient for
estimating the optimal join order?

 This is precisely the point I was trying to make in my email: the loss
 function is very complicated. Improving the ndistinct estimator could
 significantly improve the estimates of ndistinct ( in the quadratic
 loss sense ) while only marginally improving the plans.


 The point of this exercise isn't really to improve the ndistinct estimates
 for single columns. Rather, we'd like to get ndistinct estimates for
 groups of columns because that allows us to use the uniform bayesian
 approach to multi-column selectivity estimation that Tomas found literature
 on. Which on the whole seems like it *does* have a chance of greatly
 improving the plans in some cases. We could, of course, estimate multi-column
 ndistinct the same way we estimate single-column ndistincts, but quite a few
 people raised concerns that this wouldn't work due to the large error our
 current ndistinct estimates have.

Sure. But estimating multi-column ndistinct is surely not the only way
to approach this problem.

The comment I made, which you objected to, was that it's silly to look
at the whole table and then throw away all information *except*
ndistinct. If we really think that looking at the whole table is the
best approach, then shouldn't we be able to get better statistics? Is
this really an open question?

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why don't we accept exponential format for integers?

2010-12-17 Thread Nathan Boley
 print int(1e+01)
 10



That isn't building an integer: it is creating a float and casting to an int.

try:

int( 1e100 )

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Nathan Boley
 The proposed solution is based on contingency tables, built for selected
 groups of columns (not for each possible group). And the contingency
 table gives you the ability to estimate the probabilities needed to
 compute the selectivity. Or am I missing something?

 Well, I'm not real familiar with contingency tables, but it seems like
 you could end up needing to store a huge amount of data to get any
 benefit out of it, in some cases.  For example, in the United States,
 there are over 40,000 postal codes, and some even larger number of
 city names, and doesn't the number of entries go as O(m*n)?

Not to mention that the model parameters will be impossible to estimate well.

( I've only scanned the thread, so sorry if Im repeating something
that's already been said )

My intuition is that storing the covariance structure will be
unworkable both technically and statistically for all but the simplest
of cases. That being said, I dont think the problem is a lack of ways
to parameterize the covariance structure ( there are several papers on
mutli-dim histogram estimators, at least one of whichtalks about
databases explicitly, not to mention approaches like CART[1] ) , but a
complete lack of infrastructure to do anything with the estimates. So
keep working on it ;-) ( but try to make the framework general enough
to allow better estimators ).

I wonder if a good first step might be to focus on the AND and OR
operators since they seem like the most common cases and union and
intersection commute ( although it's important to remember that the
estimate variances do NOT )  That is, what about estimating the
selectivity of the condition

WHERE X=A and Y=B

by

f(A,B) = x_1*(selectivity(X = A) + selectivity( Y = B )) +
x_2*selectivity(X = A)*selectivity( Y = B ) + x_3

where x_{1,2,3} are parameters to be estimated from the data.

Another quick note: I think that storing the full contingency table is
wasteful since the marginals are already stored in the single column
statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was
looking at this a couple years back ).

Best,
Nathan

[1] http://en.wikipedia.org/wiki/Classification_and_regression_tree
[2] http://en.wikipedia.org/wiki/Copula_%28statistics%29

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Nathan Boley
 Robert explained why having more MCVs might be useful because we use
 the frequency of the least common MCV as an upper bound on the
 frequency of any value in the MCV.

Where is that being used? The only non-MCV frequency estimate that I
recall seeing is ( nrows - n_ndistinct_rows  )/ndistinct. Obviously
changing the number of mcv's affects this by lowering
n_ndistinct_rows, but it's always pretty coarse estimate.

  Binding the length of the MCV list to the size of the histogram is
 arbitrary but so would any other value

Wouldn't the best approach be to stop adding MCV's/histogram buckets
when adding new ones doesn't decrease your prediction error
'substantially'?

One very hacky threshold heuristic is to stop adding MCV's when a
simple equality select (  SELECT col FROM table WHERE col == VALUE )
would switch the plan from an index to a sequential scan ( or vice
versa, although with the current code this would never happen ). ie,
if the non_mcv frequency estimate is 0.1% ( producing an index scan ),
but adding the MCV gives us an estimate of 5% ( pbly producing a seq
scan ) then add that value as an mcv. More sophisticated variations
might also consider plan changes to very suboptimal joins; even more
sophisticated would be to stop when the MAX( curr - optimal plan /
optimal plan ) was below some threshold, say 20%, over a bunch of
recently executed queries.

A similar approach would work for histogram bins, except the queries
of interest are inequality rather than equality selections.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Nathan Boley
 That one's used, too, but the other is used as an upper bound.
 n_distinct tends to come out too small on large tables, so that
 formula is prone to overestimation.  Actually, both formulas are prone
 to overestimation.


Right - thanks.

 When this happens depends on the values of a whole boat-load of GUCs...

Well, then doesn't the 'proper' number of buckets depend on a whole
boat-load of GUCs...

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extended operator classes vs. type interfaces

2010-04-09 Thread Nathan Boley
 The advantage of specifying a + and a - in the type interface is that
 the unit definition can then be specified as part of the type
 declaration itself.  So you can do:

 CREATE TYPE ts_sec AS RANGE OVER timestamp (UNIT = '1s');
 CREATE TYPE ts_min AS RANGE OVER timestamp (UNIT = '1m');

 All of the stuff about defining + and - is hidden from the user - it's
 part of the type interface, which is pre-created.


The disadvantage is that it does not permit irregularly spaced units.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpython3

2010-02-01 Thread Nathan Boley
 On the basis of all of the foregoing, I don't think we can consider
 this patch further for this CommitFest and will update
 commitfest.postgresql.org accordingly.

FWIW, I am very excited about this patch and would be happy to review
it but have been very busy over the past month. If I can promise a
review by Thursday morning could we keep it active? Hopefully, at the
very least, I can provide some useful feedback and spawn some
community interest.

I am worried that there is a bit of a chicken and an egg problem with
this patch. I code nearly exclusively in python and C, but I have
often found pl/python to be very unwieldy.  For this reason I often
use pl/perl or pl/pgsql for problems that, outside of postgres, I
would always use python. From the documentation, this patch seems like
an enormous step in the right direction.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpython3

2010-02-01 Thread Nathan Boley
 I think it would be great for you to review it... I doubt that will
 cause it to get committed for 9.0, but my doubt is no reason for you
 to hold off reviewing it.

I assumed so, but the pretense of a chance will probably help to motivate me :-)

I'll have something by Thursday, and then 'Returned with Feedback'
will at least be factual.

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range types

2009-12-14 Thread Nathan Boley
 Because intervals (mathematical not SQL) can be open or closed at each
 end point we need to know what the next an previous value would be at
 the specified granularity. And while you can do some operations without
 knowing this, there are many you can't. For instance you could not tell
 whether two [] or () ranges were adjacent, or be able to coalesce an
 array of ranges.

 This statement seems to me to demonstrate that you don't actually
 understand the concept of open and closed ranges.  It has nothing
 whatsoever to do with assuming that the data type is discrete;
 these concepts are perfectly well defined for the reals, for example.
 What it is about is whether the inclusion conditions are  bound
 or = bound.

IMHO the first question is whether, for integers, [1,2] UNION [3,5]
should be equal to [1,5]. In math this is certainly true, and defining
'next' seems like a reasonable way to establish this in postgres.

The next question is whether, for floats, [1,3-FLT_EPSILON] UNION
[3,5] should be [1,5].

And the next question is whether, for numeric(6,2), [1,2.99] UNION
[3,5] should be [1,5].

FWIW, I would answer yes, no, yes to those three questions.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range types

2009-12-14 Thread Nathan Boley
 Right, I don't think strings are any more or less countable than integers.
 (and yes, it's a bit OT).

Well, if I remember my algebra, I think the issue is that integers are
locally finite whereas strings are not ( assuming standard orderings,
of course ).

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Python 3.1 support

2009-11-18 Thread Nathan Boley
 Again, I'm only one user.  But so far I haven't seen anyone else speak
 up here, and clearly accepting this for inclusion will need nontrivial
 convincing.

Well, FWIW, I am excited about better type integration.

Also, I am a little skeptical about this patch. I am sorry if this has
already been discussed, but would this mean that I need to choose
whether pl/python is built against Python 2.* or Python 3.*?

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Python 3.1 support

2009-11-18 Thread Nathan Boley
 Here's the patch to support Python =3.1 with PL/Python.  The
 compatibility code is mostly in line with the usual 2-3 C porting
 practice and is documented inline.


I took a cursory look at this patch and, while the logic seems sound
and roughly in line with the suggested python porting procedure, I'm
not quite certain what this implies for potential future patches.

For instance, if I wanted to write a type converter for bytea - the
python 3 byte type would the expectation be that I ensure that it
works in Python 2? Or is an ifdef that ignores it in the case of
Python 2 OK, and we can just put a note in the docs.

Also, how far back do we want to maintain 2.x compatibility? 2.0? If I
wanted to submit a patch that makes use of the list sort method, do I
need to ensure that it can either use the cmp arguments or a key
argument?

What if I wanted to implement a set returning function that made use
of an iterators next() method. Would I just put ifdefs around the code
or a preprocessor definition that defines NEXT as next() for Python
2.x and __next__() for 3.x?

I guess that my first impression is that Python broke compatibility
for a reason, and that either plpython can't evolve, or it will
quickly become impossible to maintain. That being said, I mostly buy
the maintenance arguments from the previous discussion, but if we want
to have plpython and plpython3, a bunch of defines and ifdefs does not
seem like the best way to do this.

Would a better approach be to maintain compatibility layer? ie
plython_compat.h/c
plython2.c
plython3.c

Then patches that apply to a python3 can be applied to plython3.c and
any changed function can be ripped out of plython_compat and moved
into plpython2.

I'm sorry to snipe from the sidelines like this. If we didn't expect
plpython to evolve then this patch seems like the correct approach,
but there is clearly some desire to expand plpython and following this
path seems like it will end in a much more painful split in the future
or a necessary rewrite.

If there is some consensus that this is the best approach, then I will
do a more comprehensive review.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] operator exclusion constraints

2009-11-01 Thread Nathan Boley
 After reading the docs in the patch I don't believe you're going to all
 this trouble to ensure two circles don't overlap. Can you give some
 better examples of what you're trying to achieve and why anyone else
 would care? (I'm busy, so are others).


Non overlapping time intervals is one use case - think about room
scheduling. I personally want to use it to ensure the consistency of
genomic annotations.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multi-Dimensional Histograms

2009-06-30 Thread Nathan Boley
 I'm finding myself unable to follow all the terminology on this thead.
  What's dimension reduction?  What's PCA?

( Sorry for the jargon - thanks Josh )

 It feels like what you might need is statistics for colB (MCVs and/or a
 histogram) for certain particular values of colA.

Certainly - this is exactly what a multi-dimensional histogram would store.

 Unfortunately, in
 the general case the set of values of colA for which you need these
 statistics might be inconveniently large.


Which is why, in the general case, we need some sort of dimension reduction.

If we could say that, for instance, the distribution of colB is
roughly the same for all values of colA less than 100, and it is
roughly the same for all values of colA = 100, then we would have
effectively reduced the dimension from ndistinct(colB)*ndistinct(colA)
to 2*ndistinct(colB).

The multidimensional histogram in the above paper is somewhat akin to
what you suggested and I just described - it attempts to select
contiguous regions that are the same. PCA is a much different
approach, but it is still, broadly, a dimension reduction technique.

 At the end of the day, we cant do any dimension reduction unless the
 ordering encodes some sort of useful information, and the data type
 being in R^n is certainly no guarantee. Consider, for instance, the
 cross correlation of zip-codes and area codes - you would really want
 to order those by some geographic relation. I think that is why
 cross-column stats is so hard in the general case.

To clarify my response to Tom, directly above, if as in Robert's
example, all of the values in colA  100 were different than all of
the values  100 with respect to colB, then it is easy to represent
the structure in something manageable. Ie, we store two histograms for
colB: 1 for colA  100, one for colA  100. However, if there are the
same two classes in colB ( lets say C1 and C2 ) but rather than C1
being associated with values in colA  100, it is associated with 100
completely random values in colA ( ie 1, 22, 47, 89, 91, 92, 107, ...
) there is not a whole lot we can do besides storing a list of all of
those values. We could maybe use the ratio of classes to improve the
average plan choice, but you would still get a lot of bad plans.

Could anyone provide a concrete example ( or better yet a data set )
where this occurs?

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multi-Dimensional Histograms

2009-06-29 Thread Nathan Boley
 For things like PostGIS, which will want to index in 4 dimensions
 (x, y, z, t), we might want to have multi-dimensional selectivity
 histograms and some way to use same.


Another use case is cross column statistics.

 Anybody here qualified to check out this paper on the subject, please
 speak up :)


Well, I don't know if I'm qualified per se, but here goes...

It seems seems like a very reasonable approach to multidimensional histograms.

For those who haven't had time to read the paper, it describes SGRID,
a greedy algorithm for partitioning an n-dimensional space into
non-overlapping n-dimensional boxes and 'outliers' ( points that dont
fit into any boxes, but dont warrant their own boxes ). Then, they
compare their method to a regression technique,  where the space is
fit into a fixed grid histogram and then the grid densities are
estimated via regression. They also compare a plane splitting
algorithm, called MHIST, where the planes are constrained to be
orthogonal to the naive dimension vectors. They dismiss singular value
decomposition and the discrete wavelet transform as being too
parametric ( which is silly, IMHO ) and briefly mention a couple
standard clustering techniques ( which are probably not appropriate
for us, given their complexity ). Unsurprisingly, it does well on the
two test sets that they consider.

It think the general idea is fine, but it would certainly need some
modification to work for postgres.

In particular,

Using the naive dimension vectors ( ie, north and east for geo data,
or column a and column b for cross-column stats ) in combination with
the box constraint will probably lead to problems. Consider, for
example, a river that goes in a straight line north-east, and a table
that stores camp sites which are mostly along the river. Because SGRID
can only create boxes with north east edges, you would end up with a
bunch of tiny box on the river where one, long north-east pointing box
would do.

We would probably want to replace the error term that SGRID uses to
determine when to create a new box by a statistical test ( maybe
homogeneity of means? ) ( also, the same holds for the outliers ).

Finally, this creates the partition but ( AFAICT ) it doesn't describe
a method for locating the histogram estimate given a point ( although
that doesn't seem too difficult ).

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multi-Dimensional Histograms

2009-06-29 Thread Nathan Boley
 Finally, this creates the partition but ( AFAICT ) it doesn't describe
 a method for locating the histogram estimate given a point ( although
 that doesn't seem too difficult ).

 Is that not difficult, in terms of the math that needs doing, or
 not difficult, in terms of how well PostgreSQL is already set up to
 implement, or...?


I only meant that any implementation would need to address this, but I
can think of simple ways to do it ( for instance, use the fixed width
grid method, and then store a reference to all of the intersecting
boxes ). But I am sure that there are much better ways ( nested
containment lists perhaps? ).

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multi-Dimensional Histograms

2009-06-29 Thread Nathan Boley
On Mon, Jun 29, 2009 at 3:43 PM, Tom Lanet...@sss.pgh.pa.us wrote:
 David Fetter da...@fetter.org writes:
 On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote:
 ... They dismiss
 singular value decomposition and the discrete wavelet transform as
 being too parametric ( which is silly, IMHO )

 Should we have a separate discussion about eigenvalues?  Wavelets?

 I think it'd be a short discussion: what will you do with non-numeric
 datatypes? We probably don't really want to assume anything stronger
 than that the datatype has a total ordering.

Well, in the general case, we could use their ranks.

At the end of the day, we cant do any dimension reduction unless the
ordering encodes some sort of useful information, and the data type
being in R^n is certainly no guarantee. Consider, for instance, the
cross correlation of zip-codes and area codes - you would really want
to order those by some geographic relation. I think that is why
cross-column stats is so hard in the general case.

That being said, for geographic data in particular, PCA or similar
could work well.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hist boundary duplicates bug in head and 8.3

2009-01-07 Thread Nathan Boley
 Surely the most important point in the OP was that ineqsel does not
 correctly binary search in the presence of duplicates.


It would have been if I were correct :-( .

Looking at it again, that was from a bug in my code. Thanks for your
time, and sorry about the noise.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hist boundary duplicates bug in head and 8.3

2009-01-06 Thread Nathan Boley
 For heavy tailed distributions, it is possible for analyze to
 duplicate histogram boundaries.

 I don't think this is a bug.

hmmm... Well, I assumed it was a bug from a comment in analyze.

From ( near ) line 2130 in analyze.c

 * least 2 instances in the sample.  Also, we won't suppress values
 * that have a frequency of at least 1/K where K is the intended
 * number of histogram bins; such values might otherwise cause us to
 * emit duplicate histogram bin boundaries.
 */

If this is expected, I'm also not sure what the use of maxmincount in
analyze is...

Thanks for the response,

Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Nathan Boley
 One more direction could be implementing MCV for range of values (group
 values and interpolate in between). Consider statistics on timestamp column
 that says that for 2008-December there are as many X rows, for
 2008-November as many as Y, etc.  That could be used for rather accurate
 cardinality estimation of between cases, while keeping number of entries
 in MCV list small.


I suggested this earlier (
http://archives.postgresql.org/pgsql-hackers/2008-06/msg00353.php ).
If there's interest, I'd be happy to clean up my patch and submit it
for 8.5

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Nathan Boley
 What is the specific difference between what you are talking about and
 what scalarineqsel already implements?

 Hmm... Northing new. Feel sorry for bothering you. I did not realize
 histograms are implemented.


Well, ISTM there is a profound difference. For scalarineqsel we care
about the total number of values in a bucket. For eqsel  we care about
the total number of *distinct* values in each bucket ( which we don't
track ).

IMHO, the whole idea of increasing mcv's seems a mistake. Why not use
the limited storage in pg_statistic to try and estimate the
selectivity for ranges of values rather than a single value? That
gives way better coverage of the distribution. If the number of values
is too high to fit in a single bucket we put it in an mcv slot
anyways. *That* should be the mechanism by which the number of mcv's
increases.

I guess this is a bit off topic for the middle of a commit fest though.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Nathan Boley
Thanks for the response.

 Well, ISTM there is a profound difference. For scalarineqsel we care
 about the total number of values in a bucket. For eqsel  we care about
 the total number of *distinct* values in each bucket

 Really?


Well, we would obviously also care about the total number of values in
the buckets if we were trying to use the histogram in eqsel.

Isn't a selectivity estimate of x = v as ( the number of values in v's
histogram bucket ) / ( number of distinct values in v's histogram
bucket ) pretty rational? Thats currently what we do for non-mcv
values, except that we look at ndistinct over the whole table instead
of individual histogram buckets.

 IMHO, the whole idea of increasing mcv's seems a mistake. Why not use
 the limited storage in pg_statistic to try and estimate the
 selectivity for ranges of values rather than a single value?

 MCVs are useful for questions that are not related to ranges of values
 --- an example of something highly useful we do with them is to try to
 estimate the fraction of a column that satisfies a LIKE or regex
 pattern.


Good point. I guess I was responding to the eqsel benchmarks. I should
remember that I don't necessarily know all the places that mcv's are
used.

 In fact, as I was just pointing out to Bruce, we can compute them and
 do useful things with them for datatypes that don't have a defined sort
 order and so the whole concept of range is meaningless.


Another good point. But don't we have bigger stat problems for
datatypes without an ordering relation? IIRC, analyze doesn't even
fully compute the mcv list, as that would be N^2 in the sample size.

 Now, if your point is that it'd be more flexible to not tie the MCV list
 length to the histogram length, you're right.

No, my point is just the opposite. I think the two should be *more*
tightly linked. It seems that ( at least for eqsel and scalarineqsel )
mcv's should be the values that the histogram can't deal with
effectively. ie, there's no ordering relation, there are too many
values to fit into a histogram bucket, the histogram eqsel estimate
does an especially bad job for a relatively common value, etc.  Even
now, if there are 100 histogram buckets then any values that occupy
more than 1% of the table will be mcv's regardless - why force a value
to be an mcv if it only occupies 0.1% of the table? That just bloats
pg_statistic and slows down joins unnecessarily.

I'll submit a patch for 8.5 and then, hopefully, some simple
benchmarks can make my point..

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Nathan Boley
 Isn't a selectivity estimate of x = v as ( the number of values in v's
 histogram bucket ) / ( number of distinct values in v's histogram
 bucket ) pretty rational? Thats currently what we do for non-mcv
 values, except that we look at ndistinct over the whole table instead
 of individual histogram buckets.

 But the histogram buckets are (meant to be) equal-population, so it
 should come out the same either way.

Why is it the same either way? The histogram buckets have equal
population in the number of total values, not the number of distinct
value. Consider [0,0,0,0,1,2,3,4,5,6]. The histogram buckets would be
[0,2) and [2,+oo), but they would have 2 and 5 distinct values
respectively.

 The removal of MCVs from the
 population will knock any possible variance in ndistinct down to the
 point where I seriously doubt that this could offer a win.

Well, if the number of MCV's is large enough then it certainly won't
matter. But isn't that pointlessly inefficient for most distributions?
I provided some empirical evidence of this in an earlier post (
http://archives.postgresql.org/pgsql-hackers/2008-06/msg00353.php )
for normal distributions.

 An even
 bigger problem is that this requires estimation of ndistinct among
 fractions of the population, which will be proportionally less accurate
 than the overall estimate.  Accurate ndistinct estimation is *hard*.


Agreed, but I'm not convinced that the overall error rate will go up
for our current estimator. Ndistinct estimation is hardest for
populations with a large ndistinct count variation. If the assumptions
underlying this are founded ( namely that ndistinct distributions are
related to the ordering relation in a predictable way ) then ndistinct
estimation should be easier because the ndistinct distribution will be
more uniform. But that's certainly something that should be tested on
real data.

 now, if there are 100 histogram buckets then any values that occupy
 more than 1% of the table will be mcv's regardless - why force a value
 to be an mcv if it only occupies 0.1% of the table?

 Didn't you just contradict yourself?  The cutoff would be 1% not 0.1%.

No, I'm saying that if the number of histogram buckets is 100, then
even if the mcv list is set to length 2, every value that occupies 1%
or more of the table will be an mcv. However, if the size is fixed to
100 I think that it will be common to see values with relative
frequencies much lower than 1% near the bottom of the list.

 In any case there's already a heuristic to cut off the MCV list at some
 shorter length (ie, more than 1% in this example) if it seems not
 worthwhile to keep the last entries.  See lines 2132ff (in CVS HEAD)
 in analyze.c.

Given an increase in the default stats target, limits like the 25% are
exactly what I think there should be more of.

Can anyone suggest a good data set to test this sort of question on?

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Simple postgresql.conf wizard

2008-12-05 Thread Nathan Boley
Thanks for putting out pgtune - it's a sorely needed tool.

I had a chance to look over the source code and have a few comments,
mostly about python specific coding conventions.

- The windows specific try block ( line 16 ) raises a ValueError vs
ImportError on my debian machine. Maybe it would be more appropriate
to explicitly test platform.system()==Windows?

- from ctypes import * ( 18 ) makes the block difficult to read and
pollutes the namespace.

- on line 45, the try block should probably catch exceptions derived
from Exception ( to avoid catching SystemExit and KeyboardInterrupt
errors ). ie, except Exception: return None. Also, printing out the
expection in debug mode would probably be a good idea ( ie except
Exception, e: print e\ return None )

- all classes ( 58, 135, 205 ) are 'old-style' classes.  I dont see
any reason to use classic classes ( unless Python 2.1 is a support
goal? ) To make classes 'new style' classes (
http://www.python.org/doc/2.5.2/ref/node33.html ) they should inherit
object. i.e. class PGConfigLine(object):

- The doc strings ( 59, 136, 206 ) don't follow standard conventions,
described here http://www.python.org/dev/peps/pep-0257/.

- Functions also support doc strings ( 342, 351, etc. )

- Tuple unpacking doesn't require the tuple boundaries ( 446 and
others ). ie, options, args = ReadOptions()  works.

This is more of a style comment about the 'Java-ish interface' ( line
112 ), feel free to ignore it.

overloading __str__ and __repr__ are the python ways to return string
representations of an object. ie, instead of toSting use __str__ and
then ( on 197 ) print l or print str(l) instead of print l.toString()

for the other methods ( getValue, getLineNumber, isSetting ) I'm
assuming you didnt call the attributes directly because you didnt want
them to be accidently overwritten. Have you considered the use of
properties ( http://www.python.org/download/releases/2.2.3/descrintro/#property
)? Also, it would be more clear to mark attributes as private ( i.e.
_name or __name, _readable, _lineNumber, _setsParameter ) if you dont
want them to be accessed directly.

Hope my comments are useful! Thanks again for writing this.

-Nathan

P.S. I'd be happy to officially review this if it gets to that.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-19 Thread Nathan Boley
 I still need to go through backend/utils/adt/selfuncs.c
 to figure out exactly how we use the one-dimensional values.


Here's a page that helped me figure all this out.

http://www.postgresql.org/docs/8.1/static/planner-stats-details.html


 2) Do we want to fold the MCV's into the dependence histogram? That
 will cause problems in our copula approach but I'd hate to have to
 keep an N^d histogram dependence relation in addition to the copula.

 Yeah, if we're already trying to figure out how to compress copulae,
 having also to compress MCV matrices seems painful and error-prone.
 But I'm not sure why it would cause problems to keep them in the
 copula -- is that just because we are most interested in the copula
 modeling the parts of the distribution that are most sparsely
 populated?


The problem I was thinking of is that we don't currently store the
true marginal distribution. As it stands, histograms only include non
mcv values. So we would either need to take the mcv's separately (
which would assume independence between mcv's and non-mcv values ) or
store multiple histograms.

 4) How will this approach deal with histogram buckets that have
 scaling count sizes ( ie -0.4 )?

 I'm not sure what you mean here.


That was more a note to myself, and should have been numbered 3.5.
ndistinct estimates currently start to scale after a large enough
row/ndistinct ratio. If we try to model ndistinct, we need to deal
with scaling ndistinct counts somehow. But that's way off in the
future, it was probably pointless to mention it.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Nathan Boley
 Right now our
 histogram values are really quantiles; the statistics_target T for a
 column determines a number of quantiles we'll keep track of, and we
 grab values from into an ordered list L so that approximately 1/T of
 the entries in that column fall between values L[n] and L[n+1]. I'm
 thinking that multicolumn statistics would instead divide the range of
 each column up into T equally sized segments,

 Why would you not use the same histogram bin bounds derived for the
 scalar stats (along each axis of the matrix, of course)?  This seems to
 me to be arbitrarily replacing something proven to work with something
 not proven.  Also, the above forces you to invent a concept of equally
 sized ranges, which is going to be pretty bogus for a lot of datatypes.

 Because I'm trying to picture geometrically how this might work for
 the two-column case, and hoping to extend that to more dimensions, and
 am finding that picturing a quantile-based system like the one we have
 now in multiple dimensions is difficult. I believe those are the same
 difficulties Gregory Stark mentioned having in his first post in this
 thread. But of course that's an excellent point, that what we do now
 is proven. I'm not sure which problem will be harder to solve -- the
 weird geometry or the equally sized ranges for data types where that
 makes no sense.


Look at copulas. They are a completely general method of describing
the dependence between two marginal distributions. It seems silly to
rewrite the stats table in terms of joint distributions when we'll
still need the marginals anyways. Also, It might be easier to think of
the dimension reduction problem in that form.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Nathan Boley
 I'm still working my way around the math, but copulas sound better
 than anything else I've been playing with.

I think the easiest way to think of them is, in 2-D finite spaces,
they are just a plot of the order statistics against one another. Feel
free to mail me off list if you have any math questions.

I've previously thought that, at least in the 2D case, we could use
image compression algorithms to compress the copula, but recently I've
realized that this is a change point problem. In terms of compression,
we want to decompose the copula into regions that are as homogenous as
possible.  I'm not familiar with change point problems in multiple
dimensions, but I'll try and ask someone that is, probably late next
week. If you decide to go the copula route, I'd be happy to write the
decomposition algorithm - or at least work on the theory.

Finally, a couple points that I hadn't seen mentioned earlier that
should probably be considered-

1) NULL's need to be treated specially - I suspect the assumption of
NULL independence is worse than other independence assumptions. Maybe
dealing with NULL dependence could be a half step towards full
dependence calculations?

2) Do we want to fold the MCV's into the dependence histogram? That
will cause problems in our copula approach but I'd hate to have to
keep an N^d histogram dependence relation in addition to the copula.

3) For equality selectivity estimates, I believe the assumption that
the ndistinct value distribution is uniform in the histogram will
become worse as the dimension increases. I proposed keeping track of
ndistinct per histogram beckets earlier in the marginal case partially
motivated by this exact scenario. Does that proposal make more sense
in this case? If so we'd need to store two distributions - one of the
counts and one of ndistinct.

4) How will this approach deal with histogram buckets that have
scaling count sizes ( ie -0.4 )?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IN, BETWEEN, spec compliance, and odd operator names

2008-08-25 Thread Nathan Boley
 Yes, but always in relation to operator classes, so from BTrees opclass or
 such, which I refered to as the context of index searches, as I don't
 really see any theorical need for opclass if it's not for indexing.

IIRC analyze uses the btree op class to build the selectivity histogram.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-12 Thread Nathan Boley
 Assuming that the threshold
 for switching to an indexscan is somewhere around selectivity 0.005
 (I am not certain offhand, but it's in that general area), this cannot
 possibly require more than 200 MCV slots, and for most data
 distributions it'd be a whole lot less.

Thats a really good point.

 Given such an MCV list, the planner will always make the right choice
 of whether to do index or seqscan

Given that, wouldn't it be smarter to consider a value as an mcv
candidate iff it has a density greater than 0.005, rather than having
a count greater than 1.5*average? This would allow people to raise the
hard mcv limit without having to worry as much about including
worthless mcv values...

Cheers,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
  One more problem with low ndistinct values is that the condition might 
  very well
  hit no rows at all. But Idea 1 will largely overestimate the number of 
  hits.

Thats a good point, but I don't see a clear solution. Maybe we could
look at past queries
and keep track of how often they return empty result sets?

It seems that, in some ways, we care about the distribution of the
query values in addition to the column values...

  I think for low ndistinct values we will want to know the exact
  value + counts and not a bin. So I think we will want additional stats rows
  that represent value 'a1' stats.

 Isn't that what our most frequent values list does?

 Maybe ? Do we have the relevant stats for each ?
 But the trick is to then exclude those values from the histogram bins.

Currently, the histogram is only made up of non-mcv values.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
   One more problem with low ndistinct values is that the condition might 
   very well
   hit no rows at all. But Idea 1 will largely overestimate the number of 
   hits.

 Thats a good point, but I don't see a clear solution. Maybe we could

 I think that MCVs are the solution, right?

Only if they cover the entire range of values in the table.

 A low ndistinct means that those values will likely be MCVs.

Yes, but I don't think thats the point.

If we query on values that aren't in the table, the planner will
always overestimate the expected number of returned rows because it (
implicitly ) assumes that every query will return at least 1 record.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
 If we query on values that aren't in the table, the planner will
 always overestimate the expected number of returned rows because it (
 implicitly ) assumes that every query will return at least 1 record.

 That's intentional and should not be changed.

Why?  What if ( somehow ) we knew that there was a 90% chance that
query would return an empty result set on a big table with 20 non-mcv
distinct values. Currently the planner would always choose a seq scan,
where an index scan might be better. Better yet, couldn't that be
optimized to *if record exists, execute seq scan*. That being said, I
think queries are generally searching for values that exist in the
table.

 I can't see the value of allowing fractional-row estimates anyway.

Neither can I, but I could probably think of cases where knowing the
SD of the result set could result in better plans.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
 Why?  What if ( somehow ) we knew that there was a 90% chance that
 query would return an empty result set on a big table with 20 non-mcv
 distinct values. Currently the planner would always choose a seq scan,
 where an index scan might be better.

 (1) On what grounds do you assert the above?

For a table with 100 non-mcv rows, the planner estimates a result
set of cardinality 100/20 = 5, not 1.

 (2) What makes you think that an estimate of zero rather than one row
 would change the plan?

I see where the confusion is coming from. When I said

 What if ( somehow ) we knew that there was a 90%
 chance that query would return an empty result set

I meant that the planner doesn't know that information. And how could it?

The estimate for ndistinct is an estimate for the number of distinct
values in the table, not an estimate for the number of distinct values
that will be queried for.  My original point was that we sometimes
care about the distribution of what's being queried for and not just
what's in the table.

But this is all silly anyways: if this was really a concern you would
write a function

if values exist
   return values
else return none

 (In fact, I don't think the plan would change, in this case.  The reason
 for the clamp to 1 row is to avoid foolish results for join situations.)

Which makes sense. My point certainly wasn't, in any way, a criticism
of clamping selectivity to 1.

Cheers,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-09 Thread Nathan Boley
 Your argument seems to consider only columns having a normal
 distribution.

My example was based upon normally distributed data because people
usually know what they are and they are reasonably common.

 How badly does it fall apart for non-normal distributions?

This should work to the extent that there is a correlation between the
number of distinct values in a histogram bin and the 'size' of the
bin. I know that this isn't a very good answer, but it's a hard
problem in the general case.

However, for any non-uniform probability distribution there exists a
measure under which there is a non-zero correlation between these. I
wrote up a hand-wavie semi-formal argument at the end, but it's not
tough to see intuitively.

Just think about the graph of the cdf. The histogram boundaries are
horizontal lines that are equally spaced from top to bottom. If we
rescale the x-axis st there is a fixed distance between each distinct
value, and define the distance as ( ndistinct in a given
interval)/(total ndistinct) then the only place where this doesn't
hold is when the CDF is f(x) = x, aka the dist is uniform. And even
then we just get a 0 coefficient, which is exactly what we are always
assuming now.

Obviously we run into problems when
a) we have a poor estimate for ndistinct - but then we have worse problems
b) our length measure doesn't correspond well with ndistinct in an interval

expanding on b)...

your mention of Zipfian distributions is particularly good example of
where this could be poor. Right now ( someone correct me if I'm wrong
) words are sorted alphabetically. However, if we wanted this
estimator to do a good job, we would sort them by their length or,
better yet, frequency in the english language ( which is highly
correlated with length ).

 (For instance, Zipfian distributions seem to be pretty common in database 
 work, from what I've seen.)

This should work well for any power law distribution. If people are
curious, I can rerun my previous example using a power law
distribution instead of normal.

However, the easy way of thinking about all of this is that we're
building a linear model between ndistinct and histogram bucket width.
It's intuitive to expect there to be a correlation between the two,
and so the model should have some predictive power.

-Nathan


somewhat formal - probably will be difficult to read without some
basic probability theory
To see this, first note that we can alternatively define the uniform
distribution on [a,b] as the distribution whose CDF is a straight line
that passes through both (a,0) and (b,1) ( just integrate the PDF ).
So any non-uniform distribution will have a CDF with slope that is
both below and above 1/(b-a) at some set of points, implying the
existence of an interval [i1, i2] st ( CDF(i2) - CDF(i1) )  ( i2 - i1
)/(b-a). Then, under the constraints of the probability measure, there
exists a second disjoint interval st ( CDF(i2') - CDF(i1') )  ( i2' -
i1' )/(b-a). In other words,

Next, assume that the number of potential distinct values in our
interval scales linearly with the length of the interval. Although it
seems as if this assumption could be somewhat burdensome, there always
exists a measure under which this is true for sets with a defined
order relation.  ( As remarked earlier by Tom, we are already making
this assumption ). To see this, consider defining the length(i1, i2)
as ( the number of distinct value in  [i1, i2] )/( total num distinct
values ), where the number of distinct values is the set of values { v
| v = i1 and v = i2 }.

Next, note that the joint distribution of identically distributed,
independent random variables is multinomial with cell probabilities
given by the value of the pdf at each distinct point.  Next, I'll
state without proof that for an IID RV  the expected number of
distinct values is maximized for a uniform distribution ( this is
pretty easy to see: think about the binomial case. Do you want your
cell probabilities to be ( 1.0, 0 ) or ( 0.5, 0.5 ) )

Finally, note that the number of expected distinct values decreases
faster than linearly in the length of the interval. This is pretty
clear when we consider the sparse case. As the number of potential
entries ( in this case, the interval length) approaches infinity, the
probability of a new entry being distinct approaches 1. This means
that,  in this limit, every new entry ends up being distinct, aka the
number of distinct values scales linearly in the number of new
entries. As the interval shrinks, new entries have some probability of
being repeats. As the interval shrinks to 0, there is a zero
probability of new entries being unique. Since,

a) there doesn't exists a linear relationship that contains the two
boundary points
b) the multinomial distribution of the PDF is continuous
c) the relationship is clearly decreasing

we can surmise that it is sub-linear.

Therefore, we have two intervals that have sub and super linear slopes
that cancel one another. However, 

[HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-08 Thread Nathan Boley
Currently eqsel assumes that, except for the values stored as mcv's,
the number of times that a value appears in a table column is
independent of it's value. Unfortunately, this often yields
inappropriate selectivity estimates and frequently leads to
inappropriate plans.

As an example, consider an insurance company that keeps a record of
patient heights. Assume there are a 100 patient heights in this
column, and they are distributed normally with a mean of 1.7526 and a
standard deviation of 0.0762. Furthermore, assume that the heights are
only measured to the nearest centimeter. Then, we'd expect there to be
about 73 distinct heights, with a SD of 1.5.

Ignoring the effects of MCV's, the planner expects
  SELECT height FROM heights WHERE height = 1.75;
to yield roughly 13000 results. However, given that we know the
underlying distribution, we would expect to see ~52000 results.

Similarly, the planner expects to see 13000 results from
  SELECT 1.75 FROM heights WHERE height = 2.05;
While we expect to see 2.7.

Obviously this example is not totally convincing: if I were to post
this to pg-general looking for advice I'm sure that everyone would
tell me to just increase the size of my mcv stats. However, in cases
where the number of distinct values is higher, this isn't always
feasible. Also, why store a list of 50 values and their frequencies
when 10 extra would provide the same plans without bloating
pg_statistics?

To combat this problem, I have two different proposals.

Idea 1: Keep an array of stadistinct that correspond to each bucket size.

In the example above, ( again ignoring mcv's ) the quantile data is

0%10%   20%   30%   40%   50%   60%   70%   80%   90%   100%
1.38  1.66  1.69  1.71  1.73  1.75  1.77  1.79  1.82  1.85  2.12

with numdistinct values of ( respectively )

29  2  2  2  2  2  2  3  3 25

For the two above examples, this new approach would yield selectivity
estimates of

(100/10)/2 = 5  ( vs an actual ED of ~52000 )
and
(100/10)/25 = 4000  ( vs an actual ED of ~2.7 )

Furthermore, this is done without mcvs. Since mcv's would make the
histogram more sensitive to the edges, the estimates with mcv's should
be correspondingly better.


There are two potential problems that I see with this approach:

1) It assumes the = is equivalent to = and = . This is certainly
true for real numbers, but is it true for every equality relation that
eqsel predicts for?

2) It bloats the stats table.

Idea 2: Keep a correlation statistic between ndistinct and bucket size

This addresses problem #2.

In lieu of keeping an actual list of ndistinct per histogram bucket,
we store the linear scaling coefficient between histogram bucket width
and ndistinct/(avg ndistinct). To visualize this, it is easiest to
consider plotting the bucket width versus ndistinct. The scaling
coefficient is the linear line that passes through origin and
minimizes the square of the difference between it's estimate for
ndistinct and the actual value.

When I apply this method to the above data I find a coefficient of
13.63 for an average ndist of 72/10. This provides selectivity
estimates, for the above two examples, of
(100/10)/( 13.63*7.2*(1.77 - 1.75)  ) = 50950 ( vs an actual ED of ~52000 )
and
(100/10)/( 13.63*7.2*(2.12 - 1.85)  ) = 3774  ( vs an actual ED of ~2.7 )

Although this yields better results than idea 1 for this particular
example, it will be much more sensitive to weird distributions.

Obviously there are some special cases to consider: we wouldn't want
the stats to be skewed such that they provide really bad plans.
However, with some carefully designed caps I believe that we could
ensure that the estimates are at least as good as they are now. In
fact, I'm not certain that an R^2 penalty is the correct loss
function. Ideally, we want to minimize the extra time that the db
spends by choosing an incorrect plan. Maybe slight overestimations are
better than slight underestimations? Maybe the cost of the occasional
(really) bad plan is less than the cost of a bunch of kinda bad plans?

Finally, we aren't limited to just one coefficient. We could also
store multiple coefficents to improve our estimates, and provide a
compromise between ideas 1 and 2.

Food for future thought...

I addition to the previous benefits, I think that this method has the
potential to make the process by which MCV are chosen (or not chosen)
smarter. Now the planner chooses a value to be an mcv candidate if
it's frequency is greater than 1.25 * the average frequency. Given
that this improved selectivity estimate is implemented, maybe a better
way would be to include a value as an mcv if it's a) above a certain
threshold and b) the histogram selectivity estimator does do a poor
job.

What are peoples thoughts on idea 1 vs idea 2?

Am I missing any relevant details about the planner's operation?

Do people think that the improved estimates would be worth the
additional overhead?

-Nathan

-- 
Sent via pgsql-hackers