Re: [HACKERS] potential bug in trigger with boolean params

2011-05-11 Thread tv
 Hi,
 I was trying to create a trigger with parameters. I've found a potential
 bug
 when the param is boolean.

 Here is code replicating the bug:

 CREATE TABLE x(x TEXT);

 CREATE OR REPLACE FUNCTION trigger_x() RETURNS TRIGGER AS $$
 BEGIN
 RETURN NEW;
 END; $$ LANGUAGE PLPGSQL;

 CREATE TRIGGER trig_x_text BEFORE INSERT ON x FOR EACH ROW EXECUTE
 PROCEDURE
 trigger_x('text');
 CREATE TRIGGER trig_x_int BEFORE INSERT ON x FOR EACH ROW EXECUTE
 PROCEDURE
 trigger_x(10);
 CREATE TRIGGER trig_x_float BEFORE INSERT ON x FOR EACH ROW EXECUTE
 PROCEDURE trigger_x(42.0);
 CREATE TRIGGER trig_x_bool BEFORE INSERT ON x FOR EACH ROW EXECUTE
 PROCEDURE
 trigger_x(true);

 ERROR:  syntax error at or near true
 LINE 1: ... INSERT ON x FOR EACH ROW EXECUTE PROCEDURE trigger_x(true);

The docs clearly state what the valid values are and the literal 'true' is
not one of them (TRUE is). See this:

http://www.postgresql.org/docs/9.0/interactive/datatype-boolean.html

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread tv
 On Jan 17, 2011, at 6:36 PM, Tomas Vondra wrote:
 1) Forks are 'per relation' but the distinct estimators are 'per
   column' (or 'per group of columns') so I'm not sure whether the file
   should contain all the estimators for the table, or if there should
   be one fork for each estimator. The former is a bit difficult to
   manage, the latter somehow breaks the current fork naming convention.

 Yeah, when I looked at the fork stuff I was disappointed to find out
 there's essentially no support for dynamically adding forks. There's two
 other possible uses for that I can think of:

 - Forks are very possibly a more efficient way to deal with TOAST than
 having separate tables. There's a fair amount of overhead we pay for the
 current setup.
 - Dynamic forks would make it possible to do a column-store database, or
 at least something approximating one.

 Without some research, there's no way to know if either of the above makes
 sense; but without dynamic forks we're pretty much dead in the water.

 So I wonder what it would take to support dynamically adding forks...

Interesting ideas, but a bit out of scope. I think I'll go with one fork
containing all the estimators for now, although it might be inconvenient
in some cases. I was thinking about rebuilding a single estimator with
increased precision - in that case the size changes so that all the other
data has to be shifted. But this won't be very common (usually all the
estimators will be rebuilt at the same time), and it's actually doable.

So the most important question is how to intercept the new/updated rows,
and where to store them. I think each backend should maintain it's own
private list of new records and forward them only in case of commit. Does
that sound reasonable?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-10 Thread tv
 On Fri, 2011-01-07 at 12:32 +0100, t...@fuzzy.cz wrote:
 the problem is you will eventually need to drop the results and rebuild
 it, as the algorithms do not handle deletes (ok, Florian mentioned an
 algorithm L_0 described in one of the papers, but I'm not sure we can
 use
 it).

 Yes, but even then you can start with much better cards if you already
 have an estimate of what it looks like, based on the fact that you did
 continuous updating of it. For example you'll have a pretty good
 estimate of the bounds of the number of distinct values, while if you
 really start from scratch you have nothing to start with but assume that
 you must cope with the complete range between all values are distinct -
 there's only a few of them.

Sure, using the previous estimate is a good idea. I just wanted to point
out there is no reasonable way to handle deletes, so that you have to drop
the stats are rebuild it from scratch.

The biggest problem is not choosing a reasonable parameters (some of the
parameters can handle a few billions ndistinct values with something like
128kB of memory and less than 5% error). The really serious concern is I/O
generated by rebuilding the stats.

 Another thing I'm not sure about is where to store those intermediate
 stats (used to get the current estimate, updated incrementally).

 The forks implementation proposed in other responses is probably the
 best idea if usable. It will also solve you the problem of memory
 limitations, at the expense of more resources used if the structure
 grows too big, but it will still be probably fast enough if it stays
 small and in cache.

Hm, the forks seem to be an interesting option. It's probably much better
than storing that directly in the memory (not a good idea if there is a
lot of data). And you don't really need the data to get the estimate, it's
needed just when updating the stats.

So the last thing I'm not sure is how to store the changed rows, so that
the update process can get a list of new values. Someone already offered
LISTEN/NOTIFY, but I guess we could just write the ctids into a file
(maybe another fork)?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2011-01-07 Thread tv
 On Thu, 2010-12-30 at 21:02 -0500, Tom Lane wrote:
 How is an incremental ANALYZE going to work at all?

 How about a kind of continuous analyze ?

 Instead of analyzing just once and then drop the intermediate results,
 keep them on disk for all tables and then piggyback the background
 writer (or have a dedicated process if that's not algorithmically
 feasible) and before writing out stuff update the statistics based on
 the values found in modified buffers. Probably it could take a random
 sample of buffers to minimize overhead, but if it is done by a
 background thread the overhead could be minimal anyway on multi-core
 systems.

Hi,

the problem is you will eventually need to drop the results and rebuild
it, as the algorithms do not handle deletes (ok, Florian mentioned an
algorithm L_0 described in one of the papers, but I'm not sure we can use
it).

I'm not sure a constantly running background process is a good idea. I'd
prefer storing an info about the modified tuples somewhere, and starting
analyze only when a given threshold is reached. I'm not sure how to do
that, though.

Another thing I'm not sure about is where to store those intermediate
stats (used to get the current estimate, updated incrementally). I was
thinking about pg_stats but I'm not sure it's the right place - depending
on the algorithm, this may be a fet kilobytes up to several megabytes. And
it's not needed except when updating it. Any ideas?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-28 Thread tv

 The simple truth is

 1) sampling-based estimators are a dead-end

 The Charikar and Chaudhuri paper does not, in fact, say that it is
 impossible to improve sampling-based estimators as you claim it does. In
 fact, the authors offer several ways to improve sampling-based
 estimators.  Further, 2000 was hardly the end of sampling-estimation
 paper publication; there are later papers with newer ideas.

Well, the paper states that there is a lower bound of the possible error
of a sampling based estimator, depending on the sample size. The actual
inequality is

   error(d) = sqrt( (n-r)/2r * 1/q)

where error(D) is ratio error (d - estimated number of distinct values,
D - actual number of distinct values)

   error(d) = max{ D/d, d/D }

And all this is with probability q = e^{-1000} (you can choose this).

Say you have a table  with 1.000.000 rows and you use a sample of 1.000
rows to do an estimate. In that case you get

  erorr(d) = 99  with   q = 0.05
  erorr(d) = 70  with   q = 0.1
  error(d) = 31  with   q = 0.5

if you can 10% of the table, you get this

  error(d) = 9.5 with   q = 0.05
  error(d) = 6.7 with   q = 0.1
  error(d) = 3   with   q = 0.5

So even with 10% of the table, there's a 10% probability to get an
estimate that's 7x overestimated or underestimated. With lower probability
the interval is much wider.

 For example, I still think we could tremendously improve our current
 sampling-based estimator without increasing I/O by moving to block-based
 estimation*.  The accuracy statistics for block-based samples of 5% of
 the table look quite good.

Well, that's certainly possible. But you can only achieve the error(d)
lower boundary consistently (for all datasets), you can't do better. And
they've already presented an estimator that does exactly this (called AE -
Adaptive Estimator in the paper).

 I would agree that it's impossible to get a decent estimate of
 n-distinct from a 1% sample.  But there's a huge difference between 5%
 or 10% and a majority of the table.

Sure we can do better. But there's a limit we can't cross no matter what
estimator we choose and how large the sample will be.

I've seen several post-2000 paper on sample-based estimators, but those
are mostly hybrid estimators, i.e. estimators composed of several simple
estimators. So in the end it's pretty complicated, you need to gather a
lot of different stats, and still you can't get better error than the
lower bound :-(

So yes, we can improve the current estimator (making it more complex), but
it does not solve the problem actually.

 Again, don't let this discourage you from attempting to write a
 steam-based estimator.  But do realize that you'll need to *prove* its
 superiority, head-to-head, against sxampling-based estimators.

 [* http://www.jstor.org/pss/1391058 (unfortunately, no longer
 public-access)]

It's available here http://www.stat.washington.edu/research/reports/1990s/

But it does not present an estimator contradicting the Charikar and
Chaudhuri paper - it's still 'just' a sample-based estimator, alough they
draw the sample at the block level. But yes, that's good idea and I've
already mentioned it in the cross-column stats thread I think.

The question is if a sample obtained in this way will be as good as the
current samples. This way you could get quite separate 'clusters' of
values, one cluster for each block, although in reality the values are
uniformly distributed. And the resulting histogram would be crippled by
this I guess too.

But if you know about interesting papers on sample-based estimators
(especially post-2000), let me know. I've searched for them, but those I
found were not very interesting IMHO.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] estimating # of distinct values

2010-12-28 Thread tv
 t...@fuzzy.cz wrote:

 So even with 10% of the table, there's a 10% probability to get an
 estimate that's 7x overestimated or underestimated. With lower
 probability the interval is much wider.

 Hmmm...  Currently I generally feel I'm doing OK when the estimated
 rows for a step are in the right order of magnitude -- a 7% error
 would be a big improvement in most cases.  Let's not lose track of
 the fact that these estimates are useful even when they are not
 dead-on accurate.

Well, but that's not 7%, thats 7x! And the theorem says 'greater or equal'
so this is actually the minimum - you can get a much bigger difference
with lower probability. So you can easily get an estimate that is a few
orders off.

Anyway I really don't want precise values, just a reasonable estimate. As
I said, we could use the AE estimate they proposed in the paper. It has
the nice feature that it actually reaches the low boundary (thus the
inequality changes to equality). The downside is that there are estimators
with better behavior on some datasets.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread tv
 2010/12/24 Florian Pflug f...@phlo.org:

 On Dec23, 2010, at 20:39 , Tomas Vondra wrote:

   I guess we could use the highest possible value (equal to the number
   of tuples) - according to wiki you need about 10 bits per element
   with 1% error, i.e. about 10MB of memory for each million of
   elements.

 Drat. I had expected these number to come out quite a bit lower than
 that, at least for a higher error target. But even with 10% false
 positive rate, it's still 4.5MB per 1e6 elements. Still too much to
 assume the filter will always fit into memory, I fear :-(

 I have the impression that both of you are forgetting that there are 8
 bits in a byte. 10 bits per element = 1.25MB per milion elements.

We are aware of that, but we really needed to do some very rough estimates
and it's much easier to do the calculations with 10. Actually according to
wikipedia it's not 10bits per element but 9.6, etc. But it really does not
matter if there is 10MB or 20MB of data, it's still a lot of data ...

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread tv
 On Dec18, 2010, at 17:59 , Tomas Vondra wrote:
 It seems to me you're missing one very important thing - this was not
 meant as a new default way to do estimates. It was meant as an option
 when the user (DBA, developer, ...) realizes the current solution gives
 really bad estimates (due to correlation). In that case he could create
 'cross-column' statistics on those columns, and the optimizer would use
 that info to do the estimates.

 I do understand that. I just have the nagging feeling that there is a
 way to judge from dist(A), dist(B) and dist(A,B) whether it makes sense
 to apply the uniform bayesian approach or to assume the columns are
 unrelated.

I doubt there is a way to this decision with just dist(A), dist(B) and
dist(A,B) values. Well, we could go with a rule

  if [dist(A) == dist(A,B)] the [A = B]

but that's very fragile. Think about estimates (we're not going to work
with exact values of dist(?)), and then about data errors (e.g. a city
matched to an incorrect ZIP code or something like that).

So for a real-world dataset, the condition [dist(A)==dist(A,B)] will
almost never hold. And about the same holds for the uniform correlation
assumption which is the basis for the formula I posted.

So actually we're looking for a formula that does reasonable estimates and
is robust even in cases where the correlation is not uniform or the
estimates are a reasonably unprecise.

 This motivates the definition

 F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [dist(A,B) * ( dist(B) - 1)]

 (You can probably drop the -1, it doesn't make much of a difference
 for larger values of dist(B).

 F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and
 dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while
 a value of 1 indicates that dist(A) == dist(A,B).

 So F(A,B) is a suitable measure of Implicativeness - it's higher
 if the table (A,B) looks more like a function A - B.

 You might use that to decide if either A-B or B-a looks function-like
 enough to use the uniform bayesian approach. Or you might even go further,
 and decide *with* bayesian formula to use - the paper you cited always
 averages

   P(A=x|B=y)*P(B=y) and
   P(B=y|A=x)*P(A=x)

 but they offer no convincing reason for that other than We don't know
 which to pick.

Well, the reason why they chose the sum/2 approach is they were unable to
infer which of the values is 'better' and the sum/2 limits the errors.

I haven't studied this thoroughly, but my impression is that you are going
in the same direction as they did, i.e. while they've done

   P(A,B) = (P(A|B)*P(A) + P(B|A)*P(B)) / 2

with P(A|B) = dist(A) / dist(A,B), you've chosen P(A|B) ~ F(B,A) or
something like that.

You'll probably object that you could compute F(A,B) and F(B,A) and then
use only the part corresponding to the larger value, but what if the
F(A,B) and F(B,A) are about the same?

This is the reason why they choose to always combine the values (with
varying weights).

 I'd like to find a statistical explanation for that definition of
 F(A,B), but so far I couldn't come up with any. I created a Maple 14
 worksheet while playing around with this - if you happen to have a
 copy of Maple available I'd be happy to send it to you..

No, I don't have Maple. Have you tried Maxima
(http://maxima.sourceforge.net) or Sage (http://www.sagemath.org/). Sage
even has an online notebook - that seems like a very comfortable way to
exchange this kind of data.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread tv
 On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug f...@phlo.org wrote:
 You might use that to decide if either A-B or B-a looks function-like
 enough to use the uniform bayesian approach. Or you might even go
 further,
 and decide *with* bayesian formula to use - the paper you cited always
 averages

  P(A=x|B=y)*P(B=y) and
  P(B=y|A=x)*P(A=x)

 but they offer no convincing reason for that other than We don't know
 which to pick.

 Ideally you want to somehow make this a continuous transaition between
 the available formulas rather than a discrete transition, I think.  If
 F(A,B) = 1 then the selectivity of A = x AND B = y is just P(A=x), and
 if it's 0, then it's P(A=x)*P(B=y).  But suppose F(A,B)=0.5.  Then
 what?  A naive approach would be to estimate P(A=x  B=y) = P(A=x) *
 (1 - (1 - F(A,B))*(1 - P(B = y))), so that if, say, P(A=x) = 0.1 and
 P(B=y) = 0.1, then when F(A,B) = 0 we estimate 0.01, when F(A,B) = 1
 we estimate 0.1, and when F(A,B) = 0.5 we estimate (0.1)(1 - 0.5*0.9)
 = 0.055.  Of course I'm just hand-waving here, and this is without any
 mathematical basis, being just the simplest formula I could think of
 that gets the endpoints right and plots some sort of smooth curve
 between them in the middle.  A similar formula with a believable
 argument to back it up seems like it would be a big step forward for
 this method.

This somehow reminds me how the various t-norms in fuzzy logic evolved.
I'm not saying we should use fuzzy logic here, but the requirements are
very similar so it might be an interesting inspiration. See for example
this http://plato.stanford.edu/entries/logic-fuzzy (chapter 4).

And there's one additional - IMHO very important - requirement. The whole
thing should easily extend to more than two columns. This IF (F(A,B) 
F(B,A)) THEN ... probably is not a good solution regarding this.

For example given 3 columns A,B,C, would you do that comparison for each
pair of columns, or would you do that for A vs (B,C)? Or maybe a
completely different approach? Because that would require to collect a lot
more data (number of distinct values in each combination) etc.

I'm not saying for example there is a table with (C=A+B)

  A | B | C
 ===
  1 | 1 | 2
  1 | 2 | 3
  1 | 3 | 4
  2 | 1 | 3
  2 | 2 | 4
  2 | 3 | 5
  3 | 1 | 4
  3 | 2 | 5
  3 | 3 | 6

So that dist(A)=dist(B)=3, dist(C)=6 and dist(A,B,C)=dist(A,B)=9. Given
the paper, you get something like

 P(A,B,C) = [dist(A)*P(A) +  dist(B)*P(B) + dist(C)*P(C)] / [3*dist(A,B,C)]
  = [P(A) + P(B) + 2*P(C)] / 9

so for example

 P(A=3,B=2,C=5) = [1/3 + 1/3 + 2/9]/9 = (8/9)/9

which is almost correct (by 1/81).

Don't get me wrong - I'm not a fanatic who thinks this particular formula
is the best formula possible. I'm just saying we could end up with a
formula that works beautifully in 2D, but once we get to 3 columns it
fails miserably.

Hmmm, maybe we could give this possibility (to identify two separate
groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the
user would say 'build stats for (A,B) and (C)' - this actually represents
apriori knowledge of dependencies supplied by the user.

In that case we could search for 'implicativeness' between those two
groups (and not within the groups), and we could restrict ourselves to 2D
(and thus use a more sophisticated formula).

But we should be able to do some basic analysis even when the user
supplies a list of columns without such apriori knowledge.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread tv
 On Dec21, 2010, at 11:37 , t...@fuzzy.cz wrote:
 I doubt there is a way to this decision with just dist(A), dist(B) and
 dist(A,B) values. Well, we could go with a rule

  if [dist(A) == dist(A,B)] the [A = B]

 but that's very fragile. Think about estimates (we're not going to work
 with exact values of dist(?)), and then about data errors (e.g. a city
 matched to an incorrect ZIP code or something like that).

 Huh? The whole point of the F(A,B)-exercise is to avoid precisely this
 kind of fragility without penalizing the non-correlated case...

Yes, I understand the intention, but I'm not sure how exactly do you want
to use the F(?,?) function to compute the P(A,B) - which is the value
we're looking for.

If I understand it correctly, you proposed something like this

  IF (F(A,B)  F(B,A)) THEN
P(A,B) := c*P(A);
  ELSE
P(A,B) := d*P(B);
  END IF;

or something like that (I guess c=dist(A)/dist(A,B) and
d=dist(B)/dist(A,B)). But what if F(A,B)=0.6 and F(B,A)=0.59? This may
easily happen due to data errors / imprecise estimate.

And this actually matters, because P(A) and P(B) may be actually
significantly different. So this would be really vulnerable to slight
changes in the estimates etc.

 This is the reason why they choose to always combine the values (with
 varying weights).

 There are no varying weights involved there. What they do is to express
 P(A=x,B=y) once as

 ...

   P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
   = dist(A)*P(A=x)/(2*dist(A,B)) +
 dist(B)*P(B=x)/(2*dist(A,B))
   = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))

 That averaging steps add *no* further data-dependent weights.

Sorry, by 'varying weights' I didn't mean that the weights are different
for each value of A or B. What I meant is that they combine the values
with different weights (just as you explained).

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread tv
 On Dec21, 2010, at 15:51 , t...@fuzzy.cz wrote:
 This is the reason why they choose to always combine the values (with
 varying weights).

 There are no varying weights involved there. What they do is to express
 P(A=x,B=y) once as

 ...

  P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
  = dist(A)*P(A=x)/(2*dist(A,B)) +
 dist(B)*P(B=x)/(2*dist(A,B))
  = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))

 That averaging steps add *no* further data-dependent weights.

 Sorry, by 'varying weights' I didn't mean that the weights are different
 for each value of A or B. What I meant is that they combine the values
 with different weights (just as you explained).

 I'm still not sure we're on the same page here. The resulting formula
 is *not* a weighted average of P(A=x) and P(B=y), since in general
 dist(A) + dist(B) = 2*dist(A,B) does *not* hold. It may look like one
 syntactically, but that's about it.

OK, another crazy usage or 'weights' on my side :-(

What I meant is that in the end you have two equations of P(A,B):

  P(A=x|B=y)*P(B=y) = dist(B)*P(B=y)/dist(A,B)
  P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B)

and you need to combine those two estimates. They did that by averaging,
as they don't know which of the estimates is better.

Generally I think that is a good solution, unless you know one of the
estimates is much more reliable (although I'm not sure we should
completely omit the other estimate).

 The resulting formula instead is an *unweighted* (weights 1) average of
 the two estimates P(B=y|A=x)*P(A=x) and P(A=x|B=y)*P(B=y). You might just
 as well estimate P(A=x,B=y) with

   P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B)

 and it's *still* be the very same uniform bayesian approach, just no
 longer symmetric in A and B. Which may easily be preferable if you
 have reasons to believe that this estimate is more correct than the
 one obtained by swapping A and B. The original paper doesn't deal with
 that case simply because they don't mention how P(A=x) and P(B=y)
 are obtained at all. The postgres estimator, on the other hand,
 knows quite well how it derived P(A=x) and P(B=y) and may have much
 higher confidence in one value than in the other.

OK, good point. I haven't realized that one of the estimates may be much
more reliable.

But let's assume both estimates are about the same (regarding reliability)
and let's see the following example

 A | B
 =
 1 | 1
 1 | 1
 1 | 1
 1 | 2
 2 | 1
 2 | 2
 2 | 2
 2 | 2

Thus dist(A)=dist(B)=2, dist(A,B)=4 and

  P(A=1)=P(A=2)=P(B=1)=P(B=2)=1/2
  P(A=1,B=1)=P(A=2,B=2)=3/8
  P(A=1,B=2)=P(A=1,B=1)=1/8

According to the formula presented in the paper, the partial estimates for
P(A=1,B=2) are

  P(A=1|B=2)*P(B=2) = dist(A)/dist(A,B) * P(B=2) = 2/4 * 1/2 = 1/4
  P(B=2|A=1)*P(A=1) = dist(B)/dist(A,B) * P(A=1) = 2/4 * 1/2 = 1/4

Thus P(A=1,B=2) = (1/4 + 1/4)/2 = 1/4, so it's overestimated (2x)

 A | B
 =
 1 | 1
 1 | 2
 1 | 2
 1 | 2
 2 | 1
 2 | 1
 2 | 1
 2 | 2

This obviously has exactly the same features (regarding number of distinct
values), and the produced estimate is exactly the same. But in this case

  P(A=1,B=2)=P(A=2,B=1)=3/8
  P(A=1,B=1)=P(A=2,B=2)=1/8

and thus the 1/4 is an underestimate (compared to 3/8).

The problem is the F(A,B) does not change at all. It's very simple to
construct examples (just use more rows) where F(A,B) returns exactly the
same value, but the estimates are off. The averaging somehow (smooths)
this of ...

But I think I'm missing something about how to use the F(?,?) to derive
the final estimate. So maybe the resulting estimate would be better.

Say there are two tables

   A | B | number of such rows
  ==
   1 | 1 | 1000
   1 | 2 | 1000
   2 | 1 | 1000
   1 | 2 | 1000

   A | B | number of such rows
  ==
   1 | 1 | 1
   1 | 2 | 1999
   2 | 1 | 1999
   1 | 2 | 1

How would you estimate the P(A=1,B=1) in those cases? Assume that both
estimates are equally reliable - i.e. deduced from a histogram or MCV.


 Assume for example that you're preparing the statement

   SELECT * FROM T WHERE A = ? AND B = 1

 We'll then estimate P(A=?) as 1/dist(A), since we cannot do any better
 without an actual value for the parameter ?. The estimate for P(B=1),
 on the other hand, can use the histogram, and will thus very likely be
 much more precise. The two estimates for P(A=?,B=1) in this case are

   P(A=?,B=1)*P(B=1) = dist(B)*P(B=1)/dist(A,B), and
   P(B=1,A=?)*P(A=1) = dist(A)*P(A=?)/dist(A,B).

 There's a good chance that the former beats the latter, and thus also
 the average, then.

OK, good point. I was not thinking about prepared statements. In this case
it makes sense to use only one of the estimates ...

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)

2010-12-19 Thread tv
 Tomas Vondra t...@fuzzy.cz writes:
 I've done several small changes to the patch, namely

 - added docs for the functions (in SGML)
 - added the same thing for background writer

 So I think now it's 'complete' and I'll add it to the commit fest in a
 few minutes.

 Please split this into separate patches for database-level and
 table-level tracking, because I think it's highly likely that the latter
 will get rejected.  We have had multiple complaints already about the
 size of the stats file for databases with many tables.  I don't believe
 that it's worth bloating the per-table entries even further with this
 information.  Tracking reset time it per-database doesn't seem like a
 problem though.

Sorry, it's not possible to split this patch into two, and then choose
just one or both pieces. I could split it, but using just the
'per-database' part would not make sense.

The problems is that right now, when stat's are reset for a single table
or function, the timestamp is set just for that one item, not for the
whole database. So just a blind split of the patch (and using just the
per-database part) would mean the info about tables/functions is lost.

I can prepare an alternative patch, using just per-database timestamps. So
even a reset for a single table/function would set the timestamp for the
whole database.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread tv
 On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
 Well, not really - I haven't done any experiments with it. For two
 columns selectivity equation is

  (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))

 where A and B are columns, dist(X) is number of distinct values in
 column X and sel(X) is selectivity of column X.

 Huh? This is the selectivity estimate for A = x AND B = y? Surely,
 if A and B are independent, the formula must reduce to sel(A) * sel(B),
 and I cannot see how that'd work with the formula above.

Yes, it's a selectivity estimate for P(A=a and B=b). It's based on
conditional probability, as

   P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a)

and uniform correlation assumption so that it's possible to replace the
conditional probabilities with constants. And those constants are then
estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B).

So it does not reduce to sel(A)*sel(B) exactly, as the dist(A)/dist(A,B)
is just an estimate of P(B|A). The paper states that this works best for
highly correlated data, while for low correlated data it (at least)
matches the usual estimates.

I don't say it's perfect, but it seems to produce reasonable estimates.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread tv
 On 2010-12-13 03:28, Robert Haas wrote:
 Well, I'm not real familiar with contingency tables, but it seems like
 you could end up needing to store a huge amount of data to get any
 benefit out of it, in some cases.  For example, in the United States,
 there are over 40,000 postal codes, and some even larger number of
 city names, and doesn't the number of entries go as O(m*n)?  Now maybe
 this is useful enough anyway that we should Just Do It, but it'd be a
 lot cooler if we could find a way to give the planner a meaningful
 clue out of some more compact representation.
 A sparse matrix that holds only 'implicative' (P(A|B)  P(A*B)?)
 combinations? Also, some information might be deduced from others. For
 Heikki's city/region example, for each city it would be known that it is
 100% in one region. In that case it suffices to store only that
 information, since 0% in all other regions ca be deduced. I wouldn't be
 surprized if storing implicatures like this would reduce the size to O(n).

OK, but I'll leave this for the future. My plan is to build a small PoC,
just to see whether the contingency-table + probability-estimates approach
works for the failure case mentioned by Heikki. I'l like to do this till
the end of this week, if possible.

I'll read the articles/mentioned by Nathan Boley (thanks for those links,
if you have more of them just let me know).

Once we have a solution that solves (or significantly improves) these
failure cases, we can do further plans (how to do that ascually in the
code etc.).

BTW thanks for all the comments!

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread tv
 On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Dne 13.12.2010 03:00, Robert Haas napsal(a):
 Well, the question is what data you are actually storing.  It's
 appealing to store a measure of the extent to which a constraint on
 column X constrains column Y, because you'd only need to store
 O(ncolumns^2) values, which would be reasonably compact and would
 potentially handle the zip code problem - a classic hard case rather
 neatly.  But that wouldn't be sufficient to use the above equation,
 because there A and B need to be things like column X has value x,
 and it's not going to be practical to store a complete set of MCVs for
 column X for each possible value that could appear in column Y.

 O(ncolumns^2) values? You mean collecting such stats for each possible
 pair of columns? Well, I meant something different.

 The proposed solution is based on contingency tables, built for selected
 groups of columns (not for each possible group). And the contingency
 table gives you the ability to estimate the probabilities needed to
 compute the selectivity. Or am I missing something?

 Well, I'm not real familiar with contingency tables, but it seems like
 you could end up needing to store a huge amount of data to get any
 benefit out of it, in some cases.  For example, in the United States,
 there are over 40,000 postal codes, and some even larger number of
 city names, and doesn't the number of entries go as O(m*n)?  Now maybe
 this is useful enough anyway that we should Just Do It, but it'd be a
 lot cooler if we could find a way to give the planner a meaningful
 clue out of some more compact representation.

Yes, storing a complete contingency table is not feasible in such cases.
My original proposal actually did not address this particular issue
(cities and ZIP codes) as it was based on simplified contingency tables
(with bins corresponding to current histograms, not to individual values).
So the number of values to store would grow much slower.

On the other hand, this generalization really makes it unusable in some
cases, and the issue we're discussing here (cities and ZIP codes) is one
of them. I think in such cases we could build a contingency table for MCV
and then use it to estimate those conditional probabilities we need, but I
expect it to be very tricky.

Thanks for the comments.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)

2010-12-11 Thread tv
Hi everyone,

I just wrote my first patch, and I need to know whether I missed something
or not. I haven't used C for a really long time, so sickbags on standby,
and if you notice something really stupid don't hesitate to call me an
asshole (according to Simon Phipps that proves we are a healthy open
community).

So what the patch does (or should do)? It tracks when were the stats for a
given object (database, table or function) reset for the last time. This
is useful when you do snapshots of the stats for analysis - when comparing
two snapshots, you have to know whether the stats were reset (in that case
the analysis usually yields random noise and automatic tools get confused
by this).

Tom Lane already recommended a workaround - firing the DBA who randomly
resets statistics, but that's not a good solution I think. First, you have
to be superior to the DBA to be able to fire him ;-) Second, everyone
makes a mistake from time to time. Third, when there are functions to
reset stats, it's nice to provide such info as it makes life much easier.

And there are cases when you don't reset the stats explicitly but the data
are actually gone - e.g. when after a restore or upgrade (OK, this is
solvable using pg_postmaster_start_time).

In short, I think it's a useful feature (I need it and I think there are
others). And I think it's not disruptive.

So what the patch actually does:

- extends PgStat_StatDBEntry, PgStat_StatTableEntry and
PgStat_StatFuncEntry with a new field (stat_reset_timestamp)

- adds functions to read current value from these fields
(pg_stat_get_db_last_stat_reset_time, pg_stat_get_last_stat_reset_time and
pg_stat_get_function_last_stat_reset_time)

- extends the system views with calls to these functions
(pg_stat_database, pg_stat_user_functions and pg_stat_all_tables)

The values are set like this:

- when a database is created, current timestamp is stored in
PgStat_StatDBEntry.stat_reset_timestamp
- by default all tables/functions inherit this timestamp
- when stats for a given table / function are reset, current timestamp is
stored in the stat_reset_timestamp (this happens in
pgstat_recv_resetsinglecounter)
- when stats for the whole database are reset, everything starts from
scratch (this happens in pgstat_recv_resetcounter)

What I'm not sure about:

- I really am not sure about the changes made in pg_proc.h. I'm not sure
how to assign OIDs to the new functions (I've simply chosen values that
are were not used in this file), and I'm not sure about the other columns
(I've copied and modified another function with the same parameter/return
types)

- I'm not sure if there are any other ways how the stat entries can be
created. I've found two ways - directly (when asked for the stats e.g.
from pgstat_get_tab_entry), and indirectly (when processing stats from a
backend e.g. in pgstat_recv_tabstat).

regards
Tomasdiff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 346eaaf..0ee59b1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -310,6 +310,7 @@ CREATE VIEW pg_stat_all_tables AS
 pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
 pg_stat_get_last_analyze_time(C.oid) as last_analyze,
 pg_stat_get_last_autoanalyze_time(C.oid) as last_autoanalyze,
+   pg_stat_get_last_stat_reset_time(C.oid) as last_stat_reset,
 pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
 pg_stat_get_autovacuum_count(C.oid) AS autovacuum_count,
 pg_stat_get_analyze_count(C.oid) AS analyze_count,
@@ -502,7 +503,8 @@ CREATE VIEW pg_stat_database AS
 pg_stat_get_db_tuples_fetched(D.oid) AS tup_fetched,
 pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted,
 pg_stat_get_db_tuples_updated(D.oid) AS tup_updated,
-pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted
+pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted,
+pg_stat_get_db_last_stat_reset_time(D.oid) AS last_stat_reset
 FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
@@ -512,7 +514,8 @@ CREATE VIEW pg_stat_user_functions AS
 P.proname AS funcname,
 pg_stat_get_function_calls(P.oid) AS calls,
 pg_stat_get_function_time(P.oid) / 1000 AS total_time,
-pg_stat_get_function_self_time(P.oid) / 1000 AS self_time
+pg_stat_get_function_self_time(P.oid) / 1000 AS self_time,
+pg_stat_get_function_last_stat_reset_time(P.oid) AS last_stat_reset
 FROM pg_proc P LEFT JOIN pg_namespace N ON (N.oid = P.pronamespace)
 WHERE P.prolang != 12  -- fast check to eliminate built-in functions
   AND pg_stat_get_function_calls(P.oid) IS NOT NULL;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c3c136a..f0b3453 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -249,6 

Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)

2010-12-11 Thread tv
 Hello

 you have to respect pg coding style:

 a) not too long lines
 b) not C++ line comments

OK, thanks for the notice. I've fixed those two problems.

regards
Tomasdiff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 346eaaf..0ee59b1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -310,6 +310,7 @@ CREATE VIEW pg_stat_all_tables AS
 pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
 pg_stat_get_last_analyze_time(C.oid) as last_analyze,
 pg_stat_get_last_autoanalyze_time(C.oid) as last_autoanalyze,
+   pg_stat_get_last_stat_reset_time(C.oid) as last_stat_reset,
 pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
 pg_stat_get_autovacuum_count(C.oid) AS autovacuum_count,
 pg_stat_get_analyze_count(C.oid) AS analyze_count,
@@ -502,7 +503,8 @@ CREATE VIEW pg_stat_database AS
 pg_stat_get_db_tuples_fetched(D.oid) AS tup_fetched,
 pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted,
 pg_stat_get_db_tuples_updated(D.oid) AS tup_updated,
-pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted
+pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted,
+pg_stat_get_db_last_stat_reset_time(D.oid) AS last_stat_reset
 FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
@@ -512,7 +514,8 @@ CREATE VIEW pg_stat_user_functions AS
 P.proname AS funcname,
 pg_stat_get_function_calls(P.oid) AS calls,
 pg_stat_get_function_time(P.oid) / 1000 AS total_time,
-pg_stat_get_function_self_time(P.oid) / 1000 AS self_time
+pg_stat_get_function_self_time(P.oid) / 1000 AS self_time,
+pg_stat_get_function_last_stat_reset_time(P.oid) AS last_stat_reset
 FROM pg_proc P LEFT JOIN pg_namespace N ON (N.oid = P.pronamespace)
 WHERE P.prolang != 12  -- fast check to eliminate built-in functions
   AND pg_stat_get_function_calls(P.oid) IS NOT NULL;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c3c136a..f20a9ea 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -249,6 +249,8 @@ static void pgstat_sighup_handler(SIGNAL_ARGS);
 static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
 static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
 Oid tableoid, bool create);
+static PgStat_StatFuncEntry *pgstat_get_func_entry(PgStat_StatDBEntry *dbentry,
+Oid funcoid, bool create);
 static void pgstat_write_statsfile(bool permanent);
 static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent);
 static void backend_read_statsfile(void);
@@ -3129,7 +3131,9 @@ pgstat_get_db_entry(Oid databaseid, bool create)
result-n_tuples_updated = 0;
result-n_tuples_deleted = 0;
result-last_autovac_time = 0;
-
+   
+   result-stat_reset_timestamp = GetCurrentTimestamp();
+   
memset(hash_ctl, 0, sizeof(hash_ctl));
hash_ctl.keysize = sizeof(Oid);
hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
@@ -3196,11 +3200,48 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid 
tableoid, bool create)
result-autovac_vacuum_count = 0;
result-analyze_count = 0;
result-autovac_analyze_count = 0;
+   
+   /* inherit stat time reset from dbentry */
+   result-stat_reset_timestamp = dbentry-stat_reset_timestamp;
+   
}
 
return result;
 }
 
+/*
+ * Lookup the hash table entry for the specified function. If no hash
+ * table entry exists, initialize it, if the create parameter is true.
+ * Else, return NULL.
+ */
+static PgStat_StatFuncEntry *
+pgstat_get_func_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+{
+   PgStat_StatFuncEntry *result;
+   boolfound;
+   HASHACTION  action = (create ? HASH_ENTER : HASH_FIND);
+
+   /* Lookup or create the hash table entry for this table */
+   result = (PgStat_StatFuncEntry *) hash_search(dbentry-functions,
+   
 tableoid,
+   
 action, found);
+
+   if (!create  !found)
+   return NULL;
+
+   /* If not found, initialize the new one. */
+   if (!found)
+   {
+   result-f_numcalls = 0;
+   result-f_time = 0;
+   result-f_time_self = 0;
+   
+   /* inherit stat time reset from dbentry */
+   result-stat_reset_timestamp = 

Re: [HACKERS] [GENERAL] Postgres 9.1 - Release Theme

2010-04-01 Thread tv
 Following a great deal of discussion, I'm pleased to announce that the
 PostgreSQL Core team has decided that the major theme for the 9.1
 release, due in 2011, will be 'NoSQL'.


Please, provide me your address so I can forward you the health care
bills I had to pay due to the heart attack I suffered this morning (when
reading your post).

BTW PostgreSQL core team is not alone realizing how obsolete relational
databases are:
http://thedailywtf.com/Articles/Announcing-APDB-The-Worlds-Fastest-Database.aspx

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers